DatasetAdded By mrflip
Stack Overflow Creative Commons Data Dump
We decided early on that all user-generated content on Stack Overflow would be under a Creative Commons license.
All those great Stack Overflow questions, answers, and comments, so generously contributed by all of you, are licensed under cc-wiki:You are free
- to Share — to copy, distribute, and transmit the work
- to Remix — to adapt the work
- Attribution — You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).
- Share Alike — If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one.
The community has selflessly provided all this content in the spirit of sharing and helping each other. In that very same spirit, we are happy to return the favor by providing a database dump of public data.
We always intended to give the contributed content back to the community as a whole. Our primary concern was making sure we didn’t have an AOL-style “incident” where we accidentally release personally identifying information in so-called “sanitized” data. Stack Overflow user Greg Hewgill was kind enough to help us beta test several iterations of the data dump, ensuring that we didn’t release anything except content that is visible on the public website. He also suggested several improvements to improve the data dump, so that it contains as much useful public information as possible.
Cheers, Greg! Also, thanks to Stack Overflow Valued Associate #00003, Geoff Dalgas, who patiently worked through many iterations of this to get it together on our end.
The current anonymized public data dump is ~500 megabytes, 7zipped, and contains these files:1. badges.xml 2. comments.xml 3. posts.xml 4. users.xml 5. votes.xml
All four Trilogy sites are now included in the data dump: Stack Overflow, Server Fault, Super User, and Meta Stack Overflow.
Note that if you republish this data, we require attribution as described in this blog post. Most importantly, there should be hyperlinks back to the original question, and the profiles of all participants.
Our plan is to create a new data dump every month, reflecting all data in the system up to that month. We will seed the latest and greatest dump (at a low bitrate) as long as we can, ideally permanently.
And yes, it’s still fun to say “data dump”. We look forward to seeing what the community can do with this data!
Posted by Jeff Atwood on June 4th, 2009
Filed under cc-wiki-dump, community, legal