TL;DR: We are running a two week data mining sprint from January 4–15, 2016, open to current Harvard students, based on early access to a brand new data set of American caselaw. To apply, send a resume and brief statement of interest to jcushman at law dot harvard dot edu.
We recently announced Free the Law, our project to scan every legal decision ever published in the United States. We're generating the first consistent, comprehensive, and open database of American law, from the colonial era right up to 2015. You can read the New York Times coverage of the project here.
By the end of this project we'll have millions of cases in the dataset — no one knows exactly how many. We're scanning and processing tens of thousands of pages a day, and will soon have entire states completed.
Now it's time to start exploring what to do with all that data. What new questions can we ask with millions of cases?
The answers cross every discipline at Harvard:
Can a spam filter be retrained to guess which torts cases make the most interesting stories?
How much money are we willing to fight over—and does the answer offer an alternate inflation index?
How has the use of Latin in the law changed over time—are judges writing more or less like regular people?
How have defendants' choice of murder weapon changed? The gender balance of litigants? The reliance on scientific evidence?
Can we trace a family's history through the cases they were involved in?
Caselaw is the historical record of applied moral philosophy under the law. Unlocking its secrets will have an incredible impact on scholarship of all kinds.
Hence our challenge: pick a question you think caselaw might help you answer, perhaps drawn from one of your classes. Build a tool to help answer it—whether that means loading up your favorite ML library, configuring an off-the-shelf statistical tool, or writing code from scratch. In a two-week sprint, do your best to answer the question, and to generalize your tool to help other researchers answer similar questions. We'll help share the discoveries you make and the tools you build.
The data set we will share with participants will include a single state's complete published caselaw. The data includes: (1) TIFF and JPEG2000 images for each scanned page; (2) ALTO XML files for each scanned page; and (3) structured XML files for each case.
December 2015: Application period.
January 4, 2016: Delivery of data set to participants.
January 4, 6, 8, 11, 13, 15: The group will check in three times a week, either in person or remotely, to share notes, progress updates, and requests for help.
Week of January 18: demo day (date TBD).
Send your resume and brief statement of interest (such as a general idea of what sort of project you would like to work on) to jcushman at law dot harvard dot edu. If you would like to work with others, feel free to apply as a group.