Developing the CAP Research Community

Since launching the Caselaw Access Project API and bulk data service in October, we’ve been lucky to see a research community develop around this dataset. Today, we’re excited to share examples of how that community is using CAP data to create new kinds of legal scholarship.

Since October, we’ve seen our research community use data analysis to uncover themes in the Caselaw Access Project dataset, with examples like topic modeling U.S. supreme court cases and a quantitative breakdown of our data. We’ve also seen programmatic access to the law create a space to interface with the law in new ways, from creating data notebooks to text and excerpt generators.

Outside this space, data supported by the Caselaw Access Project has found its way into a course assignment to retrieve cases and citations with Python, a call to expand the growing landscape of Wikidata, and library guides on topics ranging from legal research to data mining.

We want to see what you build with Caselaw Access Project data! Keep us in the loop at info@case.law. Looking for ideas for getting started? Visit our gallery and the CAP examples repository.

The Network Librarian

Last year, Jack Cushman expressed a desire for a personal service similar to the one I perform here at LIL – not exactly the DevOps of my job title, but more generally the provision and maintenance of network and computing infrastructure. Jack's take on this idea is very much a personal one, I think: go to a person known to and trusted by you, not the proverbial faceless corporation, for whom you may be as much product as customer.

(I should say here that what follows is my riff on our discussions; errors in transmission are mine.)

As we began to discuss it, it struck me that the idea sounded a lot like some of the work I used to do as a reference librarian at the Public Library of Brookline. This included some formal training for new computer users, but was more often one-on-one, impromptu assistance with things like signing up for a first email account.

Jack's idea goes beyond tech support as traditionally practiced in libraries, but it shares with it the combination of technical knowledge, professional ethics – especially the librarian's rigorous adherence to patron confidentiality – and the personal relationship between patron and librarian.

At LIL, we like naming things whether or not there's actually a project, or, as in this case, before there's even a definition. In order not to keep talking about this vague "idea," I'll bring out the provisional name we came up with for the role we're beginning to imagine: the network librarian.

The network librarian expands on traditional tech support by consulting on computer and network security issues specifically; by advising on self-defense against surveillance where possible and activism where it isn't; and in some cases going beyond the usual help with finding and accessing resources, to providing resources directly. Finally, the practice should expand what's possible – in developing the kinds of self-reliance a network librarian will have to have, the library itself will become more self-reliant and less dependent on vendors.

One of the specific services a network librarian might provide is a virtual private network, or VPN. This article explains why a VPN is important and why it's difficult or impossible to evaluate the trustworthiness of commercial VPN providers. It goes on to explain how to set up a VPN yourself, but it's not trivial. What the network librarian has to offer here is not only technical expertise, but a headstart on infrastructure, like an account at a cloud hosting provider. As important, if not more so, is that you know and trust your librarian.

I've made a first cut at one end of this particular problem in setting up a WireGuard server with Streisand, a neat tool that automates the creation of a server running any of several VPNs and similar services. Almost all of my home and phone network traffic has gone through the WireGuard VPN since August, and I've distributed VPN credentials to several friends and family. Obviously, that isn't a real test of this idea, nor does it get at the potentially enormous issues of agreement, support, and liability you'd have to engage with, but it's an experiment in setting up a small-scale and fairly robust service for small effort and little money.

Even before providing infrastructure, the network librarian would suggest tools and approaches. I'd do the work I used to do differently now – for example, I'd strongly encourage a scheme of multiple backups. I'd be more explicit about how to mitigate the risks of using public computers and wireless networks. I'd encourage the use of encryption, for example via Signal or keybase.io. I would sound my barbaric yawp for the use of a password manager and multi-factor authentication.

Are you a network librarian? Do you know one? Do you have ideas about scope, or tools? Can you think of a better name, or does one already exist? Let me know – I look forward to hearing from you. I'm bsteinberg@law.harvard.edu.

Data Stories and CAP API Full-Text Search

Data sets have tales to tell. In the Caselaw Access Project API, full-text search is a way to find these stories in 300+ years of U.S caselaw, from individual cases to larger themes.

This August, John Bowers began exploring this idea in the blog post Telling Stories with CAP Data: The Prolific Mr. Cartwright, writing: “In the hands of an interested researcher with questions to ask, a few gigabytes of digitized caselaw can speak volumes to the progress of American legal history and its millions of little stories.". Here, I wanted to use the CAP API full-text search as a path to find some of these stories using one keyword: pumpkins.

The CAP API full-text search option was one way to look at the presence of pumpkins in the history of U.S. caselaw. Viewing the CAP API Case List, I filtered cases using the Full-Text Search field to encompass only items that included the term “pumpkins”:

api.case.law/v1/cases/?search=pumpkins

This query returned 640 cases, the oldest decision dating to 1812 and the most recent in 2017. Next, I wanted to take a closer look at these cases in detail. To view the full case text, I logged in, revisited that same query for “pumpkins”, filtering the search to display full case text.

By running a full-text search, we can begin to pull out themes in Caselaw Access Project data. Of the 640 cases returned by our search that included the word “pumpkins”, the jurisdictions that produced the most published cases including this word were Louisiana (30) followed by Georgia (22) and Illinois (21).

In browsing the full cases returned by our query, some stories stand out. One such case is Guyer v. School Board of Alachua County, decided outside Gainesville, Florida, in 1994. Centered around the question of whether Halloween decorations including "the depiction of witches, cauldrons, and brooms" in public schools were based on secular or religious practice and promotion of the occult, this case concluded with the opinion:

Witches, cauldrons, and brooms in the context of a school Halloween celebration appear to be nothing more than a mere “shadow”, if that, in the realm of establishment cause jurisprudence."

In searching the cases available through the Caselaw Access Project API, each query can tell a story. Try your own full-text query and share it with us at @caselawaccess.

Legal Tech Student Group Session Brings Quantitative Methods to U.S. Caselaw

This September we hosted a Legal Tech Gumbo session dedicated to using quantitative methods to find new threads in U.S. caselaw. The Legal Tech Gumbo is a collaboration between the Harvard Law & Technology Society and Harvard Library Innovation Lab (LIL).

The session kicked off by introducing data made available as part of the Caselaw Access Project API, a channel to navigate 6.4 million cases dating back 360 years. How can we use that data to advance legal scholarship? In this session Research Associate John Bowers shared how researchers can apply quantitative research methods to qualitative data sources, a theme which has shaped the past decade of research practices in the humanities.

This summer, Bowers shared a blog post outlining some of the themes he found in Caselaw Access Project data, focusing on the influence of judges active in the Illinois court system. Here, we had the chance to learn more about research based on this dataset and its supporting methodology. We applied these same practices to a new segment of data, viewing a century of Arkansas caselaw in ten-year intervals using data analytics and visualization to find themes in U.S. legal history. Want to explore the data we looked at in this session? Take a look at this interactive repository (or, if you prefer, check out this read-only version).

In this session, we learned new ways to find stories in U.S. caselaw. Have you used Caselaw Access Project data in your research? Tell us about it at info@case.law.

The Story of the case.law Domain

Recently we announced the launch of the Caselaw Access Project at case.law. But we want to highlight the story of the case.law domain itself.

That domain was generously provided by Carl Jaeckel, its previous owner. Carl agreed to transfer the domain to us in recognition and support of the vital public interest in providing free, open access to caselaw. We’re thrilled to have such a perfect home for the project.

Carl is the managing partner of Jaeckel Law, the Founder of ClassAction.org, and the Chief Operating Officer of Dot Law Inc. We can’t wait to see what he and other legal entrepreneurs, researchers, and developers will build based on case.law.

Witchcraft in U.S. Caselaw

Witchcraft in Law

Happy Halloween!

This Halloween is a special one at LIL, since we’re celebrating the release of the Caselaw Access Project and 360 years of digitized U.S. caselaw.

For a small project using our full text search functionality, we mapped out the usage of the term “witchcraft” in United States caselaw.

Here is the result: https://case.law/gallery/witchcraft

(For those unfamiliar with the sordid history of witchcraft in the United States, the Wikipedia entry for Salem Witch Trials is a good primer.)

Below are some steps used to get this data.

Since our metadata is available to anyone without limitations or a login, you can see the result of our search here.

As you can see, there are 503 cases in total that include the word “witchcraft”.

In order to get the context of the word (in our visualization, we display small excerpts), we need to create an account.

First, sign up for an API key.

Once you’ve created an account and logged in, you should head over to our API documentation for a primer on authentication.

Now, you can download the cases using one of several means (but be careful! Each time you download the cases, whether in the browser, or elsewhere, your daily case limit of 500 gets decremented).

You can download cases:

  1. in your browser by going to this link: read cases in the browser
  2. download the files locally using the terminal (see an example of a curl request here)
  3. you can get the cases using your favorite programming language, or
  4. you can check out our cap-examples repository and set up an examples environment. Once it’s set up, you can run the code used to make this visualization.

Do you have any questions, concerns, ideas? We would love to hear from you! Please write to the Caselaw Access Project team at info@case.law.

Caselaw Access Project (CAP) Launches API and Bulk Data Service

Today the Library Innovation Lab at the Harvard Law School Library is excited to announce the launch of its Caselaw Access Project (CAP) API and bulk data service, which puts the full corpus of published U.S. case law online for anyone to access for free.

Between 2013 and 2018, the Library digitized over 40 million pages of U.S. court decisions, transforming them into a dataset covering almost 6.5 million individual cases. The CAP API and bulk data service puts this important dataset within easy reach of researchers, members of the legal community and the general public.

To learn more about the project, the data and how to use the API and bulk data service, please visit case.law.

The Caselaw Access Project is a project of the Library Innovation Lab at the Harvard Law School Library, and the digitization effort was completed through the partnership and support of Ravel Law, Inc..

Caselaw Access Project (case.law) Caselaw Access Project API (api.case.law)

Telling Stories with CAP Data: The Prolific Mr. Cartwright

When I think about data, caselaw isn't the first thing that comes to mind.

The word “data” evokes tabulated click-through rates, aggregated housing statistics, and short, readily classifiable chunks of born-digital text. Multi-page 19th century legal documents don’t exactly fit the archetype.

Practically speaking, of course, all it takes for a body of material to become usable ‘data’ is a person or organization willing to make that material accessible to analysis. The HLS Library Innovation Lab’s Caselaw Access Project represents an effort to do just that for centuries of American caselaw. Through resources generated and maintained by the Caselaw Access Project, researchers can explore the rich legal history of the United States byte by byte.

If the rise of big data has taught us one thing, however, it is that “can” does not necessarily imply “should.” Indeed, the practice of subjecting core texts from the humanities and social sciences to data-driven analysis has been met with sharp resistance from some quarters. A widely discussed essay by Daniel Allington, Sarah Brouillette, and David Golumbia criticizing the “Digital Humanities” movement recently argued that the application of quantitative methods to such material has driven “the displacement of… humanities scholarship and activism in favor of the manufacture of digital tools and archives.” To the essay’s writers and others of similar mind, the expansion of data’s domain comes as a threat to the integrity of a long tradition of scholarship.

In this post, I present my experience working with the Caselaw Access Project’s publically available Illinois dataset as evidence for a more optimistic narrative – namely that applying quantitative techniques to corpuses primarily associated with the qualitative disciplines can help us to uncover and relate stories which might otherwise go unnoticed.

I uncovered this particular story while messing around with measures of “prolificness” amongst Illinois judges between 1850 and the present. I had generated a plot tracking the number of opinions judges had published per year over the timespan (each point corresponds to one judge’s output over the course of one year):

Yearly output by judges in the dataset

I noticed an interesting trend – in a window of time between about 1890 and 1930, many justices were publishing upwards of 50 opinions per year (it’s worth noting that modern publication numbers have likely been pushed down by the rise of unpublished opinions, which are not indexed in reporters and therefore cannot be cited as precedent). Digging down a little further, I plotted yearly publication volume for the 5 Illinois judges who wrote the most opinions over the course of their careers.

The 5 most prolific judges in the dataset

All of these judges fall more or less into the timespan discussed, and all were justices of the Illinois Supreme Court. Running the numbers, it became apparent that one Mr. Justice Cartwright was firmly in the lead as the most prolific publisher of legal opinions in the history of the state of Illinois.

My efforts to investigate Cartwright’s life and times through internet research were largely unfruitful. Among the most complete sources I found was a short profile on a website dedicated to social reformer Florence Kelley, which cites just two brief articles about Cartwright – both of them published in the 1920s. A brief Wikipedia entry provides a portrait of the justice taken in 1919, about five years before his death.

Justice James H. Cartwright, 1919

From these paltry sources, I learned that Cartwright was born in Iowa Territory on December 1st, 1842. After serving with some distinction in the Civil War, he attended Michigan Law School starting in 1865. Between 1868 and 1876, he served as general attorney for a regional railroad company. After a period of private practice, Cartwright was elected as a circuit court judge in Oregon, Illinois in 1888. In 1895, he became a justice of the Illinois Supreme Court – a position which he held until his death on May 18th, 1924.

However, none of the sources I was able to locate shed much light on Cartwright’s amazing prolificness, though some of the articles written around the time of his death do reference it offhand. For further insights, I turned to the data. After cleaning and standardizing data corresponding to Cartwright and his peers on the Illinois Supreme Court across his almost 30-year career, I visualized the yearly output of of each justice present in the dataset.

Cartwright's published opinion output relative to that of his peers

With a few exceptions, Cartwright was among the most prolific publishers on the court throughout his time as a justice. He was particularly active in his early years of service, with a marked drop off in the two years immediately preceding his death. However, it is clear from the visualization above that Cartwright wasn’t writing enormously more than his peers – there is often at least one justice who authors more than him in a given year, and he occasionally winds up in the middle of the pack. Where, then, does his dominance come from? To find out, I generated a cumulative plot of the number of opinions written by each justice in the span between 1895 and 1925.

Cumulative published opinion counts for Cartwright and peers

As we can see, it is not just Cartwright’s yearly rate of production that catapulted him to dominance – it is also his consistency. In the years between 1896 and 1922, just once did Cartwright have an annual output of fewer than 50 opinions. Over the course of a lengthy career, he kept up this breakneck pace with a degree of longevity and persistence that seems to have eluded his peers.

Perhaps a bit of this relative immunity to fatigue can be attributed to the style of Cartwright’s writing. Per the visualization below, Cartwright tended to writer shorter opinions than the majority of his peers – his average opinion totalled about 1,724 words, as compared to the court-wide average of 1,949 words. Justice Orrin Carter, the second most prolific justice on the court in the period examined, averaged about 2,209 words per opinion. Carter’s 1,129 opinions summatively contain 2,493,649 words, whereas Cartwright’s 1,978 opinions contain 3,411,869. Interestingly, Cartwright’s word counts were at their lowest during the beginning and end of his career.

Cartwright's average opinion lengths relative to those of his peers

This basic investigation demonstrates just a few of the insights that this dataset offers into the professional life of Cartwright and his peers. In the hands of an interested researcher with questions to ask, a few gigabytes of digitized caselaw can speak volumes to the progress of American legal history and its millions of little stories.

The data used in this blog post can be downloaded on the Caselaw Access Project Website: https://capapi.org/bulk-access/. An iPython Notebook containing all of the analysis and visualization code used in this post can be found on Github here: https://github.com/john-bowers/capexamples/blob/master/CAPDemo.ipynb. Please note that this dataset contains OCR errors and was not cleaned completely – figures are approximate.

The CAP Tracking Tool

Evelin Heidel (@scannopolis on Twitter) recently asked me to document our Caselaw Access Project (website, video) digitization workflow, and open up the source for the CAP "Tracking Tool." I'll dig into our digitization workflow in my next post, but in this post, I'll discuss the Tracking Tool or TT for short. I created the TT to track CAP's physical and digital objects and their associated metadata. More specifically, it:

  • Tracked the physical book from receipt, to scanning, to temporary storage, to permanent storage
  • Served as a repository for book metadata, some of which was retrieved automatically through internal APIs, but most of which was keyed in by hand
  • Tracked the digital objects from scanning to QA, to upload, to receipt from our XML vendor
  • Facilitated sending automated delivery requests to the Harvard Depository, which stored most of our reporters
  • Provided reports on the progress of the project and the fitness of the data we were receiving from our XML vendor

If I might toot my own horn, I'd say it drastically improved the efficiency and accuracy of the project, so it's no wonder Evelin is not the first person to request I open up the source. If doing so were a trivial undertaking I certainly wouldn't hesitate, but it's not. While we have a policy of making all new projects public by default in LIL, that was not the case in the position I held when I created the tracking tool. And while there's nothing particularly sensitive in the code, I'm not comfortable releasing it without a thorough review. I also don't believe that after all that work the code would be particularly useful to people. There's so much technical debt, and it's so tightly coupled with our process, data, vendors, and institutional resources that I'm sure adapting it to a new project would take significantly more effort than starting over. I'm confident that development of Capstone — the tool which manages and distributes the fruits of this project — is a much better use of my time.

Please allow me to expound.

The Tracking Tool - The Not So Great Parts

During the project's conception in 2013, I conceived of the TT as a small utility to track metadata and log the receipt, scanning, and shipping of casebooks. Turning a small utility into a monolithic data management environment by continually applying ad hoc enhancements under significant time constraint is the perfect recipe for technical debt, and that's precisely what we ended up with.

S3 bucket names are hard-coded into models. Recipient's email addresses are hard-coded into automated reports. Tests? Ha!

The only flexibility I designed into the application, such as being able to configure the steps each volume would proceed through during the digitization process, was to mitigate not knowing exactly what the workflow would look like when I started coding, not because I was trying to make a general-purpose tool. It was made, from the ground-up, to work with our project-specific idiosyncrasies. For example, code peppered throughout the application handles a volume's reporter series, which is a critical part of this workflow but nonexistent in most projects. Significant bits of functionality are based on access to internal Harvard APIs, or having data formatted in the CaseXML, VolumeXML, and ALTO formats.

If all of that wasn't enough, it's written in everybody's favorite language, PHP5, using Laravel 4, which was released in 2013, and isn't the most straightforward framework to upgrade. I maintain that this was a good design choice at the time, but it indeed isn't something I'd recommend adopting today.

Now that I've dedicated a pretty substantial chunk of this post to how the TT is a huge, flaming pile of garbage, let's jump right over to the "pro" column before I get fired.

The Tracking Tool - The Better Parts

Despite all of its hacky bits, the TT is functional, stable, and does its jobs well.

Barcodes

Each book is identified in the TT by its barcode, so users can quickly bring up a book's metadata/event log screen with the wave of a barcode scanner. Harvard's cataloging system assigned most of the barcodes, but techs could generate new CAP-only barcodes for the occasional exception, such as when we received a book from another institution. Regardless of the barcode's source, all books need to have an entry in the TT's database. Techs could create those entries individually if necessary, but most often would create them in bulk. If the book has a cataloging system barcode, it pulls some metadata, such as the volume number and publication year, from the cataloging API.

Reporters

A crucial part of the metadata and organization of this tool is the reporter table — a hand-compiled list of every reporter series' in the scope of this project. Several expert law librarians constructed the table by combing through a few hundred years of Harvard cataloging data, which after many generations of library management and cataloging systems, had varying levels of accuracy. If you're interested, check out our master reporter list on github! The application guesses each volume's reporter based on its HOLLIS number — another internal cataloging identifier — but needs to be double-checked by the tech.

Automated Expertise

There are several data points created during the in-hand metadata analysis stage which would trigger outside review. If a book was automatically determined to be rare using a set of criteria determined by our Special Collections department, or the tech flagged it as needing bibliographic review, the TT included the barcode in its daily email to their respective groups of specialists.

Process Steps and Book Logs

The system has a configurable set of process steps each volume must complete, such as in-hand metadata analysis or scanning, with configurable prerequisites. Such a system ensures all books proceed through all of the steps, in the intended order, and facilitates very granular progress reports. Each step is recorded in the book's log, which also contains:

  • Info Entries: e.g., user x changed the publication year for this book
  • Warnings: e.g.,  the scan job was put on hold
  • Exceptions: e.g., the scanned book failed the QA test.

Control Flow

Each of those process steps has a configurable set of prerequisites. For example, to mark a book as "analyzed," it must have several metadata elements recorded. To mark it as "stored on X shelf," the log must contain a "scanned" event.

If a supervisor needs to track down a book during the digitization process, they can put that book "on hold." The next person to scan that barcode sees a prominent warning and must engage with a confirmation prompt before taking any action. Generally, the person who placed the volume on hold would put instructions in the volume's notes field.

Efficiency

Accessing each volume page to record an event, such as receipt of a book from the repository, is terribly inefficient with more than a few books. In the streamlined mode, techs specify a process step which they can bulk-apply to any book by scanning its barcode. An audio cue indicates whether or not the status was applied correctly, so the technician doesn't even have to look a the computer unless there's a problem.

External Communication

The TT has a simple REST API to communicate with daemons that run on other systems. Through the API, external processes can trigger uploading metadata once the file upload is complete, monitor our scanner output, discover newly uploaded objects from our vendor, sync scan timestamp and QA status, and a few other things.

Quality Assurance

Within the TT lies a system to inspect the output received from our XML vendor. The user can view statistics about the number volumes received per state or jurisdiction, drill down to see XML tag statistics at different levels of granularity, or even drill down to individual cases where you can view page images overlaid with interactive ALTO text. The higher-level overviews were quite useful in ironing out some vendor process problems.

The Long and Short of It

The tracking tool was an invaluable part of the CAP workflow, but vast swaths of the code would only be useful to people replicating this exact project, using Harvard's internal cataloging systems, using the highly automated scanner we used configured precisely like ours, and receiving XML in the exact format we designed for this project. While a subset of the TT's features would be pretty useful to most people doing book digitization, I am very confident that anybody interested in using it would be much better off creating a more straightforward, more generalizable tool from scratch, using a better language. I've considered starting a more generalizable, open source tool for digitization projects, but if someone else gets to it first, I'd be happy to discuss the architectural wisdom I've gained by writing the TT. If someone knows of another open source project already doing this, let me know; I'd love to check it out. Reach out to lil@law.harvard.edu with any questions, comments, or hate mail!

The 'Library' in Library Innovation Lab

A roomful of people sitting in armchairs with laptops may not appear at first glance to be a place where library work is happening. It could look more like a tech startup, or maybe a student lounge (modulo the ages of some of the people in the armchairs). You don't have to be a librarian to see it, but, as the only librarian presently working at LIL, I'll try to show how LIL's work is at the heart of librarianship.

Of our main projects, the Nuremberg project is the closest to a notion of traditional library work: scanning, optical character recognition, and metadata creation for trial documents and transcripts from the Nuremberg Military Tribunals, a collection of enormous historical interest. This is squarely in the realm of library collections, preservation, and access.

In its broad outline, the work on Nuremberg is similar to that of the Caselaw Access Project, the digitization of all U.S. case law. This project, however, is what Jonathan Zittrain has referred to as a systemic intervention. By making the law freely accessible online, we are not only going to alter the form of and access to the print collection, but we are going to transform the relationships of libraries, lawyers, courts, scholars, and citizens to the law. By freeing the law for a multitude of uses, the Caselaw Access Project will support efforts like H2O, LIL's free casebook platform, another intervention into the field of publishing.

Over the last forty years or so, as computers have become more and more essential to library work, libraries have ceded control to vendors. For example, not only does a library subscribing to an online journal database lose the ability to make collection development decisions autonomously (though LOCKSS, a distributed preservation system, helps address this), but, in relinquishing control of the platform, it relinquishes the power to protect patron confidentiality, and consequently intellectual freedom.

Perma.cc is an intervention of a different sort, a tool to combat link rot. As a means of permanently archiving web links, it's close to libraries' preservation efforts, but the point of action is generally the author or editor of a document, not an archivist, post-publication. Further, Perma's reliability rests on the authority of the library to maintain collections in perpetuity.

As library work, these interventions are radical, in the sense of at-the-root: they address core activities of the library, they engage long-standing problems in librarianship, and they expand on and distribute traditional library work.