The CAP Roadshow

In 2019 we embarked on the CAP Roadshow. This year, we shared the Caselaw Access Project at conferences and workshops with new friends and colleagues.

Between February and May 2019, we made the following stops at conferences and workshops:

Next stop on the road will be UNT Open Access Symposium from May 17 - 18 at University of North Texas College of Law. See you there!

On the road we were able to connect the Caselaw Access Project with new people. We were able to share where data comes from, what kinds of questions we can ask when we have the machine readable data to do it, and all the new ways that you’re all building and learning with Caselaw Access Project data to see the landscape of U.S. legal history in new ways.

The CAP Roadshow doesn’t stop here! Share Caselaw Access Project data with a colleague to keep the party going.

CAP Roadshow

Colors in Caselaw

The prospect of having the Caselaw Access Project dataset become public for the first time brings with it the obvious (and wholly necessary) ideas for data parsing: our dataset is vast and the metadata structured (read about the process to get to this), but the work of parsing the dataset is far from over. For instance, there's a lot of work to be done in parsing individual parties in CAP (like names of judges), we don't yet have a citator, and we still don't know who wins a case and who loses. And for that matter, we don't really know what "winning" and "losing" even means (if you are interested in working on any of these problems and more, start here:

At LIL we've also undertaken lighter explorations that highlight opportunities made possible by the data and help teach ways to get started parsing caselaw. To that end, we've written caselaw poetry with a limerick generator, discovered the most popular words in California caselaw with wordclouds, and found all instances of the word "witchcraft" for Halloween. We have created an examples repository, for anyone just starting out, too.

This particular project began as a quick look at a very silly question:

What, exactly, is the color of the law?

It turned, surprisingly, into a somewhat deep dive of an introduction into NLP. In this blog post, I'm putting down some thoughts about my decisions, process, and things I learned along the way. Hopefully it will inspire someone looking into the CAP data to ask their own silly (or very serious) questions. This example might also be useful as a small tutorial for getting started on neural-based NLP projects.

Here is the resulting website, with pretty caselaw colors:

A note on the dataset

For the purposes of sanity and brevity, we will only be looking at the Illinois dataset in this blog post. It is also the dataset that was used for this project.

If you want to download your own, here are some links: Download cases here, or Extract cases using python

How does one go about deciding on the color of the law?

One way to do it is to find all the mentions of colors in each case.

Since there is a finite number of labelled colors, we could look at each color and simply run a search though the dataset on each word. So let's say we start by looking at the color "green". But wait! We've immediately run into trouble. It turns out that "Green" is quite a popular last name. Excluding anywhere the "G" is capitalized, we might miss important data, like sentences that start with the color green. Adding to the trouble, the lower cased "orange" is both a color and a fruit. Maybe we could start by looking at the instances of the color words as adjectives?

Enter Natural Language Processing

Natural Language Processing (NLP) is a field of computer science aimed at the understanding and parsing of texts.

While I'll be introducing NLP concepts here, if you want a more in-depth write-up on NLP as a field, I would recommend Adam Geitgey's series, Natural Language Processing is Fun!

A brief overview of some NLP concepts used

Tokenization: Tokenizing is the process of divvying up a wall of text into smaller components — typically, those are words (sometimes they are characters). Having word chunks allows us to do all kinds of parsing. This can be as simple as "break on space" but usually also treats punctuation as a token.

Parts of speech tagging: tagging words with their respective parts of speech (noun, adjective, verb, etc). This is usually a built-in method in a lot of NLP tools (like nltk and spacy). The tools use a pretrained model, often one built on top of large datasets that had been tediously, and manually tagged (thanks to all ye hard workers of yesteryear that have made our glossing over this difficult work possible).

Root parsing: grouping of syntactically cogent terms. The token chosen (in this case, we're only looking at adjectives), and the "parent" of this token (read this documentation to learn more).

Now what?

Unfortunately, we don't have magical reference to every use of a color in the law, so we'll need to come up with some heuristics which will get us most of the way there. There are a couple ways we could go about finding the colors:

The easiest route we can take is to just match an adjective in the colors list that we have when we come across it and call it a day. The other, more interesting to me way, is to get the context pertinent to the color, using root parsing, to make sure that we get the right shade. "Baby pink" is very different from "hot pink", after all.

To get here, we can use the NLP library spacy. The result is a giant list of of word pairings like "red pepper" and "blue leather". This may read as a food and a type of cloth and not a color. As far as this project is concerned, however, we're treating these word pairings as specific shades. "Blue sky" might be a different shade than "blue leather". "Red pepper" might be a different shade than "red rose".

But exactly what shade is "red pepper" and how would a machine interpret it?

To find out the answer, we turn to recent advances in NLP techniques using Neural Networks.

Recurrent Neural Networks, a too-brief overview

Neural Networks (NNs) are functions that are able to "learn" (more on that in a bit) from a large trove of data. NNs are used for lots of things: from simple classifiers (is it a picture of a dog? Or a cat?) to language translation, and so forth. Recurrent Neural Networks (RNNs) are a specific kind of NN: they are able to learn from past iterations by passing the results of a preceding output down the chain, meaning that running them multiple times should produce increasingly more accurate results (with a caveat — if we run too many epochs, or full training cycles — each epoch being a forward and backward pass through all of the data), there's a danger of "overfitting", having the RNN essentially memorize the correct answers!.

A contrived example of running an already fully-trained RNN over 2-length sequences of words might look something like this: Input: "box of rocks", Output: prediction of word "rocks" Step1: RNN("", "box") -> 0% "rocks" Step2: RNN("box", "of") -> 0% "rocks" Step3: RNN("of", "rocks") -> 50% "rocks"

Notice that an RNN works over a fixed sequence length, and would only be able to understand word relationships bounded by this length. An LSTM (Long short term memory) is a special type of RNN that overcomes this by adding a type of "memory" which we won't get into here.

Crucially, the NN has two major components: forward and backward propagation. Forward propagation is responsible for getting the output of the model (as in, stepping forward in your network by running your model). An additional step is model evaluation, finding out how far from our expectations (our labelled "truth" set) our output is — in other terms, getting the error/loss. This also plays a role in backward propagation.

Backward propagation is responsible for stepping backward through the network, and computing a derivative between the computer error and the weights of the model. This derivative is used by the gradient descent function, an optimization that adjusts the weights to decrease the error by a small amount for each step. This is the "learning" part of NN — by running it over and over, stepping forward, backward, figuring out the derivative, running it through the gradient descent, adjusting the weights to minimize the error, and repeating the cycle, the NN is able to learn from past mistakes and successes, and move towards a more correct output.

For an excellent video series explaining Neural Networks in more depth, check out Season 3 by 3 Blue 1 Brown. Recurrent Neural Networks and LSTM is a nice write-up with more in-depth top-level concepts.

Colorful words

As luck would have it, I happened upon a white paper that solved the exact problem of figuring out the "correct" shade for an entered phrase, and a fantastic implementation of it (albeit one that needed a bit of tuning).

The resulting repo is here:

The basic steps to reproduce are these: We take a large set of color data. gives us access to about a million labeled, open source, community-submitted colors — everything from "dutch teal" (#1693A5) to a very popular color named "certain frogs" (#C3FF68). We create a truth set. This is important because we need to train the model against something that it treats as correct. For our purposes, we do have a sort of "truth" of colors, a largely agreed-upon set in the form of HTML color codes with their corresponding hex values. There are 148 of those that I've found. We convert all hex values to CIE LAB values (these are more conducive to an RNN's gradient learning as they are easily mappable in 3d space). We tokenize each value on character ("blue" becomes "b", "l", "u", "e"). We call in PyTorch to the rescue us from the rest of the hard stuff, like creating character embeddings And we run our BiLSTM model (a bi-directional Long Short Term Memory model, which is a type of RNN that is able to remember inputs from current and previous iterations)

The results

The results live here: (sorted by date) or (sorted by luminosity). You can also see the RNN in action by going here

Although this was a pretty whimsical look at a very serious dataset, we do see some stories start to emerge. My favorite of these is a look at the different colors of the word "hat" in caselaw.

Here are years 1867 to 1935: Illinois hats from 1867 to 1935

And years 1999 to 2011: Illinois hats from 1999 to 2011

Whereas the colors in the late 1800s are muted, and mostly grays, browns, and tans, the colors in the 21st century are bright blues, reds, oranges, greens. We seem to be getting a small window into U.S.'s industrialization and the fashion of the times ("industrialization" is a latent factor (or a hidden neuron) here :-) Who would have thought we could do that by looking at caselaw?

When I first started working on this project, I had no expectations of what I would find. Looking at the data now, it is clear that some of the most commonly present colors are black, brown, and white, and from what I can tell, the majority of the mentions of those are race related. A deeper dive would require a different person to look at this subject, and there are many other more direct ways of approaching such a serious matter than looking at the colors of caselaw.

If you have any questions, any kooky ideas about caselaw, or any interest in exploring together, please let me know!

Launching CAP Search

Today we're launching CAP search, a new interface to search data made available as part of the Caselaw Access Project API. Since releasing the CAP API in Fall 2018, this is our first try at creating a more human-friendly way to start working with this data.

CAP search supports access to 6.7 million cases from 1658 through June 2018, digitized from the collections at the Harvard Law School Library. Learn more about CAP search and limitations.

We're also excited to share a new way to view cases, formatted in HTML. Here's a sample!

We invite you to experiment by building new interfaces to search CAP data. See our code as an example.

The Caselaw Access Project was created by the Harvard Library Innovation Lab at the Harvard Law School Library in collaboration with project partner Ravel Law.

Some Thoughts on Digital Preservation

One of the things people often ask about is how we ensure the preservation of Perma links. There are some answers in Perma's documentation, for example: was built by Harvard’s Library Innovation Lab and is backed by the power of libraries. We’re both in the forever business: libraries already look after physical and digital materials — now we can do the same for links.


How long will you keep my Links?

Links will be preserved as a part of the permanent collection of participating libraries. While we can't guarantee that these records will be preserved forever, we are hosted by university libraries that have endured for centuries, and we are planning to be around for the long term. If we ever do need to shut down, we have laid out a detailed contingency plan for preserving existing data.

The contingency plan is worth reading; I won't quote it here. (Here's a Perma link to it, in case we've updated it by the time you read this.) In any case, all three of these statements might be accused of a certain nonspecificity - not as who should say vagueness.

I think what people sometimes want to hear when they ask about preservation of Perma links is a very specific arrangement of technology. A technologically specific answer, however, can only be provisional at best. That said, here's what we do at present: Perma saves captures in the form of WARC files to an S3 bucket and serves them from there; within seconds of each capture, a server in Germany downloads a copy of the WARC; twenty-four hours after each capture, a copy of the WARC is uploaded to the Internet Archive (unless the link has been marked as private); also at the twenty-four hour mark, a copy is distributed to a private LOCKSS network. The database of links, users, registrars, and so on, in AWS, is snapshotted daily, and another snapshot of the database is dumped and saved by the server in Germany.

Here's why that answer can only be provisional: there is no digital storage technology whose lifespan approaches the centuries of acid-free paper or microfilm. Worse, the systems housing the technology will tend to become insecure on a timescale measured in days, weeks, or months, and, unattended, impossible to upgrade in perhaps a few years. Every part of the software stack, from the operating system to the programming language to its packages to your code, is obsolescing, or worse, as soon as it's deployed. The companies that build and host the hardware will decline and fall; the hardware itself will become unperformant, then unusable.

Mitigating these problems is a near-constant process of monitoring, planning, and upgrading, at all levels of the stack. Even if we were never to write another line of Perma code, we'd need to update Django and all the other Python packages it depends on (and a Perma with no new code would become less and less able to capture pages on the modern web); in exactly the same way, the preservation layers of Perma will never be static, and we wouldn't want them to be. In fact, their heterogeneity across time, as well as at a given moment, is a key feature.

The core of digital preservation is institutional commitment, and the means are people. They require dedication, expertise, and flexibility; the institution's commitment and its staff's dedication are constants, but their methods can't be. The resilience of a digital preservation program lies in their careful and constant attention, as in the commonplace, "The best fertilizer is the farmer's footprint."

Although I am not an expert in digital preservation, nor well-read in its literature, I'm a practitioner; I'm a librarian, a software developer, and a DevOps engineer. Whether or not you thought this was fertilizer, I'd love to hear from you. I'm

Developing the CAP Research Community

Since launching the Caselaw Access Project API and bulk data service in October, we’ve been lucky to see a research community develop around this dataset. Today, we’re excited to share examples of how that community is using CAP data to create new kinds of legal scholarship.

Since October, we’ve seen our research community use data analysis to uncover themes in the Caselaw Access Project dataset, with examples like topic modeling U.S. supreme court cases and a quantitative breakdown of our data. We’ve also seen programmatic access to the law create a space to interface with the law in new ways, from creating data notebooks to text and excerpt generators.

Outside this space, data supported by the Caselaw Access Project has found its way into a course assignment to retrieve cases and citations with Python, a call to expand the growing landscape of Wikidata, and library guides on topics ranging from legal research to data mining.

We want to see what you build with Caselaw Access Project data! Keep us in the loop at Looking for ideas for getting started? Visit our gallery and the CAP examples repository.

The Network Librarian

Last year, Jack Cushman expressed a desire for a personal service similar to the one I perform here at LIL – not exactly the DevOps of my job title, but more generally the provision and maintenance of network and computing infrastructure. Jack's take on this idea is very much a personal one, I think: go to a person known to and trusted by you, not the proverbial faceless corporation, for whom you may be as much product as customer.

(I should say here that what follows is my riff on our discussions; errors in transmission are mine.)

As we began to discuss it, it struck me that the idea sounded a lot like some of the work I used to do as a reference librarian at the Public Library of Brookline. This included some formal training for new computer users, but was more often one-on-one, impromptu assistance with things like signing up for a first email account.

Jack's idea goes beyond tech support as traditionally practiced in libraries, but it shares with it the combination of technical knowledge, professional ethics – especially the librarian's rigorous adherence to patron confidentiality – and the personal relationship between patron and librarian.

At LIL, we like naming things whether or not there's actually a project, or, as in this case, before there's even a definition. In order not to keep talking about this vague "idea," I'll bring out the provisional name we came up with for the role we're beginning to imagine: the network librarian.

The network librarian expands on traditional tech support by consulting on computer and network security issues specifically; by advising on self-defense against surveillance where possible and activism where it isn't; and in some cases going beyond the usual help with finding and accessing resources, to providing resources directly. Finally, the practice should expand what's possible – in developing the kinds of self-reliance a network librarian will have to have, the library itself will become more self-reliant and less dependent on vendors.

One of the specific services a network librarian might provide is a virtual private network, or VPN. This article explains why a VPN is important and why it's difficult or impossible to evaluate the trustworthiness of commercial VPN providers. It goes on to explain how to set up a VPN yourself, but it's not trivial. What the network librarian has to offer here is not only technical expertise, but a headstart on infrastructure, like an account at a cloud hosting provider. As important, if not more so, is that you know and trust your librarian.

I've made a first cut at one end of this particular problem in setting up a WireGuard server with Streisand, a neat tool that automates the creation of a server running any of several VPNs and similar services. Almost all of my home and phone network traffic has gone through the WireGuard VPN since August, and I've distributed VPN credentials to several friends and family. Obviously, that isn't a real test of this idea, nor does it get at the potentially enormous issues of agreement, support, and liability you'd have to engage with, but it's an experiment in setting up a small-scale and fairly robust service for small effort and little money.

Even before providing infrastructure, the network librarian would suggest tools and approaches. I'd do the work I used to do differently now – for example, I'd strongly encourage a scheme of multiple backups. I'd be more explicit about how to mitigate the risks of using public computers and wireless networks. I'd encourage the use of encryption, for example via Signal or I would sound my barbaric yawp for the use of a password manager and multi-factor authentication.

Are you a network librarian? Do you know one? Do you have ideas about scope, or tools? Can you think of a better name, or does one already exist? Let me know – I look forward to hearing from you. I'm

Data Stories and CAP API Full-Text Search

Data sets have tales to tell. In the Caselaw Access Project API, full-text search is a way to find these stories in 300+ years of U.S caselaw, from individual cases to larger themes.

This August, John Bowers began exploring this idea in the blog post Telling Stories with CAP Data: The Prolific Mr. Cartwright, writing: “In the hands of an interested researcher with questions to ask, a few gigabytes of digitized caselaw can speak volumes to the progress of American legal history and its millions of little stories.". Here, I wanted to use the CAP API full-text search as a path to find some of these stories using one keyword: pumpkins.

The CAP API full-text search option was one way to look at the presence of pumpkins in the history of U.S. caselaw. Viewing the CAP API Case List, I filtered cases using the Full-Text Search field to encompass only items that included the term “pumpkins”:

This query returned 640 cases, the oldest decision dating to 1812 and the most recent in 2017. Next, I wanted to take a closer look at these cases in detail. To view the full case text, I logged in, revisited that same query for “pumpkins”, filtering the search to display full case text.

By running a full-text search, we can begin to pull out themes in Caselaw Access Project data. Of the 640 cases returned by our search that included the word “pumpkins”, the jurisdictions that produced the most published cases including this word were Louisiana (30) followed by Georgia (22) and Illinois (21).

In browsing the full cases returned by our query, some stories stand out. One such case is Guyer v. School Board of Alachua County, decided outside Gainesville, Florida, in 1994. Centered around the question of whether Halloween decorations including "the depiction of witches, cauldrons, and brooms" in public schools were based on secular or religious practice and promotion of the occult, this case concluded with the opinion:

Witches, cauldrons, and brooms in the context of a school Halloween celebration appear to be nothing more than a mere “shadow”, if that, in the realm of establishment cause jurisprudence."

In searching the cases available through the Caselaw Access Project API, each query can tell a story. Try your own full-text query and share it with us at @caselawaccess.

Legal Tech Student Group Session Brings Quantitative Methods to U.S. Caselaw

This September we hosted a Legal Tech Gumbo session dedicated to using quantitative methods to find new threads in U.S. caselaw. The Legal Tech Gumbo is a collaboration between the Harvard Law & Technology Society and Harvard Library Innovation Lab (LIL).

The session kicked off by introducing data made available as part of the Caselaw Access Project API, a channel to navigate 6.4 million cases dating back 360 years. How can we use that data to advance legal scholarship? In this session Research Associate John Bowers shared how researchers can apply quantitative research methods to qualitative data sources, a theme which has shaped the past decade of research practices in the humanities.

This summer, Bowers shared a blog post outlining some of the themes he found in Caselaw Access Project data, focusing on the influence of judges active in the Illinois court system. Here, we had the chance to learn more about research based on this dataset and its supporting methodology. We applied these same practices to a new segment of data, viewing a century of Arkansas caselaw in ten-year intervals using data analytics and visualization to find themes in U.S. legal history. Want to explore the data we looked at in this session? Take a look at this interactive repository (or, if you prefer, check out this read-only version).

In this session, we learned new ways to find stories in U.S. caselaw. Have you used Caselaw Access Project data in your research? Tell us about it at

The Story of the Domain

Recently we announced the launch of the Caselaw Access Project at But we want to highlight the story of the domain itself.

That domain was generously provided by Carl Jaeckel, its previous owner. Carl agreed to transfer the domain to us in recognition and support of the vital public interest in providing free, open access to caselaw. We’re thrilled to have such a perfect home for the project.

Carl is the managing partner of Jaeckel Law, the Founder of, and the Chief Operating Officer of Dot Law Inc. We can’t wait to see what he and other legal entrepreneurs, researchers, and developers will build based on

Witchcraft in U.S. Caselaw

Witchcraft in Law

Happy Halloween!

This Halloween is a special one at LIL, since we’re celebrating the release of the Caselaw Access Project and 360 years of digitized U.S. caselaw.

For a small project using our full text search functionality, we mapped out the usage of the term “witchcraft” in United States caselaw.

Here is the result:

(For those unfamiliar with the sordid history of witchcraft in the United States, the Wikipedia entry for Salem Witch Trials is a good primer.)

Below are some steps used to get this data.

Since our metadata is available to anyone without limitations or a login, you can see the result of our search here.

As you can see, there are 503 cases in total that include the word “witchcraft”.

In order to get the context of the word (in our visualization, we display small excerpts), we need to create an account.

First, sign up for an API key.

Once you’ve created an account and logged in, you should head over to our API documentation for a primer on authentication.

Now, you can download the cases using one of several means (but be careful! Each time you download the cases, whether in the browser, or elsewhere, your daily case limit of 500 gets decremented).

You can download cases:

  1. in your browser by going to this link: read cases in the browser
  2. download the files locally using the terminal (see an example of a curl request here)
  3. you can get the cases using your favorite programming language, or
  4. you can check out our cap-examples repository and set up an examples environment. Once it’s set up, you can run the code used to make this visualization.

Do you have any questions, concerns, ideas? We would love to hear from you! Please write to the Caselaw Access Project team at