Improving pip-compile --generate-hashes

Recently I landed a series of contributions to the Python package pip-tools:

pip-tools is a "set of command line tools to help you keep your pip-based [Python] packages fresh, even when you've pinned them." My changes help the pip-compile --generate-hashes command work for more people.

This isn't a lot of code in the grand scheme of things, but it's the largest set of contributions I've made to a mainstream open source project, so this blog post is a celebration of me! 🎁πŸ’₯πŸŽ‰ yay. But it's also a chance to talk about package manager security and open source contributions and stuff like that.

I'll start high-level with "what are package managers" and work my way into the weeds, so feel free to jump in wherever you want.

What are package managers?

Package managers help us install software libraries and keep them up to date. If I want to load a URL and print the contents, I can add a dependency on a package like requests …

$ echo 'requests' > requirements.txt
$ pip install -r requirements.txt
Collecting requests (from -r requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/8f/ea/140f18072bbcd81885a9490abb171792fd2961fd7f366be58396f4c6d634/requests-2.0.1-py2.py3-none-any.whl (439kB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 440kB 4.1MB/s
Installing collected packages: requests
Successfully installed requests-2.0.1

… and let requests do the heavy lifting:

>>> import requests
>>> requests.get('http://example.com').text
'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title> ...'

But there's a problem – if I install exactly the same package later, I might get a different result:

$ echo 'requests' > requirements.txt
$ pip install -r requirements.txt
Collecting requests (from -r requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl (57kB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 61kB 3.3MB/s
Collecting certifi>=2017.4.17 (from requests->-r requirements.txt (line 1))
  Using cached https://files.pythonhosted.org/packages/60/75/f692a584e85b7eaba0e03827b3d51f45f571c2e793dd731e598828d380aa/certifi-2019.3.9-py2.py3-none-any.whl
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (from requests->-r requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/39/ec/d93dfc69617a028915df914339ef66936ea976ef24fa62940fd86ba0326e/urllib3-1.25.2-py2.py3-none-any.whl (150kB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 153kB 10.6MB/s
Collecting idna<2.9,>=2.5 (from requests->-r requirements.txt (line 1))
  Using cached https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl
Collecting chardet<3.1.0,>=3.0.2 (from requests->-r requirements.txt (line 1))
  Using cached https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl
Installing collected packages: certifi, urllib3, idna, chardet, requests
Successfully installed certifi-2019.3.9 chardet-3.0.4 idna-2.8 requests-2.22.0 urllib3-1.25.2
<requirements.txt, pip install -r, import requests>

I got a different version of requests than last time, and I got some bonus dependencies (certifi, urllib3, idna, and chardet). Now my code might not do the same thing even though I did the same thing, which is not how anyone wants computers to work. (I've cheated a little bit here by showing the first example as though pip install had been run back in 2013.)

So the next step is to pin the versions of my dependencies and their dependencies, using a package like pip-tools:

$ echo 'requests' > requirements.in
$ pip-compile
$ cat requirements.txt
#
# This file is autogenerated by pip-compile
# To update, run:
#
#    pip-compile
#
certifi==2019.3.9         # via requests
chardet==3.0.4            # via requests
idna==2.8                 # via requests
requests==2.22.0
urllib3==1.25.2           # via requests

(There are other options I could use instead, like pipenv or poetry. For now I still prefer pip-tools, for roughly the reasons laid out by Hynek Schlawack.)

Now when I run pip install -r requirements.txt I will always get the same version of requests, and the same versions of its dependencies, and my program will always do the same thing.

… just kidding.

The problem with pinning Python packages

Unfortunately pip-compile doesn't quite lock down our dependencies the way we would hope! In Python land you don't necessarily get the same version of a package by asking for the same version number. That's because of binary wheels.

Up until 2015, it was possible to change a package's contents on PyPI without changing the version number, simply by deleting the package and reuploading it. That no longer works, but there is still a loophole: you can delete and reupload binary wheels.

Wheels are a new-ish binary format for distributing Python packages, including any precompiled programs written in C (or other languages) used by the package. They speed up installs and avoid the need for users to have the right compiler environment set up for each package. C-based packages typically offer a bunch of wheel files for different target environments – here's bcrypt's wheel files for example.

So what happens if a package was originally released as source, and then the maintainer wants to add binary wheels for the same release years later? PyPI will allow it, and pip will happily install the new binary files. This is a deliberate design decision: PyPI has "made the deliberate choice to allow wheel files to be added to old releases, though, and advise folks to use –no-binary and build their own wheel files from source if that is a concern."

That creates room for weird situations, like this case where wheel files were uploaded for the hiredis 0.2.0 package on August 16, 2018, three years after the source release on April 3, 2015. The package had been handed over without announcement from Jan-Erik Rediger to a new volunteer maintainer, ifduyue, who uploaded the binary wheels. ifduyue's personal information on Github consists of: a new moon emoji; an upside down face emoji; the location "China"; and an image of Lanny from the show Lizzie McGuire with spirals for eyes. In a bug thread opened after ifduyue uploaded the new version of hiredis 0.2.0, Jan-Erik commented that users should "please double-check that the content is valid and matches the repository."

ifduyue's user account on github.com

The problem is that I can't do that, and most programmers can't do that. We can't just rebuild the wheel ourselves and expect it to match, because builds are not reproducible unless one goes to great lengths like Debian does. So verifying the integrity of an unknown binary wheel requires rebuilding the wheel, comparing a diff, and checking that all discrepancies are benign – a time-consuming and error-prone process even for those with the skills to do it.

So the story of hiredis looks a lot like a new open source developer volunteering to help out on a project and picking off some low-hanging fruit in the bug tracker, but it also looks a lot like an attacker using the perfect technique to distribute malware widely in the Python ecosystem without detection. I don't know which one it is! As a situation it's bad for us as users, and it's not fair to ifduyue if in fact they're a friendly newbie contributing to a project.

(Is the hacking paranoia warranted? I think so! As Dominic Tarr wrote after inadvertently handing over control of an npm package to a bitcoin-stealing operation, "I've shared publish rights with other people before. … open source is driven by sharing! It's great! it worked really well before bitcoin got popular.")

This is a big problem with a lot of dimensions. It would be great if PyPI packages were all fully reproducible and checked to verify correctness. It would be great if PyPI didn't let you change package contents after the fact. It would be great if everyone ran their own private package index and only added packages to it that they had personally built from source that they personally checked, the way big companies do it. But in the meantime, we can bite off a little piece of the problem by adding hashes to our requirements file. Let's see how that works.

Adding hashes to our requirements file

Instead of just pinning packages like we did before, let's try adding hashes to them:

$ echo 'requests==2.0.1' > requirements.in
$ pip-compile --generate-hashes
#
# This file is autogenerated by pip-compile
# To update, run:
#
#    pip-compile --generate-hashes
#
requests==2.0.1 \
    --hash=sha256:8cfddb97667c2a9edaf28b506d2479f1b8dc0631cbdcd0ea8c8864def59c698b \
    --hash=sha256:f4ebc402e0ea5a87a3d42e300b76c292612d8467024f45f9858a8768f9fb6f6e

Now when pip-compile pins our package versions, it also fetches the currently-known hashes for each requirement and adds them to requirements.txt (an example of the crypto technique of "TOFU" or "Trust On First Use"). If someone later comes along and adds new packages, or if the https connection to PyPI is later insecure for whatever reason, pip will refuse to install and will warn us about the problem:

$ pip install -r requirements.txt
...
ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
    requests==2.0.1 from https://files.pythonhosted.org/packages/8f/ea/140f18072bbcd81885a9490abb171792fd2961fd7f366be58396f4c6d634/requests-2.0.1-py2.py3-none-any.whl#sha256=f4ebc402e0ea5a87a3d42e300b76c292612d8467024f45f9858a8768f9fb6f6e (from -r requirements.txt (line 7)):
        Expected sha256 8cfddb97667c2a9edaf28b506d2479f1b8dc0631cbdcd0ea8c8864def59c6981
        Expected     or f4ebc402e0ea5a87a3d42e300b76c292612d8467024f45f9858a8768f9fb6f61
             Got        f4ebc402e0ea5a87a3d42e300b76c292612d8467024f45f9858a8768f9fb6f6e

But there are problems lurking here! If we have packages that are installed from Github, then pip-compile can't hash them and pip won't install them:

$ echo '-e git+https://github.com/requests/requests@master#egg=requests' > requirements.in
$ pip-compile --generate-hashes
#
# This file is autogenerated by pip-compile
# To update, run:
#
#    pip-compile --generate-hashes
#
-e git+https://github.com/requests/requests@master#egg=requests
certifi==2019.3.9 \
    --hash=sha256:59b7658e26ca9c7339e00f8f4636cdfe59d34fa37b9b04f6f9e9926b3cece1a5 \
    --hash=sha256:b26104d6835d1f5e49452a26eb2ff87fe7090b89dfcaee5ea2212697e1e1d7ae
chardet==3.0.4 \
    --hash=sha256:84ab92ed1c4d4f16916e05906b6b75a6c0fb5db821cc65e70cbd64a3e2a5eaae \
    --hash=sha256:fc323ffcaeaed0e0a02bf4d117757b98aed530d9ed4531e3e15460124c106691
idna==2.8 \
    --hash=sha256:c357b3f628cf53ae2c4c05627ecc484553142ca23264e593d327bcde5e9c3407 \
    --hash=sha256:ea8b7f6188e6fa117537c3df7da9fc686d485087abf6ac197f9c46432f7e4a3c
urllib3==1.25.2 \
    --hash=sha256:a53063d8b9210a7bdec15e7b272776b9d42b2fd6816401a0d43006ad2f9902db \
    --hash=sha256:d363e3607d8de0c220d31950a8f38b18d5ba7c0830facd71a1c6b1036b7ce06c
$ pip install -r requirements.txt
Obtaining requests from git+https://github.com/requests/requests@master#egg=requests (from -r requirements.txt (line 7))
ERROR: The editable requirement requests from git+https://github.com/requests/requests@master#egg=requests (from -r requirements.txt (line 7)) cannot be installed when requiring hashes, because there is no single file to hash.

That's a serious limitation, because -e requirements are the only way pip-tools knows to specify installations from version control, which are useful while you wait for new fixes in dependencies to be released. (We mostly use them at LIL for dependencies that we've patched ourselves, after we send fixes upstream but before they are released.)

And if we have packages that rely on dependencies pip-tools considers unsafe to pin, like setuptools, pip will refuse to install those too:

$ echo 'Markdown' > requirements.in
$ pip-compile --generate-hashes
#
# This file is autogenerated by pip-compile
# To update, run:
#
#    pip-compile --generate-hashes
#
markdown==3.1 \
    --hash=sha256:fc4a6f69a656b8d858d7503bda633f4dd63c2d70cf80abdc6eafa64c4ae8c250 \
    --hash=sha256:fe463ff51e679377e3624984c829022e2cfb3be5518726b06f608a07a3aad680
$ pip install -r requirements.txt
Collecting markdown==3.1 (from -r requirements.txt (line 7))
  Using cached https://files.pythonhosted.org/packages/f5/e4/d8c18f2555add57ff21bf25af36d827145896a07607486cc79a2aea641af/Markdown-3.1-py2.py3-none-any.whl
Collecting setuptools>=36 (from markdown==3.1->-r requirements.txt (line 7))
ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==. These do not:
    setuptools>=36 from https://files.pythonhosted.org/packages/ec/51/f45cea425fd5cb0b0380f5b0f048ebc1da5b417e48d304838c02d6288a1e/setuptools-41.0.1-py2.py3-none-any.whl#sha256=c7769ce668c7a333d84e17fe8b524b1c45e7ee9f7908ad0a73e1eda7e6a5aebf (from markdown==3.1->-r requirements.txt (line 7))

This can be worked around by adding --allow-unsafe, but (a) that sounds unsafe (though it isn't), and (b) it won't pop up until you try to set up a new environment with a low version of setuptools, potentially days later on someone else's machine.

Fixing pip-tools

Those two problems meant that, when I set out to convert our Caselaw Access Project code to use --generate-hashes, I did it wrong a few times in a row, leading to multiple hours spent debugging problems I created for me and other team members (sorry, Anastasia!). I ended up needing a fancy wrapper script around pip-compile to rewrite our requirements in a form it could understand. I wanted it to be a smoother experience for the next people who try to secure their Python projects.

So I filed a series of pull requests:

Support URLs as packages

Support URLs as packages #807 and Fix –generate-hashes with bare VCS URLs #812 laid the groundwork for fixing --generate-hashes, by teaching pip-tools to do something that had been requested for years: installing packages from archive URLs. Where before, pip-compile could only handle Github requirements like this:

-e git+https://github.com/requests/requests@master#egg=requests

It can now handle requirements like this:

https://github.com/requests/requests/archive/master.zip

And zipped requirements can be hashed, so the resulting requirements.txt comes out looking like this, and is accepted by pip install:

https://github.com/requests/requests/archive/master.zip \
   	     --hash=sha256:3c3d84d35630808bf7750b0368b2c7988f89d9f5c2f2633c47f075b3d5015638

This was a long process, and began with resurrecting a pull request from 2017 that had first been worked on by nim65s. I started by just rebasing the existing work, fixing some tests, and submitting it in the hopes the problem had already been solved. Thanks to great feedback from auvipy, atugushev, and blueyed, I ended up making 14 more commits (and eventually a follow-up pull request) to clean up edge cases and get everything working.

Landing this resulted in closing two other pip-tools pull requests from 2016 and 2017, and feature requests from 2014 and 2018.

Warn when --generate-hashes output is uninstallable

The next step was Fix pip-compile output for unsafe requirements #813 and Warn when –generate-hashes output is uninstallable #814. These two PRs allowed pip-compile --generate-hashes to detect and warn when a file would be uninstallable for hashing reasons. Fortunately pip-compile has all of the information it needs at compile time to know that the file will be uninstallable and to make useful recommendations for what to do about it:

$ pip-compile --generate-hashes
#
# This file is autogenerated by pip-compile
# To update, run:
#
#    pip-compile --generate-hashes
#
# WARNING: pip install will require the following package to be hashed.
# Consider using a hashable URL like https://github.com/jazzband/pip-tools/archive/SOMECOMMIT.zip
-e git+https://github.com/jazzband/pip-tools@7d86c8d3ecd1faa6be11c7ddc6b29a30ffd1dae3#egg=pip-tools
click==7.0 \
    --hash=sha256:2335065e6395b9e67ca716de5f7526736bfa6ceead690adf616d925bdc622b13 \
    --hash=sha256:5b94b49521f6456670fdb30cd82a4eca9412788a93fa6dd6df72c94d5a8ff2d7
first==2.0.2 \
    --hash=sha256:8d8e46e115ea8ac652c76123c0865e3ff18372aef6f03c22809ceefcea9dec86 \
    --hash=sha256:ff285b08c55f8c97ce4ea7012743af2495c9f1291785f163722bd36f6af6d3bf
markdown==3.1 \
    --hash=sha256:fc4a6f69a656b8d858d7503bda633f4dd63c2d70cf80abdc6eafa64c4ae8c250 \
    --hash=sha256:fe463ff51e679377e3624984c829022e2cfb3be5518726b06f608a07a3aad680
six==1.12.0 \
    --hash=sha256:3350809f0555b11f552448330d0b52d5f24c91a322ea4a15ef22629740f3761c \
    --hash=sha256:d16a0141ec1a18405cd4ce8b4613101da75da0e9a7aec5bdd4fa804d0e0eba73

# WARNING: The following packages were not pinned, but pip requires them to be
# pinned when the requirements file includes hashes. Consider using the --allow-unsafe flag.
# setuptools==41.0.1        # via markdown

Hopefully, between these two efforts, the next project to try using –generate-hashes will find it a shorter and more straightforward process than I did!

Things left undone

Along the way I discovered a few issues that could be fixed in various projects to help the situation. Here are some pointers:

First, the warning to use --allow-unsafe seems unnecessary – I believe that --allow-unsafe should be the default behavior for pip-compile. I spent some time digging into the reasons that pip-tools considers some packages "unsafe," and as best I can tell it is because it was thought that pinning those packages could potentially break pip itself, and thus break the user's ability to recover from a mistake. This seems to no longer be true, if it ever was. Instead, failing to use –allow-unsafe is unsafe, as it means different environments will end up with different versions of key packages despite installing from identical requirements.txt files. I started some discussion about that on the pip-tools repo and the pip repo.

Second, the warning not to use version control links with --generate-hashes is necessary only because of pip's decision to refuse to install those links alongside hashed requirements. That seems like a bad security tradeoff for several reasons. I filed a bug with pip to open up discussion on the topic.

Third, PyPI and binary wheels. I'm not sure if there's been further discussion on the decision to allow retrospective binary uploads since 2017, but the example of hiredis makes it seem like that has some major downsides and might be worth reconsidering. I haven't yet filed anything for this.

Personal reflections (and, thanks Jazzband!)

I didn't write a ton of code for this in the end, but it was a big step for me personally in working with a mainstream open source project, and I had a lot of fun – learning tools like black and git multi-author commits that we don't use on our own projects at LIL, collaborating with highly responsive and helpful reviewers (thanks, all!), learning the internals of pip-tools, and hopefully putting something out there that will make people more secure.

pip-tools is part of the Jazzband project, which is an interesting attempt to make the Python package ecosystem a little more sustainable by lowering the bar to maintaining popular packages. I had a great experience with the maintainers working on pip-tools in particular, and I'm grateful for the work that's gone into making Jazzband happen in general.

The CAP Roadshow

In 2019 we embarked on the CAP Roadshow. This year, we shared the Caselaw Access Project at conferences and workshops with new friends and colleagues.

Between February and May 2019, we made the following stops at conferences and workshops:

Next stop on the road will be UNT Open Access Symposium from May 17 - 18 at University of North Texas College of Law. See you there!

On the road we were able to connect the Caselaw Access Project with new people. We were able to share where data comes from, what kinds of questions we can ask when we have the machine readable data to do it, and all the new ways that you’re all building and learning with Caselaw Access Project data to see the landscape of U.S. legal history in new ways.

The CAP Roadshow doesn’t stop here! Share Caselaw Access Project data with a colleague to keep the party going.

CAP Roadshow

Colors in Caselaw

The prospect of having the Caselaw Access Project dataset become public for the first time brings with it the obvious (and wholly necessary) ideas for data parsing: our dataset is vast and the metadata structured (read about the process to get to this), but the work of parsing the dataset is far from over. For instance, there's a lot of work to be done in parsing individual parties in CAP (like names of judges), we don't yet have a citator, and we still don't know who wins a case and who loses. And for that matter, we don't really know what "winning" and "losing" even means (if you are interested in working on any of these problems and more, start here: https://case.law/tools/).

At LIL we've also undertaken lighter explorations that highlight opportunities made possible by the data and help teach ways to get started parsing caselaw. To that end, we've written caselaw poetry with a limerick generator, discovered the most popular words in California caselaw with wordclouds, and found all instances of the word "witchcraft" for Halloween. We have created an examples repository, for anyone just starting out, too.

This particular project began as a quick look at a very silly question:

What, exactly, is the color of the law?

It turned, surprisingly, into a somewhat deep dive of an introduction into NLP. In this blog post, I'm putting down some thoughts about my decisions, process, and things I learned along the way. Hopefully it will inspire someone looking into the CAP data to ask their own silly (or very serious) questions. This example might also be useful as a small tutorial for getting started on neural-based NLP projects.

Here is the resulting website, with pretty caselaw colors: https://colors.lil.tools/

A note on the dataset

For the purposes of sanity and brevity, we will only be looking at the Illinois dataset in this blog post. It is also the dataset that was used for this project.

If you want to download your own, here are some links: Download cases here, or Extract cases using python

How does one go about deciding on the color of the law?

One way to do it is to find all the mentions of colors in each case.

Since there is a finite number of labelled colors, we could look at each color and simply run a search though the dataset on each word. So let's say we start by looking at the color "green". But wait! We've immediately run into trouble. It turns out that "Green" is quite a popular last name. Excluding anywhere the "G" is capitalized, we might miss important data, like sentences that start with the color green. Adding to the trouble, the lower cased "orange" is both a color and a fruit. Maybe we could start by looking at the instances of the color words as adjectives?

Enter Natural Language Processing

Natural Language Processing (NLP) is a field of computer science aimed at the understanding and parsing of texts.

While I'll be introducing NLP concepts here, if you want a more in-depth write-up on NLP as a field, I would recommend Adam Geitgey's series, Natural Language Processing is Fun!

A brief overview of some NLP concepts used

Tokenization: Tokenizing is the process of divvying up a wall of text into smaller components β€” typically, those are words (sometimes they are characters). Having word chunks allows us to do all kinds of parsing. This can be as simple as "break on space" but usually also treats punctuation as a token.

Parts of speech tagging: tagging words with their respective parts of speech (noun, adjective, verb, etc). This is usually a built-in method in a lot of NLP tools (like nltk and spacy). The tools use a pretrained model, often one built on top of large datasets that had been tediously, and manually tagged (thanks to all ye hard workers of yesteryear that have made our glossing over this difficult work possible).

Root parsing: grouping of syntactically cogent terms. The token chosen (in this case, we're only looking at adjectives), and the "parent" of this token (read this documentation to learn more).

Now what?

Unfortunately, we don't have magical reference to every use of a color in the law, so we'll need to come up with some heuristics which will get us most of the way there. There are a couple ways we could go about finding the colors:

The easiest route we can take is to just match an adjective in the colors list that we have when we come across it and call it a day. The other, more interesting to me way, is to get the context pertinent to the color, using root parsing, to make sure that we get the right shade. "Baby pink" is very different from "hot pink", after all.

To get here, we can use the NLP library spacy. The result is a giant list of of word pairings like "red pepper" and "blue leather". This may read as a food and a type of cloth and not a color. As far as this project is concerned, however, we're treating these word pairings as specific shades. "Blue sky" might be a different shade than "blue leather". "Red pepper" might be a different shade than "red rose".

But exactly what shade is "red pepper" and how would a machine interpret it?

To find out the answer, we turn to recent advances in NLP techniques using Neural Networks.

Recurrent Neural Networks, a too-brief overview

Neural Networks (NNs) are functions that are able to "learn" (more on that in a bit) from a large trove of data. NNs are used for lots of things: from simple classifiers (is it a picture of a dog? Or a cat?) to language translation, and so forth. Recurrent Neural Networks (RNNs) are a specific kind of NN: they are able to learn from past iterations by passing the results of a preceding output down the chain, meaning that running them multiple times should produce increasingly more accurate results (with a caveat β€” if we run too many epochs, or full training cycles β€” each epoch being a forward and backward pass through all of the data), there's a danger of "overfitting", having the RNN essentially memorize the correct answers!.

A contrived example of running an already fully-trained RNN over 2-length sequences of words might look something like this: Input: "box of rocks", Output: prediction of word "rocks" Step1: RNN("", "box") -> 0% "rocks" Step2: RNN("box", "of") -> 0% "rocks" Step3: RNN("of", "rocks") -> 50% "rocks"

Notice that an RNN works over a fixed sequence length, and would only be able to understand word relationships bounded by this length. An LSTM (Long short term memory) is a special type of RNN that overcomes this by adding a type of "memory" which we won't get into here.

Crucially, the NN has two major components: forward and backward propagation. Forward propagation is responsible for getting the output of the model (as in, stepping forward in your network by running your model). An additional step is model evaluation, finding out how far from our expectations (our labelled "truth" set) our output is β€” in other terms, getting the error/loss. This also plays a role in backward propagation.

Backward propagation is responsible for stepping backward through the network, and computing a derivative between the computer error and the weights of the model. This derivative is used by the gradient descent function, an optimization that adjusts the weights to decrease the error by a small amount for each step. This is the "learning" part of NN β€” by running it over and over, stepping forward, backward, figuring out the derivative, running it through the gradient descent, adjusting the weights to minimize the error, and repeating the cycle, the NN is able to learn from past mistakes and successes, and move towards a more correct output.

For an excellent video series explaining Neural Networks in more depth, check out Season 3 by 3 Blue 1 Brown. Recurrent Neural Networks and LSTM is a nice write-up with more in-depth top-level concepts.

Colorful words

As luck would have it, I happened upon a white paper that solved the exact problem of figuring out the "correct" shade for an entered phrase, and a fantastic implementation of it (albeit one that needed a bit of tuning).

The resulting repo is here: https://github.com/anastasia/namecolor

The basic steps to reproduce are these: We take a large set of color data. https://www.colourlovers.com/api gives us access to about a million labeled, open source, community-submitted colors β€” everything from "dutch teal" (#1693A5) to a very popular color named "certain frogs" (#C3FF68). We create a truth set. This is important because we need to train the model against something that it treats as correct. For our purposes, we do have a sort of "truth" of colors, a largely agreed-upon set in the form of HTML color codes with their corresponding hex values. There are 148 of those that I've found. We convert all hex values to CIE LAB values (these are more conducive to an RNN's gradient learning as they are easily mappable in 3d space). We tokenize each value on character ("blue" becomes "b", "l", "u", "e"). We call in PyTorch to the rescue us from the rest of the hard stuff, like creating character embeddings And we run our BiLSTM model (a bi-directional Long Short Term Memory model, which is a type of RNN that is able to remember inputs from current and previous iterations)

The results

The results live here: https://colors.lil.tools/ (sorted by date) or https://colors.lil.tools/lum (sorted by luminosity). You can also see the RNN in action by going here https://colors.lil.tools/create

Although this was a pretty whimsical look at a very serious dataset, we do see some stories start to emerge. My favorite of these is a look at the different colors of the word "hat" in caselaw.

Here are years 1867 to 1935: Illinois hats from 1867 to 1935

And years 1999 to 2011: Illinois hats from 1999 to 2011

Whereas the colors in the late 1800s are muted, and mostly grays, browns, and tans, the colors in the 21st century are bright blues, reds, oranges, greens. We seem to be getting a small window into U.S.'s industrialization and the fashion of the times ("industrialization" is a latent factor (or a hidden neuron) here :-) Who would have thought we could do that by looking at caselaw?

When I first started working on this project, I had no expectations of what I would find. Looking at the data now, it is clear that some of the most commonly present colors are black, brown, and white, and from what I can tell, the majority of the mentions of those are race related. A deeper dive would require a different person to look at this subject, and there are many other more direct ways of approaching such a serious matter than looking at the colors of caselaw.

If you have any questions, any kooky ideas about caselaw, or any interest in exploring together, please let me know!

Launching CAP Search

Today we're launching CAP search, a new interface to search data made available as part of the Caselaw Access Project API. Since releasing the CAP API in Fall 2018, this is our first try at creating a more human-friendly way to start working with this data.

CAP search supports access to 6.7 million cases from 1658 through June 2018, digitized from the collections at the Harvard Law School Library. Learn more about CAP search and limitations.

We're also excited to share a new way to view cases, formatted in HTML. Here's a sample!

We invite you to experiment by building new interfaces to search CAP data. See our code as an example.

The Caselaw Access Project was created by the Harvard Library Innovation Lab at the Harvard Law School Library in collaboration with project partner Ravel Law.

Some Thoughts on Digital Preservation

One of the things people often ask about Perma.cc is how we ensure the preservation of Perma links. There are some answers in Perma's documentation, for example:

Perma.cc was built by Harvard’s Library Innovation Lab and is backed by the power of libraries. We’re both in the forever business: libraries already look after physical and digital materials β€” now we can do the same for links.

and:

How long will you keep my Perma.cc Links?

Links will be preserved as a part of the permanent collection of participating libraries. While we can't guarantee that these records will be preserved forever, we are hosted by university libraries that have endured for centuries, and we are planning to be around for the long term. If we ever do need to shut down, we have laid out a detailed contingency plan for preserving existing data.

The contingency plan is worth reading; I won't quote it here. (Here's a Perma link to it, in case we've updated it by the time you read this.) In any case, all three of these statements might be accused of a certain nonspecificity - not as who should say vagueness.

I think what people sometimes want to hear when they ask about preservation of Perma links is a very specific arrangement of technology. A technologically specific answer, however, can only be provisional at best. That said, here's what we do at present: Perma saves captures in the form of WARC files to an S3 bucket and serves them from there; within seconds of each capture, a server in Germany downloads a copy of the WARC; twenty-four hours after each capture, a copy of the WARC is uploaded to the Internet Archive (unless the link has been marked as private); also at the twenty-four hour mark, a copy is distributed to a private LOCKSS network. The database of links, users, registrars, and so on, in AWS, is snapshotted daily, and another snapshot of the database is dumped and saved by the server in Germany.

Here's why that answer can only be provisional: there is no digital storage technology whose lifespan approaches the centuries of acid-free paper or microfilm. Worse, the systems housing the technology will tend to become insecure on a timescale measured in days, weeks, or months, and, unattended, impossible to upgrade in perhaps a few years. Every part of the software stack, from the operating system to the programming language to its packages to your code, is obsolescing, or worse, as soon as it's deployed. The companies that build and host the hardware will decline and fall; the hardware itself will become unperformant, then unusable.

Mitigating these problems is a near-constant process of monitoring, planning, and upgrading, at all levels of the stack. Even if we were never to write another line of Perma code, we'd need to update Django and all the other Python packages it depends on (and a Perma with no new code would become less and less able to capture pages on the modern web); in exactly the same way, the preservation layers of Perma will never be static, and we wouldn't want them to be. In fact, their heterogeneity across time, as well as at a given moment, is a key feature.

The core of digital preservation is institutional commitment, and the means are people. They require dedication, expertise, and flexibility; the institution's commitment and its staff's dedication are constants, but their methods can't be. The resilience of a digital preservation program lies in their careful and constant attention, as in the commonplace, "The best fertilizer is the farmer's footprint."

Although I am not an expert in digital preservation, nor well-read in its literature, I'm a practitioner; I'm a librarian, a software developer, and a DevOps engineer. Whether or not you thought this was fertilizer, I'd love to hear from you. I'm bsteinberg@law.harvard.edu.

Developing the CAP Research Community

Since launching the Caselaw Access Project API and bulk data service in October, we’ve been lucky to see a research community develop around this dataset. Today, we’re excited to share examples of how that community is using CAP data to create new kinds of legal scholarship.

Since October, we’ve seen our research community use data analysis to uncover themes in the Caselaw Access Project dataset, with examples like topic modeling U.S. supreme court cases and a quantitative breakdown of our data. We’ve also seen programmatic access to the law create a space to interface with the law in new ways, from creating data notebooks to text and excerpt generators.

Outside this space, data supported by the Caselaw Access Project has found its way into a course assignment to retrieve cases and citations with Python, a call to expand the growing landscape of Wikidata, and library guides on topics ranging from legal research to data mining.

We want to see what you build with Caselaw Access Project data! Keep us in the loop at info@case.law. Looking for ideas for getting started? Visit our gallery and the CAP examples repository.

The Network Librarian

Last year, Jack Cushman expressed a desire for a personal service similar to the one I perform here at LIL – not exactly the DevOps of my job title, but more generally the provision and maintenance of network and computing infrastructure. Jack's take on this idea is very much a personal one, I think: go to a person known to and trusted by you, not the proverbial faceless corporation, for whom you may be as much product as customer.

(I should say here that what follows is my riff on our discussions; errors in transmission are mine.)

As we began to discuss it, it struck me that the idea sounded a lot like some of the work I used to do as a reference librarian at the Public Library of Brookline. This included some formal training for new computer users, but was more often one-on-one, impromptu assistance with things like signing up for a first email account.

Jack's idea goes beyond tech support as traditionally practiced in libraries, but it shares with it the combination of technical knowledge, professional ethics – especially the librarian's rigorous adherence to patron confidentiality – and the personal relationship between patron and librarian.

At LIL, we like naming things whether or not there's actually a project, or, as in this case, before there's even a definition. In order not to keep talking about this vague "idea," I'll bring out the provisional name we came up with for the role we're beginning to imagine: the network librarian.

The network librarian expands on traditional tech support by consulting on computer and network security issues specifically; by advising on self-defense against surveillance where possible and activism where it isn't; and in some cases going beyond the usual help with finding and accessing resources, to providing resources directly. Finally, the practice should expand what's possible – in developing the kinds of self-reliance a network librarian will have to have, the library itself will become more self-reliant and less dependent on vendors.

One of the specific services a network librarian might provide is a virtual private network, or VPN. This article explains why a VPN is important and why it's difficult or impossible to evaluate the trustworthiness of commercial VPN providers. It goes on to explain how to set up a VPN yourself, but it's not trivial. What the network librarian has to offer here is not only technical expertise, but a headstart on infrastructure, like an account at a cloud hosting provider. As important, if not more so, is that you know and trust your librarian.

I've made a first cut at one end of this particular problem in setting up a WireGuard server with Streisand, a neat tool that automates the creation of a server running any of several VPNs and similar services. Almost all of my home and phone network traffic has gone through the WireGuard VPN since August, and I've distributed VPN credentials to several friends and family. Obviously, that isn't a real test of this idea, nor does it get at the potentially enormous issues of agreement, support, and liability you'd have to engage with, but it's an experiment in setting up a small-scale and fairly robust service for small effort and little money.

Even before providing infrastructure, the network librarian would suggest tools and approaches. I'd do the work I used to do differently now – for example, I'd strongly encourage a scheme of multiple backups. I'd be more explicit about how to mitigate the risks of using public computers and wireless networks. I'd encourage the use of encryption, for example via Signal or keybase.io. I would sound my barbaric yawp for the use of a password manager and multi-factor authentication.

Are you a network librarian? Do you know one? Do you have ideas about scope, or tools? Can you think of a better name, or does one already exist? Let me know – I look forward to hearing from you. I'm bsteinberg@law.harvard.edu.

Data Stories and CAP API Full-Text Search

Data sets have tales to tell. In the Caselaw Access Project API, full-text search is a way to find these stories in 300+ years of U.S caselaw, from individual cases to larger themes.

This August, John Bowers began exploring this idea in the blog post Telling Stories with CAP Data: The Prolific Mr. Cartwright, writing: β€œIn the hands of an interested researcher with questions to ask, a few gigabytes of digitized caselaw can speak volumes to the progress of American legal history and its millions of little stories.". Here, I wanted to use the CAP API full-text search as a path to find some of these stories using one keyword: pumpkins.

The CAP API full-text search option was one way to look at the presence of pumpkins in the history of U.S. caselaw. Viewing the CAP API Case List, I filtered cases using the Full-Text Search field to encompass only items that included the term β€œpumpkins”:

api.case.law/v1/cases/?search=pumpkins

This query returned 640 cases, the oldest decision dating to 1812 and the most recent in 2017. Next, I wanted to take a closer look at these cases in detail. To view the full case text, I logged in, revisited that same query for β€œpumpkins”, filtering the search to display full case text.

By running a full-text search, we can begin to pull out themes in Caselaw Access Project data. Of the 640 cases returned by our search that included the word β€œpumpkins”, the jurisdictions that produced the most published cases including this word were Louisiana (30) followed by Georgia (22) and Illinois (21).

In browsing the full cases returned by our query, some stories stand out. One such case is Guyer v. School Board of Alachua County, decided outside Gainesville, Florida, in 1994. Centered around the question of whether Halloween decorations including "the depiction of witches, cauldrons, and brooms" in public schools were based on secular or religious practice and promotion of the occult, this case concluded with the opinion:

β€œWitches, cauldrons, and brooms in the context of a school Halloween celebration appear to be nothing more than a mere β€œshadow”, if that, in the realm of establishment cause jurisprudence."

In searching the cases available through the Caselaw Access Project API, each query can tell a story. Try your own full-text query and share it with us at @caselawaccess.

Legal Tech Student Group Session Brings Quantitative Methods to U.S. Caselaw

This September we hosted a Legal Tech Gumbo session dedicated to using quantitative methods to find new threads in U.S. caselaw. The Legal Tech Gumbo is a collaboration between the Harvard Law & Technology Society and Harvard Library Innovation Lab (LIL).

The session kicked off by introducing data made available as part of the Caselaw Access Project API, a channel to navigate 6.4 million cases dating back 360 years. How can we use that data to advance legal scholarship? In this session Research Associate John Bowers shared how researchers can apply quantitative research methods to qualitative data sources, a theme which has shaped the past decade of research practices in the humanities.

This summer, Bowers shared a blog post outlining some of the themes he found in Caselaw Access Project data, focusing on the influence of judges active in the Illinois court system. Here, we had the chance to learn more about research based on this dataset and its supporting methodology. We applied these same practices to a new segment of data, viewing a century of Arkansas caselaw in ten-year intervals using data analytics and visualization to find themes in U.S. legal history. Want to explore the data we looked at in this session? Take a look at this interactive repository (or, if you prefer, check out this read-only version).

In this session, we learned new ways to find stories in U.S. caselaw. Have you used Caselaw Access Project data in your research? Tell us about it at info@case.law.

The Story of the case.law Domain

Recently we announced the launch of the Caselaw Access Project at case.law. But we want to highlight the story of the case.law domain itself.

That domain was generously provided by Carl Jaeckel, its previous owner. Carl agreed to transfer the domain to us in recognition and support of the vital public interest in providing free, open access to caselaw. We’re thrilled to have such a perfect home for the project.

Carl is the managing partner of Jaeckel Law, the Founder of ClassAction.org, and the Chief Operating Officer of Dot Law Inc. We can’t wait to see what he and other legal entrepreneurs, researchers, and developers will build based on case.law.