Browse the Bookshelf of U.S. Case Law: Announcing the CAP Case Browser

Today we’re announcing the CAP case browser! Browse published U.S. case law from 1658 to 2018—all 40 million pages of it.

The CAP case browser is one way to browse and cite cases made available via the Caselaw Access Project API. The Caselaw Access Project shares cases digitized from the collections of the Harvard Law School Library.

Let’s take a quick tour. Starting the CAP Case Browser at cite.case.law:

Teaching Data Science for Lawyers with Caselaw Access Project Data

In the Spring of 2019, at the University of Iowa, I taught an experimental course called Introduction to Quantitative & Computational Legal Reasoning. The idea of the class was beginning "data science" in the legal context. The course is taught in Python, and focuses on introductory coding and statistics, with focused applications in the law (such as statistical evidence of discrimination).

Of course, for students with no prior technical background, it's unrealistic to expect a law school course to produce "data scientists" in the sense used in industry. But my observations of the growth in student skills by the end of the course suggest that it is realistic to produce young lawyers with the skills to solve simple problems with coding, understand data, avoid getting led astray by dubious scientific claims (especially with probability and statistics in litigation), and learn about potential pathways for further learning and career development in legal technology and analytics.

The Library Innovation Lab's Caselaw Access Project (CAP) is particularly well-suited for assignments and projects in such a course. I believe that much of the low-hanging fruit in legal technology is in wrangling the vast amounts of unstructured text that lawyers and courts produce—as is evidenced by the numerous commercial efforts focusing around document production in discovery, contract assembly and interpretation, and similar textual problems faced by millions of lawyers daily. CAP offers a sizable trove of legal text accessible through a relatively simple and well-documented API (unlike other legal data APIs currently available). Moreover, the texts available through CAP are obviously familiar to every law student after their first semester, and their comfort with the format and style of such texts enables students to handle assignments that require them to combine their understanding of how law works with their developing technology skills.

To leverage these advantages, I included a CAP-based assignment in the first problem set for the course, due at the end of the programming intensive that occupies the initial few weeks of the semester. The problem, which is reproduced at the end of this post along with a simple example of code to successfully complete it, requires students to write a function that can call into the CAP API, retrieve an Illinois Supreme Court case (selected due to the lack of access restrictions) by citation, and return a sorted list of each unique case in the U.S. Reporter cited in the case they have retreived.

While the task is superficially simple, students found it fairly complex, for it requires the use of a number of programming concepts, such as functions and control flow, that they had only recently learned. It also exposes students to common beginner's mistakes in Python programming, such as missing the difference between sorting a list in place with list.sort() and returning a new list with sorted(list). In my observation, the results of the problem set accurately distinguished those students who were taking to programming quickly and easily, and those who required more focused assistance.

In addition to such standard programming skills, this assignment requires students to practice slightly more advanced skills such as:

  • Reading and understanding API documentation;
  • Making network requests;
  • Processing text with regular expressions;
  • Using third-party libraries;
  • Parsing JSON data; and
  • Handling empty responses from external data sources.

With luck, this problem can encourage broader thinking about legal text as something that can be treated as data, and the structure inherent in legal forms. With even more luck, some students may begin to think about more intellectual questions prompted by the exercise, such as: can we learn anything about the different citation practices in majority versus dissent opinions, or across different justices?

I plan to teach the class again in Spring 2020; one recurrent theme in student feedback for the first iteration was the need for more practice in basic programming. As such, I expect that the next version of the course will include more assignments using CAP data. Projects that I'm considering include:

  • Write wrapper functions in Python for the CAP API (which the class as a whole could work on releasing as a library as an advanced project);
  • Come to some conclusions about the workload of courts over time or of judges within a court by applying data analysis skills to metadata produced by the API; or
  • Discover citation networks and identify influential cases and/or judges.

Appendix: A CAP-Based Law Student Programming Assignment

Write a function, named cite_finder, that takes one parameter, case, a string with a citation to an Illinois Supreme Court case, and returns the following:

A. None, if the citation does not correspond to an actual case.

B. An empty list, if the citation corresponds to an actual case, but the text of that case does not include any citations to the U.S. Supreme Court.

C. A Python list of unique U.S. Supreme Court citations that appear in the text of the case, if the citation corresponds to an actual case and the case contains any U.S. Supreme Court citation.

Rules and definitions for this problem:

  • "Unique" means a citation to a specific case from a specific reporter.

  • "Citation to an Illinois Supreme Court case" means a string reflecting a citation to the official reporter of the Illinois Supreme Court, in the form 12 Ill. 345 or 12 Ill.2d 345.

  • "U.S. Supreme Court citation" means any full citation (not supra, id, etc.) from the official U.S. Supreme Court reporter as abbreviated U.S.. Party names, years, and page numbers need not be included. Archaic citations (like to Cranch), S.Ct., and L.Ed. Citations should not be included. Subsequent cites/pin cites to a case of the form 123 U.S. at 456 should not be included.

  • "Text" of a case includes all opinions (majority, concurrence, dissent, etc.) but does not include syllabus or any other content.

  • Your function must use the Caselaw Access Project (case.law) API.

  • The list must be sorted using Python’s built-in list sorting functionality with default options.

  • Each citation must appear only once.

Example correct input and output:

  • cite_finder("231 Ill.2d 474") should return ['387 U.S. 136', '419 U.S. 102', '424 U.S. 1', '429 U.S. 252', '508 U.S. 520', '509 U.S. 43']

  • cite_finder("231 Ill.2d 475") should return None

  • cite_finder("215 Ill.2d 219") should return ['339 U.S. 594', '387 U.S. 136', '467 U.S. 837', '538 U.S. 803']

Sample Code to Complete Assignment

import requests, re
endpoint = "https://api.case.law/v1/cases/"
pattern = r"\d+ U\.S\. \d+"
# no warranties are made as to the correctness of this somewhat lazy regex

def get_opinion_texts(api_response):
    try:
        ops = api_response["results"][0]["casebody"]["data"]["opinions"]
    except:
        return None
    return [x["text"] for x in ops]

def cite_finder(cite):
    resp = requests.get(endpoint, params={"cite": cite, "full_case": "true"}).json()
    opinions = get_opinion_texts(resp)
    if opinions:
        allcites = []
        for opinion in opinions:
            opcites = re.findall(pattern, opinion)
            allcites.extend(opcites)
        filtered = list(set(allcites))
        filtered.sort()
        return filtered
    return None

The Caselaw Access Project Research Summit

Since launching the CAP API and Bulk Data Service in Fall 2018, we’ve been developing a research community around the Caselaw Access Project dataset.

Last week we hosted the first Caselaw Access Project Research Summit to bring together researchers from this community who had already made progress exploring data made available by the Caselaw Access Project.

Presenters shared research that highlighted a broad range of disciplines and perspectives. They explored the contents of court opinions and the evolution of language over time. They examined things like text comprehension and language patterns and explored themes like link rot and connecting legal data with other digital collections. They asked what words appear in this text corpus, how we can identify changes in the meaning of those words, and how changes in this legal corpus connect to the larger landscape. All of their work was interesting and important, and we’re excited to see what insights they continue to develop.

The Caselaw Access Project Research Summit was our first attempt to bring researchers together in person to meet, share and learn, and to help us better understand how we can support their work. We’re immensely grateful for their participation in the event, and we look forward to doing it again.

Are you using Caselaw Access Project data in your work? Share it with us at info@case.law.

Historical Trends at the Caselaw Access Project

Today we’re excited to share Historical Trends, a new way to explore U.S. case law made available by the Caselaw Access Project at Harvard Law School.

Historical Trends, Caselaw Access Project.

Historical Trends is a way to visualize word usage in court opinions over time. We want Historical Trends to help you ask new questions and understand the law in new ways. Let’s see how this works with some examples:

Want to build your own visualization? Here’s how to get started:

  • Let’s say you have a question about produce. You want to know if apples, bananas, or oranges are more commonly shown in the legal record.
  • Try it: go to https://case.law/trends/ and enter one or more keywords separated by a comma. For now, let’s try “apple”, “banana”, and “orange”.
  • Refine your query or learn more by selecting “Advanced” or the gear icon shown above the visualization.
  • Select “Graph”.

The data underlying Historical Trends is drawn from the Harvard Law Library’s collection of roughly 6.7 million official, published opinions issued by state and federal courts throughout U.S. history and made available as part of the Caselaw Access Project.

Get started at https://case.law/trends/.

Improving pip-compile --generate-hashes

Recently I landed a series of contributions to the Python package pip-tools:

pip-tools is a "set of command line tools to help you keep your pip-based [Python] packages fresh, even when you've pinned them." My changes help the pip-compile --generate-hashes command work for more people.

This isn't a lot of code in the grand scheme of things, but it's the largest set of contributions I've made to a mainstream open source project, so this blog post is a celebration of me! 🎁💥🎉 yay. But it's also a chance to talk about package manager security and open source contributions and stuff like that.

I'll start high-level with "what are package managers" and work my way into the weeds, so feel free to jump in wherever you want.

What are package managers?

Package managers help us install software libraries and keep them up to date. If I want to load a URL and print the contents, I can add a dependency on a package like requests

$ echo 'requests' > requirements.txt
$ pip install -r requirements.txt
Collecting requests (from -r requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/8f/ea/140f18072bbcd81885a9490abb171792fd2961fd7f366be58396f4c6d634/requests-2.0.1-py2.py3-none-any.whl (439kB)
     |████████████████████████████████| 440kB 4.1MB/s
Installing collected packages: requests
Successfully installed requests-2.0.1

… and let requests do the heavy lifting:

>>> import requests
>>> requests.get('http://example.com').text
'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title> ...'

But there's a problem – if I install exactly the same package later, I might get a different result:

$ echo 'requests' > requirements.txt
$ pip install -r requirements.txt
Collecting requests (from -r requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl (57kB)
     |████████████████████████████████| 61kB 3.3MB/s
Collecting certifi>=2017.4.17 (from requests->-r requirements.txt (line 1))
  Using cached https://files.pythonhosted.org/packages/60/75/f692a584e85b7eaba0e03827b3d51f45f571c2e793dd731e598828d380aa/certifi-2019.3.9-py2.py3-none-any.whl
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (from requests->-r requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/39/ec/d93dfc69617a028915df914339ef66936ea976ef24fa62940fd86ba0326e/urllib3-1.25.2-py2.py3-none-any.whl (150kB)
     |████████████████████████████████| 153kB 10.6MB/s
Collecting idna<2.9,>=2.5 (from requests->-r requirements.txt (line 1))
  Using cached https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl
Collecting chardet<3.1.0,>=3.0.2 (from requests->-r requirements.txt (line 1))
  Using cached https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl
Installing collected packages: certifi, urllib3, idna, chardet, requests
Successfully installed certifi-2019.3.9 chardet-3.0.4 idna-2.8 requests-2.22.0 urllib3-1.25.2
<requirements.txt, pip install -r, import requests>

I got a different version of requests than last time, and I got some bonus dependencies (certifi, urllib3, idna, and chardet). Now my code might not do the same thing even though I did the same thing, which is not how anyone wants computers to work. (I've cheated a little bit here by showing the first example as though pip install had been run back in 2013.)

So the next step is to pin the versions of my dependencies and their dependencies, using a package like pip-tools:

$ echo 'requests' > requirements.in
$ pip-compile
$ cat requirements.txt
#
# This file is autogenerated by pip-compile
# To update, run:
#
#    pip-compile
#
certifi==2019.3.9         # via requests
chardet==3.0.4            # via requests
idna==2.8                 # via requests
requests==2.22.0
urllib3==1.25.2           # via requests

(There are other options I could use instead, like pipenv or poetry. For now I still prefer pip-tools, for roughly the reasons laid out by Hynek Schlawack.)

Now when I run pip install -r requirements.txt I will always get the same version of requests, and the same versions of its dependencies, and my program will always do the same thing.

… just kidding.

The problem with pinning Python packages

Unfortunately pip-compile doesn't quite lock down our dependencies the way we would hope! In Python land you don't necessarily get the same version of a package by asking for the same version number. That's because of binary wheels.

Up until 2015, it was possible to change a package's contents on PyPI without changing the version number, simply by deleting the package and reuploading it. That no longer works, but there is still a loophole: you can delete and reupload binary wheels.

Wheels are a new-ish binary format for distributing Python packages, including any precompiled programs written in C (or other languages) used by the package. They speed up installs and avoid the need for users to have the right compiler environment set up for each package. C-based packages typically offer a bunch of wheel files for different target environments – here's bcrypt's wheel files for example.

So what happens if a package was originally released as source, and then the maintainer wants to add binary wheels for the same release years later? PyPI will allow it, and pip will happily install the new binary files. This is a deliberate design decision: PyPI has "made the deliberate choice to allow wheel files to be added to old releases, though, and advise folks to use –no-binary and build their own wheel files from source if that is a concern."

That creates room for weird situations, like this case where wheel files were uploaded for the hiredis 0.2.0 package on August 16, 2018, three years after the source release on April 3, 2015. The package had been handed over without announcement from Jan-Erik Rediger to a new volunteer maintainer, ifduyue, who uploaded the binary wheels. ifduyue's personal information on Github consists of: a new moon emoji; an upside down face emoji; the location "China"; and an image of Lanny from the show Lizzie McGuire with spirals for eyes. In a bug thread opened after ifduyue uploaded the new version of hiredis 0.2.0, Jan-Erik commented that users should "please double-check that the content is valid and matches the repository."

ifduyue's user account on github.com

The problem is that I can't do that, and most programmers can't do that. We can't just rebuild the wheel ourselves and expect it to match, because builds are not reproducible unless one goes to great lengths like Debian does. So verifying the integrity of an unknown binary wheel requires rebuilding the wheel, comparing a diff, and checking that all discrepancies are benign – a time-consuming and error-prone process even for those with the skills to do it.

So the story of hiredis looks a lot like a new open source developer volunteering to help out on a project and picking off some low-hanging fruit in the bug tracker, but it also looks a lot like an attacker using the perfect technique to distribute malware widely in the Python ecosystem without detection. I don't know which one it is! As a situation it's bad for us as users, and it's not fair to ifduyue if in fact they're a friendly newbie contributing to a project.

(Is the hacking paranoia warranted? I think so! As Dominic Tarr wrote after inadvertently handing over control of an npm package to a bitcoin-stealing operation, "I've shared publish rights with other people before. … open source is driven by sharing! It's great! it worked really well before bitcoin got popular.")

This is a big problem with a lot of dimensions. It would be great if PyPI packages were all fully reproducible and checked to verify correctness. It would be great if PyPI didn't let you change package contents after the fact. It would be great if everyone ran their own private package index and only added packages to it that they had personally built from source that they personally checked, the way big companies do it. But in the meantime, we can bite off a little piece of the problem by adding hashes to our requirements file. Let's see how that works.

Adding hashes to our requirements file

Instead of just pinning packages like we did before, let's try adding hashes to them:

$ echo 'requests==2.0.1' > requirements.in
$ pip-compile --generate-hashes
#
# This file is autogenerated by pip-compile
# To update, run:
#
#    pip-compile --generate-hashes
#
requests==2.0.1 \
    --hash=sha256:8cfddb97667c2a9edaf28b506d2479f1b8dc0631cbdcd0ea8c8864def59c698b \
    --hash=sha256:f4ebc402e0ea5a87a3d42e300b76c292612d8467024f45f9858a8768f9fb6f6e

Now when pip-compile pins our package versions, it also fetches the currently-known hashes for each requirement and adds them to requirements.txt (an example of the crypto technique of "TOFU" or "Trust On First Use"). If someone later comes along and adds new packages, or if the https connection to PyPI is later insecure for whatever reason, pip will refuse to install and will warn us about the problem:

$ pip install -r requirements.txt
...
ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
    requests==2.0.1 from https://files.pythonhosted.org/packages/8f/ea/140f18072bbcd81885a9490abb171792fd2961fd7f366be58396f4c6d634/requests-2.0.1-py2.py3-none-any.whl#sha256=f4ebc402e0ea5a87a3d42e300b76c292612d8467024f45f9858a8768f9fb6f6e (from -r requirements.txt (line 7)):
        Expected sha256 8cfddb97667c2a9edaf28b506d2479f1b8dc0631cbdcd0ea8c8864def59c6981
        Expected     or f4ebc402e0ea5a87a3d42e300b76c292612d8467024f45f9858a8768f9fb6f61
             Got        f4ebc402e0ea5a87a3d42e300b76c292612d8467024f45f9858a8768f9fb6f6e

But there are problems lurking here! If we have packages that are installed from Github, then pip-compile can't hash them and pip won't install them:

$ echo '-e git+https://github.com/requests/requests@master#egg=requests' > requirements.in
$ pip-compile --generate-hashes
#
# This file is autogenerated by pip-compile
# To update, run:
#
#    pip-compile --generate-hashes
#
-e git+https://github.com/requests/requests@master#egg=requests
certifi==2019.3.9 \
    --hash=sha256:59b7658e26ca9c7339e00f8f4636cdfe59d34fa37b9b04f6f9e9926b3cece1a5 \
    --hash=sha256:b26104d6835d1f5e49452a26eb2ff87fe7090b89dfcaee5ea2212697e1e1d7ae
chardet==3.0.4 \
    --hash=sha256:84ab92ed1c4d4f16916e05906b6b75a6c0fb5db821cc65e70cbd64a3e2a5eaae \
    --hash=sha256:fc323ffcaeaed0e0a02bf4d117757b98aed530d9ed4531e3e15460124c106691
idna==2.8 \
    --hash=sha256:c357b3f628cf53ae2c4c05627ecc484553142ca23264e593d327bcde5e9c3407 \
    --hash=sha256:ea8b7f6188e6fa117537c3df7da9fc686d485087abf6ac197f9c46432f7e4a3c
urllib3==1.25.2 \
    --hash=sha256:a53063d8b9210a7bdec15e7b272776b9d42b2fd6816401a0d43006ad2f9902db \
    --hash=sha256:d363e3607d8de0c220d31950a8f38b18d5ba7c0830facd71a1c6b1036b7ce06c
$ pip install -r requirements.txt
Obtaining requests from git+https://github.com/requests/requests@master#egg=requests (from -r requirements.txt (line 7))
ERROR: The editable requirement requests from git+https://github.com/requests/requests@master#egg=requests (from -r requirements.txt (line 7)) cannot be installed when requiring hashes, because there is no single file to hash.

That's a serious limitation, because -e requirements are the only way pip-tools knows to specify installations from version control, which are useful while you wait for new fixes in dependencies to be released. (We mostly use them at LIL for dependencies that we've patched ourselves, after we send fixes upstream but before they are released.)

And if we have packages that rely on dependencies pip-tools considers unsafe to pin, like setuptools, pip will refuse to install those too:

$ echo 'Markdown' > requirements.in
$ pip-compile --generate-hashes
#
# This file is autogenerated by pip-compile
# To update, run:
#
#    pip-compile --generate-hashes
#
markdown==3.1 \
    --hash=sha256:fc4a6f69a656b8d858d7503bda633f4dd63c2d70cf80abdc6eafa64c4ae8c250 \
    --hash=sha256:fe463ff51e679377e3624984c829022e2cfb3be5518726b06f608a07a3aad680
$ pip install -r requirements.txt
Collecting markdown==3.1 (from -r requirements.txt (line 7))
  Using cached https://files.pythonhosted.org/packages/f5/e4/d8c18f2555add57ff21bf25af36d827145896a07607486cc79a2aea641af/Markdown-3.1-py2.py3-none-any.whl
Collecting setuptools>=36 (from markdown==3.1->-r requirements.txt (line 7))
ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==. These do not:
    setuptools>=36 from https://files.pythonhosted.org/packages/ec/51/f45cea425fd5cb0b0380f5b0f048ebc1da5b417e48d304838c02d6288a1e/setuptools-41.0.1-py2.py3-none-any.whl#sha256=c7769ce668c7a333d84e17fe8b524b1c45e7ee9f7908ad0a73e1eda7e6a5aebf (from markdown==3.1->-r requirements.txt (line 7))

This can be worked around by adding --allow-unsafe, but (a) that sounds unsafe (though it isn't), and (b) it won't pop up until you try to set up a new environment with a low version of setuptools, potentially days later on someone else's machine.

Fixing pip-tools

Those two problems meant that, when I set out to convert our Caselaw Access Project code to use --generate-hashes, I did it wrong a few times in a row, leading to multiple hours spent debugging problems I created for me and other team members (sorry, Anastasia!). I ended up needing a fancy wrapper script around pip-compile to rewrite our requirements in a form it could understand. I wanted it to be a smoother experience for the next people who try to secure their Python projects.

So I filed a series of pull requests:

Support URLs as packages

Support URLs as packages #807 and Fix –generate-hashes with bare VCS URLs #812 laid the groundwork for fixing --generate-hashes, by teaching pip-tools to do something that had been requested for years: installing packages from archive URLs. Where before, pip-compile could only handle Github requirements like this:

-e git+https://github.com/requests/requests@master#egg=requests

It can now handle requirements like this:

https://github.com/requests/requests/archive/master.zip

And zipped requirements can be hashed, so the resulting requirements.txt comes out looking like this, and is accepted by pip install:

https://github.com/requests/requests/archive/master.zip \
   	     --hash=sha256:3c3d84d35630808bf7750b0368b2c7988f89d9f5c2f2633c47f075b3d5015638

This was a long process, and began with resurrecting a pull request from 2017 that had first been worked on by nim65s. I started by just rebasing the existing work, fixing some tests, and submitting it in the hopes the problem had already been solved. Thanks to great feedback from auvipy, atugushev, and blueyed, I ended up making 14 more commits (and eventually a follow-up pull request) to clean up edge cases and get everything working.

Landing this resulted in closing two other pip-tools pull requests from 2016 and 2017, and feature requests from 2014 and 2018.

Warn when --generate-hashes output is uninstallable

The next step was Fix pip-compile output for unsafe requirements #813 and Warn when –generate-hashes output is uninstallable #814. These two PRs allowed pip-compile --generate-hashes to detect and warn when a file would be uninstallable for hashing reasons. Fortunately pip-compile has all of the information it needs at compile time to know that the file will be uninstallable and to make useful recommendations for what to do about it:

$ pip-compile --generate-hashes
#
# This file is autogenerated by pip-compile
# To update, run:
#
#    pip-compile --generate-hashes
#
# WARNING: pip install will require the following package to be hashed.
# Consider using a hashable URL like https://github.com/jazzband/pip-tools/archive/SOMECOMMIT.zip
-e git+https://github.com/jazzband/pip-tools@7d86c8d3ecd1faa6be11c7ddc6b29a30ffd1dae3#egg=pip-tools
click==7.0 \
    --hash=sha256:2335065e6395b9e67ca716de5f7526736bfa6ceead690adf616d925bdc622b13 \
    --hash=sha256:5b94b49521f6456670fdb30cd82a4eca9412788a93fa6dd6df72c94d5a8ff2d7
first==2.0.2 \
    --hash=sha256:8d8e46e115ea8ac652c76123c0865e3ff18372aef6f03c22809ceefcea9dec86 \
    --hash=sha256:ff285b08c55f8c97ce4ea7012743af2495c9f1291785f163722bd36f6af6d3bf
markdown==3.1 \
    --hash=sha256:fc4a6f69a656b8d858d7503bda633f4dd63c2d70cf80abdc6eafa64c4ae8c250 \
    --hash=sha256:fe463ff51e679377e3624984c829022e2cfb3be5518726b06f608a07a3aad680
six==1.12.0 \
    --hash=sha256:3350809f0555b11f552448330d0b52d5f24c91a322ea4a15ef22629740f3761c \
    --hash=sha256:d16a0141ec1a18405cd4ce8b4613101da75da0e9a7aec5bdd4fa804d0e0eba73

# WARNING: The following packages were not pinned, but pip requires them to be
# pinned when the requirements file includes hashes. Consider using the --allow-unsafe flag.
# setuptools==41.0.1        # via markdown

Hopefully, between these two efforts, the next project to try using –generate-hashes will find it a shorter and more straightforward process than I did!

Things left undone

Along the way I discovered a few issues that could be fixed in various projects to help the situation. Here are some pointers:

First, the warning to use --allow-unsafe seems unnecessary – I believe that --allow-unsafe should be the default behavior for pip-compile. I spent some time digging into the reasons that pip-tools considers some packages "unsafe," and as best I can tell it is because it was thought that pinning those packages could potentially break pip itself, and thus break the user's ability to recover from a mistake. This seems to no longer be true, if it ever was. Instead, failing to use –allow-unsafe is unsafe, as it means different environments will end up with different versions of key packages despite installing from identical requirements.txt files. I started some discussion about that on the pip-tools repo and the pip repo.

Second, the warning not to use version control links with --generate-hashes is necessary only because of pip's decision to refuse to install those links alongside hashed requirements. That seems like a bad security tradeoff for several reasons. I filed a bug with pip to open up discussion on the topic.

Third, PyPI and binary wheels. I'm not sure if there's been further discussion on the decision to allow retrospective binary uploads since 2017, but the example of hiredis makes it seem like that has some major downsides and might be worth reconsidering. I haven't yet filed anything for this.

Personal reflections (and, thanks Jazzband!)

I didn't write a ton of code for this in the end, but it was a big step for me personally in working with a mainstream open source project, and I had a lot of fun – learning tools like black and git multi-author commits that we don't use on our own projects at LIL, collaborating with highly responsive and helpful reviewers (thanks, all!), learning the internals of pip-tools, and hopefully putting something out there that will make people more secure.

pip-tools is part of the Jazzband project, which is an interesting attempt to make the Python package ecosystem a little more sustainable by lowering the bar to maintaining popular packages. I had a great experience with the maintainers working on pip-tools in particular, and I'm grateful for the work that's gone into making Jazzband happen in general.

The CAP Roadshow

In 2019 we embarked on the CAP Roadshow. This year, we shared the Caselaw Access Project at conferences and workshops with new friends and colleagues.

Between February and May 2019, we made the following stops at conferences and workshops:

Next stop on the road will be UNT Open Access Symposium from May 17 - 18 at University of North Texas College of Law. See you there!

On the road we were able to connect the Caselaw Access Project with new people. We were able to share where data comes from, what kinds of questions we can ask when we have the machine readable data to do it, and all the new ways that you’re all building and learning with Caselaw Access Project data to see the landscape of U.S. legal history in new ways.

The CAP Roadshow doesn’t stop here! Share Caselaw Access Project data with a colleague to keep the party going.

CAP Roadshow

Colors in Caselaw

The prospect of having the Caselaw Access Project dataset become public for the first time brings with it the obvious (and wholly necessary) ideas for data parsing: our dataset is vast and the metadata structured (read about the process to get to this), but the work of parsing the dataset is far from over. For instance, there's a lot of work to be done in parsing individual parties in CAP (like names of judges), we don't yet have a citator, and we still don't know who wins a case and who loses. And for that matter, we don't really know what "winning" and "losing" even means (if you are interested in working on any of these problems and more, start here: https://case.law/tools/).

At LIL we've also undertaken lighter explorations that highlight opportunities made possible by the data and help teach ways to get started parsing caselaw. To that end, we've written caselaw poetry with a limerick generator, discovered the most popular words in California caselaw with wordclouds, and found all instances of the word "witchcraft" for Halloween. We have created an examples repository, for anyone just starting out, too.

This particular project began as a quick look at a very silly question:

What, exactly, is the color of the law?

It turned, surprisingly, into a somewhat deep dive of an introduction into NLP. In this blog post, I'm putting down some thoughts about my decisions, process, and things I learned along the way. Hopefully it will inspire someone looking into the CAP data to ask their own silly (or very serious) questions. This example might also be useful as a small tutorial for getting started on neural-based NLP projects.

Here is the resulting website, with pretty caselaw colors: https://colors.lil.tools/

A note on the dataset

For the purposes of sanity and brevity, we will only be looking at the Illinois dataset in this blog post. It is also the dataset that was used for this project.

If you want to download your own, here are some links: Download cases here, or Extract cases using python

How does one go about deciding on the color of the law?

One way to do it is to find all the mentions of colors in each case.

Since there is a finite number of labelled colors, we could look at each color and simply run a search though the dataset on each word. So let's say we start by looking at the color "green". But wait! We've immediately run into trouble. It turns out that "Green" is quite a popular last name. Excluding anywhere the "G" is capitalized, we might miss important data, like sentences that start with the color green. Adding to the trouble, the lower cased "orange" is both a color and a fruit. Maybe we could start by looking at the instances of the color words as adjectives?

Enter Natural Language Processing

Natural Language Processing (NLP) is a field of computer science aimed at the understanding and parsing of texts.

While I'll be introducing NLP concepts here, if you want a more in-depth write-up on NLP as a field, I would recommend Adam Geitgey's series, Natural Language Processing is Fun!

A brief overview of some NLP concepts used

Tokenization: Tokenizing is the process of divvying up a wall of text into smaller components — typically, those are words (sometimes they are characters). Having word chunks allows us to do all kinds of parsing. This can be as simple as "break on space" but usually also treats punctuation as a token.

Parts of speech tagging: tagging words with their respective parts of speech (noun, adjective, verb, etc). This is usually a built-in method in a lot of NLP tools (like nltk and spacy). The tools use a pretrained model, often one built on top of large datasets that had been tediously, and manually tagged (thanks to all ye hard workers of yesteryear that have made our glossing over this difficult work possible).

Root parsing: grouping of syntactically cogent terms. The token chosen (in this case, we're only looking at adjectives), and the "parent" of this token (read this documentation to learn more).

Now what?

Unfortunately, we don't have magical reference to every use of a color in the law, so we'll need to come up with some heuristics which will get us most of the way there. There are a couple ways we could go about finding the colors:

The easiest route we can take is to just match an adjective in the colors list that we have when we come across it and call it a day. The other, more interesting to me way, is to get the context pertinent to the color, using root parsing, to make sure that we get the right shade. "Baby pink" is very different from "hot pink", after all.

To get here, we can use the NLP library spacy. The result is a giant list of of word pairings like "red pepper" and "blue leather". This may read as a food and a type of cloth and not a color. As far as this project is concerned, however, we're treating these word pairings as specific shades. "Blue sky" might be a different shade than "blue leather". "Red pepper" might be a different shade than "red rose".

But exactly what shade is "red pepper" and how would a machine interpret it?

To find out the answer, we turn to recent advances in NLP techniques using Neural Networks.

Recurrent Neural Networks, a too-brief overview

Neural Networks (NNs) are functions that are able to "learn" (more on that in a bit) from a large trove of data. NNs are used for lots of things: from simple classifiers (is it a picture of a dog? Or a cat?) to language translation, and so forth. Recurrent Neural Networks (RNNs) are a specific kind of NN: they are able to learn from past iterations by passing the results of a preceding output down the chain, meaning that running them multiple times should produce increasingly more accurate results (with a caveat — if we run too many epochs, or full training cycles — each epoch being a forward and backward pass through all of the data), there's a danger of "overfitting", having the RNN essentially memorize the correct answers!.

A contrived example of running an already fully-trained RNN over 2-length sequences of words might look something like this: Input: "box of rocks", Output: prediction of word "rocks" Step1: RNN("", "box") -> 0% "rocks" Step2: RNN("box", "of") -> 0% "rocks" Step3: RNN("of", "rocks") -> 50% "rocks"

Notice that an RNN works over a fixed sequence length, and would only be able to understand word relationships bounded by this length. An LSTM (Long short term memory) is a special type of RNN that overcomes this by adding a type of "memory" which we won't get into here.

Crucially, the NN has two major components: forward and backward propagation. Forward propagation is responsible for getting the output of the model (as in, stepping forward in your network by running your model). An additional step is model evaluation, finding out how far from our expectations (our labelled "truth" set) our output is — in other terms, getting the error/loss. This also plays a role in backward propagation.

Backward propagation is responsible for stepping backward through the network, and computing a derivative between the computer error and the weights of the model. This derivative is used by the gradient descent function, an optimization that adjusts the weights to decrease the error by a small amount for each step. This is the "learning" part of NN — by running it over and over, stepping forward, backward, figuring out the derivative, running it through the gradient descent, adjusting the weights to minimize the error, and repeating the cycle, the NN is able to learn from past mistakes and successes, and move towards a more correct output.

For an excellent video series explaining Neural Networks in more depth, check out Season 3 by 3 Blue 1 Brown. Recurrent Neural Networks and LSTM is a nice write-up with more in-depth top-level concepts.

Colorful words

As luck would have it, I happened upon a white paper that solved the exact problem of figuring out the "correct" shade for an entered phrase, and a fantastic implementation of it (albeit one that needed a bit of tuning).

The resulting repo is here: https://github.com/anastasia/namecolor

The basic steps to reproduce are these: We take a large set of color data. https://www.colourlovers.com/api gives us access to about a million labeled, open source, community-submitted colors — everything from "dutch teal" (#1693A5) to a very popular color named "certain frogs" (#C3FF68). We create a truth set. This is important because we need to train the model against something that it treats as correct. For our purposes, we do have a sort of "truth" of colors, a largely agreed-upon set in the form of HTML color codes with their corresponding hex values. There are 148 of those that I've found. We convert all hex values to CIE LAB values (these are more conducive to an RNN's gradient learning as they are easily mappable in 3d space). We tokenize each value on character ("blue" becomes "b", "l", "u", "e"). We call in PyTorch to the rescue us from the rest of the hard stuff, like creating character embeddings And we run our BiLSTM model (a bi-directional Long Short Term Memory model, which is a type of RNN that is able to remember inputs from current and previous iterations)

The results

The results live here: https://colors.lil.tools/ (sorted by date) or https://colors.lil.tools/lum (sorted by luminosity). You can also see the RNN in action by going here https://colors.lil.tools/create

Although this was a pretty whimsical look at a very serious dataset, we do see some stories start to emerge. My favorite of these is a look at the different colors of the word "hat" in caselaw.

Here are years 1867 to 1935: Illinois hats from 1867 to 1935

And years 1999 to 2011: Illinois hats from 1999 to 2011

Whereas the colors in the late 1800s are muted, and mostly grays, browns, and tans, the colors in the 21st century are bright blues, reds, oranges, greens. We seem to be getting a small window into U.S.'s industrialization and the fashion of the times ("industrialization" is a latent factor (or a hidden neuron) here :-) Who would have thought we could do that by looking at caselaw?

When I first started working on this project, I had no expectations of what I would find. Looking at the data now, it is clear that some of the most commonly present colors are black, brown, and white, and from what I can tell, the majority of the mentions of those are race related. A deeper dive would require a different person to look at this subject, and there are many other more direct ways of approaching such a serious matter than looking at the colors of caselaw.

If you have any questions, any kooky ideas about caselaw, or any interest in exploring together, please let me know!

Launching CAP Search

Today we're launching CAP search, a new interface to search data made available as part of the Caselaw Access Project API. Since releasing the CAP API in Fall 2018, this is our first try at creating a more human-friendly way to start working with this data.

CAP search supports access to 6.7 million cases from 1658 through June 2018, digitized from the collections at the Harvard Law School Library. Learn more about CAP search and limitations.

We're also excited to share a new way to view cases, formatted in HTML. Here's a sample!

We invite you to experiment by building new interfaces to search CAP data. See our code as an example.

The Caselaw Access Project was created by the Harvard Library Innovation Lab at the Harvard Law School Library in collaboration with project partner Ravel Law.

Some Thoughts on Digital Preservation

One of the things people often ask about Perma.cc is how we ensure the preservation of Perma links. There are some answers in Perma's documentation, for example:

Perma.cc was built by Harvard’s Library Innovation Lab and is backed by the power of libraries. We’re both in the forever business: libraries already look after physical and digital materials — now we can do the same for links.

and:

How long will you keep my Perma.cc Links?

Links will be preserved as a part of the permanent collection of participating libraries. While we can't guarantee that these records will be preserved forever, we are hosted by university libraries that have endured for centuries, and we are planning to be around for the long term. If we ever do need to shut down, we have laid out a detailed contingency plan for preserving existing data.

The contingency plan is worth reading; I won't quote it here. (Here's a Perma link to it, in case we've updated it by the time you read this.) In any case, all three of these statements might be accused of a certain nonspecificity - not as who should say vagueness.

I think what people sometimes want to hear when they ask about preservation of Perma links is a very specific arrangement of technology. A technologically specific answer, however, can only be provisional at best. That said, here's what we do at present: Perma saves captures in the form of WARC files to an S3 bucket and serves them from there; within seconds of each capture, a server in Germany downloads a copy of the WARC; twenty-four hours after each capture, a copy of the WARC is uploaded to the Internet Archive (unless the link has been marked as private); also at the twenty-four hour mark, a copy is distributed to a private LOCKSS network. The database of links, users, registrars, and so on, in AWS, is snapshotted daily, and another snapshot of the database is dumped and saved by the server in Germany.

Here's why that answer can only be provisional: there is no digital storage technology whose lifespan approaches the centuries of acid-free paper or microfilm. Worse, the systems housing the technology will tend to become insecure on a timescale measured in days, weeks, or months, and, unattended, impossible to upgrade in perhaps a few years. Every part of the software stack, from the operating system to the programming language to its packages to your code, is obsolescing, or worse, as soon as it's deployed. The companies that build and host the hardware will decline and fall; the hardware itself will become unperformant, then unusable.

Mitigating these problems is a near-constant process of monitoring, planning, and upgrading, at all levels of the stack. Even if we were never to write another line of Perma code, we'd need to update Django and all the other Python packages it depends on (and a Perma with no new code would become less and less able to capture pages on the modern web); in exactly the same way, the preservation layers of Perma will never be static, and we wouldn't want them to be. In fact, their heterogeneity across time, as well as at a given moment, is a key feature.

The core of digital preservation is institutional commitment, and the means are people. They require dedication, expertise, and flexibility; the institution's commitment and its staff's dedication are constants, but their methods can't be. The resilience of a digital preservation program lies in their careful and constant attention, as in the commonplace, "The best fertilizer is the farmer's footprint."

Although I am not an expert in digital preservation, nor well-read in its literature, I'm a practitioner; I'm a librarian, a software developer, and a DevOps engineer. Whether or not you thought this was fertilizer, I'd love to hear from you. I'm bsteinberg@law.harvard.edu.