Caselaw Access Project Links Case Citations

The Caselaw Access Project is taking its first steps to create links to case citations in our collection of 6.7 million cases.

This update makes case citations available as links in the CAP case browser. When viewing a case, citations are shown as links.

United States v. Kennedy, 573 F.2d 657 (1978) displaying in-text link to "407 F.2d 1391".

When you click on a citation, you’ll go directly to that case.

We also created a cites_to field in the Caselaw Access Project API. This new field shows which cases an opinion cites to. Here’s what that looks like.

United States v. Kennedy, 573 F.2d 657 (1978) showing "cites_to" field in Caselaw Access Project API.

This is only the beginning of our work with case citations. In the future, we hope to improve citation extraction and ultimately to offer researchers a citation graph.

Are you using the Caselaw Access Project to understand the relationship between cases with case citations? Tell us about it.

CAP Code Share: Caselaw Access Project API to CSV

Today we’re going to learn how to write case data from the Caselaw Access Project API to CSV. This post shows work from Jack Cushman, Senior Developer at the Harvard Library Innovation Lab.

The Caselaw Access Project makes 6.7 million individual cases freely available from Harvard Law School Library. With this code, we can create a script to get case data from the Caselaw Access Project API, and write that data to a spreadsheet with Python. This demo is made available as part of the CAP Examples repository on Github. Let’s get started!

How does this script find the data it’s looking for? This happens with an API call using the CAP API, and retrieves all cases that include the words “first amendment”: Want to create your own CAP API call? Here’s how.

The Caselaw Access Project has structured, case-level metadata. You can query parts of that data using the CAP API with endpoints, like “court” or “jurisdiction”. Here’s a rundown of the endpoints we have. This demo gets data using these endpoints to write case data to a CSV file: 'id', 'frontend_url', 'name', 'name_abbreviation', 'citation', 'decision_date', 'jurisdiction'. You can adapt this code, and choose your own endpoints.

To run this script, find your CAP API key by creating an account or logging in, and viewing your user details.

This code is part of the CAP Examples repository on Github, a place to find and share code for working with data from the Caselaw Access Project. Do you have code to share? We want to see this resource grow.

Are you creating new things with code or data made available by the Caselaw Access Project? Send it our way. Our inbox is always open.

Caselaw Access Project Downloads Now Available

Today we're announcing CAP downloads, a new way to access select datasets relating to the Caselaw Access Project. While researchers can use our API and bulk data to access standardized metadata and text for all of the cases in the CAP dataset, we also want to make it possible to share specialized and derivative datasets.

How does it work?

Everything available for download is presented in a simple file directory that lets you navigate to the specific dataset or file you want. Each dataset or export comes with a README file that includes basic information about it.

What data do we have?

To view and access what's currently available, visit We're starting with:

What other datasets should we share?

If you have ideas or suggestions for other datasets you'd like us to share, we'd love to hear about it. Contact us at!

Caselaw Access Project Shares Scanned Images for Open Jurisdictions

The Caselaw Access Project now has scanned images available for download as PDF, with selectable text, for all open-access jurisdictions, including Arkansas, Illinois, North Carolina and New Mexico. To download scanned images by volume, visit our downloads page and browse to the volume you seek:

Through our API and bulk data tools, researchers already have access to metadata and text files produced through OCR of the scanned images. With this new release, we're able to share the scanned images themselves in an enhanced form that enables text selection and search.

For this initial release, scanned images are available only for those jurisdictions that have taken the important step of ensuring that all of their current opinions are published and freely accessible online in an authoritative, machine-readable manner that avoids vendor-specific citation. As always, we're eager to work with other states seeking to take this step toward digital-first publishing. Here's how to get started.

Connecting Data with the Supreme Court Database (SCDB) and Caselaw Access Project

Last week we released an update to the Caselaw Access Project that adds case IDs and citations from the Supreme Court Database (SCDB) to our U.S. Supreme Court case metadata.

This update adds new, parallel citations to cases and makes it easy for people using data from the Caselaw Access Project to also take advantage of this rich dataset made available by the Supreme Court Database (SCDB). This represents one of the major benefits of open data - the ability to connect two datasets to enable new kinds of analysis and insight.

The Supreme Court Database (SCDB) is an outstanding project by Harold J. Spaeth, Lee Epstein, Ted Ruger, Jeffrey Segal, Andrew D. Martin and Sarah Benesh. A key resource in legal data, SCDB offers case-and-justice-specific metadata about every Supreme Court decision. Metadata made available by this resource covers a range of variables, like Majority Opinion Writer, Docket Number, Issue Area, Majority and Minority Votes, and more. To learn more about the Supreme Court Database (SCDB), their documentation is a great place to start.

Here are some ways to work with Supreme Court Database (SCDB) data and Caselaw Access Project.

When viewing an individual case in the Caselaw Access Project, new citations and case IDs from SCDB are now visible in the citations field. Here’s a look!

Example of case citation field, showing: "Brown v. Board of Education, 347 U.S. 483, 98 L. Ed. 2d 873, 74 S. Ct. 686 (1954)".

When we retrieve cases with Caselaw Access Project API, we can see the connection between our case metadata and data made available by the Supreme Court Database (SCDB). Try this example.

Caselaw Access Project API displaying citation metadata for case "Brown v. Board of Education, 347 U.S. 483, 98 L. Ed. 2d 873, 74 S. Ct. 686 (1954)"

You can retrieve cases from CAP Search and the CAP API with a Supreme Court Database (SCDB) ID. Here’s how to do it. In CAP Search, add your SCDB ID to the Citation field and run your search. Here’s an example! Want to do the same in the CAP API? Create an API call to retrieve a case by citation, and add the SCDB ID. Here’s what that looks like:

We’re also making a download available of cases matched from the Supreme Court Database (SCDB) to the Caselaw Access Project as a spreadsheet:

What can we learn with this data? Here’s one example. By using data from the Caselaw Access Project and the Supreme Court Database (SCDB) data together, you can isolate opinions by particular justices, or opinions that involve particular legal issues. This can be the first step to understanding the appellate history of a Supreme Court case. This is just one of the many possibilities that are now available as part of this opportunity to learn new things with case data.

This is our first cut at incorporating external data into the Caselaw Access Project, and there may be bugs we have not yet identified. For example, while we are able to match 28,090 out of 28,347 cases (~99%), there are a few we couldn’t match. We’ll be taking a look at those and updating the data as we go. If you find other errors, as always, reach out to tell us about them.

We’re excited about this update to the Caselaw Access Project and grateful for all the hard work the folks at Supreme Court Database (SCDB) have done to make and to share their data. With this update, we’re excited to see what our community learns and creates with this resource. Working on something new? We’re looking forward to hearing about it.

Some Recent Perma Use

You may have seen links in a number of documents of current interest, including the Trial Memorandum of the U.S. House of Representatives in the Impeachment Trial of President Donald J. Trump (archived at and the Trial Memorandum of President Donald J. Trump (archived at Interestingly, both documents cite Perma links without citing the original URL that Perma archived; generally, you would include both in your citation.

As an exercise, I used Perma's public API to look up the URLs for the Perma links cited in these two documents; here are CSV files listing the 148 links in the House Memorandum (one ill-formed) and the 129 links in the President's memorandum. (Note that both CSV files include duplicates, as some links are repeated in each document; I'm leaving the duplicates in place in case you want to read along with the original documents.)

North Carolina Joins Growing List of Open Access Jurisdictions

Today we're pleased to announce that North Carolina has joined Illinois, Arkansas, and New Mexico as the latest jurisdiction to make all of its appellate court decisions openly available online, in an authoritative, machine-readable format that is citable using a vendor-neutral citation.

As a result of North Carolina taking this important step, the Caselaw Access Project has removed all use and access restrictions from the North Carolina cases in its collection. You now can view or download the full text of all North Carolina cases without restriction. You can read individual cases with the CAP Case Browser or access many cases at once with the CAP API and Bulk Data. Here's an example!

We're delighted and inspired by the work of the North Carolina Supreme Court and the many dedicated professionals on the Court's staff. We hope that many other states will follow North Carolina’s example as soon as possible. Here's how to help make it happen.

The Open Casebook: Creating Casebooks with H2O and the Caselaw Access Project

What if you could create your own casebook with any case ever published? By connecting H2O open casebooks with the Caselaw Access Project, we can change how we read and create casebooks.

In higher education, open textbooks have created new ways to learn, share, and adapt knowledge - and save students money in the meantime. For casebooks that can cost law students hundreds of dollars each, this gives law schools the opportunity to create casebooks to serve their communities.

What do open casebooks look like? From Contracts (Prof. Charles Fried), Criminal Law (Prof. Jeannie Suk-Gersen), Civil Procedure (Prof. I. Glenn Cohen), Torts (Prof. Jonathan Zittrain) and more, open casebooks are one way to create course content to support the future of legal education.

How can you create a new casebook with 6.7 million unique cases from the Caselaw Access Project? Here’s how!

  1. Create your casebook (Here’s more on how to get started).
  2. Select “Add Resource”.
  3. Under “Find Case”, import your case by citation (Example: 347 U.S. 483).
  4. Select the case. You’ve just added it to your casebook! Nice 😎.

With a collection of all published U.S. case law, what casebooks would you create, read, and share? Create your casebook with H2O Open Casebooks and the Caselaw Access Project.

Guest Post: Is the US Supreme Court in lockstep with Congress when it comes to abortion?

This guest post is part of the CAP Research Community Series. This series highlights research, applications, and projects created with Caselaw Access Project data.

Abdul Abdulrahim is a graduate student at the University of Oxford completing a DPhil in Computer Science. His primary interests are in the use of technology in government and law and developing neural-symbolic models that mitigate the issues around interpretability and explainability in AI. Prior to the DPhil, he worked as an advisor to the UK Parliament and a lawyer at Linklaters LLP.

The United States of America (U.S.) has seen declining public support for major political institutions, and a general disengagement with the processes or outcomes of the branches of government. According to Pew's Public Trust in Government survey earlier this year, "public trust in the government remains near historic lows," with only 14% of Americans stating that they can trust the government to do "what is right" most of the time. We believed this falling support could affect the relationship between the branches of government and the independence they might have.

One indication of this was a study on congressional law-making which found that Congress was more than twice as likely to overturn a Supreme Court decision when public support for the Court is at its lowest compared to its highest level (Nelson & Uribe-McGuire, 2017). Furthermore, another study found that it was more common for Congress to legislate against Supreme Court rulings that ignored the legislative intentions, or rejects positions taken by federal, state, or local governments — due to ideological differences (Eskridge Jr, 1991).

To better understand how the interplay between the U.S. Congress and Supreme Court has evolved over time, we developed a method for tracking the ideological changes in each branch using word embeddings and text corpora generated. For Supreme Court, we used the opinions for the cases provided in the CAP dataset — though we extended this to include other federal court opinion to ensure our results were stable. As for Congress, we used the transcribed speeches of the Congress from Stanford's Social Science Data Collection (SSDS) (Gentzkow & Taddy, 2018). We use the case study of reproductive rights (particularly, the target word "abortion"), which is arguably one of the more contentious topics ideologically divided Americans have struggled to agree on. Over the decades, we have seen shifts in the interpretation of rights by both the U.S. Congress and Supreme Court that has arguably led to the expansion of reproductive rights in the 1960s and a contraction in the subsequent decades.

What are word embeddings? To track these changes, we use a quantitative method of tracking semantic shift from computational linguistics, which is based on the co-occurrence statistics of words used — and corpora of Congress speeches and the Court's judicial opinions. These are also known as word embeddings. They are a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems. This allows us to see, using the text corpus as a proxy, how they have ideologically leaned over the years on the issue of abortion, and whether any particular case led to an ideological divide or alignment.

For a more detailed account on word embeddings and the different algorithms used, I highly recommend Sebastian Ruder's "On word embeddings".

Our experimental setup In tracking the semantic shifts, we evaluated a couple of approaches using a word2vec algorithm. Conceptually, we formulate the task of discovering semantic shifts as follows. Given a time sorted corpus: corpus 1, corpus 2, …, corpus n, we locate our target word and its meanings in the different time periods. We chose the word2vec algorithm based comparisons made on the performance of the different algorithms which were count-based, prediction-based or a hybrid of the two on a corpus of U.S. Supreme Court opinions. We found that although there is variability in coherence and stability as a result of the algorithm chosen, the word2vec models show the most promise in capturing the wider interpretation of our target word. Between the two word2vec algorithms — Continuous Bag of Words (CBOW) and Skip-Gram Negative Sampling (SGNS) — we observe similar performance, however, the latter showed more promising results in capturing case law related to our target word at a specific time period.

As we test one algorithm in our experiments — a low dimensional representation learned with SGNS — with the incremental updates method (IN) and diachronic alignment method (AL), we got results for two models SGNS (IN) and SGNS (AL). In our implementation, we use parts of the Python library gensim and supplement this with implementations by Dubossarsky et al. (2019) and Hamilton et al. (2016b) for tracking semantic shifts. For the SGNS (AL) model, we only extract regular word-context pairs (w,c) for time slices and trained SGNS on these. For the SGNS (IN) model, we similarly extract the regular word-context pairs (w,c), but rather than divide the corpus and train on separate time bins, we train the first time period and incrementally add new words, update and save the model.

To tune our algorithm, we performed two main evaluations (intrinsic and extrinsic) on samples of our corpora, comparing the performance across different hyperparameters (window size and minimum word frequency). Based on these results, the parameters used were MIN = 200 (minimum word frequency), WIN = 5 (symmetric window cut-off), DIM = 300 (vector dimensionality), CDS = 0:75 (context distribution smoothing), K = 5 (number of negative samples) and EP = 1 (number of training epochs).

What trends did we observe in our results? We observed some notable trends from the changes in the nearest neighbours to our target word. Using the nearest neighbours to abortion indicates how the speakers or writers who generated our corpous associate the word and what connotations it might have in the group.

To better assess our results, we conducted an expert interview with a Womens and Equalities Specialist to categorise the words as: (i) a medically descriptive word, i.e., it relates to common medical terminology on the topic; (ii) a legally descriptive word, i.e., it relates to case, legislation or opinion terminology; and (iii) a potentially biased word, i.e., it is not a legal or medical term and thus was chosen by the user as a descriptor.

Nearest Neighbours Table Key. Description of keys used to classify words in the nearest neighbours by type of terminology. These were based on the insights derived from an expert interview.

Table showing "Category", "Colour Code", and "Description" for groups of words.

A key observation we made on the approaches to tracking semantic shifts is that depending on what type of cultural shift we intend to track, we might want to pick a different method. The incremental updates approach helps identify how parts of a word sense from a preceding time periods change in response to cultural developments in the new time period. For example, we see how the relevance of Roe v. Wade (1973) changes across all time periods in our incremental updates model for the judicial opinions.

In contrast, the diachronic alignment approach better reflects what the issues of that specific period are in the top nearest neighbours. For instance, the case of Roe v. Wade (1973) appears in the nearest neighbours for the judicial opinions shortly after it is decided in the decade up to 1975 but drops off our top words until the decades up to 1995 and 2015, where the cases of Webster v. Reproductive Health Services (1989), Planned Parenthood v. Casey (1992) and Gonzales v. Carhart (2007) overrule aspects of Roe v. Wade (1973) — hence, the new references to it. This is useful for detecting the key issues of a specific time period and explains why it has the highest overall detection performance of all our approaches.

Local Changes in U.S. Federal Court Opinions. The top 10 nearest neighbours to the target word "abortion" ranked by cosine similarity for each model.

Table displaying "Incremental Updates" for the years 1965, 1975, 1985, 1995, 2005, and 2015.

Table displaying "Diachronic Alignment" for the years 1965, 1975, 1985, 1995, 2005, and 2015.

Local Changes in U.S. Congress Speeches. The top 10 nearest neighbours to the target word "abortion" ranked by cosine similarity for each model.

Table displaying "Incremental Updates" for the years 1965, 1975, 1985, 1995, 2005, and 2015.

Table displaying "Diachronic Alignment" for the years 1965, 1975, 1985, 1995, 2005, and 2015.

These preliminary insights allow us to understand some of the interplay between the Courts and Congress on the topic of reproductive rights. The method also offers a way to identify bias and how it may feed into the process of lawmaking. As such, for future work, we aim to refine the methods to serve as a guide for operationalising word embeddings models to identify bias - as well as the issues that arise when applied to legal or political corpora.

Using Machine Learning to Extract Nuremberg Trials Transcript Document Citations

In Harvard's Nuremberg Trials Project, being able to link to cited documents in each trial's transcript is a key feature of site navigation. Each document submitted into evidence by prosecution and defense lawyers is introduced in the transcript and discussed, and the site user is offered the possibility at each document mention to click open the document and view its contents and attendant metadata. While document references generally follow various standard patterns, deviations from the pattern large and small are numerous, and correctly identifying the type of document reference – is this a prosecution or defense exhibit, for example – can be quite tricky, often requiring teasing out contextual clues.

While manual linkage is highly accurate, it becomes infeasible over a corpus of 153,000 transcript pages and more than 100,000 document references to manually tag and classify each mention of a document, whether it be a prosecution or defense trial exhibit, or a source document from which the former were often chosen. Automated approaches offer the most likely promise of a scalable solution, with strategic, manual, final-mile workflows responsible for cleanup and optimization.

Initial prototyping by Harvard of automated document reference capture focused on the use of pattern matching in regular expressions. Targeting only the most frequently found patterns in the corpus, Harvard was able to extract more than 50,000 highly reliable references. While continuing with this strategy could have found significantly more references, it was not clear that once identified, a document reference could be accurately typed without manual input.

At this point Harvard connected with Tolstoy, a natural language processing (NLP) AI startup, to ferret out the rest of the tags and identify them by type. Employing a combination of machine learning and rule-based pattern matching, Tolstoy was able to extract and classify the bulk of remaining document references.

Background on Machine Learning

Machine learning is a comprehensive branch of artificial intelligence. It is, essentially, statistics on steroids. Working from a “training set” – a set of human-labeled examples – a machine learning algorithm identifies patterns in the data that allow it to make predictions. For example, a model that is supplied many labeled pictures of cats and dogs will eventually find features of the cat images that correlate with the label “cat,” and likewise, for “dog.” Broadly speaking, the same formula is used by self-driving cars learning how to respond to traffic signs, pedestrians, and other moving objects.

In Harvard’s case, a model was needed that could learn to extract and classify, using a labeled training set, document references in the court transcripts. To enable this, one of the main features used was surrounding context, including possible trigger words that can be used to determine whether a given trial exhibit was submitted by the prosecution or defense. To be most useful, the classifier needed to be very accurate (correctly labeled as either prosecution or defense), precise (minimal false positives), and have a high recall (few missing references).

Feature Engineering

The first step in any machine learning project is to produce a thorough, unbiased training set. Since Harvard staff had already identified 53,000 verified references, Tolstoy used that, along with an additional set generated using more precise heuristics, to train a baseline model.

The model is the predictive algorithm. There are many different families of models a data scientist can choose from. For example, one might use a support vector machine (SVM) if there are fewer examples than features, a convolutional neural net (CNN) for images, or a recurrent neural net (RNN) for processing long passages requiring memory. That said, the model is only a part of the entire data processing pipeline, which includes data pre-processing (cleaning), feature engineering, and post-processing.

Here, Tolstoy used a "random forest" algorithm. This method uses a series of decision-tree classifiers with nodes, or branches, representing points at which the training data is subdivided based on feature characteristics. The random forest classifier aggregates the final decisions of a suite of decision trees, predicting the class most often output by the trees. The entire process is randomized as each tree selects a random subset of the training data and random subset of features to use for each node.

Models work best when they are trained on the right features of the data. Feature engineering is the process by which one chooses the most predictive parts of available training data. For example, predicting the price of a house might take into account features such as the square footage, location, age, amenities, recent remodeling, etc.

In this case, we needed to predict the type of document reference involved: was it a prosecution or defense trial exhibit? The exact same sequence of characters, say "Exhibit 435," could be either defense or prosecution, depending on – among other things – the speaker and how they introduced it. Tolstoy used features such as the speaker, the presence or absence of prosecution or defense attorneys' names (or that of the defendant), and the presence or absence of country name abbreviations to classify the references.


Machine learning is a great tool in a predictive pipeline, but in order to gain very high accuracy and recall rates, one often needs to combine it with heuristics-based methods as well. For example, in the transcripts, phrases like “submitted under” or “offered under” may precede a document reference. These phrases were used to catch references that had previously been missed. Other post-processing included catching and removing tags from false positives, such as years (e.g. January 1946) or descriptions (300 Germans). These techniques allowed us to preserve high precision while maximizing recall.

Collaborative, Iterative Build-out

In the build-out of the data processing pipeline, it was important for both Tolstoy and Harvard to carefully review interim results, identify and discuss error patterns and suggest next-step solutions. Harvard, as a domain expert, was able to quickly spot areas where the model was making errors. These iterations allowed Tolstoy to fine-tune the features used in the model, and amend the patterns used in identifying document references. This involved a workflow of tweaking, testing and feedback, a cycle repeated numerous times until full process maturity was reached. Ultimately, Tolstoy was able to successfully capture more than 130,000 references throughout the 153,000 pages, with percentages in the high 90s for accuracy and low 90s for recall. After final data filtering and tuning at Harvard, these results will form the basis for the key feature enabling interlinkage between the two major data domains of the Nuremberg Trials Project: the transcripts and evidentiary documents. Working together with Tolstoy and machine learning has significantly reduced the resources and time otherwise required to do this work.