Caselaw Access Project Cite Grid

Today we’re sharing Cite Grid, a first visualization of our citation graph data. Citation graphs are a way to see relationships between cases, and to answer questions like “What’s the most cited jurisdiction?” and “What year was the most influential in U.S. case law?”

You can explore this visualization two ways. The map view allows you to select a jurisdiction, and view inbound and outbound citations. This shows states more likely to cite that jurisdiction in a darker color. For example, when viewing Texas, the states Missouri and California are shown as most likely to cite that state.

Map view showing inbound citations to Texas, with Missouri and California shown as most likely to cite that state.

The grid view allows you to view the percentage of citations by and to each state. Here’s an example! When we select one square, we can see that 1.4% of cases from Colorado cite California.

Grid view showing 1.4% of cases from Colorado citing to California.

Do you want to create your own visualization with the data supporting this tool? We’re sharing the dataset here. If you’re using our citation graph data, we want to hear about it, and help you spread the word!

Guest Post: An Empirical Study of Statutory Interpretation in Tax Law

This guest post is part of the CAP Research Community Series. This series highlights research, applications, and projects created with Caselaw Access Project data.

Jonathan H. Choi is a Fellow at the New York University School of Law and will join the University of Minnesota Law School as an Associate Professor in August 2020. This post summarizes an article recently published in the May 2020 issue of the New York University Law Review, titled An Empirical Study of Statutory Interpretation in Tax Law, available here on SSRN.

Do agencies interpret statutes using the same methodologies as courts? Have agencies and courts changed their interpretive approaches over time? And do different interpretive tools apply in different areas of law?

Tax law provides a good case study for all these questions. It has ample data points for comparative analysis: the IRS is one of the biggest government agencies and has published a bulletin of administrative guidance on a weekly basis for more than a hundred years, while the Tax Court (which hears almost all federal tax cases) has been active since 1942. By comparing trends in interpretive methodology at the IRS and Tax Court, we can see how agency and court activity has evolved over time.

The dominant theoretical view among administrative law scholars is that agencies ought to take a more purposivist approach than courts—that is, agencies are more justified in examining indicia of statutory meaning like legislative history, rather than focusing more narrowly on the text of the statute (as textualists would). Moreover, most administrative law scholars believe that judicial deference (especially Chevron) allows agencies to select their preferred interpretation of the statute on normative grounds, when choosing between multiple competing interpretations of statutes that are “reasonable.”

On top of this, a huge amount of tax literature has discussed “tax exceptionalism,” the view that tax law is different and should be subject to customized methods of interpretation. This has a theoretical component (the tax code’s complexity, extensive legislative history, and specialized drafting process) as well as a cultural component (the tax bar, from which both the IRS and the Tax Court draw, is famously insular).

That’s the theory—but does it match empirical reality? To find out, I created a new database of Internal Revenue Bulletins and combined it with Tax Court decisions from the Caselaw Access Project. I used Python to measure the frequency of terms associated with different interpretive methods in documents produced by the IRS, the Tax Court, and other federal courts. For example, “statutory” terms discuss the interpretation of statutes, “normative” terms discuss normative values like fairness and efficiency, “purposivist” terms discuss legislative history, and “textualist” terms discuss the language canons and dictionaries favored by textualists.

It turns out that the IRS has indeed shifted toward considering normative issues rather than statutory ones:

Graph showing "Statuatory and Normative Terms in IRS Publications" and the relationshp between year and Normalized Term Frequency.

In contrast, the Tax Court has fluctuated over time but has been stable in the relative mix of normative and statutory terms:

Graph showing "Statuatory and Normative Terms in Tax Court Decisions" and the relationshp between year and Normalized Term Frequency.

On the choice between purposivism and textualism, we can compare the IRS and the Tax Court with the U.S. Supreme Court. The classic story at the Supreme Court is that purposivism rose up during the 1930s and 1940s, peaked around the 1970s, and then declined from the 1980s onward, as the new textualism of Justice Scalia and his conservative colleagues began to dominate jurisprudence at the Supreme Court:

Graph showing "Purposivist and Textualist Terms in Supreme Court Decisions" and the relationshp between year and Normalized Term Frequency.

Has the IRS followed the new textualism? Not at all—it shifted toward purposivism in the 1930s and 1940s, but has basically ignored the new textualism:

Graph showing "Purposivist and Textualist Terms in IRS Publications" and the relationshp between year and Normalized Term Frequency.

In contrast, the Tax Court has completely embraced the new textualism, albeit with a lag compared to the Supreme Court:

Graph showing "Purposivist and Textualist Terms in Tax Court Decisions" and the relationshp between year and Normalized Term Frequency.

Overall, the IRS has shifted toward making decisions on normative grounds and has remained purposivist, as administrative law scholars have argued. The Tax Court has basically followed the path of other federal courts toward the new textualism, sticking with its fellow courts rather than its fellow tax specialists.

That said, even though the Tax Court has shifted toward textualism like other federal trial courts, it might still differ in the details—it could favor some specific interpretive tools (e.g., certain kinds of legislative history, certain language canons) over others. To test this, I used Python’s scikit-learn package to train an algorithm to distinguish between opinions written by the Tax Court, the Court of Federal Claims (a federal court specializing in money claims against the federal government), and federal District Courts. The algorithm used a simple log-regression classifier, with tf-idf transformation, in a bag-of-words model that vectorized each opinion using a restricted dictionary of terms related to statutory interpretation.

The algorithm performed reasonably well—for example, here are bootstrapped confidence intervals reflecting the performance of the algorithm in classifying opinions between the Tax Court and the district courts, showing Matthews correlation coefficient, accuracy, and F1 score. The white dots represent median performance over the bootstrapped sample; the blue bars show the 95-percent confidence interval, the green bars show the 99-percent confidence interval, and the red line shows the null hypothesis (performance no better than random). The algorithm performed statistically significantly better than random, even at a 99-percent confidence level.

Confidence intervals showing "the performance of the algorithm in classifying opinions between the Tax Court and the district courts, showing Matthews correlation coefficient, accuracy, and F1 score. The White dots represent median performance over the bootstrapped sample; the blue bars show the 95-percent confidence interval, the green bars show the 99-percent confidence interval, and the red line shows the null hypothesis (performance no better than random)."

Because the classifier used log regression, we can also analyze individual coefficients to see which particular terms more strongly indicated a Tax Court decision or a District Court decision. The graph of these terms is below, with terms more strongly associated with the District Courts below the line in red, and the terms more strongly associated with the Tax Court above the line in green. These terms were all statistically significant using bootstrapped significance tests and correcting for multiple comparisons (using Šidák correction).

Graph showing individual terms and the strength of their relationship to District Courts or Tax Court.

Finally, I used regression analysis (two-part regression to account for distributional issues in the data) to test whether the political party of the Tax Court judge and/or the case outcome could predict whether an opinion was written in more textualist or purposivist language. The party of the Tax Court judge was strongly predictive of methodology; but case outcome (whether the taxpayer won or the IRS won) was not.

Table showing "Regression Results for Party Affiliation in Tax Court Opinions, 1942 - 2015" including dependent variables for purposivist and textualist terms per million words, for "Democrat", "Year Judge Appointed", "Taxpayer Wins", "Opinion Year Fixed Effects", and "N".

The published paper contains much more detail about data, methods, and findings. I’m currently writing another paper using similar methodology to test the causal effect of Chevron deference on agency decisionmaking, so any comments on the methods in this paper are always appreciated!

Data Science for Case Law: A Course Collaboration

We just wrapped up a unique, semester-long collaboration between the Library and the data science program at SEAS.

This semester Jack Cushman and I joined the instructors of Advanced Topics in Data Science (CS109b) to lead a course module called Data Science for Case Law. Working closely with instructors, we challenged the students by asking them to apply data science methods to generate case summaries (aka "headnotes") with cases from CAP.

The course partnered with schools across campus to create six course modules, from predicting how disease spreads with machine learning, to understanding what galaxies look like using neural networks. We introduced our module by reviewing and discussing a case, and framed our goal around the need for freely available case summaries.

This challenge was a highlight of the semester. Students presented their work at the end of the term, which included multiple approaches to creating case summaries - like supervised and unsupervised models for machine learning and more.

We’re looking forward to new collaborations in the future, and want to hear from you. Have ideas? Let’s talk!

Caselaw Access Project Nominated for a Webby: Vote for Us!

The Caselaw Access Project has been nominated for one of the 24th Annual Webby Awards. We’re honored to be named alongside this year’s other nominees, including friends and leaders in the field like the Knight First Amendment Institute.

CAP makes 6.7 million cases freely available online from the collections of Harvard Law School Library. We’re creating new ways to access the law, such as our case browser, bulk data and downloads for research scholars, and graphs that show how words are used over time.

Brown v. Board of Education, 347 U.S. 483, 98 L. Ed. 2d 873, 74 S. Ct. 686 (1954)

If you like what we're doing, we would greatly appreciate a minute of your time to vote for the Webby People’s Voice Award in the category Websites: Law.

Do you have ideas to share with us? Send them our way. We’re looking forward to hearing from you.

Caselaw Access Project Citation Graph

The Caselaw Access Project is now sharing a citation graph of the 6.7 million cases in our collection from Harvard Law School Library. This update makes available a CSV file that lists case IDs and the cases they cite to. Here’s where you can find it: case.law/download/citation_graph

This citation graph shows us how cases are connected; it lets us find relationships between cases, like identifying the most influential cases and jurisdictions. This update is a new resource for finding those patterns. In the future, we want to use the CAP citation graph to create visualizations to show these relationships. We’re excited for you to do the same.

Have something to share? Send it our way! We’re looking forward to hearing from you.

Caselaw Access Project Shares PDFs for All Cases

The Caselaw Access Project is now making scanned PDFs available for every case in our collection.

This update makes all cases in the CAP case browser available as PDF, digitized from the collections of Harvard Law School Library. When viewing a case, just select the “view PDF” option above the title.

We’re also making volume-level PDFs available as part of CAP downloads. This will let users access PDF files for entire volumes, organized by jurisdiction and reporter.

Case and volume PDFs are available without restriction for our open jurisdictions (Illinois, Arkansas, New Mexico, and North Carolina). PDF files from closed jurisdictions are restricted to 500 cases per-person, per-day, with volume-level PDF access limited to authorized researchers.

This update creates new ways to read cases, online, for free. Are you using the Caselaw Access Project to read case law? We’re looking forward to hearing about it.

Caselaw Access Project Links Case Citations

The Caselaw Access Project is taking its first steps to create links to case citations in our collection of 6.7 million cases.

This update makes case citations available as links in the CAP case browser. When viewing a case, citations are shown as links.

United States v. Kennedy, 573 F.2d 657 (1978) displaying in-text link to "407 F.2d 1391".

When you click on a citation, you’ll go directly to that case.

We also created a cites_to field in the Caselaw Access Project API. This new field shows which cases an opinion cites to. Here’s what that looks like.

United States v. Kennedy, 573 F.2d 657 (1978) showing "cites_to" field in Caselaw Access Project API.

This is only the beginning of our work with case citations. In the future, we hope to improve citation extraction and ultimately to offer researchers a citation graph.

Are you using the Caselaw Access Project to understand the relationship between cases with case citations? Tell us about it.

CAP Code Share: Caselaw Access Project API to CSV

Today we’re going to learn how to write case data from the Caselaw Access Project API to CSV. This post shows work from Jack Cushman, Senior Developer at the Harvard Library Innovation Lab.

The Caselaw Access Project makes 6.7 million individual cases freely available from Harvard Law School Library. With this code, we can create a script to get case data from the Caselaw Access Project API, and write that data to a spreadsheet with Python. This demo is made available as part of the CAP Examples repository on Github. Let’s get started!

How does this script find the data it’s looking for? This happens with an API call using the CAP API, and retrieves all cases that include the words “first amendment”: api.case.law/v1/cases/?search=first+amendment. Want to create your own CAP API call? Here’s how.

The Caselaw Access Project has structured, case-level metadata. You can query parts of that data using the CAP API with endpoints, like “court” or “jurisdiction”. Here’s a rundown of the endpoints we have. This demo gets data using these endpoints to write case data to a CSV file: 'id', 'frontend_url', 'name', 'name_abbreviation', 'citation', 'decision_date', 'jurisdiction'. You can adapt this code, and choose your own endpoints.

To run this script, find your CAP API key by creating an account or logging in, and viewing your user details.

This code is part of the CAP Examples repository on Github, a place to find and share code for working with data from the Caselaw Access Project. Do you have code to share? We want to see this resource grow.

Are you creating new things with code or data made available by the Caselaw Access Project? Send it our way. Our inbox is always open.

Caselaw Access Project Downloads Now Available

Today we're announcing CAP downloads, a new way to access select datasets relating to the Caselaw Access Project. While researchers can use our API and bulk data to access standardized metadata and text for all of the cases in the CAP dataset, we also want to make it possible to share specialized and derivative datasets.

How does it work?

Everything available for download is presented in a simple file directory that lets you navigate to the specific dataset or file you want. Each dataset or export comes with a README file that includes basic information about it.

What data do we have?

To view and access what's currently available, visit case.law/download. We're starting with:

What other datasets should we share?

If you have ideas or suggestions for other datasets you'd like us to share, we'd love to hear about it. Contact us at case.law/contact/!