Welcome Molly White: Library Innovation Lab Fellow

The Harvard Library Innovation Lab is delighted to welcome Molly White as an academic fellow.

Molly has become a leading critic on cryptocurrency and web3 issues through her blog Web3 is Going Just Great as well as her long-form research and essays, talks, guest lectures, social media, and interviews. Her rigorous, independent, and accessible work has had a substantial impact on legislators, media, and the public debate.

In her work at the Library Innovation Lab, Molly will further examine goals she shares with the web3 community she studies—such as orchestrating community-governed projects, returning power to users from advertisers and tech companies, and increasing access to financial services and reducing wealth inequality—and consider what attracts individuals to web3 and what can be learned from their experience about how to advance those causes.

We are excited for Molly to join the Lab because we share her interest in reinventing traditional institutions to better empower people, while being deeply critical of proposed replacements that may misfire or disempower people. We also admire her skill in translating technical insights into effective advocacy. As libraries continue to rethink our role—as a cultural memory, information network, and social safety net—we have a great deal to learn from her work.

Molly will be joining us for one year, as part of our Democratizing Open Knowledge program supported by the Filecoin Foundation for the Decentralized Web. Welcome, Molly!

H2O usability study: do students want physical casebooks?

This summer, one of our research assistants, Seonghee Lee, ran a study among current law students that is helping us reconsider some longstanding assumptions about student reading preferences and informing future development of the H2O Open Casebook platform.

H2O was launched in an early form in 2012, and for years we worked under the assumption that most books written with H2O would eventually be read in a print format. We have put a lot of work into improving the export experience so that professors can create a book using the H2O platform, export it as a Word document, format it as they like, and distribute it to students as a low-cost, print-on-demand book or as a printable PDF. Our expectations of reading formats began to evolve as we heard authors start to ask for multimedia options like video in their H2O casebooks, but we still heard strong feedback from many professors that they needed a print option for their students.

However, as with so many things, 2020 may have changed what we thought we knew by resetting students' expectations and preferences for their learning materials. This summer we spoke to 21 current law students who had not used H2O, and more than half told us that they prefer digital casebooks over physical texts. Cost was an obvious factor—if a digital book costs less than a physical book, they want the digital book—but many also cited their use of digital notetaking and writing tools as well as the clunkiness and inconvenience of heavy, printed books in their backpacks. Nine of those 21 students talked to us over Zoom (the others completed a survey), and once we were able to show those nine how H2O worked, they all said they could see themselves reading and annotating H2O directly on the platform.

While these conversations cut against the common wisdom about student reading preferences, they align with anecdotes I've been hearing from students. When chatting with some current students at a library event at Harvard Law School earlier this week, most told me that when a professor assigned an H2O book they read it on the H2O platform, even when the professor had created a print option.

Of course, all of these conversations put together still add up to a small number of law students, and even if it is only a minority of students who prefer physical books, we want to make sure H2O is a platform that can meet those students' learning needs as well. We will continue to support professors who want to create printable versions of their H2O books for their students.

But these early conversations about how students prefer to read and to learn are forcing us to ask new questions, too. What tools and capabilities do we owe students who are reading H2O directly on the platform? How can we work with professors to better understand their students' reading preferences? What expertise can we learn from in designing a digital reading platform that is as effective (or better!) than physical reading?

Some early answers come directly from the usability study—students were most concerned with whether a digital reading platform like H2O has embedded annotation tools they can use to mark up cases and inform the outlines they make for their classes. While many students thought they could use the annotation features already built into H2O, this feedback may point to a separate, student-centric set of annotation tools in H2O down the line. For now, we've added some improved UI to better direct readers of H2O casebooks to the annotation tools already there.

Read a summary of Seonghee's work here, and if you have ideas for us, let us know at info@opencasebook.org.

Web Archiving: Opportunities and Challenges of Client-Side Playback

Historically, the complexities of the backend infrastructures needed to play back and embed rich web archives on a website have limited how we explore, showcase and tell stories with the archived web. Client-side playback is an exciting emerging technology that lifts a lot of those restraints.

The replayweb.page software suite developed by our long-time partner Webrecorder is, to date, the most advanced client-side playback technology available, allowing for the easy embedding of rich web archive playbacks on a website without the need for a complex backend infrastructure. However, entirely delegating to the browser the responsibility of downloading, parsing, and restituting web archives also means transferring new responsibilities to the client, which comes with its own set of challenges.

In this post, we'll reflect on our experience deploying replayweb.page on Perma.cc and provide general security, performance and practical recommendations on how to embed web archives on a website using client-side playback.

Security model

Conceptually, embedding a web archive on a web page is equivalent to embedding a third-party website: the embedder has limited control over what is embedded, and the embedded content should therefore be as isolated as possible from its parent context.

Although the software replaying a web archive can attempt to prevent replayed JavaScript from escaping its context, we believe embedding should be implemented in a way that benefits as much as possible from the built-in protections the browser offers for such use cases: namely, the same-origin policy.

Embedding third-party code from a web archive can go wrong in a few ways: there could be an intentional cross-site scripting attack, where JavaScript code is added to a web archive with the intent of accessing or modifying information on the top-level document. There could be an accidental cookie rewrite, where the archive creates a new cookie overwriting one already in use by the embedding website. There could also be proxying conflicts, where a URL of the embedding website ends up being caught by the proxying system of the playback software, making it harder to reach.

Our experience so far tells us that these "context clashes" are more easily prevented by instructing the browser to isolate the archive replay as much as possible.

For that reason, although it is entirely possible—and convenient—to mix web archive content directly into a top-level HTML page, our recommendation is to use an iframe to do so, pointing at a subdomain of the embedding website.

<!-- On www.example.com -->
  sandbox="allow-scripts allow-same-origin allow-modals allow-forms">

In this example, www.example.com uses an iframe to embed warc.example.com/replay/{id}, which serves an HTML document containing an instance of replayweb.page, pointing at an archive file identified by {id}.

A few reasons for that recommendation:

  • warc.example.com is a different origin: therefore the browser will greatly restrict interactions between the embedded replay and its parent, helping prevent context leaks that the playback system might have not accounted for. This should remain true even though the embedding iframe needs to allow both allow-scripts and allow-same-origin for the playback system to work properly.
  • But, it is still on the example.com domain: and the browser will therefore allow this frame to install a service worker. Service workers are subject to the same restrictions as cookies in a third-party embedding context: as such, if third-party cookies are blocked by the browser (which is becoming the default in most browsers), so are third-party service workers.

Client-side performance and caching

Transition from server-side to client-side playback also forces us to reconsider performance and caching strategies, informed by the client's network access characteristics and the limitations of their browsers. The following recommendations are specific to replayweb.page, but are likely applicable, to a certain extent, to other client-side playback solutions.

By default, replayweb.page will try to store every payload it pulled from the archive into IndexedDB for caching purposes. Different browsers have different storage allowances and eviction mechanisms, and it is not unlikely that said allowance runs out after a few archive playbacks. This is a problem we faced with Safari on Perma.cc, and recovery mechanisms proved difficult to efficiently implement.

While this caching feature is helpful to reduce bandwidth usage for returning visitors, turning it off via the noCache attribute may make sense.

There seems to be a strong-enough correlation between browsers giving limited storage allowances and browsers not supporting the StorageManager.estimate API to formulate the following recommendation: noCache should be added if StorageManager.estimate is either not available, or indicates that storage usage is above a certain threshold.

It should be noted that, even when using noCache, replayweb.page needs to store content indexes and other information about the archives in IndexedDB to function. As such, determining how much space should be left for that purpose is context-specific, and we are unfortunately unable to make a general recommendation on this topic.

Alternatively, always using noCache may be considered an acceptable trade-off, if bandwidth usage matters less than reliability for your use case.

Storing and serving archive files

Retrieving and parsing archive files directly within the browser means that client-side constraints now apply to this set of operations. The following recommendations focus on the use case of serving archives files over HTTP for use with replayweb.page, or similar client-side playback solutions.


Replayweb.page uses the Fetch API to download archive files, which enforces the Cross-Origin Resource Sharing policy. Pointing replayweb.page's source attribute at a resource hosted on a different domain will trigger a preflight request, which will fail unless the target file bears sufficiently permissive CORS headers:

  • Access-Control-Allow-Origin should at least allow the embedder's origin.
  • Access-Control-Allow-Methods should allow HEAD and GET.
  • Access-Control-Allow-Headers should be permissive.
  • Access-Control-Expose-Headers should include headers needed for range request support, such as Content-Range and Content-Length. Content-Encoding should likely also be exposed.


Archive files generally need to explicitly state their MIME type for the player to properly identify them. We recommend populating the Content-Type headers with the following values when serving archive files:

  • application/x-gzip for .warc.gz files
  • application/wacz for .wacz files

Support for range requests and range-request caching

Replayweb.page makes extensive use of HTTP range requests to more efficiently retrieve resources from a given archive without having to download the entire file. This is especially true for wacz files, which were designed specifically for this purpose.

As a result, and although there is a "standard" fallback mode for warc.gz files in replayweb.page, servers hosting files for client-side playback should support range requests, or go through a proxy to palliate the absence of that feature.

That shift from single whole-file HTTP requests to myriads of partial HTTP requests may have an impact on billing with certain cloud storage providers. Although this problem is likely vendor-specific, our experiments so far indicate that using a proxy-cache may be a viable option to deal with the issue.

That said, caching range requests efficiently is notoriously difficult and implementations vary widely from provider to provider. To our knowledge, for the use case of client-side web archives playback, slice-by-slice range request caching appears to be the most efficient approach.

Other recommendations

No playback outside of cross-origin iframes

As a way to ensure that an archive replay is not taken out of context and that it is executed in a cross-origin iframe, we recommend checking that properties of parent.window are not accessible before injecting <replay-web-page> in the document.

Replayweb.page and Apple Safari

What appears to be a bug in the way certain versions of Safari handle state partitioning in Web Workers spun from Service Workers in the context of cross-origin iframes may cause replayweb.page to freeze.

This problem should be fixed in Safari 16: in the meantime, we recommend using replayweb.page's noWebWorker option with problematic versions of Safari, which can be identified in JavaScript by the presence of window.GestureEvent, and the absence of window.SharedWorker.

Warc-embed: our experimental boilerplate

Warc-embed is LIL's experimental client-side playback integration boilerplate, which can be used to test out and explore the recommendations described in this article. It consists of: a basic web server configuration for storing, proxying, caching and serving web archive files; and a pre-configured "embed" page, serving an instance of replayweb.page aimed at a given archive file.

Source code and documentation on GitHub: https://github.com/harvard-lil/warc-embed.

These notes have been compiled as part of a new chapter exploring this technology, but the foundation of our insight was built long ago by Rebecca Cremona as she spearheaded the integration of client-side playback into Perma.cc.

2022-10-07 update: We're happy to report that version 1.7.0 of Webrecorder's replayweb.page implements some of the recommendations outlined in this blog post.


  • Automatically using noCache mode in browsers that do not support the StorageManager.estimate API.
  • Automatically using noWebWorker mode for Safari 15 and older.
  • Addition of an optional requireSubDomainIframe attribute to ensure the player won't start unless it's embedded in a cross-origin <iframe>.

What Legal Hackers Can Learn From Libraries

This is a lightly edited transcript of a talk I gave at the 2022 Legal Hackers International Summit on September 10, 2022.

Hello, everyone! I'm Jack Cushman. I'm the director of the Harvard Library Innovation Lab.

Jameson encouraged us to include a big idea in these talks. And we're here at Legal Hackers, whose mission is to work on "the most pressing issues at the intersection of law and technology."

So the big idea I wanted to bring to you as legal hackers is: the most pressing issue at the intersection of law and technology is that we don't know how to have a civilization anymore.

Larry Lessig famously said that what's at the intersection of law and technology is us: we're this pathetic dot at the middle, being regulated by law, by tech, by markets, by norms.

And the Internet has disrupted all of those! It's made all of those start to regulate us in much faster, less predictable ways. So we're now exploring what it means to be a civilization, what our options are, much faster than we ever did before, and we don't know if any of that works yet.

We don't know if we can have a civilization in the presence of the Internet yet.

What it means to have a job is changing incredibly fast right now. We can no longer assume that the same kind of jobs will exist at the end of our careers as the start of our careers.

What it means to form a consensus truth is changing incredibly fast right now.

What it means to choose a government is changing incredibly fast right now, and we don't know if it works yet.

What I want to bring to you beyond that moment of panic is to say, hey, I work at a library.

I work at a law library and I want all of you legal hackers, all of us legal hackers who are reinventing how the world works — that's what legal hacking is! — to steal more from libraries. Steal more ideas from libraries.

Ideas like, libraries are places that help us remember who we are, and they help us remember generationally. They help us remember, at a scale of decades and centuries, who we are and where we came from and where we're going. Steal that idea.

Libraries, especially public libraries, are the places of last resort where you go when you just don't know what to do next. Whether you're in a domestic violence situation or you don't know how to file your taxes or you just don't know what to read next, libraries are places with a person with an ethical commitment to help you out as best they can. It's an extraordinary resource. Let's borrow that idea.

Libraries are an essential part of the speech network that we maintain as societies. Even a tiny town will pay to have a public library, because the public library is a core part of how we form consensus truth. We need to pay attention to those networks that help tell us who we are.

Libraries are little anti-capitalist experiments! You have your economy working along in whatever way it does, and then within the walls of the library they're like, "it all works differently in here! Let's try this other thing for a while!" Whatever economy you're in, libraries are a chance to try something else to experiment and learn. They help you stabilize the change that's happening in your society by experimenting.

And libraries are places that think about citizens and not consumers or users. Libraries call you "patrons." And what we mean by patrons is sort of like citizens of your community — not citizens on a government list, but in the sense of people who are part of this community that we're trying to build, people who are part of our civic infrastructure.

That's how your library sees you.

They don't see you as a user, they don't see you as a resource to exploit. They see you as someone they can help be whatever it is you're trying to be.

We need to borrow that idea.

We need to borrow all those ideas because, after fifty years of the internet, libraries are the one information technology I know of that actually scales. Meaning, the more it grows the more it helps knit your social fabric together instead of tearing it apart. [OK, I didn't say this line in the talk, but I meant to.]

If we are to answer this pressing question of, like, "can we have civilization together anymore," now that we can all talk to each other all the time and don't know what to say — if we are to answer that, I think libraries are one of the core tools that we can use to do it.

And since I'm here from a library, I wanted to pass that along.

That was only three minutes and 45 seconds. So let me tell you very quickly a few of the things that I would love to talk with you about that we're working on at the Harvard Library Innovation Lab, and the very small part of the "saving civilization" problem that we're thinking about:

How do we collaboratively update the legal curriculum? I mean questions like, how do we teach criminal law? We have to start moving faster and including more people in that question. Tools like our Open Casebook platform can help professors collaboratively decide what to teach.

How do we make core legal data open and computable — like our Caselaw Access Project, which scanned all of the precedential legal cases in the United States. And what happens when we do, and who gets exposed, and is that good or bad or both?

How do we preserve data for the next fifty years? The internet is only fifty years old and we don't know if we can remember things from generation to generation yet. Websites break within months of posting them; they need constant maintenance. We need to make websites that last for decades. We need to make data that lasts for centuries. Let's figure out how to do that together.

We're thinking about how to get more people included in that cultural record. The question of whether you are remembered, whether you are part of that generational memory the libraries offer, has always depended on how legally precarious you are. I'm thinking of examples like the sex worker advocacy movement that responded to the SESTA-FOSTA debate, that is now at risk of being forgotten already because the platforms where the movement happened were removed by the law that the movement was about. What gets remembered in the record depends a lot on who you are, and the law has a lot to say about that, and technology does too. So we're thinking about those sorts of precarious archives that are legally in danger.

And we're thinking about, how do we help internet communities grow into civic communities?

As we move from, "my people are on Main Street, my civic life is on Main Street, my civic sustenance is on Main Street," to where my people are in a Slack group, or maybe they're a group of people I talk to on Twitter, but maybe they don't talk to each other — there's a sense of hollowness that comes from what we left behind, and haven't figured out how to bring along yet.

I get to think about that from the library perspective, because libraries are one of those core resources in a small town. I think they might be a core resource in our new civic life as well, in those Slack groups and the other ways that we build a civic society online — but libraries certainly are not the only one. What else does it take to build a government out of a pile of online communities, to build a people, a society, a civilization out of online communities?

Finally, since we are coming from a bunch of law schools, how do we involve students in this conversation? When we're teaching classes about innovation, beyond the design thinking stuff — which is really important, but it's just a tool they can use — what conversation are we trying to have with students about this saving-the-world stuff? Many of them won't just go out and work at law firms anymore, so what other perspectives should we be bringing to them?

So that's what's on my mind. Thank you so much.

IIPC 2022 Recap

This past week was the annual gathering of the International Internet Preservation Consortium. This year, the event was hosted online by the Library of Congress, and we were excited to be able to attend sessions from folks all over the globe.

The programming will be available in full in a couple of weeks (we will send the links out with our next newsletter!), but here are some highlights from the live event that we think our community would find particularly relevant:

Arquivo404: This project from Portuguese archive Arquivo uses Memento protocols to allow website administrators to back up pages with various web archives. "This presentation will show use cases of the Arquivo404 service, detail the technologies it uses and provide some insight on the configurations it allows, namely the addition of other web archives for the search"

Optimizing Archival Replay by Eliminating Unnecessary Traffic to Web Archives: Our friends from the Internet Archive and Web Science & Digital Libraries Research Group at Old Dominion University have been conducting research on the speed of archival replay. "We discovered that some replayed web pages cause recurrent requests that lead to unnecessary traffic for the web archive. We looked at the network traffic on numerous archived web pages and found, for example, an archived page that made 945 requests per minute on average."

WARC Collection Summarization: We send copies of our Perma collection to the Internet Archive as part of our preservation plan - and have worked with the team at the Internet Archive to optimize the way that we share our collection. This presentation is by our collaborator on their team, and is related to our work together. "Items in the Internet Archive’s Petabox collections of various media types like image, video, audio, book, etc. have rich metadata, representative thumbnails, and interactive hero elements. However, web collections, primarily containing WARC files and their corresponding CDX files, often look opaque. We created an open-source CLI tool called 'CDX Summary' to process sorted CDX files and generate reports."

The Evolving Treatment of Wayback Machine Evidence by U.S. Federal Courts: Friend of LIL Nicholas Taylor took a deep dive into how U.S. federal courts have been evaluating the efficacy of Wayback Machine content for use in court. This chart outlines the four different ways that lawyers have argued for the use of a web archive as evidence:

chart of web archive evidentiary success

Keep an eye out for recordings of the full sessions as well as Q&A sessions! Thanks to IIPC and the Library of Congress for pulling all of this together!

2021 Research Associates

Like most things at LIL, our visiting researcher program has taken many forms over the years. This year, despite our team being spread across the East and Midwest Coasts (shout out to Lake Michigan) we were thrilled to welcome five research associates to the virtual LILsphere, to explore their interests through the lens of our projects and mission.

In addition to joining us for our daily morning standups, RAs attended project meetings and brainstorming sessions, and had access to all of the resources the Harvard Library system has to offer. Their individual research was based on questions they had or ideas they wanted to explore in the realm of each of our three tentpole projects: the Caselaw Access Project, H2O, and Perma.cc.

Each of our visitors tackled an exceptionally interesting corner of our work; some helped propel us forward in terms of platform functionality, others prompted us to reconsider some of our base assumptions around our users. They produced things from new software features to teaching materials, design briefs, and research documentation. Below are brief descriptions of their work and links to their individual outputs.

Rachel Auslander

Using technology to empower research and information access is a central tenet of the LIL mission. Another value we have as a group is that of collaboration. This summer, Rachel explored what it would mean to be able to fuse external datasets into CAP via metadata in a way that would bring context and texture to caselaw.

Her design brief which will guide future LILers to integrate these ideas into the CAP interface can be viewed here.

Ashley Fan

We got double the fun from Ashley this summer! Initially, she was interested in working on collections of caselaw that would empower journalists on various beats to apply a legal lens to their writing. Using a new feature available from CAP Labs, Ashley put together a series of Chronolawgic timelines for three different beats: education, health, and environment.

You can read her post about all of these timelines and find links to them here.

Then, in true LIL fashion, Ashley found herself swept up in an interesting problem that happened to come up during her time with the team. The power of the CAP dataset is that it makes accessing caselaw exponentially easier, but caselaw, by nature, can contain sensitive content about individuals involved in specific cases. This tension often manifests itself in requests by those individuals to remove their information from our database of cases, and Ashley jumped in alongside our team to research and formalize a process for decision-making and action.

Follow this link to learn more about this question, and Ashley's research.

Andy Gu

The scope of possibilities surrounding the Caselaw Access Project is so vast, we're really just starting to see how it can change the way scholars look at and study the law. This summer, Andy worked to create further flexibility in our built-in visualization features and expand users' ability to explore trends, particularly in relation to an extremely important aspect of the law: inter-case citation.

In a series of blog posts, Andy sets out how he extended the Trends tool using the Cases endpoint of the API; a powerful application of a new feature; and the design work that was done to integrate these upgrades into the general search interface of CAP.

Adaeze Ibeanu

Undergraduate curricula were the focus of Adaeze's summer. Where and how is the law taught to students who aren't explicitly attending law school? Via a thorough survey of undergraduate curricula and conversations with students, Adaeze presented our team with a summary of legal teaching in an undergraduate setting, and took a deeper dive into legal teaching in the social and natural science fields. Her research explored the potential impact of legal texts and open educational resources in completely new settings.

Aadi Kulkarni

Since 2018, our team has been integrating primary legal documents, including caselaw and the U.S. Code, directly into H2O, our open casebook platform, to make the creation of legal teaching materials even more seamless and powerful. This summer, Aadi continued that work by exploring ways in which H2O could include state code in a casebook—extending content capabilities for all of our users. Along the way, Aadi learned a lot about open-source communities and the process of integrating public materials into our platform.

If you're interested in our visiting research opportunities, make sure to follow us on Twitter. You should also feel free to reach out to us at lil@law.harvard.edu!

Interface Upgrade | Integrating Queries into Search and Case View

With expanded feature capabilities, users may find writing these queries to be more difficult, especially as researchers increase the complexity of their investigations. To make usage easier, we have integrated the Trends query language into the Search and Case View features. From a search query, users can click the Trends button, upon which our servers will automatically convert an existing query into a Trends timeline.

Gif showing search results converted into a Trend timeline.

Additionally, users can now view the citation history of a particular case from that case's page by clicking the "View citation history in trends" button.

Gif showing ability to display citation history on a Trend timeline from an individual case

Our exploration of timeline generation for empirical legal scholarship has inspired us to reimagine how people reason about CAP's corpus of American caselaw. In the future, we hope to restructure the search page further and empower people to quickly ask complex questions about American caselaw over time.

We believe that citation-based analysis can significantly enrich our understanding of American caselaw, and we are excited to see how these tools can expose insights both in the law itself and in quantitative techniques for its exploration. If you have any ideas for how we can further expand on these features, please do not hesitate to reach out to us at info@case.law.

This is part of a series of posts by Andy Gu, a visiting researcher who joined the LIL team in summer 2021. We were inspired to build these features after recognizing the power of the Caselaw Access Project's case and citation data to analyze and explore caselaw. We hope that these features will make empirical study of caselaw both faster and more accessible for researchers.

New Feature | Flexible Citation Queries

Expand your ability to visualize citation practices with the latest support added to our Trends tool. Trends now supports flexible queries of how cases cite other cases in addition to the other ways in which cases can be filtered. By appending the name of any acceptable filter parameter to cites_to__{parameter name here}, users can retrieve all cases citing to cases matching said filter. The parameter name, like before, can be any parameter accepted by the Cases API.

For instance, the following query graphs the number of cases that cite to another case where Justice Cardozo wrote the majority opinion against the number of cases where Justice Brandeis wrote the majority opinion.

comparison of majority opinion authors over time displayed on a graph
Figure 1 query: api(cites_to__author_type=cardozo:majority), api(cites_to__author_type=brandeis:majority)

The cites_to__ feature provides users the power to flexibly reason about case citation patterns. For instance, if a user were interested in how the Supreme Court of California cited authority from its own jurisdiction in comparison to authority from other jurisdictions, they could write the following query:

comparison of citations within jurisdiction versus outside displayed on a graph
Figure 2 query: api(court=cal-1&cites_to__jurisdiction__exclude=cal), api(court=cal-1&cites_to__jurisdiction=cal)

This set of parameters can be integrated with any other parameters compatible with the Cases API. For instance, we can filter the above timeline only to citations of cases that mention the term 'technology':

comparison of citations within jurisdiction versus outside filtered by topic displayed on a graph
Figure 3 query: api(court=cal-1&cites_to__jurisdiction__exclude=cal&cites_to__search=technology), api(court=cal-1&cites_to__jurisdiction=cal&cites_to__search=technology)

Users may also use the parameters within the api() tag to query the Cases API directly. A caveat to the cites_to__ feature is that if the number of cases that fulfill a cites_to__ condition is greater than 20,000 cases, our system will randomly select 20,000 cases within the filtered cases to match against. For more information about all the parameters we support, please feel free to consult our Cases API documentation here.

If you're interested in exploring this data in a different way, make sure you've checked out Cite Grid.

This is part of a series of posts by Andy Gu, a visiting researcher who joined the LIL team in summer 2021. We were inspired to build these features after recognizing the power of the Caselaw Access Project's case and citation data to analyze and explore caselaw. We hope that these features will make empirical study of caselaw both faster and more accessible for researchers.

Feature Update | Extension of Trend Search Capability

Today, we are announcing an update to the Caselaw Access Project (CAP) API and Trends tool to help users better investigate changes in the law over time. These new features enable users to easily generate timelines of cases and explore patterns in case citations. We hope that they can help researchers uncover new insights about American caselaw.

Previously, the project's Historical Trends tool permitted users to graph word and phrase frequencies in cases over time. For instance, the following graph displays the frequency of the terms 'lobster' and 'gold' over time in cases in Maine and California.

historical trends results displayed on graph
Figure 1 query: me: lobster, cal: gold

We have extended the Trends tool so that users can generate timelines of cases for any parameter accepted by the Cases API endpoint. As a result, users can ask broad questions about the Caselaw Access Project's dataset and quickly retrieve timelines of cases that follow the queried pattern.

For instance, the following query presents timelines of cases which cite Mapp v. Ohio since 1961, split by jurisdiction.

query results displayed on graph
Figure 2 query: *: api(cites_to=367 U.S. 643)

The breadth of available filters drastically increases the number of possibilities for a researcher to explore case data. For example, we can take the author parameter in the Cases API to graph the number of cases where Justice Scalia wrote a dissenting opinion with the number of cases where Justice Scalia wrote a majority opinion. By clicking into the timeline, users can retrieve granular information about the qualifying cases.

results filtered by author, displayed on a graph
Figure 3 query: api(author_type=scalia:dissent), api(author_type=scalia:majority)

The power of this flexible query language increases with each parameter supplied to the Trends query. If a user wanted to compare the frequency of Supreme Court cases where Justice Scalia dissented and Justice Breyer wrote the majority opinion with cases where Justice Breyer dissented and Justice Scalia wrote the majority opinion, they could draft the following search:

graphed results of specific opinion author queries
Figure 4 query: api(author_type=scalia:dissent&author_type=breyer:majority&court=us), api(author_type=scalia:majority&author_type=breyer:dissent&court=us)

We have also updated our underlying database to allow users to reason over the citation patterns of individual opinions, in addition to the case itself. If a user wanted to see how many times Justice Scalia specifically cited Mapp v. Ohio in an opinion, we can do so with the following query:

number of time a case was cited by a specific author over time, displayed on a graph
Figure 5 query: api(author__cites_to_id=1785580&author=scalia), api(author__cites_to_id=1785580&author=breyer)

We believe that these features will empower researchers to quickly conduct rich explorations of American caselaw, and we are excited to see how they can expose new insights about our corpus of cases. If you have any ideas for how we can further expand on these features, please do not hesitate to reach out to us at info@case.law.

This is part of a series of posts by Andy Gu, a visiting researcher who joined the LIL team in summer 2021. We were inspired to build these features after recognizing the power of the Caselaw Access Project's case and citation data to analyze and explore caselaw. We hope that these features will make empirical study of caselaw both faster and more accessible for researchers.

Download PDFs of Cases by Citation with CAP

Today we're announcing Fetch PDFs, a simple tool to find case citations in text and give you links to scanned PDFs of those cases from CAP.

Why is this helpful? Courts and law reviews often use print case reporters to confirm exact quotes from legal citations. For people who don't have print reporters — or don't have easy access to them from home — doing this kind of cite checking can be a challenge.

Fetch PDFs lets you extract case citations from your text and read scanned PDFs of those cases or download them all as a zip file. Our PDFs come from print case reporters from the collections of Harvard Law School Library.

Here's how it works! You can start by adding your own text or list of citations. We'll use a snippet from Miranda v. Arizona:

Screenshot of Fetch PDFs showing text box containing excerpt from Miranda v. Arizona.

Select "Find Citations" to show all cases cited in your text:

Screenshot of Fetch PDFs showing the list of cases cited in the excerpt, and the option to download PDFs of those cases as a zip file.

Click the case name of any case to read it as HTML, or click "PDF" to go right to the PDF. Click "Download Zip" to download all of the selected cases.

We want to hear from you! Do you have ideas, stories, or feedback about using CAP for cite checking, access to print reporters, and more? We're looking forward to your message.