AI Projects at LIL

These projects are all in progress. If you’d like to keep up with them please subscribe to our newsletter.

Generative AI offers a defining moment where information archives gain the ability to answer questions about themselves; to act without human intervention; and even to simulate the processes that created them. From our perspective at the lab - where we bring library principles to technological frontiers - anyone who is concerned with information, access, and the law should be aware of and feel empowered to influence the use of generative AIs. We have embarked upon a series of projects in support of that belief.

Collaborative Open Legal Data (COLD) Dataset

Today we are proud to announce COLD Cases, a research data set to support the open law community.

The legal nonprofit Free Law Project maintains a wide variety of web crawlers to collect and publish an ever-growing dataset of public domain law at CourtListener.com.

We have collaborated with FLP to release COLD Cases as a dataset suitable for machine learning available on HuggingFace along with a Data Nutrition Label which explains the source of the data and gives guidelines for ethical use. The result is information about over 8 million court cases available for batch processing in the context of data science, AI experiments, and legal tool building.

One Million Books

Building on our work on the COLD Dataset, we are working with the Harvard Libraries to identify more public domain datasets that can support computational research and AI experimentation for the public good. One project underway would release a large cache of digitized books and accurate metadata, in collaboration with library technologists and with researchers exploring data set consent and bias issues.

Caselaw Access Project and the Right to Access Edicts of Law

In 2017, the Harvard Law Library kicked off the Caselaw Access Project (CAP), by digitizing 360 years of U.S. case law. The Library Innovation Lab (LIL) transformed it into the largest U.S. legal database of its time. Much of this data is public and discoverable for free on LIL's site case.law.

At the end of February 2024, this collection will go fully open as a dataset structured for reading by humans and machines. We are collaborating with friends at LexisNexis, Fastcase, and the Free Law Project to ensure the maximum social benefit of this launch. We expect part of the launch will include a public declaration and call for a human right to access the edicts of law.

Open French Law Chatbot

This experiment explores whether an open-source LLM (Llama 2) can act as both translator and French law expert if provided with the entirety of French codes as a vector database. The case study will delve into the potential and limitations of open-source LLMs and associated toolchains, and to what extent embedding models and vector databases can augment and “ground” the responses provided by LLMs. In addition, LIL will publish the embeddings database containing the entirety of French law.

Model Academic Research Agreement

We have drafted and are seeking initial partners for a Model Academic Research Agreement, which would make it easier, faster, and safer for academic researchers to provide advice on unreleased products and models, benefiting both academia and industry.

Library Data Trust

Research into speculative funding structures for libraries and their collections. The Library Data Trust would set up a structure whereby libraries and archives digitize their collections and make them available for a fee to commercial users and for free to everyone else. The goals would be to make more collections available to more people while securing the financial future of libraries.

LLMs and Book Ban Benchmark

Are LLMs champions of the freedom to read? The impetus of this experiment was the news that an Iowa school district relied on ChatGPT’s responses to determine which books to remove from the library. We asked five different LLMs to provide a justification for removing Toni Morrison’s The Bluest Eye from library shelves with the objective of demonstrating how “guardrails” differ among models and assessing the impact of altering the temperature parameter. About 75% of the responses across all five LLMs justified removing the book from library shelves. The case study will be published as part of Banned Books Week in October.

LIL Vector

lil_vector is our community server for experimenting with generative AI technology. Located in LIL space, this machine learning server allows us to freely explore the potential of open-source AI models. But more than a shared compute resource, it is a nascent community hub on which technologists from LIL and beyond share resources and experiments, as we collectively make sense of this AI moment.

Harvard Library Innovation Lab AI Fund

To support this and other work, we have launched a LIL AI Fund to accept gifts and coordinate collaboration with law firms, legal technologists, foundations, and others working at the cutting edge of law, AI, libraries, and society. LexisNexis is the first supporter of the Fund; contact us to discuss participation.