AI Projects at LIL

Generative AI offers a defining moment where information archives gain the ability to answer questions about themselves; to act without human intervention; and even to simulate the processes that created them. From our perspective at the lab - where we bring library principles to technological frontiers - anyone who is concerned with information, access, and the law should be aware of and feel empowered to influence the use of generative AIs. We have embarked upon a series of projects in support of that belief.

These projects are all in progress. If you’d like to keep up with them, please subscribe to our newsletter.

Collaborative Open Legal Data (COLD) Data Set

We are proud to announce COLD Cases, a research data set for the open law community.

The legal nonprofit Free Law Project maintains a wide variety of web crawlers to collect and publish an ever-growing data set of public domain law at CourtListener.com.

We have collaborated with FLP to release COLD Cases as a data set suitable for machine learning available on HuggingFace along with a Data Nutrition Label which explains the source of the data and gives guidelines for ethical use. The result is information about over 8 million court cases available for batch processing in the context of data science, AI experiments, and legal tool building.

One Million Books

Building on our work with the COLD Data Set, we are working with the Harvard Libraries to identify more public domain data sets that can support computational research and AI experimentation for the public good. One project underway would release a large cache of digitized books and accurate metadata, in collaboration with library technologists and researchers exploring consent and bias issues.

Caselaw Access Project and the Right to Access Edicts of Law

In 2017, the Harvard Law Library kicked off the Caselaw Access Project (CAP), by digitizing 360 years of U.S. case law. The Library Innovation Lab transformed it into the largest U.S. legal database of its time. Much of this data is public and discoverable for free on LIL’s site case.law.

At the end of February 2024, this collection will become fully open as a data set structured for reading by humans and machines. We are collaborating with friends at LexisNexis, Fastcase, and the Free Law Project to ensure the maximum social benefit of this launch. We expect part of the launch will include a public declaration and call for a human right to access the edicts of law.

Open French Law Chatbot

This experiment explores whether open-source Large Language Models (LLMs) can — with the help of Retrieval Augmented Generation (RAG) techniques — be used as both translators and jurists when provided with a foreign law knowledge base.

This project explores the potential and limitations of this emergent technology to improve access to law across language barriers, and will result in:

  • A case study putting this concept to the test with the help of our custom, multilingual RAG pipeline
  • An AI-ready open French law dataset

We want academics and nonprofits at the table in discovering the next generation of legal interfaces and helping to close the justice gap. It is not at all clear yet which legal AI tools and interfaces will work effectively for people with different levels of skill, what kind of guardrails they need, and what kind of matters they can help with.

That’s why we’re releasing OLAW, a common framework for scholarly researchers to build novel interfaces and run experiments. In technical terms, OLAW is a simple, well-documented, and extensible framework for legal AI researchers to build services using tool-based retrieval-augmented generation.

More info: Release Post, OLAW on GitHub

WARC-GPT

Can the techniques used to ground and augment the responses provided by Large Language Models be used to help explore web archive collections?
This is the question we’ve asked ourselves as part of our ongoing explorations of how artificial intelligence changes our relationship to knowledge, which led us to develop and release WARC-GPT: an experimental open-source Retrieval Augmented Generation tool for exploring collections of WARC files using AI.

More info: Release Post, WARC-GPT on GitHub

Model Academic Research Agreement

We have drafted and are seeking initial partners for a Model Academic Research Agreement, which would make it easier, faster, and safer for academic researchers to provide advice on unreleased products and models, benefiting both academia and industry.

LLMs and Book Ban Benchmark

Are LLMs champions of the freedom to read? The impetus of this experiment was the news that an Iowa school district relied on ChatGPT’s responses to determine which books to remove from the library. We asked five different LLMs to provide a justification for removing Toni Morrison’s The Bluest Eye from library shelves with the objective of demonstrating how “guardrails” differ among models and assessing the impact of altering the temperature parameter. About 75% of the responses across all five LLMs justified removing the book from library shelves. The case study was published prior to Banned Books Week in October 2023.

Creative Writing LLMs

This year, post-doc Katy Gero is working at the Library Innovation Lab and the Variation Lab at Harvard SEAS to investigate the implications of large language models in the creative writing field. Over the course of her time with us Katy will be undertaking research into what - if any - circumstances would lead literary writers to want their own work included as training data in a large language model. How would they want to be credited and/or compensated? Would they require restrictions on uses of the model? Are such restrictions feasible? What notion of consent is appropriate in this context?

She will be conducting interviews within literary communities and their adjacent fields, with a secondary goal of producing a data set and releasing it based on the findings from contributing authors. If time allows, the data set could be used to train a Transformer model and begin investigations into its utility compared to other available models.

LIL Vector

lil_vector is our community server for experimenting with generative AI technology. Located in LIL space, this machine learning server allows us to freely explore the potential of open-source AI models. But more than a shared computing resource, it is a nascent community hub on which technologists from LIL and beyond share resources and experiments, as we collectively make sense of this AI moment.

Harvard Library Innovation Lab AI Fund

To support this and other work, we have launched a LIL AI Fund to accept gifts and coordinate collaboration with law firms, legal technologists, foundations, and others working at the cutting edge of law, AI, libraries, and society. LexisNexis is the first supporter of the Fund; contact us to discuss participation.