Generative AI offers a defining moment where information archives gain the ability to answer questions about themselves; to act without human intervention; and even to simulate the processes that created them. From our perspective at the lab - where we bring library principles to technological frontiers - anyone who is concerned with information, access, and the law should be aware of and feel empowered to influence the use of generative AIs. We have embarked upon a series of projects in support of that belief.
These projects are all in progress. If you’d like to keep up with them, please subscribe to our newsletter.
We are proud to announce COLD Cases, a research data set for the open law community.
We have collaborated with FLP to release COLD Cases as a data set suitable for machine learning available on HuggingFace along with a Data Nutrition Label which explains the source of the data and gives guidelines for ethical use. The result is information about over 8 million court cases available for batch processing in the context of data science, AI experiments, and legal tool building.
Building on our work with the COLD Data Set, we are working with the Harvard Libraries to identify more public domain data sets that can support computational research and AI experimentation for the public good. One project underway would release a large cache of digitized books and accurate metadata, in collaboration with library technologists and researchers exploring consent and bias issues.
In 2017, the Harvard Law Library kicked off the Caselaw Access Project (CAP), by digitizing 360 years of U.S. case law. The Library Innovation Lab transformed it into the largest U.S. legal database of its time. Much of this data is public and discoverable for free on LIL’s site case.law.
At the end of February 2024, this collection will become fully open as a data set structured for reading by humans and machines. We are collaborating with friends at LexisNexis, Fastcase, and the Free Law Project to ensure the maximum social benefit of this launch. We expect part of the launch will include a public declaration and call for a human right to access the edicts of law.
This experiment explores whether open-source Large Language Models (LLMs) can — with the help of Retrieval Augmented Generation (RAG) techniques — be used as both translators and jurists when provided with a foreign law knowledge base.
This project explores the potential and limitations of this emergent technology to improve access to law across language barriers, and will result in:
Can the techniques used to ground and augment the responses provided
by Large Language Models be used to help explore web archive collections?
This is the question we’ve asked ourselves as part of our ongoing explorations of how artificial intelligence changes our relationship to knowledge, which led us to develop and release WARC-GPT: an experimental open-source Retrieval Augmented Generation tool for exploring collections of WARC files using AI.
We have drafted and are seeking initial partners for a Model Academic Research Agreement, which would make it easier, faster, and safer for academic researchers to provide advice on unreleased products and models, benefiting both academia and industry.
Are LLMs champions of the freedom to read? The impetus of this experiment was the news that an Iowa school district relied on ChatGPT’s responses to determine which books to remove from the library. We asked five different LLMs to provide a justification for removing Toni Morrison’s The Bluest Eye from library shelves with the objective of demonstrating how “guardrails” differ among models and assessing the impact of altering the temperature parameter. About 75% of the responses across all five LLMs justified removing the book from library shelves. The case study was published prior to Banned Books Week in October 2023.
This year, post-doc Katy Gero is working at the Library Innovation Lab and the Variation Lab at Harvard SEAS to investigate the implications of large language models in the creative writing field. Over the course of her time with us Katy will be undertaking research into what - if any - circumstances would lead literary writers to want their own work included as training data in a large language model. How would they want to be credited and/or compensated? Would they require restrictions on uses of the model? Are such restrictions feasible? What notion of consent is appropriate in this context?
She will be conducting interviews within literary communities and their adjacent fields, with a secondary goal of producing a data set and releasing it based on the findings from contributing authors. If time allows, the data set could be used to train a Transformer model and begin investigations into its utility compared to other available models.
lil_vector is our community server for experimenting with generative AI technology. Located in LIL space, this machine learning server allows us to freely explore the potential of open-source AI models. But more than a shared computing resource, it is a nascent community hub on which technologists from LIL and beyond share resources and experiments, as we collectively make sense of this AI moment.
To support this and other work, we have launched a LIL AI Fund to accept gifts and coordinate collaboration with law firms, legal technologists, foundations, and others working at the cutting edge of law, AI, libraries, and society. LexisNexis is the first supporter of the Fund; contact us to discuss participation.