Institutional Data Initiative

The Library Innovation Lab is preparing to launch the Institutional Data Initiative (IDI), a new effort to unlock and refine high quality training data at library, academic, and government institutions across the world.

By bridging the gap between model-makers and institutions through a world-class data practice, IDI stands to increase access to knowledge, empower contributing institutions and the cultures they represent, and advance the state of the art for all builders of AI.

We are currently building partnerships with institutions and model-makers ahead of a public launch. Reach out to join us: lil@law.harvard.edu

Background

Research efforts like Textbooks Are All You Need are shining a light on the importance of data quality in model training, but acquiring a critical mass of high-quality datasets remains a challenge. While libraries and academic institutions have spent centuries cultivating exactly this type of information, the same institutional structures and studious practices that have been critical to preserving it for generations present a challenge for model-makers developing a rapidly expanding technology.

Harvard Library alone contains 20M volumes, 400M manuscripts, 10M photographs, and 1M maps, each of which were hand-selected for reasons of quality and uniqueness before being labeled for context, with most residing in the public domain. Of these, over 6M have already been digitized and are made available for à la carte viewing by patrons. Similar digitization and browsing efforts, like Boston Public Library’s 1M item Digital Commonwealth, exist at hundreds of university and public libraries across the world.

Because AI is poised to change the way society interacts with knowledge, these collections are not merely information on which models may be trained, they are critical snapshots of cultures and worldviews that deserve representation within them. But these corpora have yet to be made widely accessible in the shape and scale required for model training. Model-makers are each left to negotiate access with disparate institutional stakeholders, slowing access and leaving datasets inaccessible to the broader AI ecosystem. And once the data is finally acquired, each model-maker must further refine it for training. Institutions, meanwhile, are often understaffed and lack knowledge of the norms and needs of model-makers, leaving them with limited bandwidth, reservations about engaging with model-makers, and repeated inquiries for specific data preparation tasks they have yet to develop into repeatable pipelines.

About the Library Innovation Lab

As an organization at the intersection of libraries and technology, the Library Innovation Lab (LIL) at Harvard Law School Library is in a unique position to bridge this gap. When faced with these hurdles in accessing legal knowledge, LIL worked with the Harvard Law Library to launch the Caselaw Access Project (CAP), a multi-year effort in which 360 years of U.S. case law was scanned, parsed, and structured into a first of its kind dataset. The resulting corpus of 7M cases and 17B tokens has become the backbone of every major legal training dataset.

Developing CAP required LIL’s rare breadth of knowledge, resources, and relationships to navigate institutional stakeholders, gain access to a one of a kind corpus, work with legal domain experts on processing the data, and build technical work-streams for scanning, analysis, and hosting. Harvard’s reputation of excellence and LIL’s place as a trusted partner—guided by public interest library values and driven to apply them using new technologies—allows these skills to be applied beyond the walls of Harvard and unlock corpora across the globe in meaningful ways.

Far from working alone, LIL’s connection to The Berkman Klein Center facilitates access to a wide range of domain experts to advise in processing corpora while a relationship with the global Network of Centers provides an inroad to a vast array of institutions and their collections.

Next Steps

The Institutional Data Initiative is currently engaging institutions and model-makers to join as collaborators and contributing partners as we embark on this mission. Partner institutions will work with IDI to identify promising corpora within their collections for refinement and release, with the goal of accelerating and amplifying the mission of those institutions at each step of the process. Likewise, model-makers will contribute data refinement expertise, computational resources, and funding to enable this work. Reach out today to join us: lil@law.harvard.edu