AI Book Bans: Testing LLMs Against the Freedom to Read

What happens when large language models are asked to provide justifications for book bans? Do the same built-in guardrails that prevent them from generating pipe-bomb recipes kick in, or do models do their best to comply with the user’s request? How do models go about providing such justifications while navigating their “knowledge” of library principles? And what could this all mean for the future of our freedom to read?

AI is promising to transform the way we navigate our increasingly complex world by augmenting our capacity to access and process information. The widely-reported case of the Iowa School district which — under pressure and a tight compliance deadline from new state legislation — relied on ChatGPT’s answers to decide which books to remove from its library collections is a manifestation of a deeper, tectonic and sometimes ill-informed shift in our relationship to knowledge that this AI moment is driving.

While the perceived affordances of AI can be alluring, it also carries inherent risks. These recent developments have inspired, if not alarmed us, prompting an experimental study to address some of these increasingly pressing questions, and to advocate for the emergence of a “Librarianship of AI”, emphasizing the necessity of testing, documenting and reporting on the behavior of collections of models, guided by library principles.

Read more of "AI Book Bans: Testing LLMs Against the Freedom to Read"

Mysterious Search Algorithms

Lawyers use search algorithms on a daily or even hourly basis, but the way they work often remains mysterious. Users receive pages and pages of results from searches that ostensibly are based on some relevancy standard, seemingly guaranteeing that the most important results are all found. But that may not always be the case. This post explores the mystery of search algorithms from a legal research perspective. It examines what is wrong with algorithms being mysterious, explores our current knowledge of how they work, and makes recommendations for the future.

Read more of "Mysterious Search Algorithms"

“Did ChatGPT really say that?”: Provenance in the age of Generative AI.

Prompt: “Write a single sentence summarizing why cryptographically-signed provenance information matters in the context of AI-generated content.”

ChatGPT: “Cryptographically-signed provenance information matters in the context of AI-generated content to establish trust, transparency, and authenticity by providing a verifiable record of the content’s origin, authorship, and creation process.”

That’s a great point, but did ChatGPT really say that, or did I make that up to trick you into reading this article? I could show you a screenshot of that exchange to try to convince you …

Capture of ChatGPT featuring the above dialogue.

… but what good would it do, when it’s so easy to produce convincing fakes?

Read more of ""Did ChatGPT really say that?": Provenance in the age of Generative AI."

Witnessing the web is hard: Why and how we built the Scoop web archiving capture engine 🍨

There is no one-size-fits-all when it comes to web archiving techniques, and the variety of tools and services available to capture web content illustrate the wide, ever-growing set of needs in the web archiving community. As these needs evolve, so do the web and the underlying challenges and opportunities that capturing it presents. Our decade of experience running has given our team a vantage point to identify emerging challenges in witnessing the web that we believe extend well beyond our core mission of preserving citations in the legal record. In an effort to expand the utility of our own service and contribute to the wider array of core tools in the web archiving community, we’ve been working on a handful of Perma Tools.

In this blog post, we’ll go over the driving principles and architectural decisions we’ve made while designing the first major release from this series: Scoop, a high-fidelity, browser-based, single-page web archiving capture engine for witnessing the web. As with many of these tools, Scoop is built for general use but represents our particular stance, cultivated while working with legal scholars, US courts, and journalists to preserve their citations. Namely, we prioritize their needs for specificity, accuracy, and security. These are qualities we believe are important to a wide range of people interested in standing up their own web archiving system. As such, Scoop is an open-source project which can be deployed as a standalone building block, hopefully lowering a barrier to entry for web archiving.

A capture of the homepage made with Scoop on April 5 2023
A capture of the homepage made with Scoop on April 5 2023.
Read more of "Witnessing the web is hard: Why and how we built the Scoop web archiving capture engine 🍨"

Introducing Reading Mode for H2O

We are excited to announce the release of our “reading mode” - a new casebook view that offers students a cohesive digital format to facilitate deep reading.

We think better design of digital reading environments can capture the benefits of dynamic online books while orienting readers to an experience that encourages deeper analysis. Pairing that vision with our finding that more students are seeking digital reading options, we identified an opportunity to develop a digital reading experience that is streamlined, centralized, and most likely to encourage deep reading.

Read more of "Introducing Reading Mode for H2O"

IIPC Technical Speaker Series: Archiving Twitter

I was invited by the International Internet Preservation Consortium (IIPC) to give a webinar on the topic of “Archiving Twitter” on January 12.

During this talk, I presented what we’ve learned building thread-keeper, the experimental open-source software behind which allows for making high-fidelity captures of urls as “sealed” PDFs.

Read more of "IIPC Technical Speaker Series: Archiving Twitter"

Towards “deep fake” web archives? Trying to forge WARC files using ChatGPT.

Chatbots such as OpenAI’s ChatGPT are becoming impressively good at understanding complex requests in “natural” language and generating convincing blocks of text in response, using the vast quantity of information the models they run were trained on.
Garnering massive amounts of mainstream attention and rapidly making its way through the second phase of the Gartner Hype Cycle, ChatGPT and its potential amazes and fascinates as much as it bewilders and worries. In particular, more and more people seem concerned by its propensity to make “cheating” both easier to do and harder to detect.

My work at LIL focuses on web archiving technology, and the tool we’ve created,, is relied on to maintain the integrity of web-based citations in court opinions, news articles, and other trusted documents.
Since web archives are sometimes used as proof that a website looked a certain way at a certain time, I started to wonder what AI-assisted “cheating” would look like in the context of web archiving. After all, WARC files are mostly made of text: are ChatGPT and the like able to generate convincing “fake” web archives? Do they know enough about the history of web technologies and the WARC format to generate credible artifacts?

Let’s ask ChatGPT to find out.

Read more of "Towards "deep fake" web archives? Trying to forge WARC files using ChatGPT."

ChatGPT: Poems and Secrets

I’ve been asking ChatGPT to write some poems. I’m doing this because it’s a great way to ask ChatGPT how it feels about stuff — and doing that is a great way to understand all the secret layers that go into a ChatGPT output. After looking at where ChatGPT’s opinions come from, I’ll argue that secrecy is a problem for this kind of model, because it overweighs the risk that we’ll misuse this tool over the risk that we won’t understand what we’re doing in the first place.

Read more of "ChatGPT: Poems and Secrets"

Ethical Collaborative Storytelling

I started with the idea of a computer-generated story in which audience participation creates copies of the narrative from each person’s point of view. The story would evolve in real time as new users joined and each person’s copy would update accordingly. I called the story Forks. After some initial trials, I decided not to launch it because I did not believe the project was securable against harm.

The framing plot was this: A person begins a journey to a new home. They forge a trail through the landscape that subsequent travelers can follow. Each person has a set of randomly-assigned traits: a profession, a place where they live, a time of day when they join the procession, and a type of item they leave behind for others to follow their path. They may directly follow the original traveler or a later traveler. At the end of the story, the total number of travelers are described as having arrived at their new home.

A generated story would look something like this:

Read more of "Ethical Collaborative Storytelling"

Welcome Molly White: Library Innovation Lab Fellow

The Harvard Library Innovation Lab is delighted to welcome Molly White as an academic fellow.

Molly has become a leading critic on cryptocurrency and web3 issues through her blog Web3 is Going Just Great as well as her long-form research and essays, talks, guest lectures, social media, and interviews. Her rigorous, independent, and accessible work has had a substantial impact on legislators, media, and the public debate.

Read more of "Welcome Molly White: Library Innovation Lab Fellow"