I just made a minor change to WARC-GPT, the tool Matteo Cargnelutti wrote for querying web archives with AI. I’ll explain a little bit of the background, and what the change is for.

What are we trying to do here

The basic idea is that we want to combine web archives with an existing large language model, so that the model will answer questions using the contents of the web archive as well as its inherent knowledge. I have for many years run a wiki for myself and a few friends, which has served variously as social venue, surrogate memory, place to pile up links, storehouse of enthusiasms. When Matteo first announced WARC-GPT, it struck me that the wiki would be a good test; would the tool accurately reflect the content, which I know well? Would it be able to tell me anything surprising? And more prosaically, could I run it on my laptop? Even though the wiki is exposed to the world, and I assume has been crawled by AI companies for inclusion into their models (despite the presence of a restrictive robots.txt), I don’t want to send those companies either the raw material or my queries.

What is a web archive

Briefly, a web archive is a record of all the traffic between a web browser and a web server, for one or more pages—from which you can recreate the playback of those pages. My first task was to generate a list of the 1,116 pages in the wiki, then create an archive using browsertrix-crawler, with this command, which produced the 40-megabyte file crawls/collections/wiki/wiki_0.warc.gz. This is a WARC file, a standard form of web archive.

Ingest the web archive

We now turn to WARC-GPT; the next step is to ingest the web archive, an optional step is to visualize the resulting data, and the last step is to run the application with which we can ask a question and get an answer.

I installed WARC-GPT by running

git clone https://github.com/harvard-lil/warc-gpt.git
cd warc-gpt
poetry env use 3.11
poetry install

I copied .env.example to .env and made a couple of changes recommended by Matteo (of which more later), then copied wiki_0.warc.gz into the warc/ subdirectory of the repo. The command to process the archive is

poetry run flask ingest

which… took a long time. This is when I started looking at and trying to understand the code, specifically in commands/ingest.py.

What is actually going on here

In the ingest step, WARC-GPT takes the text content of each captured page, splits it into chunks, then uses a sentence transformer to turn the text into embeddings, which are vectors of numbers. It later uses those vectors to pick source material that matches the question, and again later in producing the answer to the question.

This is perhaps the moment to point out that AI terminology can be confusing. As I’ve been discussing all this with Matteo, we continually return to the ideas that the material is opaque, the number of knobs to turn is very large, and the documentation tends to assume a lot of knowledge on the part of the reader.

The first setting Matteo had me change in the .env file was VECTOR_SEARCH_SENTENCE_TRANSFORMER_MODEL, which I changed from "intfloat/e5-large-v2" to "BAAI/bge-m3". This is one of the knobs to turn; it’s the model used to create the embeddings. Matteo said, “I think this new embedding model might be better suited for your collection…. Main advantage: embeddings encapsulate text chunks up to 8K tokens.” That is, the vectors can represent longer stretches of text. (A token is a word or a part of a word, roughly.)

One of the other knobs to turn, of course, is where you run the ingest process. Matteo has been doing most of his work on lil-vector, our experimental AI machine, which is made for this kind of work and is much more performant than a Mac laptop. When I ran an ingest with BAAI/bge-m3, the encoding of multi-chunk pages was very slow, and Matteo pointed out that the parallelization built into the encoding function must be running up against the limits of my computer. I turned an additional knob, changing VECTOR_SEARCH_SENTENCE_TRANSFORMER_DEVICE from "cpu" to "mps"—this setting is, roughly, what hardware abstraction to use in the absence of a real GPU, or graphics processing unit, which is where the work is done on machines like lil-vector—but I didn’t see a big improvement, so I set out to make the change I mentioned at the beginning of this post. The idea is to keep track of encoding times for one-chunk pages, and if encoding a multi-chunk page takes a disproportionately long time, stop attempting to encode multiple chunks in parallel.

This worked; ingest times (on my laptop, for my 1,116-page web archive) went from over an hour to about 38 minutes. Success! But note that I still don’t have a clear picture of how ingest time is really related to all the variables of hardware, settings, and for that matter, what else is happening on the machine. Further improvements might well be possible.

Also note that the pull request contains changes other than those described here: I moved this part of the code into a function, mainly for legibility, and changed some for-loops to list comprehensions, mainly for elegance. I experimented with a few different arrangements, and settled on this one as fastest and clearest, but I have not done a systematic experiment in optimization. I’m currently working on adding a test suite to this codebase, and plan to include in it a way to assess different approaches to encoding.

Coda

You will notice that we have not actually run the web application, nor asked a question of the model.

When Matteo suggested the change to the sentence transformer model, he added, “But you’ll also want to use a text generation model with a longer context window, such as: yarn-mistral“—the point Matteo is making here that when the sentence transformer encodes the input in larger pieces, the text generation model should be able to handle larger pieces of text. The implicit point is that the text generation model is external to WARC-GPT; the application has to call out to another service. In this case, where I wanted to keep everything on my computer, I am running Ollama, an open-source tool for running large language models, and set OLLAMA_API_URL to "http://localhost:11434", thereby pointing at my local instance. (I could also have pointed to an instance of Ollama running elsewhere, say on lil-vector, or pointed the system with a different variable to OpenAI or an OpenAI-compatible provider of models.)

Once Ollama was running, and I’d run the ingest step, I could run

poetry run flask run

and visit the application at http://localhost:5000/. I can pick any of the models I’ve pulled with Ollama; these vary pretty dramatically in speed and in quality of response. This is, obviously, another of the knobs to turn, along with several other settings in the interface. So far, I’ve had the best luck with mistral:7b-instruct-v0.2-fp16, a version of the Mistral model that is optimized for chat interfaces. (Keep an eye out for models that have been quantized: model parameters have been changed from floating-point numbers to integers in order to save space, time, and energy, at some cost in accuracy. They often have names including q2, q3, q4, etc.) The question you ask in the interface is yet another knob to turn, as is the system prompt specified in .env.

I haven’t learned anything earth-shattering from WARC-GPT yet. I was going to leave you with some light-hearted output from the system, maybe a proposal for new wiki page titles that could be used as an outro for a blog post on querying web archives with AI, or a short, amusing paragraph on prompt engineering and the turning of many knobs, but I haven’t come up with a combination of model and prompt that delivers anything fun enough. Stay tuned.