Announcing the COLD French Law Dataset

There is a new addition to the Collaborative Open Legal Data collection: a set of over 800,000 articles extracted from the LEGI dataset, one of France’s official open law repositories, that were programmatically identified as “currently applicable French law” by our pipeline.

This dataset—formatted into a single CSV file and openly available on Hugging Face—contains original texts from the LEGI dataset as well as machine-generated French to English translations thanks to the participation of the CoCounsel team at Casetext, part of Thomson Reuters.

COLD French Law was initially compiled to be used in a forthcoming experiment at the Lab. We are releasing it broadly today as part of our commitment to open knowledge. We see this dataset as a contribution to the quickly expanding field of legal AI, and hope it will help researchers, builders, and tinkerers of all kinds in their endeavors.

The Process

As part of these release notes, we would like to share details about the process used to translate the articles contained in the dataset.

In a field where the volume of data is so important, it’s useful to understand the plausibility of working with a dataset in one language with an LLM trained in another. This process revealed some techniques for not only reliably translating a large set of documents, but also for doing so efficiently. We do not plan to maintain this dataset outside of the needs of our experiments, and are therefore sharing the details of the pipeline so that others may update the data in the future if needed.

Over the course of two months the CoCounsel team ran all ~800,000 articles through a translation pipeline that took each individual entry and translated it from its original French into English using OpenAI’s GPT-4 large language model. One hurdle was the variety of important metadata for each entry that was also in French, and a desire to retain each of the articles in its fullest form.

Via GPT-4’s function-calling feature, the pipeline was able to translate the full entries, and allowed each column of an entry to be translated in a single call (or couple of calls in the limited cases where entries were longer than 2,500 tokens.) This saved weeks of processing. Additionally, this technique outputs individual JSON files for each of the law articles.

With this approach, we were able to run the pipeline for just a few hours each night, and the structure of the dataset remained intact.

Over the course of this process adjustments were made to the prompt based on the expertise of the CoCounsel team and feedback provided by Timothée Charmeil, an LL.M. candidate at HLS, who quality tested samples of the initial outputs.

The final prompt that was engineered by our colleagues is shared below.

The Prompt

	data = {
	"messages": [
	{
	"role": "system",
	"content": "You are a helpful assistant."
	},
	{
	"role": "user",
	"content": "Below I am going to provide you with at least nine key value pairs with each of the nine keys: article_identifier, article_num, texte_nature, texte_num, texte_ministere, texte_titre, texte_titre_court, texte_contexte and article_contenu_markdown. Each key has a corresponding value. Ignore any keys other than the nine keys I listed. Translate each value except for article_identifier, article_num, texte_nature and texte_num into English. Double check that it is in English, and not French. Make sure you return valid JSON. There can be no JSON escape characters in your response. Here are the key/value pairs: " + query_text
	}
	],
	"model": GPT_4_MODEL_NAME,
	"functions":[{"name": "set_article", "parameters": schema}],
	"function_call": {"name": "set_article"},
	"max_tokens": 2500,
	"temperature": 0,
	"logprobs": None,
	"priority:": "low",
	"skip_recorder" : False
	}

	schema = {
	"type": "object",
	"properties": {
	"article_identifier": {"type": "string"},
	"article_num": {"type": "string"},
	"texte_nature": {"type": "string"},
	"texte_num": {"type": "string"},
	"texte_ministere": {"type": "string"},
	"texte_titre": {"type": "string"},
	"texte_titre_court": {"type": "string"},
	"texte_contexte": {"type": "string"},
	"article_contenu_markdown": {"type": "string"}
	}
	}

view raw COLDfrenchlaw_translation.py hosted with ❤ by GitHub

Links

COLD French Law dataset on Hugging Face

COLD French Law CLI pipeline on Github