Interviewer
If you were given unlimited funding to design a system for storing and preserving digital information for at least a century, what would you do?
Matteo Cargnelutti
That’s a really good question, and assuming I have unlimited budget and time, I think the first thing I would focus on is expanding the new storage formats that exist right now at the lab stage.
I’ve been very encouraged by some of the projects I’ve seen on developing new mediums for storing data long term. I’m thinking of things like glass or ceramic-based storage. Something that is sturdy in a way, but also fragile because of the material that is used. If I had an infinite budget, my focus would be on moving this from the lab bench or the cloud- because right now I believe that’s the main use case- to something that can be used by anybody locally.
Institutions have a critical role to play in archiving things “forever” and “lots of copies keep stuff safe.” But if I’m thinking about storing things for the long term, I would want as many copies as possible. So if we were able to have this sort of technology, designed from the beginning to store things for centuries and more, it would need to be broadly accessible, and cheap. This is where I would start. I think this is a massive undertaking, because it’s research and development for something that is extremely challenging. Practices have changed and the cloud is the default for many people. But I would want as many actors as possible involved in archiving things and making their own curatorial decisions. So I would want this type of technology to be broadly available.
I also think of the user as the amateur archivist preserving things just because it matters to them personally.
Interviewer
How do you think about users when designing or considering an archival system or tool? It’s such a different use-case than many other types of software and hardware.
Matteo Cargnelutti
For me, the user in an archiving system can be the archivist and it could be the institution holding archives. That’s a core use case, and we need to keep that in mind. But we’ve also seen more and more homegrown archives, especially since the mid-2010s. So I also think of the user as the amateur archivist preserving things just because it matters to them personally. Eventually, what they preserved might matter for society in general, sometimes even for the whole of humanity.We also spoke with Trevor Owens, whose work considers how individual collecting can interact with institutional stewardship.
So I don’t think I have a precise answer to that question. It’s more about how many different use cases you want to enable as opposed to just focusing on one. In my work in general, I try to make sure I don’t over-optimize for a single use case. I try to be as use case agnostic as possible, which is a major challenge.
Interviewer
At the Institutional Data Initiative, you work with very large data sets, these extremely big corpuses that are large repositories of information. How do you think about handling and communicating the context of that data, in addition to preserving the individual items themselves?
Matteo Cargnelutti
So I think there are two key things to consider, when working with large collections.
First, the data itself gives you information about the collection’s curation and collection processes. Not everything, not the full picture, but being able to see what is in a large collection tells you about some of what went into assembling it. Something to remember when you make use of those collections, is that they are a product of humans selecting things.
The other is that a lot of the work we do at IDI aims to help users, to the extent possible, make better informed decisions about their use of a given dataset.
We think that getting insight into the nature of a collection is extremely important for that purpose. What I find challenging, on top of just getting the data and interpreting it, is that we’re all also necessarily making choices in that process in terms of what insight to get, how to collect that information, how to refine it, and how to publish it. This is also a sort of curatorial process by itself.
There’s always humans in the loop when assembling collections and datasets. Trying to get signal about the contents of the collection is also a humans-in-the-loop process, and the product of collective decisions about what needs to be reflected, and what could help end users make decisions.
Interviewer
How do you approach the stewardship of these collections at IDI?
Matteo Cargnelutti
We work mainly with collections that come from libraries, and we want to make sure that stewardship from libraries continues once the dataset is released: it’s not just about publishing data. I think provenance is one aspect of that. Making clear where the data comes from, how it was processed, and providing as much context as possible about the process itself, the origin of the collection and its contents, et cetera.
IDI works directly with libraries. It’s not libraries giving us access to data and us independently doing something with it. It’s a collaboration on every single collection. So in our context specifically, the idea is that libraries extend their stewardship in a version of the collection that now is a dataset, as opposed to it being completely separate.
Interviewer
When you have a huge amount of information in a corpus, what do you think about the everyday risks of misinterpretation?
Matteo Cargnelutti
We need to contextualize as much as possible, but we also cannot make decisions on behalf of the user. I think the best thing we can do is, again, to provide as much insight and context as possible. Some of it will come from the work that librarians have done in assembling a collection. Some of it will come from the analysis we were able to perform on a given volume or on a given subset of a collection.
I don’t think we can make all those decisions by ourselves. It’s more of a community effort in figuring out how to provide context as effectively as possible.
Interviewer
What else should we talk about that we haven’t yet?
Matteo Cargnelutti
I know that everyone talks about AI, but something that is top of mind for me is the role AI could play not only in preservation, but also in improving access to what we archive.
There are two things that come to mind when I think about the interaction of AI with archives and access to those archives.
There’s of course using machine learning models to improve transcriptions of source materials, and that’s something that I’m focusing on a daily basis and I do see there’s new use cases for that.
In terms of access, I also see that people are asking more and more questions to LLMs and increasingly relying on those generated responses. What I find interesting here is that it’s still somewhat unclear how LLMs learn, retain and recall information. I know prompt engineering is no longer trendy, but there’s something about this in my mind: if you ask the question “right” you might get the information you’re looking for because it might be present in the model’s weights and biases.
Very broadly when I think about the interaction of access and LLMs, I think there’s potential but it’s still unclear to me at this stage what the practice is here, and what are the standards around that practice are.