Michelle Lee

MICHELLE LEE

Michelle Lee is product lead at Protocol Labs and Executive Director of the IPFS Foundation.

Interviewer

If you were given unlimited funding to design a system for storing and preserving digital information for at least a century, what would you do?

Michelle Lee

I would do two things, both where I’m not following instructions. First, I would think about storing for 20 years and 500 years, not 100. The reason for 20 is because if you look at a lot of the media formats, like vinyl, and then cassette, and then CD-ROM, and then MP3, like those tend to have 20 or 30 year arcs and it’s towards the end of it that the tools to access that data start to fall apart. The medium is intact, the medium is fine, but it’s the ability to use and access that data that tends to degrade. Secondly, I would spend three quarters of that funding on projects that enable access.

I think sometimes in the data world we compare data archiving to the seed bank in Svalbard. The purpose of the seed bank isn’t really to preserve the seeds for their own sake, it’s to safeguard our capacity to grow and regenerate the planet in case of catastrophe. Data at rest is like a seed, they’re both compact and stable, but sort of hunkered in their shells and a little bit dry and crunchy if you try to eat them raw.

The long-term value of data lies in our ability to read and learn from it, whether it’s processing, or replicating, or porting to other formats. And so that’s the credible exit lens on data preservation. I think there’s a lot of fantastic and very thorough ways to keep data around, but if you look at what practitioners want to do with that data, that’s where the ecosystem is weaker.

IPFS as a project is 10 years old. In terms of software, it’s time to have a midlife crisis.

Interviewer

Obviously, this is a part of your everyday work but it also relates to this conversation, how are you currently thinking about the trade-offs between decentralization and centralization?

Michelle Lee

IPFS as a project is 10 years old. In terms of software, it’s time to have a midlife crisis. In the very beginning, Juan Benet, its inventor, had this vision of if you create these primitives that are self-certifying, then you can recombine them in this global peer-to-peer network and have this fully alternate network for sharing information across the internet. I think what we’re seeing more recently is developers and community members using these primitives and recombining them in hybrid architectures.

And I think what we’re coming around to is transitions take time and transitions often pass through a messy middle. Now a lot of our energy is put into making sure these primitives are interoperable and self-certifying and easy to use in whatever architecture you choose. BlueSky, for example, uses the IPFS content addressing tooling, but it’s run on traditional web servers. And that’s why what is important is the interoperability of the data and being able to reuse it in other systems and other networks, rather than dictating that the only way to exchange information is peer-to-peer.

Interviewer

What do you think of as the biggest challenges to ensuring access to data that’s been preserved? Even outside of specific format examples, in terms of social behavior, what makes it hard?

Michelle Lee

With IPFS, we talk about the data life cycle of finding and then retrieving and then verifying. Finding continues to be a challenge across the whole data preservation world, right? Like you have these long respected institutions like Harvard, for example, that can keep an archive. You have organizations like the Internet Archive. But there’s a lot of other institutions and communities out there who have downloaded YouTube videos or public health data sets or whatever they’re preserving, fanfic writers are archiving tons and tons of content. There are very few ways to find that data and find where it exists and have it come with some sort of reputation, its lineage and how it was archived, where it came from.

As a space evolves some subsets of a field become more mature and less mature. I think right now, discoverability is a big challenge, the biggest barrier to practical access to preserved data.Ian Milligan talked about the symbiotic relationship between different web archiving organizations - and how discoverability and legability of those artifacts is a huge challenge. It’s so interesting thinking about the challenges for different types of preservers in this context.

Interviewer

I’d love to hear more about how you think about longevity and how you think about creating and fostering systems, environments, and cultures that can deal with stewardship and governance over long periods.

Michelle Lee

I love that word culture that you mentioned because I’ve been talking a little bit with people and it feels like when we have a credible exit from whatever proprietary systems, you also need something to exit to, you need a hospitable other world. You need an ecosystem of tools to find, read, inspect, and analyze data. You need a community of people who know how to build and repair and evolve these tools. Right now, we’re really cultivating a community of people who are absolute nerds on data formats or data structures.

Are there going to be schools or programs teaching these techniques if we look at the way academic fields come and grow? (It turns out there is more of an active hand in those decisions than I realized during my time in academia.) Are there the social or governance structures to keep up with permissions for data? I think as I’ve spent more and more time in this world, data, to me, feels like gardening rather than a museum. You need to constantly tend it. You can’t just lock it underground for 100 years and come back to it.