Recently I landed a series of contributions to the Python package
- Support URLs as packages #807
- Fix –generate-hashes with bare VCS URLs #812
- Fix pip-compile output for unsafe requirements #813
- Warn when –generate-hashes output is uninstallable #814
pip-tools is a "set of command line tools to help you keep your pip-based [Python] packages fresh, even when you've pinned them." My changes help the
pip-compile --generate-hashes command work for more people.
This isn't a lot of code in the grand scheme of things, but it's the largest set of contributions I've made to a mainstream open source project, so this blog post is a celebration of me! 🎁💥🎉 yay. But it's also a chance to talk about package manager security and open source contributions and stuff like that.
I'll start high-level with "what are package managers" and work my way into the weeds, so feel free to jump in wherever you want.
What are package managers?
Package managers help us install software libraries and keep them up to date. If I want to load a URL and print the contents, I can add a dependency on a package like
… and let
requests do the heavy lifting:
But there's a problem – if I install exactly the same package later, I might get a different result:
I got a different version of
requests than last time, and I got some bonus dependencies (
chardet). Now my code might not do the same thing even though I did the same thing, which is not how anyone wants computers to work. (I've cheated a little bit here by showing the first example as though
pip install had been run back in 2013.)
So the next step is to pin the versions of my dependencies and their dependencies, using a package like
(There are other options I could use instead, like
poetry. For now I still prefer pip-tools, for roughly the reasons laid out by Hynek Schlawack.)
Now when I run
pip install -r requirements.txt I will always get the same version of requests, and the same versions of its dependencies, and my program will always do the same thing.
… just kidding.
The problem with pinning Python packages
pip-compile doesn't quite lock down our dependencies the way we would hope! In Python land you don't necessarily get the same version of a package by asking for the same version number. That's because of binary wheels.
Up until 2015, it was possible to change a package's contents on PyPI without changing the version number, simply by deleting the package and reuploading it. That no longer works, but there is still a loophole: you can delete and reupload binary wheels.
Wheels are a new-ish binary format for distributing Python packages, including any precompiled programs written in C (or other languages) used by the package. They speed up installs and avoid the need for users to have the right compiler environment set up for each package. C-based packages typically offer a bunch of wheel files for different target environments – here's bcrypt's wheel files for example.
So what happens if a package was originally released as source, and then the maintainer wants to add binary wheels for the same release years later? PyPI will allow it, and pip will happily install the new binary files. This is a deliberate design decision: PyPI has "made the deliberate choice to allow wheel files to be added to old releases, though, and advise folks to use –no-binary and build their own wheel files from source if that is a concern."
That creates room for weird situations, like this case where wheel files were uploaded for the
hiredis 0.2.0 package on August 16, 2018, three years after the source release on April 3, 2015. The package had been handed over without announcement from Jan-Erik Rediger to a new volunteer maintainer, ifduyue, who uploaded the binary wheels. ifduyue's personal information on Github consists of: a new moon emoji; an upside down face emoji; the location "China"; and an image of Lanny from the show Lizzie McGuire with spirals for eyes. In a bug thread opened after ifduyue uploaded the new version of hiredis 0.2.0, Jan-Erik commented that users should "please double-check that the content is valid and matches the repository."
The problem is that I can't do that, and most programmers can't do that. We can't just rebuild the wheel ourselves and expect it to match, because builds are not reproducible unless one goes to great lengths like Debian does. So verifying the integrity of an unknown binary wheel requires rebuilding the wheel, comparing a diff, and checking that all discrepancies are benign – a time-consuming and error-prone process even for those with the skills to do it.
So the story of
hiredis looks a lot like a new open source developer volunteering to help out on a project and picking off some low-hanging fruit in the bug tracker, but it also looks a lot like an attacker using the perfect technique to distribute malware widely in the Python ecosystem without detection. I don't know which one it is! As a situation it's bad for us as users, and it's not fair to ifduyue if in fact they're a friendly newbie contributing to a project.
(Is the hacking paranoia warranted? I think so! As Dominic Tarr wrote after inadvertently handing over control of an npm package to a bitcoin-stealing operation, "I've shared publish rights with other people before. … open source is driven by sharing! It's great! it worked really well before bitcoin got popular.")
This is a big problem with a lot of dimensions. It would be great if PyPI packages were all fully reproducible and checked to verify correctness. It would be great if PyPI didn't let you change package contents after the fact. It would be great if everyone ran their own private package index and only added packages to it that they had personally built from source that they personally checked, the way big companies do it. But in the meantime, we can bite off a little piece of the problem by adding hashes to our requirements file. Let's see how that works.
Adding hashes to our requirements file
Instead of just pinning packages like we did before, let's try adding hashes to them:
Now when pip-compile pins our package versions, it also fetches the currently-known hashes for each requirement and adds them to requirements.txt (an example of the crypto technique of "TOFU" or "Trust On First Use"). If someone later comes along and adds new packages, or if the https connection to PyPI is later insecure for whatever reason, pip will refuse to install and will warn us about the problem:
But there are problems lurking here! If we have packages that are installed from Github, then pip-compile can't hash them and pip won't install them:
That's a serious limitation, because
-e requirements are the only way
pip-tools knows to specify installations from version control, which are useful while you wait for new fixes in dependencies to be released. (We mostly use them at LIL for dependencies that we've patched ourselves, after we send fixes upstream but before they are released.)
And if we have packages that rely on dependencies
pip-tools considers unsafe to pin, like
setuptools, pip will refuse to install those too:
This can be worked around by adding
--allow-unsafe, but (a) that sounds unsafe (though it isn't), and (b) it won't pop up until you try to set up a new environment with a low version of
setuptools, potentially days later on someone else's machine.
Those two problems meant that, when I set out to convert our Caselaw Access Project code to use
--generate-hashes, I did it wrong a few times in a row, leading to multiple hours spent debugging problems I created for me and other team members (sorry, Anastasia!). I ended up needing a fancy wrapper script around
pip-compile to rewrite our requirements in a form it could understand. I wanted it to be a smoother experience for the next people who try to secure their Python projects.
So I filed a series of pull requests:
Support URLs as packages
Support URLs as packages #807 and Fix –generate-hashes with bare VCS URLs #812 laid the groundwork for fixing
--generate-hashes, by teaching
pip-tools to do something that had been requested for years: installing packages from archive URLs. Where before, pip-compile could only handle Github requirements like this:
It can now handle requirements like this:
And zipped requirements can be hashed, so the resulting requirements.txt comes out looking like this, and is accepted by
https://github.com/requests/requests/archive/master.zip \ --hash=sha256:3c3d84d35630808bf7750b0368b2c7988f89d9f5c2f2633c47f075b3d5015638
This was a long process, and began with resurrecting a pull request from 2017 that had first been worked on by nim65s. I started by just rebasing the existing work, fixing some tests, and submitting it in the hopes the problem had already been solved. Thanks to great feedback from auvipy, atugushev, and blueyed, I ended up making 14 more commits (and eventually a follow-up pull request) to clean up edge cases and get everything working.
Landing this resulted in closing two other
pip-tools pull requests from 2016 and 2017, and feature requests from 2014 and 2018.
--generate-hashes output is uninstallable
The next step was Fix pip-compile output for unsafe requirements #813 and Warn when –generate-hashes output is uninstallable #814. These two PRs allowed
pip-compile --generate-hashes to detect and warn when a file would be uninstallable for hashing reasons. Fortunately
pip-compile has all of the information it needs at compile time to know that the file will be uninstallable and to make useful recommendations for what to do about it:
Hopefully, between these two efforts, the next project to try using –generate-hashes will find it a shorter and more straightforward process than I did!
Things left undone
Along the way I discovered a few issues that could be fixed in various projects to help the situation. Here are some pointers:
First, the warning to use
--allow-unsafe seems unnecessary – I believe that
--allow-unsafe should be the default behavior for pip-compile. I spent some time digging into the reasons that pip-tools considers some packages "unsafe," and as best I can tell it is because it was thought that pinning those packages could potentially break pip itself, and thus break the user's ability to recover from a mistake. This seems to no longer be true, if it ever was. Instead, failing to use –allow-unsafe is unsafe, as it means different environments will end up with different versions of key packages despite installing from identical requirements.txt files. I started some discussion about that on the pip-tools repo and the pip repo.
Second, the warning not to use version control links with
--generate-hashes is necessary only because of pip's decision to refuse to install those links alongside hashed requirements. That seems like a bad security tradeoff for several reasons. I filed a bug with pip to open up discussion on the topic.
Third, PyPI and binary wheels. I'm not sure if there's been further discussion on the decision to allow retrospective binary uploads since 2017, but the example of hiredis makes it seem like that has some major downsides and might be worth reconsidering. I haven't yet filed anything for this.
Personal reflections (and, thanks Jazzband!)
I didn't write a ton of code for this in the end, but it was a big step for me personally in working with a mainstream open source project, and I had a lot of fun – learning tools like
black and git multi-author commits that we don't use on our own projects at LIL, collaborating with highly responsive and helpful reviewers (thanks, all!), learning the internals of pip-tools, and hopefully putting something out there that will make people more secure.
pip-tools is part of the Jazzband project, which is an interesting attempt to make the Python package ecosystem a little more sustainable by lowering the bar to maintaining popular packages. I had a great experience with the maintainers working on
pip-tools in particular, and I'm grateful for the work that's gone into making Jazzband happen in general.