When you’re building a production machine learning system, reproducibility is a proxy for the effectiveness of your development process. But without locking all your Python dependencies, your builds are not actually repeatable. If you work in a Python project without locking long enough, you will eventually get a broken build because of a transitive dependency (that is, a dependency of a dependency).
But a broken build isn’t the most dangerous problem. A transitive dependency can change in a way that affects an algorithm’s results without you ever knowing it. Imagine being unable to reproduce an algorithm result because an ancestor dependency that never got pinned changes in an unexpected way. Not good.
Enter Poetry. Poetry is an environment management tool for Python projects. Admittedly, there are a lot of environment management tools for Python and there’s no guarantee that Poetry will win. But the Python development community does seem to be slowly standardizing around Poetry.
What is Poetry?
Virtual environment tools like
conda have been an important part of modern Python development for a while. But there are a few problems:
- Different tools and approaches don’t play nicely with each other
- The standards for configuration are inconsistent
- Dependency resolution can fail in mysterious ways
Ideally, you want all your builds to be precisely repeatable and you want a straightforward solution that you can use across all your projects. This is where Poetry shines.
Think of Poetry like
npm for Python. It’s centered around
pyproject.toml, a single config file that defines your entire Python environment and it comes with a lock file,
poetry.lock, which pins all dependencies (including transitive dependencies).
Poetry extends the
pyproject.toml file (introduced in PEP 517) to specify build requirements, project dependencies and development dependencies. It also includes project-level configuration (things like the name, description and license), scripts, extra dependencies and more.
Importantly, the Poetry spec for
pyproject.toml works for both Python packages and Python applications, which prevents you from using a different approach across Python projects. This is important for machine learning development because reusable utilities like config and basic calculations like bounding box math often fit better in Python packages than in your actual machine learning application.
poetry.lock file is similar in concept to running
pip freeze > requirements.txt every time you add or remove a package. It’s just that Poetry manages this for you when you add or remove a dependency.
And Poetry comes with a good dependency resolver. Since some packages on PyPI don’t specify all dependencies in their metadata, it can make adding a new package a little slower in Poetry. But it does ensure that you won’t get broken builds down the road as dependencies evolve, which is important.
To install the latest Poetry into your Python environment, run:
curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python
After installation, the script will print instructions to your screen to add
poetry to your path.
To initialize a new project, run:
This will guide you interactively through the process of setting up a new Python project with Poetry. You will add information about the project, the version of Python to use and (optionally) dependencies and development dependencies.
Say you want to add a dependency, like NumPy, after your project has been initialized. You can do that with the
poetry add command:
poetry add numpy@~1.20
This will install NumPy with version
>=1.20.0,<1.21.0. Alternatively, you could run
poetry add numpy@^1.20 to install version
>=1.20.0,<2.0.0. When you run this command Poetry does a few things:
pyproject.tomlto specify the new dependency.
- Use the dependency resolver to find the set of package versions that best fit the configuration. In this case, just
numpy 1.20.3, but this will also include other packages if they’re specified.
- Install all the packages found by the resolver. By default, these will be installed into a virtual environment managed by Poetry. But you can also tell Poetry to install dependencies directly to the system Python.
- Freeze all dependencies and save them to
poetry.lockso the exact build can be repeated in the future.
If you add the
--dev flag to the
poetry add command, Poetry treats the dependency as a development dependency. For example:
poetry add black --dev
This distinction is useful because for production builds, you can install all non-development dependencies by adding the
poetry install --no-dev
This prevents unnecessary bloat in your production environment where you don’t need tools like formatters and linters.
Finally, you can enter a shell with the virtual environment managed by Poetry activated:
poetry shell python
As mentioned above, Poetry manages virtual environments for you and gives you a dedicated shell. But for production machine learning projects, I think it’s usually better to develop inside a Docker container.
In this case, I recommend turning off Poetry’s virtual environment management. You can do that with a config command:
poetry config virtualenvs.create false
Or you can use an environment variable, for example:
POETRY_VIRTUALENVS_CREATE=false poetry install
And now, you should know everything you need to get started using Poetry to manage your Python environment machine learning projects! Of course, there’s a ton more to learn about Poetry. Fortunately, their documentation is pretty good.
Happy ML engineering!