Poetry for Package Management in Machine Learning Projects

When you’re building a production machine learning system, reproducibility is a proxy for the effectiveness of your development process. But without locking all your Python dependencies, your builds are not actually repeatable. If you work in a Python project without locking long enough, you will eventually get a broken build because of a transitive dependency (that is, a dependency of a dependency).

But a broken build isn’t the most dangerous problem. A transitive dependency can change in a way that affects an algorithm’s results without you ever knowing it. Imagine being unable to reproduce an algorithm result because an ancestor dependency that never got pinned changes in an unexpected way. Not good.

Enter Poetry. Poetry is an environment management tool for Python projects. Admittedly, there are a lot of environment management tools for Python and there’s no guarantee that Poetry will win. But the Python development community does seem to be slowly standardizing around Poetry.

What is Poetry?

Virtual environment tools like virtualenv and conda have been an important part of modern Python development for a while. But there are a few problems:

Different tools and approaches don’t play nicely with each other
The standards for configuration are inconsistent
Dependency resolution can fail in mysterious ways

Ideally, you want all your builds to be precisely repeatable and you want a straightforward solution that you can use across all your projects. This is where Poetry shines.

Think of Poetry like npm for Python. It’s centered around pyproject.toml, a single config file that defines your entire Python environment and it comes with a lock file, poetry.lock, which pins all dependencies (including transitive dependencies).

Poetry configuration

Poetry extends the pyproject.toml file (introduced in PEP 517) to specify build requirements, project dependencies and development dependencies. It also includes project-level configuration (things like the name, description and license), scripts, extra dependencies and more.

Importantly, the Poetry spec for pyproject.toml works for both Python packages and Python applications, which prevents you from using a different approach across Python projects. This is important for machine learning development because reusable utilities like config and basic calculations like bounding box math often fit better in Python packages than in your actual machine learning application.

Lock file

The poetry.lock file is similar in concept to running pip freeze > requirements.txt every time you add or remove a package. It’s just that Poetry manages this for you when you add or remove a dependency.

And Poetry comes with a good dependency resolver. Since some packages on PyPI don’t specify all dependencies in their metadata, it can make adding a new package a little slower in Poetry. But it does ensure that you won’t get broken builds down the road as dependencies evolve, which is important.

Quick Start

To install the latest Poetry into your Python environment, run:

curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python

After installation, the script will print instructions to your screen to add poetry to your path.

To initialize a new project, run:

poetry init

This will guide you interactively through the process of setting up a new Python project with Poetry. You will add information about the project, the version of Python to use and (optionally) dependencies and development dependencies.

Say you want to add a dependency, like NumPy, after your project has been initialized. You can do that with the poetry add command:

poetry add numpy@~1.20

This will install NumPy with version >=1.20.0,<1.21.0. Alternatively, you could run poetry add numpy@^1.20 to install version >=1.20.0,<2.0.0. When you run this command Poetry does a few things:

Update pyproject.toml to specify the new dependency.
Use the dependency resolver to find the set of package versions that best fit the configuration. In this case, just numpy 1.20.3, but this will also include other packages if they’re specified.
Install all the packages found by the resolver. By default, these will be installed into a virtual environment managed by Poetry. But you can also tell Poetry to install dependencies directly to the system Python.
Freeze all dependencies and save them to poetry.lock so the exact build can be repeated in the future.

If you add the --dev flag to the poetry add command, Poetry treats the dependency as a development dependency. For example:

poetry add black --dev

This distinction is useful because for production builds, you can install all non-development dependencies by adding the --no-dev flag:

poetry install --no-dev

This prevents unnecessary bloat in your production environment where you don’t need tools like formatters and linters.

Finally, you can enter a shell with the virtual environment managed by Poetry activated:

poetry shell
python

Recommended Setup

As mentioned above, Poetry manages virtual environments for you and gives you a dedicated shell. But for production machine learning projects, I think it’s usually better to develop inside a Docker container.

In this case, I recommend turning off Poetry’s virtual environment management. You can do that with a config command:

poetry config virtualenvs.create false

Or you can use an environment variable, for example:

POETRY_VIRTUALENVS_CREATE=false poetry install

And now, you should know everything you need to get started using Poetry to manage your Python environment machine learning projects! Of course, there’s a ton more to learn about Poetry. Fortunately, their documentation is pretty good.

Happy ML engineering!