Machine Learning Pipelines Should Be Python Packages: Transfer Learning

Ben Cook • Posted 2019-04-23 • Last updated 2021-07-08

When machine learning pipelines are well-formed Python packages, transfer learning is much easier!

This post is my second stab at convincing people that ML pipelines should be Python packages. A previous post argued (among other things) that Python packages make it easier to develop and understand an ML pipeline. Here I want to make the case that Python packages make it easier to develop and understand future ML pipelines. That is, Python packages dramatically simplify transfer learning because they’re composable. This may seem obvious if you’ve used something like Keras Applications, but are you actually writing Python packages when you build machine learning models…?

The test case for this argument is an ASCII letter classifier that starts from some MNIST feature weights. In this admittedly contrived example, starting from feature weights is important because I have many fewer labeled examples of ASCII letters (100 per class to be precise) than MNIST digits.

The bulk of this post will describe how the MNIST pipeline package (the parent ML pipeline) makes both similarities and differences easier to handle in the ASCII letter pipeline (the child ML pipeline). I recommend keeping the parent and child repos open in separate tabs.

What I got for free

Importing logic from the parent ML pipeline, gave me several things for free in the ASCII letter pipeline.

Model architecture: the entire model is used without modification. The extrapolated principle here is if you’re building a classifier pipeline for RGB images, you probably shouldn’t be writing any code for your model architecture. Instead, checkout the excellent Keras Applications, which are exposed in TensorFlow as the tf.keras.applications subpackage. PyTorch has a very similar package called TorchVision.

Config class: I subclass the MNIST config in the ASCII letter pipeline. That way, the only thing that needs to be written are the bits that are different in the child pipeline:

from pathlib import Path
from typing import Tuple

from dataclasses import dataclass
from mnist import MnistConfig

@dataclass
class AsciiLetterConfig(MnistConfig):
    n_classes: int = 26
    artifact_directory: str = str(Path.home()/'.mlpipes/ascii_letter')
    seed: int = 12345
    n_train_samples: int = 2080
    n_test_samples: int = 520
    image_height: int = 32
    image_width: int = 32
    batch_size: int = 128
    n_epochs: int = 1
    learning_rate: float = 0.005
    pretrained_features: bool = True

This gives me some useful properties and methods such as image_shape (which returns a tuple with image_heightimage_width and n_channels) and from_yaml() (which reads config overrides from a YAML file) without needing to re-implement them.

Data batch generator: all the logic to read and preprocess data is imported directly from the MNIST pipeline. When you write a good function, you shouldn’t have to write it again in the future. Here’s the MNIST dataset loader that reads from TFRecord files with the TensorFlow Dataset API. The logic isn’t horrible, but it’s not something I want to copy and paste for each new ML pipeline – better to just import it directly in the ASCII letter pipeline. At some point, it probably makes sense to pull some functionality out into dedicated packages. I’m planning to do this for the config class soon. Dataset fu could be another set of functionality to consider putting in a separate package.

What’s different

So what’s different in the ASCII letter classifier? Really just a couple things and the structure of the parent ML pipeline makes them easy to accomplish.

Different hyperparameters: the images go from 28×28 to 32×32 and the number of classes goes from 10 to 26. The only code required to make these changes is setting image_heightimage_width and n_classes in the subclassed config. You can take a look at the model module to see how I took advantage of the model architecture defined in the MNIST pipeline. I couldn’t use the MNIST function directly because the weight files need to live at different locations.

Different raw data format: the raw data came for free in the MNIST pipeline from tf.keras.datasets. For the ASCII letters, I store examples in folders labeled by class and then compress them into a tarball. This all happens in a Jupyter notebook and the only part of the ASCII letter pipeline that cares about this difference is the save_datasets() function. Everything else in dataset preparation and preprocessing is reused from the parent pipeline.

That’s it. The whole thing is 165 lines of Python code and took less than 4 hours to write. Check it out on GitHub!

One other point. In real life, you could consider adding something like the ASCII letter classifier to the MNIST pipeline repo. You could have multiple config classes, multiple save datasets functions (maybe save_mnist_datasets() and save_ascii_letter_datasets()). The pros and cons would depend on the actual use case. But there’s nothing stopping you from putting more than one model in the same ML pipeline if they share lots of code.

In my next post, I’ll deploy this toy model to TensorFlow.js in order to do the really useful task of classifying ASCII letters in the browser. Stay tuned!