Everything is a Pipeline

Ben Cook • Posted 2020-05-21 • Last updated 2021-03-24

Imagine this workflow:

  1. Download a multi-file dataset
  2. Convert each file to fit your preferred input format
  3. Run simulation code on each formatted input file
  4. Run an analysis to aggregate the results

This pattern is a directed acyclic graph (aka DAG, aka pipeline).

DAGs can fan in and fan out, but they don’t have any cycles. That is, as you move through the workflow you can split work out into parallel jobs during a step and you can merge them back in at another step. You just can’t go backwards.

When you’re working on a pipeline, it’s helpful to think about each step as an independent job with a separate orchestration layer responsible for running all the steps. Even if you plan on running all steps from your laptop, this approach will make your code more modular.

In the example workflow, downloading a single file of the dataset would be one function, converting a single file into your preferred format would be another function, running the simulation on a single input would be another function and aggregating all the results would be a fourth function. Then you would have a main loop that calls these steps in the right order.

But another advantage of thinking about these steps independently is that it opens you up to more powerful batch processing tools like Apache Airflow and AWS Batch. These systems serve as the orchestration layer. They can execute code on different servers (allowing you to parallelize things like fetching datasets), handle retries, give you a view into progress of the pipeline and much more.

Start thinking about your projects as pipelines now and you will set yourself up to scale efficiently in the future.