tianle91.github.io

Why Can’t I Just Cron My Batch Pipeline?

So many batch pipeline orchestrators, which one to use? (hint: not cron)

so many orchestrators

Say you’ve created an application to extract value out of data, now you need to rerun a series of steps periodically to generate new data for your application, taking new data as input each time it was rerun. Figuratively, you want to create a pipe, though which new data can flow, a pipeline, if you will.

Just run it on a schedule!

crontab is a commonly used tool for many people. What it does is simple — just run a command (i.e. a job or task) on a given schedule!

For example 5 4 * * * indicates that a particular command be run whenever the clock hits 4:05 (i.e. daily). See crontab.guru or just type man crontab on your terminal to get started. This simple approach brings about two major issues.

Retrying failures

Jobs can fail for any number of reasons, such as a command running on a spot instance being killed by a cloud provider while running or expired credentials.

Recovering from intermittent errors is easy — a simple retry will do. Ensuring that a retry succeeds requires that jobs are idempotent (i.e. each run returns exactly the same results). This is not usually verified but assumed, so that jobs can be skipped in case of a retry. In luigi, this is achieved by setting up a task output property.

Dependencies

In many cases, some jobs depend on prior jobs succeeding. One way to solve this issue with crontab is simply to observe past runtimes and schedule downstream jobs after their upstream jobs are completed.

Basically it’s just a Gantt chart, where downstream tasks start, and we cross our fingers and hope everything upstream has already completed.

gantt

The downsides of this approach is that it can be an involved process to benchmark past runtimes and set appropriate schedules. However, a schedule that works for most of the historical runtimes can also be too loose and take a much longer time than expected. On the other hand, a tight schedule is prone to changes caused by jobs taking longer than anticipated.

A common way to manage dependencies is to explicitly specify dependencies is to use relationships such as requires (i.e. defining upstream dependencies in luigi), set_upstream, set_downstream (in Apache Airflow).

Just specifying these relationships doesn’t mean that the jobs are set up in a way that can actually be run, since there may be cyclical dependencies. If one plots out the dependencies between jobs, with jobs forming nodes and their upstream requirements as the edges, they should form a DAG (i.e. a directed acyclic graph) — meaning that there should be no cycles (i.e. jobs eventually requiring itself).

This DAG represents the same dependencies as the Gantt chart above.

graph LR
A --> B
A --> C
B --> D
C --> D

This example on Leetcode solves this problem.

Monitoring

There are a few main areas of monitoring (i.e. group-managed emails or notifications on an internal messaging platform such as Slack or Teams) that are common:

What to look out for

To summarize, here’s a few things to look out for (on top of running through the documentation):

Trifecta of orchestration

Remember that orchestration requires all the following (ideally distinct entities):

trifecta of orchestration