Luigi

Luigi is an open-source Python package that simplifies the plumbing of long-running batch pipelines. Rather than using external XML or opaque DSLs, Luigi expresses dependency graphs in Python code so tasks, parameters and date arithmetic can be composed programmatically. It was originally developed at Spotify and is maintained by the community; the project is designed for pipelines that may comprise thousands of tasks and run for hours, days or even weeks. At its core Luigi provides dependency resolution, workflow scheduling, task orchestration and a built-in visualiser. You write Tasks as Python classes that declare their outputs and dependencies; Luigi's scheduler and workers coordinate which tasks should run, when, and on which workers. Common operational features include retries, per-task timeouts, configurable resource limits, parallel scheduling, atomic file operations for targets, and robust handling of failures. Luigi can also re-check external dependencies during a run, group failures into batch notifications, and send error alerts via SMTP, SendGrid, Amazon SES/SNS or other configured channels. Luigi ships with a toolbox of task templates and integrations to make common jobs easier. Out-of-the-box support exists for Hadoop streaming, Spark submit jobs, Hive queries, HDFS and local file targets, and database targets for Postgres, MySQL and Redshift. There are also Kubernetes Job task implementations and options for using efficient HDFS clients such as snakebite or webhdfs. Metrics and observability integrations include Datadog and Prometheus collectors, and logging/notification behaviours are extensively configurable. Installable via pip (pip install luigi, or pip install luigi[toml] for TOML-based config), Luigi supports modern Python 3.x runtimes and is configurable with multiple levels of configuration files for schedulers, workers and task behaviour. Typical use-cases include ETL orchestration, nightly or periodic data aggregation, large-scale report generation, and machine-learning pipelines where preprocessing, training and evaluation steps are chained together with complex dependencies. Because tasks are defined in Python, you can easily trigger non-Python jobs (JVM Hadoop jobs, Spark in Scala, containerized workloads, CLI tools) while keeping the graph and orchestration in one place. Teams use Luigi to stitch together Hadoop/Spark jobs, move data to and from relational stores, run recurring analytics and coordinate workflows that must gracefully recover from partial failures. Operational details and configuration are a strong point: Luigi provides a central scheduler to track progress and a worker process that pulls and runs tasks. Configuration can control scheduling timeouts, worker heartbeats, retry policies, resource quotas, history storage, and the scheduler’s persistence path. The visualiser gives an interactive dependency graph view showing completed, pending and running tasks, which is useful when debugging or explaining pipeline topology. For production deployments you can tune parallel scheduling, caching of completion checks, batch e-mailing of failures, and integrate with monitoring systems. Because state files and history can be persisted, administrators should plan clean shutdown/restart procedures and be mindful of compatibility when upgrading. If you’re getting started: define small Task classes that declare outputs and requires() dependencies, test them locally using local file targets, then scale by introducing database or HDFS targets and deploying a scheduler and workers. Use the visualiser to inspect dependency graphs and failure points, and then extend with Spark/Hive/Hadoop templates or Kubernetes job tasks as your infrastructure requires. Luigi is intentionally lightweight and extensible: it doesn't replace lower-level processing frameworks, but it excels at stitching them together into resilient, maintainable workflows.

Links