August 11, 2020 • 11 minute read
Dagster: The Data Orchestrator
As machine learning, analytics, and data processing become more complex and central to organizations, improving the software behind them becomes more urgent.
Data within organizations is disorganized and not trusted. Engineers and practitioners feel unproductive and mired in drudgery. Collaboration between data scientists, data engineers, analysts, and other roles that build complex data systems is painful. The software that processes and produces data is unreliable and resistant to change.
This state of affairs is why Dagster exists, which we first discussed publicly a year ago.
Dagster is a new type of workflow engine: a data orchestrator. Moving beyond just managing the ordering and physical execution of data computations, Dagster introduces a new primitive: a data-aware, typed, self-describing, logical orchestration graph. This graph explicitly models an implicit, pre-existing structure in every data application and platform. We believe this graph is integral to the entire application lifecycle and, when made accessible and operable over an API, can form the basis of an entire ecosystem of tools and libraries.
Dagster is a new type of workflow engine: a data orchestrator.
Dagster targets two primary audiences. The first is full-stack engineers and data scientists responsible for an entire end-to-end data application and infrastructure. The second is platform teams that enable other data teams to efficiently and autonomously deliver features and capabilities. They both care deeply about the interactions and interfaces between roles, tools, and infrastructure. Technology choices do not limit them: Dagster is flexible and adaptable, designed to run any tool, use any storage, and deploy to any infrastructure.
What also unites these audiences is the belief that there is something “off” in today’s data systems. Code is under-abstracted and under-tested. Unproductive development environments and repetitive infrastructure management bedevil the developer experience. The systems are too unreliable and difficult to change. The way out is to introduce new abstractions and tools that make these systems more testable, reliable, and fun to build.
Since our launch, we’ve worked with many users building remarkable systems with Dagster. Our latest release, 0.9.0 — code-named “Laundry Service” — represents the culmination of that year of hard work and learning. This post discusses what we’ve learned, what we’ve built, and the design principles we’ve used to guide the system’s evolution.
Data Applications and the Orchestration Graph
After a year, we’re more confident in the broad applicability of the concept of a data application, which unifies a set of software practices, artifacts, and systems — such as ETL, ELT, and ML training — that have different historical roots but convergent needs and characteristics. We define a data application as a graph of computations that consume and produce data assets.
The shared characteristics of these systems translate into challenges that data teams feel today:
- Local development with fast feedback cycles
- Effective testing before deployment
- Integration with the diverse tools used by data practitioners
- Collaboration between the different personas that construct them
- Debugging when things go wrong
- Linking data assets to the code that produced them
- Managing application complexity in concert with changing and increasingly demanding requirements
The net result is systems of stunning complexity, whose formidable needs are not met by the software ecosystem that supports them today.
In the status quo, traditional workflow engines such as Airflow work with a purely operational dependency graph: ensuring proper execution order, managing retries, consolidating logging, and so forth. These systems were a huge step forward over loosely coupled cron jobs and other solutions that did not formally define dependencies. A narrow focus on execution allowed those systems to be maximally general while demanding minimal change to the code that they orchestrated.
Dagster makes different tradeoffs, enabling a more structured programming model that exposes a richer, semantically aware graph. The data computations within Dagster are:
- Modeled as coarse-grained functions with typed inputs and outputs, not with parameter-less tasks
- Connected with data dependencies, not with pure execution dependencies
- Parametrized by schematized configuration separate from code, not tightly coupled, loosely structured code and configuration
- Executable in multiple environments (dev, test, and prod), not bound to a specific deployment
- Observable via stream of metadata events, not complete “black box” computations
- Viewable in tools in your local environment with no infrastructure requirements, not only after deployment and execution
We call this opinionated structure the orchestration graph. This graph and the tools built on its API enable a dramatically improved way to structure data systems. It can be tested, deployed, executed, reused, and debugged flexibly. A software engineering process can be built around it in which multiple teams and personas can be effective, and it naturally becomes the system of record for your data processes.
The orchestration graph is the abstraction that connects all practitioners.
We designed this graph and system with broader principles in mind. These principles encode our beliefs about the underlying nature of data applications and the direction of the broader ecosystem.
- Acknowledge and manage complexity: Data application construction is an extraordinarily and subtly difficult endeavor, and the status quo is under-abstracted software that results in chaos and untamed complexity. The complexity cannot be wished away. Instead, acknowledge it and adopt the abstractions and engineering processes needed to manage it.
- Embrace heterogeneity: Data tools and systems are heterogeneous for a good reason: people with diverse skill sets, needs, and tool preferences build them. They are deployed and execute on a wide range of infrastructure. Rather than forcing homogeneity and vertical integration, embrace heterogeneity and manage it with cross-cutting tools and software abstractions.
- Data-aware orchestration: Computations in data applications consume and produce data. The orchestrator should be aware of this. Typed data dependencies mean correct, testable, and understandable software. Structured metadata events emitted by computations are a definitive log of operational data and asset creation, forming a critical base layer for data catalogs, data lineage, and self-service ops.
The orchestration graph is the common abstraction that connects all practitioners. Practitioners may use different computational runtimes, storage systems, programming languages, and tools, but the data assets they consume and produce must come from somewhere and go somewhere. This universality makes it a natural point of leverage for shared tools, collaboration, and managing system complexity. The orchestration graph should not only order computation: it should structure, organize, and interrelate computations and the assets they produce.
Acknowledge and Manage Complexity
This is software.
It bears repeating that data applications are software applications, and many of the principles of software engineering have their corollaries in the data domain. But attempts to blindly apply traditional software engineering techniques to data often fall short of their promise, especially compared to the impact of nontraditional tools like notebooks. We need to adapt software engineering concepts to this adjacent domain, where uncontrolled inputs and heavy-weight side effects are the rule rather than the exception.
We think data engineering today is in a similar position as web frontend engineering was a decade ago: wrestling with a novel and complex domain, dramatically under-tooled, and often regarded with unjustified disdain by practitioners in better-understood domains with more mature tooling, like systems programming. At that time, frontend engineering needed changes in web standards and browser APIs, new programming models and runtimes, and investment in tools. Today, even novice frontend developers work in sophisticated frameworks — built on what was considered arcane functional programming principles — with high-quality tools supported by a healthy ecosystem.
Just like web frontend ten years ago, we believe that data application engineering needs new approaches. Dagster offers a set of novel abstractions that lead to more resilient, testable, reusable code.
Take this example, which incorporates a number of our abstractions into a data computation executed in Pandas:
This example uses Pandas to load data about cereals, compute a new feature, sugar_per_cup, and then filter out cereals below a certain, configurable percentile of that feature. We then save that result as a file in a data lake.
This code demonstrates a few notable properties of Dagster:
Functional Computations: The core abstraction is a solid, a functional unit of computation in the orchestration graph. These solids describe their data requirements (in the form of typed inputs and outputs), configuration (with config_schema), and environmental requirements (expressed as required_resource_keys).
Data Dependencies: Solids are connected using data-aware dependencies. The system takes responsibility for marshaling data (or pointers to data) from the outputs of one solid to the inputs of another. This code builds the orchestration graph structure but does not directly execute it.
Gradually Typed Inputs and Outputs: Dagster uses a gradual, optional type system. Note that the input into add_sugar_per_cup is a DataFrame type with no constraints on its internal structure. However, its output defines column schema and other constraints verified at runtime. These are flexible user-defined types specified in software.
Configurable: Solids (as well as other artifacts) can also declare a typed configuration schema with embedded documentation, defaults, and other features. The schema ensures that passed config values conform to a particular shape and type. A typed schema means earlier error detection, high-quality error messages, and easier use and reuse. It also enables tooling support, like our autocompleting config editor in Dagit.
Abstracted Environment: The context object abstracts the environment. Any heavy external dependency can be modeled as a user-provided resource and attached to the system-provided context. Resources model cross-cutting infrastructure concerns in a pipeline and provide a layer of indirection so the user can swap in a different implementation of that resource for testing or local development.
Event Stream: Solids produce a structured stream of metadata events during computation. (A solid like add_sugar_per_cup that returns a single value is just shorthand for one that yields a single output.) This event stream is persisted and serves as an immutable record of everything that has ever happened in your system. The event log explicitly links data to the computations that produce them and contains operational data.
Adopting this API makes the orchestration graph accessible by graphical tools like Dagit for local development and production operations:
These abstractions have an ecosystem effect, which we are just beginning to feel. Take this example of a PySpark solid:
Note that this job is written using the native PySpark API and defines only business logic. What is remarkable is that this code, unmodified, can be executed on a local Spark cluster on your laptop, a remote EMR cluster in AWS, or the Databricks runtime. The only thing that changes is the configuration of the pyspark_step_launcher resource. Libraries contain specific implementations of this resource. Indeed, the Databricks variant is community-contributed. Extending this to GCP, Azure, Qubole, or other runtimes would be relatively straightforward. Dagster provides enough structure so that these infrastructure concerns can be abstracted away from business logic using resources and then shared across the broader ecosystem.
Dagster provides enough structure so that these infrastructure concerns can be abstracted away from business logic using resources and then shared across the broader ecosystem.
Proper use of these abstractions means code that’s easier to test, more likely to be reused, more observable, and more straightforward to execute in different environments. Our users commonly build libraries of reusable solids and resources, accelerating development in their data platform. We are also starting to feel an ecosystem-wide reuse effect, which is exciting. Solids become reusable components of data processing, and resources become reusable components encapsulating infrastructure concerns.
Multi-persona, multi-tool, multi-team, multi-environment
Historically, non-engineers authored ETL workflows in vertically integrated, graphical tools. Some, such as Informatica PowerCenter or Talend, are still in use today. These are tightly constrained development environments, typically not incorporated into formal engineering processes.
These tools existed in this form for a good reason: it is unreasonable for all subject matter experts and business users in all data domains to become formally trained software engineers. But the complexity and integration requirements of data applications outstripped the capability of these closed, proprietary systems long ago. Today, companies typically build internal data platforms, composed of a wide variety of tools, centrally managed by software engineers.
These centralized systems unleash another organizational pathology: the data practitioners who know the domain do not own the end-to-end capabilities. These domain experts must then offload productionization to a completely different team with specialized technical skills but limited domain knowledge. End-to-end ownership in this siloed structure is simply not possible.
Transforming every data practitioner into a formally trained software engineer is neither a reasonable nor desirable goal. But to build reliable multi-persona applications, data practitioners need to be supported with proper tools and integrated into a broader system that allows them to become participants in a software engineering process.
Heterogeneous Data Tools
dbt is a stellar example of a software development tool designed to let users that are not formally trained software engineers build data applications. The team behind dbt is transforming countless analysts into analytics engineers. By using a thoughtfully designed tool, SQL-speaking analysts can construct modular, testable, repeatable software.
Another tool in this category is Papermill, an open-source library for parametrizing and executing Jupyter notebooks. Jupyter is a broadly used tool in the data science community that offers an interactive programming environment with inline visualizations. Notebooks have an earned reputation as hotbeds of throwaway, non-reusable code executable only in the original author’s environment. Papermill offers a novel middle ground: parameterized notebooks, invocable as coarse-grained functions, that can be put under test and scheduled in production workflows.
Users of these tools do not have to learn every concept in the Dagster programming model. Integrations provided by ecosystem libraries or internal data platform teams can adapt the tools to the Dagster environment. Our team and our users utilized this strategy for dbt, Papermill, and other tool integrations. Computations authored in these domain-specific tools are incorporated into a broader orchestration graph that defines the entire multi-persona data application.
Here is an example of our integration with Papermill, called Dagstermill.
This integration makes those notebooks solids, usable in Dagster tools:
All data processing tools consume data, perform a computation, and produce a result. Dagster is flexible enough to integrate any of them into the orchestration graph. We anticipate more integrations with domain-specific tools as our ecosystem grows.
Data platforms inevitably serve many teams that want some level of operational isolation and independence. We previously noted that the orchestration graph touches every team and tool that manages data. It can devolve into a centralized, unmanaged dumping ground with minimal structure and isolation.
Teams want to use their own tools, with their own concrete dependencies, deployed on their own schedule, while still leveraging shared infrastructure.
The practical problem of conflicting Python dependencies is often the first manifestation of this unwanted interdependence. Pipelines authored by the data science team should not have to tangle with the dependencies of PySpark and the JVM because the data engineering team uses it elsewhere in the system. Teams want to use their own tools, with their own concrete dependencies, deployed on their own schedule, while still leveraging shared infrastructure.
As our users deployed Dagster across multiple teams, both they and we experienced this pain firsthand. Previously we loaded user code directly into process in our tools, resulting in the commingling of the system environment with user-defined environments and a monolithic deployment schedule. Errant user code could also crash or destabilize our services.
We now enforce strict process isolation between our tools and the user-defined pipelines, communicating with those processes over a gRPC interface:
Process isolation means teams can use different Python environments or even different Python versions. It means that user code is less likely to introduce instability into the core infrastructure. Team-specific code can be deployed and versioned independently from other teams and the platform, while still sharing critical infrastructure and ops tools.
The heterogeneity of the data domain does not just apply to tools. It also applies to infrastructure and deployment. Just as Dagster interfaces with any data tool — Spark, Python, etc. — it can also deploy to any cloud and execution substrate — such as Kubernetes, on-premise bare-metal nodes, or a custom PaaS.
Dagster has libraries and tools for deploying to common infrastructure stacks. For example, because of the broad adoption of Kubernetes as an execution substrate, we made significant investments to provide a prefabricated solution for deploying Dagster and its constituent parts using Helm.
However, while we enable out-of-the-box deployment to Kubernetes, we do not require Kubernetes. Although it is a popular technology with undeniable momentum, we do not believe Kubernetes will be the universal answer to managing computation. Indeed, we think that many teams jump too quickly to Kubernetes and that there will be a counter-movement back to simpler managed computational substrates like PaaS and FaaS platforms.
Even if you’re deploying to Kubernetes, you don’t want to run entire clusters to test your pipelines. Deployment should be independent from testing. Dagster’s pluggability means that you can develop and test independently of your deployment target.
Early users used our pre-built Kubernetes infrastructure to deploy production instances of Dagster quickly to modern cloud environments. But the flexible nature of the architecture also allowed our users to deploy Dagster in many configurations: on a custom PaaS, allocating ephemeral per-run computational resources, in an air-gapped data center using Dask as an executor, and with Docker-in-Docker to run Dagster steps in isolated containers. We expect more deployment strategies to blossom and library support as the ecosystem matures.
We’ve described three dimensions of heterogeneity in the data ecosystem: processing tools, teams, and infrastructure. But the full heterogeneity of the data world is far beyond the scope of this post. Investors have attempted to capture the ecosystem in landscape diagrams like this one:
It is utterly bewildering, not due to any lack of effort or cleverness by the authors, but because of the underlying reality of the world. This isn’t going to change anytime soon. There will be no single data warehouse, storage system, computational runtime, or vertically integrated platform to rule them all.
The data ecosystem is extraordinarily heterogenous, and this isn’t going to change anytime soon.
The ecosystem spans different kinds of data and access patterns: big data and small data, graph data and log data, analytic and transactional workloads, streaming and parallel algorithms, and many others. Diversity necessitates heterogeneity in storage and compute systems. The ergonomics appropriate to different personas also demand variation in tooling. A data engineer writing spark, a data scientist authoring a notebook, and an analyst writing SQL: these practitioners use dramatically different tools and processes to author computations in the same logical data application. While daunting to manage, this kind of variation is ultimately a strength.
Workflow engines correctly order and manage computations, but are generally unaware of what they are doing, both before execution and at runtime. We firmly believe that this is a lost opportunity. Embracing data-awareness at the orchestration layer can improve productivity and outcomes at all phases of the application lifecycle.
Dagster connects the elements of the orchestration graph with data dependencies. This property facilitates testability, marshaling of intermediate data between processes, and improved correctness via typing. However, this is not the only dimension of data awareness.
Embracing data-awareness at the orchestration layer can improve productivity and outcomes at all phases of the application lifecycle.
Dagster computations also emit a structured stream of metadata events at runtime. Some of these events are system-generated, but they can also be user-provided. They are vended to our runtime, stored in our infrastructure, and streamed back to our tools. This event log is the basis for our live, reactive tools, used for local development and production operations.
This immutable event log also serves as the definitive record of all activity in your data system. Without such a record, reliably answering questions such as “Was my reporting table updated before or after the most recent data import?” is surprisingly difficult. We anticipate that this event log will become an essential piece of data infrastructure, and the foundation of an entire suite of useful tools.
One event type reports asset materializations. Our definition of an asset includes potentially any produced entity that outlives the scope of the computation. Assets can range from traditional, such as tables in data warehouses and files in object stores, to unorthodox, like Slack messages and GitHub pull requests.
Over this store of asset metadata, we built a new tool, the asset manager. It provides an index over the assets produced by computations. The asset manager’s unique value is deep linkage between assets and the computations that produce them. You can look up a pipeline or an individual run and see the assets it has created. Likewise, you can look up an asset and quickly ascertain what pipelines touched or produced it, and when.
We believe this verifiable, trusted linkage, enabled by our programming model, is an important, missing building block of data observability and self-service operations. This out-of-the-box linkage is a feature that a data-aware orchestrator can uniquely provide. It serves as a base layer of data cataloging without having to incorporate another tool. Our goal is not to be the universal data catalog, and we anticipate integrations with more sophisticated data catalogs like Amundsen and Marquez.
A user can inspect a pipeline within the asset manager and immediately see the assets that it touches, rather than examining code or asking a question on a Slack channel. Conversely, a user can look up an asset and see what computations produced it and when without leaving the tool.
This opens up operations and monitoring to the class of users that thinks about assets first, and many not even know what pipeline or computation produced it. They can look it up in the asset manager and, at a minimum, they can quickly locate and contact the person or team responsible for maintaining and producing that data asset. Our users also report that the asset manager enables self-serve operations, resulting in end-to-end ownership of asset production by domain experts with little to no intervention needed by platform teams.
A Data Orchestrator for the Whole Lifecycle
Over the past decade, there have been huge advances in data technology, especially around managing pure scale. Advanced computational runtimes and cloud data warehouses built on infinite, cheap storage and elastic compute are available to any organization with the right tools and sufficient resources. Massive scale is available to the masses.
The primary challenges now reside higher in the stack: productivity, testing, integration, collaboration, correctness, and debugging. Addressing these challenges is not a distributed systems problem. This is about how to organize and structure software and how you use it to implement processes and influence organizational design.
We believe there is an opportunity for a new type of workflow engine, a data orchestrator. At the center of it is a new artifact, the orchestration graph.
The orchestration graph is a powerful point of leverage in data systems. It manages every piece of business logic, invokes every data processing runtime, and, by extension, writes and reads from every storage system in an enterprise. Every data practitioner must interact with this artifact.
Rather than managing narrow execution concerns after deployment, we believe the graph should be a rich artifact, central to every phase of the application lifecycle:
Local Development: Dagster can run locally with minimal system dependencies, and the graph is viewable before deployment and execution.
Test: The graph has a programming model designed for testability. Parametrizable computations and a pluggable environment mean execution on different data and in different contexts.
Deployment: Process isolation between system code and user repositories enable platform and user teams to deploy reliably and independently.
Operations: Our operational tools built on a structured event stream enable the linkage of data and computation, resulting in faster debugging and end-to-end ownership by data teams.
Our users report improved individual productivity, earlier defect detection, increased stakeholder autonomy, better collaboration, increased code reuse, more comprehensible systems, better data tracking, and more reliable software. Considered individually, each dimension listed has considerable value. Taken together, the benefits multiple and compound. The net result is a system that allows you to effectively manage complexity that would be impossible in a less structured system.
The Road Ahead
This release marks a new level of quality and maturity for Dagster. We’re confident that it can power data applications at scale and that the ecosystem can support many more users. We’re looking forward to communicating more openly and publicly about the project. We hope we’ve made the case that an operable, data-aware orchestration graph and the metadata it produces naturally enable all the tool development to come.
We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!