Why We Built Dagster for the Data Decade

Announcing our $14M Series A led by Index Ventures, alongside Sequoia Capital, Slow Ventures, Coatue, Amplify Partners, OSS Capital, and others.

Dagster has been built out in the open, but we’ve been quietly building Dagster Labs —previously known as Elementl—, the company behind it, along the way.

Today, I’m excited to announce that Elementl raised a $14M Series A led by Mike Volpi of Index Ventures, alongside Sequoia Capital, Slow Ventures, Coature, Neo, Amplify Partners, OSS Capital, and other leading venture capital investors.

Soon we will also be announcing our hosted product, Dagster+, which unifies all of your data tools into a productive, seamless, enterprise-grade data platform. Sign up here.

Why I Started Elementl

Before I started Elementl, I worked at Facebook, where I founded a team called Product Infrastructure.

Initially, we built internal-facing tools and abstractions to make application developers at Facebook more efficient and productive. But as time went on, our mission expanded to include open-source projects like React, React Native, Relay––and GraphQL, which I co-created along with Lee Byron and Dan Schafer. That experience taught me the awesome power of working on developer tools with the right team, focusing on the right problem, empowering the right community. The worldwide adoption and impact of those technologies continue to stun everyone who participated in their creation.

When I left Facebook and started figuring out what to do next, I spoke to companies inside and outside of Silicon Valley, asking what their biggest technical challenges were. Everywhere I went, I heard the same thing: data and ML infrastructure. I dug in and found the biggest mismatch between the complexity and criticality of a problem domain and the tools to support that domain I had ever seen in my entire career. The problem mattered, the developer experience was excruciating, and opportunity for impact on organizations and industry was massive.

I was hooked.

A Problem That Matters: The Decade of Data

Data assets drive human and automated decision-making at organizations of all sizes. Whether it’s a business leader deciding what customers to focus on, a government agency deciding where to target its spending, or a machine learning model deciding what products to recommend to customers, those decisions are only as good as the data asset they are based on.

But developing high-quality, trusted data assets is a challenging engineering problem. Data is ingested from dozens––if not hundreds––of external sources, SaaS apps, and operational systems of record. Ripped from its original context, data must be integrated and transformed by practitioners into usable and meaningful forms. These data assets must be managed and served to downstream stakeholders and processes: business decision-makers, machine learning models, other data practitioners, and production applications.

The process of transforming, managing, and making sense of data crosses personas and skillsets. It involves many different types of tools and technologies and is riddled with organizational, operational, and technical complexity. And yet, it is critical to the success of every modern organization.

Every organization has a data platform that manages its data assets. The only question is whether it is formally acknowledged and staffed or not.

We call this a data platform: the tools, services, and applications that store, transform, manage, and serve data outside its original source or system of record.

Every organization has a data platform that manages its data assets. The only question is whether it is formally acknowledged and staffed or not.

Productivity Pain: From “Big Data” to “Big Complexity”

Organizations are desperately struggling to manage their data platforms. Despite large amounts of time, energy, and investment, the results are still unsatisfactory. Organizations don’t feel in control of their data; practitioners are overwhelmed and unproductive; and stakeholders too often can’t trust the data assets that inform their decisions.

The last decade of investment in data infrastructure was dominated by “Big Data,” solving the technical problem of successfully performing computations over the massive amounts of data created by modern systems. This was an awe-inspiring engineering achievement. With the right tools and team, today you can infinitely scale computation and storage and efficiently query data sets of arbitrarily large size. The rise of the cloud data warehouses, the IPO of Snowflake, and the imminent IPO of Databricks mark the successful conclusion of the Big Data era.

The era of Big Data is over; the time of Big Complexity has come.

Now that computation and storage at scale are possible, we need to ensure that data assets are productively used and efficiently managed. Today, our problem isn’t the physical processing of data at scale, but managing the complexity of the data platform. Practitioner productivity, data management, and other issues “higher in the stack” will dominate data infrastructure development and tooling for the next decade. The era of Big Data is over; the time of Big Complexity has come.

Our mission is to empower every organization to build a productive, scalable data platform.

Big Complexity can’t be wished away. It must be managed and tamed by making engineers, practitioners, and stakeholders more productive and efficient. We think this is one of the most consequential and formidable software engineering challenges of the next decade. The benefits of productivity and taming complexity compound and scale, not only making existing processes more efficient, but entirely new, unimagined things possible. And that’s why Dagster exists. Our mission: to empower every organization to build a productive, scalable data platform.

The Path to Impact: Data Orchestration

Data orchestration is the beating heart of any organization’s data platform. Any practitioner putting a data asset into production must interact with the orchestrator because all data must come from somewhere and go somewhere. Orchestration invokes every data processing tool, and those tools, in turn, touch every storage system. Orchestration is the layer that interconnects all tools, data, practitioners, and stakeholders.

{' '} Orchestration is the layer that interconnects all tools, data, practitioners, and stakeholders within the data platform.

Historically, orchestrators and workflow engines focused exclusively on operational concerns: the reliable scheduling and ordering of deployed computations. They weren’t focused on practitioner productivity, weren’t designed as multi-tenant platforms, and weren’t specifically geared towards data asset development and production.

{' '} Dagster is an orchestration platform for the development, production, and observation of data assets.

Dagster is a lot more than a workflow engine: it’s an orchestration platform for the development, production, and observation of data assets. We built Dagster to bring support for the full engineering lifecycle to the orchestrator and explicitly target the data management use case.

We think this approach is critical to taming the complexity of modern data platforms. When organizations adopt Dagster, they see huge wins along three dimensions of value:

Productivity: A programming model built for dev and test, and better tooling for observation, operation, and debugging. This means data pracitioners move faster, and data platform engineers enable the end-to-end ownership of computations and data assets across the entire organization.
Scale: Flexible, scalable, multi-tenant infrastructure lets multiple teams within an organization independently deploy, schedule, and execute their data computations in a highly reliable environment.
Context: Since Dagster is data-asset aware, it can provide context around those assets that no other system can. Assets are linked to computations, and consumer-grade tools become the single pane of glass where all stakeholders can understand and operate the data platform.

We believe that this more integrated approach to data management and orchestration will be as self-evident and impactful in hindsight as the vertical integration of compute and storage that has made cloud data warehouses as dominant as they are today.

Not Just Engineering

This is about more than the software infrastructure problems at big companies. In the modern world, data-driven decision-making determines the answer to who qualifies for credit and who doesn’t, what news people see, and how healthcare resources are allocated. Those decisions can only be as good as the data they are based on. Without better tools, we will continue to live in a world that feels out of control, where decisions are based on false data and assumptions, and therefore made in ways we fundamentally cannot explain and understand. Simply put, data management affects everyone.

It’s a privilege to wake up every day and work on this important problem. If you're interested in working with us, Elementl is hiring, and Dagster's open-source community is thriving. I look forward to sharing more about our progress soon.

Have feedback or questions? Start a discussion in Slack or Github.

Interested in working with us? View our open roles.

Want more content like this? Follow us on LinkedIn.