Data Engineering Glossary of Terms with Python code examples

Data Engineering Terms Explained

A guide to key terms used in data engineering. Entries with the icon include useful code examples in Python.
For installation instructions for the packages used in the examples, visit the packages page.

For a complete list of Data Engineering terms all data engineers should know, please check out the terms index.

Dagster Newsletter: Get updates delivered to your inbox

Dagster Glossary code icon

Aggregate

Combine data from multiple sources into a single dataset.
An image representing the data engineering concept of 'Aggregate'
Dagster Glossary code icon

Align

Aligning data can mean one of three things: aligning datasets, meeting business rules, or arranging data elements in memory.
An image representing the data engineering concept of 'Align'
Dagster Glossary code icon

Anomaly Detection

Identify data points or events that deviate significantly from expected patterns or behaviors.
An image representing the data engineering concept of 'Anomaly Detection'
Dagster Glossary code icon

Anonymize

Remove personal or identifying information from data.
An image representing the data engineering concept of 'Anonymize'
Dagster Glossary code icon

Append

Adding or attaching new records or data items to the end of an existing dataset, database table, file, or list.
An image representing the data engineering concept of 'Append'

Archive

Move rarely accessed data to a low-cost, long-term storage solution to reduce costs. Store data for long-term retention and compliance.
An image representing the data engineering concept of 'Archive'
Dagster Glossary code icon

AsyncIO

Speed up execution with asynchronous I/O.
An image representing the data engineering concept of 'AsyncIO'
Dagster Glossary code icon

Augment

Add new data or information to an existing dataset to enhance its value.
An image representing the data engineering concept of 'Augment'

Auto-materialize

The automatic execution of computations and the persistence of their results.
An image representing the data engineering concept of 'Auto-materialize'
Dagster Glossary code icon

Backpressure

A mechanism to handle situations where data is produced faster than it can be consumed.
An image representing the data engineering concept of 'Backpressure'

Backup

Create a copy of data to protect against loss or corruption.
An image representing the data engineering concept of 'Backup'
Dagster Glossary code icon

Batch Processing

Process large volumes of data all at once in a single operation or batch.
An image representing the data engineering concept of 'Batch Processing'

Big Data Processing

Process large volumes of data in parallel and distributed computing environments to improve performance.
An image representing the data engineering concept of 'Big Data Processing'
Dagster Glossary code icon

Cache

Store expensive computation results so they can be reused, not recomputed.
An image representing the data engineering concept of 'Cache'
Dagster Glossary code icon

Categorize

Organizing and classifying data into different categories, groups, or segments.
An image representing the data engineering concept of 'Categorize'
Dagster Glossary code icon

Clean or Cleanse

Remove invalid or inconsistent data values, such as empty fields or outliers.
An image representing the data engineering concept of 'Clean or Cleanse'
Dagster Glossary code icon

Cluster

Group data points based on similarities or patterns to facilitate analysis and modeling.
An image representing the data engineering concept of 'Cluster'
Dagster Glossary code icon

Compact

Reducing the size of data while preserving its essential information.
An image representing the data engineering concept of 'Compact'
Dagster Glossary code icon

Compress

Reduce the size of data to save storage space and improve processing performance.
An image representing the data engineering concept of 'Compress'
Dagster Glossary code icon

Consolidate

Combine multiple datasets into one to create a more comprehensive view of the data.
An image representing the data engineering concept of 'Consolidate'
Dagster Glossary code icon

Cosine Similarity

A measure of similarity between two entities used in text analysis, natural language processing, etc.
An image representing the data engineering concept of 'Cosine Similarity'
Dagster Glossary code icon

Curate

Select, organize, and annotate data to make it more useful for analysis and modeling.
An image representing the data engineering concept of 'Curate'
Dagster Glossary code icon

De-identify

Remove personally identifiable information (PII) from data to protect privacy and comply with regulations.
An image representing the data engineering concept of 'De-identify'
Dagster Glossary code icon

Deduplicate

Identify and remove duplicate records or entries to improve data quality.
An image representing the data engineering concept of 'Deduplicate'
Dagster Glossary code icon

Denoise

Remove noise or artifacts from data to improve its accuracy and quality.
An image representing the data engineering concept of 'Denoise'
Dagster Glossary code icon

Denormalize

Optimize data for faster read access by reducing the number of joins needed to retrieve related data.
An image representing the data engineering concept of 'Denormalize'
Dagster Glossary code icon

Derive

Extracting, transforming, and generating new data from existing datasets.
An image representing the data engineering concept of 'Derive'

Deserialize

Deserialization is essentially the reverse process of serialization. See: 'Serialize'.
An image representing the data engineering concept of 'Deserialize'
Dagster Glossary code icon

Dimensionality

Analyzing the number of features or attributes in the data to improve performance.
An image representing the data engineering concept of 'Dimensionality'
Dagster Glossary code icon

Discretize

Transform continuous data into discrete categories or bins to simplify analysis.
An image representing the data engineering concept of 'Discretize'
Dagster Glossary code icon

Downsample

Reduce the amount of data for analysis, storage, or processing.
An image representing the data engineering concept of 'Downsample'
Dagster Glossary code icon

ETL

Extract, transform, and load data between different systems.
An image representing the data engineering concept of 'ETL'
Dagster Glossary code icon

Encapsulate

The bundling of data with the methods that operate on that data.
An image representing the data engineering concept of 'Encapsulate'
Dagster Glossary code icon

Encode

Convert categorical variables into numerical representations for ML algorithms.
An image representing the data engineering concept of 'Encode'
Dagster Glossary code icon

Enrich

Enhance data with additional information from external sources.
An image representing the data engineering concept of 'Enrich'
Dagster Glossary code icon

Explore

Understand the data, identify patterns, and gain insights.
An image representing the data engineering concept of 'Explore'

Export

Extract data from a system for use in another system or application.
An image representing the data engineering concept of 'Export'
Dagster Glossary code icon

Extrapolate

Predict values outside a known range, based on the trends or patterns identified within the available data.
An image representing the data engineering concept of 'Extrapolate'

Fan-Out

A pipeline design in which one operation is broken into - or results in - many parallel downstream tasks.
An image representing the data engineering concept of 'Fan-Out'
Dagster Glossary code icon

Feature Extraction

Identify and extract relevant features from raw data for use in analysis or modeling.
An image representing the data engineering concept of 'Feature Extraction'
Dagster Glossary code icon

Feature Selection

Identify and select the most relevant and informative features for analysis or modeling.
An image representing the data engineering concept of 'Feature Selection'
Dagster Glossary code icon

Filter

Extract a subset of data based on specific criteria or conditions.
An image representing the data engineering concept of 'Filter'
Dagster Glossary code icon

Fragment

Break data down into smaller chunks for storage and management purposes.
An image representing the data engineering concept of 'Fragment'
Dagster Glossary code icon

Geospatial Analysis

Analyze data that has geographic or spatial components to identify patterns and relationships.
An image representing the data engineering concept of 'Geospatial Analysis'
Dagster Glossary code icon

Graph Theory

A powerful tool to model and understand intricate relationships within our data systems.
An image representing the data engineering concept of 'Graph Theory'
Dagster Glossary code icon

Hash

Convert data into a fixed-length code to improve data security and integrity.
An image representing the data engineering concept of 'Hash'
Dagster Glossary code icon

Homogenize

Make data uniform, consistent, and comparable.
An image representing the data engineering concept of 'Homogenize'
Dagster Glossary code icon

Idempotent

An operation that produces the same result each time it is performed.
An image representing the data engineering concept of 'Idempotent'
Dagster Glossary code icon

Impute

Fill in missing data values with estimated or imputed values to facilitate analysis.
An image representing the data engineering concept of 'Impute'
Dagster Glossary code icon

Index

Create an optimized data structure for fast search and retrieval.
An image representing the data engineering concept of 'Index'
Dagster Glossary code icon

Ingest

The initial collection and import of data from various sources into your processing environment.
An image representing the data engineering concept of 'Ingest'
Dagster Glossary code icon

Integrate

Combine data from different sources to create a unified view for analysis or reporting.
An image representing the data engineering concept of 'Integrate'
Dagster Glossary code icon

Interpolate

Use known data values to estimate unknown data values.
An image representing the data engineering concept of 'Interpolate'
Dagster Glossary code icon

Lineage

Understand how data moves through a pipeline, including its origin, transformations, dependencies, and ultimate consumption.
An image representing the data engineering concept of 'Lineage'
Dagster Glossary code icon

Linearizability

Ensure that each individual operation on a distributed system appear to occur instantaneously.
An image representing the data engineering concept of 'Linearizability'
Dagster Glossary code icon

Linearize

Transforming the relationship between variables to make datasets approximately linear.
An image representing the data engineering concept of 'Linearize'
Dagster Glossary code icon

Load

Insert data into a database or data warehouse, or your pipeline for processing.
An image representing the data engineering concept of 'Load'
Dagster Glossary code icon

Mask

Obfuscate sensitive data to protect its privacy and security.
An image representing the data engineering concept of 'Mask'
Dagster Glossary code icon

Materialize

Executing a computation and persisting the results into storage.
An image representing the data engineering concept of 'Materialize'
Dagster Glossary code icon

Memoize

Store the results of expensive function calls and reusing them when the same inputs occur again.
An image representing the data engineering concept of 'Memoize'
Dagster Glossary code icon

Merge

Combine data from multiple datasets into a single dataset.
An image representing the data engineering concept of 'Merge'
Dagster Glossary code icon

Mine

Extract useful information, patterns or insights from large volumes of data using statistics and machine learning.
An image representing the data engineering concept of 'Mine'
Dagster Glossary code icon

Model

Create a conceptual representation of data objects.
An image representing the data engineering concept of 'Model'

Monitor

Track data processing metrics and system health to ensure high availability and performance.
An image representing the data engineering concept of 'Monitor'
Dagster Glossary code icon

Multiprocessing

Optimize execution time with multiple parallel processes.
An image representing the data engineering concept of 'Multiprocessing'

Munge

See 'wrangle'.
An image representing the data engineering concept of 'Munge'
Dagster Glossary code icon

Named Entity Recognition

Locate and classify named entities in text into pre-defined categories.
An image representing the data engineering concept of 'Named Entity Recognition'
Dagster Glossary code icon

NoSQL

Non-relational databases designed for scalability, schema flexibility, and optimized performance in specific use-cases.
Dagster Glossary code icon

Normality Testing

Assess the normality of data distributions to ensure validity and reliability of statistical analysis.
An image representing the data engineering concept of 'Normality Testing'
Dagster Glossary code icon

Normalize

Standardize data values to facilitate comparison and analysis. Organize data into a consistent format.
Dagster Glossary code icon

Obfuscate

Make data unintelligible or difficult to understand.
Dagster Glossary code icon

Parallelize

Boost execution speed of large data processing by breaking the task into many smaller concurrent tasks.
An image representing the data engineering concept of 'Parallelize'
Dagster Glossary code icon

Parse

Interpret and convert data from one format to another.
Dagster Glossary code icon

Partition

Data partitioning is a technique that data engineers and ML engineers use to divide data into smaller subsets for improved performance.
An image representing the data engineering concept of 'Partition'
Dagster Glossary code icon

Pickle

Convert a Python object into a byte stream for efficient storage.

Pre-aggregate

See 'aggregate'.
Dagster Glossary code icon

Prep

Transform your data so it is fit-for-purpose.
An image representing the data engineering concept of 'Prep'
Dagster Glossary code icon

Preprocess

Transform raw data before data analysis or machine learning modeling.
Dagster Glossary code icon

Profile

Generate statistical summaries and distributions of data to understand its characteristics.
An image representing the data engineering concept of 'Profile'
Dagster Glossary code icon

Purge

Delete data that is no longer needed or relevant to free up storage space.
Dagster Glossary code icon

Rebalance

Redistributing data across nodes or partitions for optimal performance.
An image representing the data engineering concept of 'Rebalance'
Dagster Glossary code icon

Reduce

Convert a large set of data into a smaller, more manageable form without significant loss of information.
An image representing the data engineering concept of 'Reduce'
Dagster Glossary code icon

Repartition

Redistribute data across multiple partitions for improved parallelism and performance.
Dagster Glossary code icon

Replicate

Create a copy of data for redundancy or distributed processing.
Dagster Glossary code icon

Reshape

Change the structure of data to better fit specific analysis or modeling requirements.
An image representing the data engineering concept of 'Reshape'
Dagster Glossary code icon

Sample

Extract a subset of data for exploratory analysis or to reduce computational complexity.

Scaling

Increasing the capacity or performance of a system to handle more data or traffic.
Dagster Glossary code icon

Schema Inference

Automatically identify the structure of a dataset.
An image representing the data engineering concept of 'Schema Inference'
Dagster Glossary code icon

Schema Mapping

Translate data from one schema or structure to another to facilitate data integration.
Dagster Glossary code icon

Scrape

Extract data from a website or another source.
An image representing the data engineering concept of 'Scrape'
Dagster Glossary code icon

Secondary Index

Improve the efficiency of data retrieval in a database or storage system.
An image representing the data engineering concept of 'Secondary Index'
Dagster Glossary code icon

Secure

Protect data from unauthorized access, modification, or destruction.
Dagster Glossary code icon

Sentiment Analysis

Analyze text data to identify and categorize the emotional tone or sentiment expressed.
An image representing the data engineering concept of 'Sentiment Analysis'
Dagster Glossary code icon

Serialize

Convert data into a linear format for efficient storage and processing.
An image representing the data engineering concept of 'Serialize'
Dagster Glossary code icon

Shard

Partitioning a database into smaller, more manageable pieces.
Dagster Glossary code icon

Shred

Break down large datasets into smaller, more manageable pieces for easier processing and analysis.
Dagster Glossary code icon

Shuffle

Randomize the order of data records to improve analysis and prevent bias.
An image representing the data engineering concept of 'Shuffle'
Dagster Glossary code icon

Skew

An imbalance in the distribution or representation of data.
Dagster Glossary code icon

Software-defined Asset

A declarative design pattern that represents a data asset through code.
An image representing the data engineering concept of 'Software-defined Asset'
Dagster Glossary code icon

Spill

Temporarily transfer data that exceeds available memory to disk.
An image representing the data engineering concept of 'Spill'
Dagster Glossary code icon

Split

Divide a dataset into training, validation, and testing sets for machine learning model training.
Dagster Glossary code icon

Standardize

Transform data to a common unit or format to facilitate comparison and analysis.
Dagster Glossary code icon

Stored Procedure

Precompiled and stored SQL statements and procedural logic for easy database operations and complex data manipulations.
An image representing the data engineering concept of 'Stored Procedure'

Synchronize

Ensure that data in different systems or databases are in sync and up-to-date.
Dagster Glossary code icon

Thread

Enable concurrent execution in Python by decoupling tasks which are not sequentially dependent.
An image representing the data engineering concept of 'Thread'
Dagster Glossary code icon

Time Series Analysis

Analyze data over time to identify trends, patterns, and relationships.
An image representing the data engineering concept of 'Time Series Analysis'
Dagster Glossary code icon

Tokenize

Convert data into tokens or smaller units to simplify analysis or processing.
An image representing the data engineering concept of 'Tokenize'

Transform

Convert data from one format or structure to another.

Unstructured Data Analysis

Analyze unstructured data, such as text or images, to extract insights and meaning.
An image representing the data engineering concept of 'Unstructured Data Analysis'
Dagster Glossary code icon

Upsert

Update a record or insert a new record if it does not yet exist.
An image representing the data engineering concept of 'Upsert'
Dagster Glossary code icon

Validate

Check data for completeness, accuracy, and consistency.
An image representing the data engineering concept of 'Validate'
Dagster Glossary code icon

Vectorize

Executing a single operation on multiple data points simultaneously.
An image representing the data engineering concept of 'Vectorize'
Dagster Glossary code icon

Version

Maintain a history of changes to data for auditing and tracking purposes.
An image representing the data engineering concept of 'Version'
Dagster Glossary code icon

Wrangle

Convert unstructured data into a structured format.
An image representing the data engineering concept of 'Wrangle'

View the Index

Can't find what you are looking for? Check the complete Index here →
An image representing Daggy the Dagster mascot as painted by René Magritte.

About the artwork.

The art you see throughout the glossary was generated thanks to Midjourney and curated by the Dagster Labs team. It was inspired by some of the great artists of the 20th century (and some from earlier periods). See if you can recognize the ‘work’ of Marcel Duchamp, Frederic Remington, Keith Haring, Claes Oldenburg, Roy Lichtenstein, Wassily Kandinsky, and others.

Left: Daggy, as seen by René Magritte.

Interested in trying Dagster Cloud for Free?
Enterprise orchestration that puts developer experience first. Serverless or hybrid deployments, native branching, and out-of-the-box CI/CD.