Data Engineering Terms Explained

Terms and Definitions You Need to Know as a Data Engineer

A/B Testing

A statistical hypothesis testing for a randomized experiment with two variables, A and B, which are used to compare two models or strategies and determine which performs better.

See Wikipedia

ACID Properties

The set of properties of database transactions intended to guarantee validity even in the event of errors or failures, encompassing Atomicity, Consistency, Isolation, and Durability.

See Wikipedia

Aggregation

Combining data from multiple sources into a single dataset.

See Glossary entry

Agile Methodology

An iterative approach to software development and project management that prioritizes flexibility and customer satisfaction, often used by data engineering teams to manage projects.

See Wikipedia

Alation

A machine learning data catalog that helps people find, understand, and trust the data.

Vendor Website

Aligning

Aligning data can mean one of three things: aligning datasets, meeting business rules, or arranging data elements in memory.

See Glossary entry

Amazon DynamoDB

A managed NoSQL database service provided by Amazon Web Services.

Vendor Website

Amazon Kinesis

A platform to stream data on AWS, offering powerful services to make it easy to load and analyze streaming data.

Vendor Website

Amazon Redshift

A fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL.

Vendor Website

Amazon Web Services (AWS)

Offers a broad set of global cloud-based products including compute, storage, databases, analytics, networking, mobile, developer tools, and more.

Vendor Website

Annotation

The process of adding metadata or explanatory notes to data, often used in machine learning to create labeled data for training models.

Anomaly Detection

The identification of items, events, or observations which do not conform to an expected pattern or other items in a dataset, crucial in fraud detection, network security, and fault detection.

See Glossary entry

Anonymize

Remove personal or identifying information from data.

See Glossary entry

Apache Airflow

A platform to programmatically author, schedule, and monitor workflows of tasks.

Apache Arrow

Apache Arrow is a cross-language development platform for in-memory data that specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Project website

Apache Atlas

A scalable and extensible set of core foundational governance services, enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop.

Project website

Apache Camel

An open-source integration framework that empowers you to quickly and easily integrate various systems consuming or producing data.

Project website

Apache Flink

A framework and distributed processing engine for stateful computations over unbounded and bounded data streams.

Project website

Apache Hadoop

A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

Project website

Apache Kafka

A distributed streaming platform capable of handling trillions of events a day.

Project website

Apache Nifi

A tool designed to automate the flow of data between software systems.

Project website

Apache Pulsar

A highly scalable, low-latency messaging platform running on commodity hardware.

Project website

Apache Samza

A stream processing framework for running applications that process data as it is created.

Project website

Apache Spark

A fast and general-purpose cluster computing system, providing high-level APIs in Java, Scala, Python, and R.

Project website

Apache Storm

A free and open-source distributed real-time computation system.

Project website

API (Application Programming Interface)

A set of rules and definitions that allow different software entities to communicate with each other.

Append

The process of adding new, updated, or corrected information to an existing database or list.

See Glossary entry

Argo

An open-source container-native workflow engine for orchestrating parallel jobs on Kubernetes.

Vendor Website

Association Rule Mining

A machine learning method aimed at identifying interesting relations between variables (items or events) in large databases, frequently used for market basket analysis.

Find out more

Asyncio

A Python library for asynchronous I/O. It is built around the coroutines of Python and provides tools to manage them and handle the I/O in an efficient way.

See Glossary entry

Augment

The technique of increasing the diversity of your training dataset by modifying the existing data points, often used in training deep learning models to improve model generalization.

See Glossary entry

Augmented Data Management

The use of AI and ML technologies to optimize and enhance data management tasks, improving data quality and metadata development.

Find out more

Automated Machine Learning (AutoML)

The process of automating the end-to-end process of applying machine learning to real-world problems, facilitating the development of ML models by experts and non-experts alike.

Learn more

Avro

A binary serialization format developed within the Apache Hadoop project, compact, fast, and suitable for serializing large amounts of data. It uses JSON for defining data types and protocols, and it serializes data in a compact binary format.

Visit the project

AWS Step Functions

Enables you to coordinate AWS components, applications and microservices using visual workflows.

See vendor site

Backend-as-a-Service (BaaS)

A cloud computing service model that serves as the middleware that provides developers with ways to connect their web and mobile applications to cloud services via application programming interfaces (APIs) and software developers' kits (SDKs).

See Wikipedia

Backpressure

A mechanism to handle situations where data is produced faster than it can be consumed.

See Glossary entry

Backup

Create a copy of data to protect against loss or corruption.

See Glossary entry

Batch Processing

The processing of data in a batch or group where the entire batch is processed before any individual item in the batch is considered processed.

See Glossary entry

Big Data

Refers to extremely large datasets that can be analyzed for patterns, trends, and associations, typically involving varied and complex structures. What constitutes 'big' is debated, but a rule of thumb is a volume of data that cannot be analyzed on a single machine.

See Wikipedia

Big Data Processing

Process large volumes of data in parallel and distributed computing environments to improve performance.

See Glossary entry

Big O Notation

A mathematical notation used to describe the limiting behavior of a function when the argument tends towards a particular value or infinity, primarily used to classify algorithms by how they respond to changes in input size.

See Wikipedia

Binary Tree

A tree data structure in which each node has at most two children, referred to as the left child and the right child.

See Wikipedia

Bitwise Operation

Operations that manipulate one or more bits at the level of their individual binary representation.

See Wikipedia

Blend

A term coined by data analytics vendors to describe the process of combining data from multiple sources to create a cohesive, unified dataset. Typically used in the context of data analysis and business intelligence.

Blockchain

A system of recording information in a way that makes it difficult or impossible to change, hack, or cheat the system. A blockchain is a digital ledger of transactions that is duplicated and distributed across the entire network of computer systems on the blockchain.

See Wikipedia

Broadcast

A method in parallel computing where data is sent from one point (a root node) to all other nodes in the topology.

Broadcasting

A method in distributed computing to send the same message to all nodes in a network.

BSON (Binary JSON)

A binary-encoded serialization of JSON-like documents used to store documents and make remote procedure calls in MongoDB. BSON supports embedded documents and arrays, offering additional data types not supported by JSON.

Bucketing

A method for dividing a dataset into discrete buckets or bins to separate it into roughly equal parts based on some characteristic.

Bulk Extract

The process of extracting large amounts of data from a database in a single transaction.

Business Intelligence (BI)

A set of strategies and technologies used by enterprises for the data analysis of business information, helping companies make more informed business decisions.

Cache Invalidation

A process in a computing system where entries in a cache are replaced or removed due to change in the underlying data.

Caching

The process of storing copies of files in a cache, or temporary storage location, so that they can be accessed more quickly.

See Glossary entry

Callback

A piece of executable code that is passed as an argument to other code and is expected to execute at a given time.

CAP Theorem

In computer science, it represents that it is impossible for a distributed system to simultaneously provide more than two out of three guarantees: Consistency, Availability, and Partition Tolerance.

Cap’n Proto

A data interchange format similar to Protobuf, but faster. Instead of parsing the data and then unpacking it, the data is directly accessed in the binary form in which it is stored, reducing processing time.

Visit the website

Capacity Planning

The process used to determine how much hardware and software is required to meet future workload demands.

Cassandra

A highly scalable NoSQL database designed to handle large amounts of data.

Categorical Data

A type of data that can take on one of a limited and usually fixed number of possible values, representing the membership of an object in a group, such as ‘male’ or ‘female’.

Categorize

Organizing and classifying data into different categories, groups, or segments.

See Glossary entry

Causal Inference

A process used to make conclusions about one variable’s effect on another, critical in understanding relationships in data and making informed decisions based on those relationships.

CBOR (Concise Binary Object Representation)

A binary format encoding data in a more efficient and compact manner than JSON. It is designed to efficiently serialize and deserialize complex data structures without losing schema-free property of JSON.

Chaining

Linking two or more computing tasks together so that, as soon as one task is finished, the next task immediately begins.

Character Encoding

A method used to represent a repertoire of characters by some kind of encoding system, e.g., ASCII or UTF-8.

Checkpoint

A snapshot of the state of a system at a specific point in time, usually used to recover from failures.

Checkpointing

The process of saving the state of a system at specific points, so it can be returned to that state in case of failure.

Circular Dependency

A relation between two or more modules which either directly or indirectly depend on each other to function properly.

Class Variable

A variable that is shared by all instances of a class, belonging to the class rather than any object instance.

Class-Method

A method that is bound to the class and not the instance of the class.

Classify

The process of organizing data by relevant categories for efficient use and secure data management.

Clean Code

Code that is easy to understand and easy to change, adhering to good programming principles and practices.

Clean or Cleanse

The process of identifying and correcting (or removing) errors and inconsistencies in datasets to improve their quality.

See Glossary entry

Cloud Computing

The delivery of various services over the Internet, such as storage, processing, and networking resources.

Cloudera

A provider of software for data engineering, data warehousing, machine learning, and analytics.

Cluster

Group data points based on similarities or patterns to facilitate analysis and modeling.

See Glossary entry

Cluster Analysis

A group of algorithms used to categorize data into groups, or clusters, where objects in the same group are more similar to each other than to those in other groups.

Coalesce

A SQL function that returns the first non-null value in a list.

Cold storage

A storage strategy for data that is accessed infrequently and is primarily for archival purposes, offering cost-efficiency at the expense of retrieval speed.

Columnar Database

A database optimized for reading and writing columns of data as opposed to rows of data, often used for analytics and reporting.

Combinatorial Explosion

A phenomenon in computer science where the number of possible solutions or combinations in a problem grows exponentially with the size of the problem.

Command-Line Interface (CLI)

A text-based user interface used to interact with software by entering commands into the interface.

Comment

A programming language feature allowing the insertion of human-readable descriptions or annotations in the source code.

Commit

The act of saving changes in a database, version control system, or transactional system, making them permanent.

Common Gateway Interface (CGI)

A standard protocol for web servers to execute programs and generate dynamic content, often used for form processing.

Compilation

The process of translating a high-level programming language into machine language or bytecode that can be executed by a computer’s CPU.

Compound Key

A key that consists of multiple attributes to uniquely identify an entity in a database.

Compress

Reduce the size of data to save storage space and improve processing performance.

See Glossary entry

Compress

Reducing the size of a data file, typically to save space or speed up transmission.

Compression

The process of reducing the size of data, usually to save space or speed up transmission over networks.

Computed Column

A virtual column in a database table that is based on a calculation or expression using other columns in the table.

Concurrency Control

Techniques to manage simultaneous operations in a database system, ensuring consistency and resolving conflicts.

Concurrent Processing

A computing concept where several tasks are executed during overlapping time periods, enabling more efficient use of computing resources.

Configuration File

A file used to configure the initial settings of software programs, usually written in XML, JSON, or YAML.

Configuration Management

The process of systematically managing, organizing, and controlling the changes in the documents, codes, and other entities during the development process.

Connection Pool

A cache of database connections maintained to be reused by future requests, reducing the overhead of opening and closing connections.

Consensus Algorithm

A process used in computer science to achieve agreement on a single data value among distributed processes or systems.

Consolidate

Combine multiple datasets into one to create a more comprehensive view of the data.

See Glossary entry

Container

A lightweight, stand-alone, and executable software package that includes everything needed to run a piece of software, including the code, runtime, and system libraries.

Containerization

A lightweight, stand-alone, and executable software package that includes everything needed to run a piece of software, including the code, runtime, system tools, and libraries.

Continuous Delivery

A software development discipline where software is built in such a way that it can be released to production at any time.

Continuous Deployment (CD)

A software engineering approach in which software functionalities are delivered and deployed continuously and automatically into production, after passing a series of automated tests.

Continuous Integration (CI)

A development practice where developers integrate code into a shared repository frequently, ideally several times a day, to detect errors quickly.

Control Flow

The order in which individual statements, instructions, or function calls are executed within a program.

Convergence

The state where different nodes (or systems) update their internal states to a common value, usually used in the context of iterative algorithms and distributed systems.

Convolutional Neural Network (CNN)

A class of deep learning neural networks, most commonly applied to analyzing visual imagery, used in image recognition and classification tasks.

Covariance

A statistical measure that indicates the extent to which two variables change together.

Crash Recovery

The process by which an operating system or application restarts operation after a crash, possibly recovering lost data.

CRON

A time-based job scheduler in Unix-like computer operating systems for scheduling periodic jobs at fixed times, dates, or intervals.

Cron Job

A scheduled task in Unix-based operating systems, used to automate repetitive tasks.

Cross-Join

A SQL join that returns the Cartesian product of the joined tables, meaning every row of the first table is combined with every row of the second table.

Cross-Validation

A statistical method used to estimate the skill of machine learning models, it is primarily used in applied machine learning to assess a predictive modeling algorithm’s performance when there is no separate test dataset available.

Cryptography

The practice and study of techniques for securing communication and data from third parties or the public.

CSV (Comma Separated Values)

A simple, plain-text file format used to store tabular data, where each line represents a data record, and each record consists of one or more fields, separated by commas. Suitable for a wide range of applications due to its simplicity, but lacks a standard schema, which can lead to parsing errors.

Curate

Select, organize and annotate data to make it more useful for analysis and modeling.

See Glossary entry

CURL

A command-line tool and library for transferring data with URLs, supporting various protocols like HTTP, FTP, and more.

Cursor

A database object used to traverse the results of a SQL query, allowing individual rows to be accessed.

Cybersecurity

The practice of protecting systems, networks, and programs from digital attacks aimed at accessing, changing, or destroying sensitive information.

Dagster

An open source solution for defining, building and managing critical data assets.

Learn more

Data Aggregation

The process of gathering and summarizing information in a specified form, often used in statistical analysis.

Data Allocation

The assignment of storage space to specific data, often in the context of distributed databases where data is allocated across multiple nodes.

Data Analytics

The science of analyzing raw data to make conclusions about that information.

Data Annotation

The process of adding explanatory notes or comments to data, often used in the context of machine learning to create labeled training data.

Data Append

The process of adding new, updated, or corrected information to an existing database or list.

See Glossary entry

Data Architecture

The overall structure, organization, and rules used to manage and use data within an organization, including the arrangement of data and data processing.

Data Block

The smallest unit of data storage in a database, storing a set of rows or a subset of a table's columns.

Data Catalog

A centralized repository that allows for the management, collaboration, discovery, and consumption of organizational datasets, serving as a metadata inventory.

Data Degradation

The gradual loss or deterioration of data quality over time.

Data Dictionary

A collection of descriptions of the data objects or items in a data model for the benefit of programmers and others who need to refer to them.

Data Drift

A phenomenon where the statistical properties of incoming data change over time, potentially impacting model performance and accuracy.

Data Fabric

A unified architecture that provides a consistent and coherent set of capabilities and services across different environments.

Data Federation

The process of aggregating data from different sources to create a single, unified view.

Data Fusion

The process of integrating multiple data sources to produce more consistent, accurate, and useful information than that provided by any individual data source.

Data Governance

The overall management of the availability, usability, integrity, and security of data employed in an enterprise, involving a set of practices and policies.

Data Lake

A centralized storage repository that allows storing structured and unstructured data at any scale, usually used for big data and real-time analytics.

Data Lakehouse

A modern data architecture that combines the best elements of data lakes and data warehouses, enabling efficient handling of both structured and unstructured data.

Data Lifecycle

The journey that data goes through from creation and initial storage to the time it becomes obsolete and is deleted.

Data Lifecycle Management

The process of managing the flow of data throughout its lifecycle from creation and initial storage to the time it is archived or deleted.

Data Lineage

The visualization of the flow and transformation of data as it moves through the various stages of a data pipeline, crucial for understanding and maintaining complex data systems.

Data Mart

A subset of a data warehouse that is designed for a specific line of business or department within an organization.

Data Marts

Subsets of data warehouses designed to provide data for specific business lines or departments.

Data Mesh

A decentralized approach to data architecture and organizational structure that treats data as a product and emphasizes domain-oriented decentralized data ownership and architecture.

Data Ops

An automated, process-oriented methodology used to improve the quality and reduce the cycle time of data analytics.

Data Pipeline

A series of data processing steps involved in the flow of data from the source to its final destination, usually used in the context of ETL and data integration.

Data Provenance

Information that helps to trace the origins, processing, and use of data, helping to determine the quality and reliability of the dataset.

Data Quality

A comprehensive way of maintaining the accuracy, reliability, and consistency of data over its entire life cycle.

Data Redundancy

The existence of data that is additional to the actual data and permits correction of errors in stored or transmitted data.

Data Reservoir

An expansive storage repository that allows for the integration and storage of data from various sources in its native format.

Data Silo

A repository of data isolated or segregated from other parts of the organization's data system.

Data Stewardship

Responsible management and oversight of an organization's data to help provide business users with high-quality data.

Data Vault Modeling

A database modeling method specifically designed for top-down data warehouses with a focus on long-term historical storage, tractability, and scalability.

Data Volume

The amount of data available for analysis, usually referred to in the context of Big Data.

Data Warehouse

A central repository of integrated data from disparate sources, used to store and manage large volumes of historical data and enable fast, complex queries across all the consolidated data.

Database Indexing

The use of special data structures that improve the speed of operations in a table, such as search, filter, and sort.

Database Management System (DBMS)

A software package designed to define, manipulate, retrieve, and manage data in a database.

Database Mirroring

A technique used to increase data availability by maintaining two copies of a single database that must reside on different server instances of SQL Server Database Engine.

Database Normalization

A systematic approach of decomposing tables to eliminate data redundancy and undesirable characteristics like insertion, update, and deletion anomalies.

Database Schema

The structure or blueprint of a database that outlines the way data is organized and how relationships are between the data entities.

De-identify

Remove personally identifiable information (PII) from data to protect privacy and comply with regulations.

See Glossary entry

Deadlock

A condition where two or more database transactions are unable to proceed because each is waiting for the other to release a lock, leading to a cyclic waiting condition.

Decision Tree

A tree-like model of decisions used to make predictions, especially in machine learning algorithms.

Deduplicate

A process used to eliminate redundant copies of data, ensuring data accuracy and reducing storage overhead.

Deduplicate

Identify and remove duplicate records or entries to improve data quality.

See Glossary entry

Deep Learning

A subset of machine learning that utilizes neural networks with many layers (hence “deep”) to analyze various factors of data and to learn and make intelligent decisions.

Delta Lake

An open-source storage layer that brings reliability to data lakes, ensuring ACID transactions, scalable metadata handling, and unifying streaming and batch data processing.

Denoise

Remove noise or artifacts from data to improve its accuracy and quality.

See Glossary entry

Denormalize

The process of attempting to optimize the performance of a database by adding redundant data or by grouping data.

See Glossary entry

Dependency Parsing

A Natural Language Processing (NLP) technique to analyze the grammatical structure of a sentence to establish relationships between words.

Derive

Extracting, transforming, and generating new data from existing datasets.

See Glossary entry

Deserialize

Deserialization is essentially the reverse process of serialization. See: 'Serialize'.

See Glossary entry

DevOps

A set of practices that combines software development (Dev) and IT operations (Ops), aiming to shorten the systems development life cycle and provide continuous delivery.

Differential Privacy

A system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset.

Dimension Table

A table in a star schema of a data warehouse that stores categorical, descriptive, hierarchical, or textual attributes of data.

Dimensional Modeling

A design technique used in data warehousing to map and visualize data in a way that’s intuitive to business users, typically using facts and dimensions.

Dimensionality

Analyzing the number of features or attributes in the data to improve performance.

See Glossary entry

Dimensionality Reduction

The process of reducing the number of random variables under consideration by obtaining a set of principal variables, crucial for dealing with the “curse of dimensionality” in high-dimensional spaces.

See Glossary entry

DAG (Directed Acyclic Graph)

A finite directed graph with no directed cycles, used extensively in representing data flow in data processing systems like Apache Airflow.

Directed Acyclic Graph (DAG)

A finite directed graph with no directed cycles, used extensively in representing data flow in data processing systems like Apache Airflow.

Discretize

Transform continuous data into discrete categories or bins to simplify analysis.

See Glossary entry

Distributed Computing

A model in which components located on networked computers communicate and coordinate their actions by passing messages to achieve a common goal, crucial for handling large

Distributed Ledger Technology

A decentralized database managed by multiple participants, across multiple nodes.

Distributed Ledger Technology (DLT)

A digital system for recording the transaction of assets wherein transactions and their details are recorded in multiple places at the same time, the most common form being blockchain technology.

Distributed System

A system where components located on networked computers communicate and coordinate their actions by passing messages.

Docker

A platform used to develop, ship, and run applications inside containers, promoting software reliability and scalability.

Document Store Database

A type of NoSQL database designed to store, manage, and retrieve document-oriented information, also known as semi-structured data.

Domain-Driven Design (DDD)

An approach to software development that centers the design and development process on the business domain, ensuring that the software solves real business problems.

Downsample

The process of reducing the amount of data in a dataset, primarily by reducing the number of points in the data or reducing the precision of the data.

See Glossary entry

Drift Detection

Identifying when the statistical properties of the target variable, which the model is trying to predict, change.

Dynamic Data

Data that change frequently and are usually generated in real-time, such as stock prices or sensor data.

Eager Execution

A programming environment that evaluates operations immediately, instead of building graphs to run later, typically used in TensorFlow for debugging and interactive development.

Early Stopping

A form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent, by stopping the training process before it completes all iterations.

Edge Computing

A distributed computing paradigm that brings computation and data storage closer to the sources of data generation, improving response times and saving bandwidth.

Elasticity

The ability of a system to efficiently allocate resources to meet demand and then deallocate resources when they are no longer needed.

Elasticsearch

A search engine based on the Lucene library, providing a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.

Embedded Analytics

The integration of analytical capabilities and content within the business process applications.

Embedding Layer

A layer within a neural network that learns to map the input data (such as words in text) into fixed-size dense vectors of continuous values, usually as the first layer in a network processing sequential or textual data.

Encode

Convert categorical variables into numerical representations for ML algorithms.

See Glossary entry

Encrypt

The process of converting data into a code to prevent unauthorized access.

Enrich

The process of enhancing, refining, and improving raw data by adding information to it.

See Glossary entry

Ensemble Learning

A technique used in machine learning that combines several models to solve a single predictive problem, enhancing the performance and robustness of the model.

Entity Resolution

The process of identifying and linking mentions of the same entity across different data sources, critical for creating a unified view of entities from disparate data sources.

Entity-Relationship Model

A data model for describing a database in an abstract way, using entities, relationships, and attributes.

Ephemeral Storage

Temporary storage that is provisioned for a short period of time and is deleted when the instance using it is terminated.

ETL (Extract, Transform, Load)

A type of data integration that refers to the three steps used to blend data from multiple sources.

See Glossary entry

ETL Testing

The process of validating, verifying, and qualifying data while preventing duplicate records and data loss, conducted during the ETL process.

Event-driven Architecture

A software architecture paradigm promoting the production, detection, consumption of, and reaction to events.

Evolutionary Algorithm

A subset of evolutionary computation, a generic population-based metaheuristic optimization algorithm used to find approximate solutions to optimization and search problems.

Exabyte

A unit of information or computer storage equal to one quintillion bytes (1 billion gigabytes).

Exascale Computing

Computing systems capable of at least one exaFLOP, or a billion billion calculations per second, representing a thousandfold increase over petascale.

Explainable AI (XAI)

An area in AI that develops methods and techniques to help human users understand and trust the output and operations of machine learning models.

Explore

Understand the data, identify patterns, and gain insights.

See Glossary entry

Export

Extract data from a system for use in another system or application.

See Glossary entry

Extract

The process of retrieving data out of unstructured data sources for further processing or storage.

Extract, Load, Transform (ELT)

A variant of ETL in which extracted data is loaded into the target system and then transformed.

See ETL entry

Extrapolate

Predict values outside a known range, based on the trends or patterns identified within the available data.

See Glossary entry

Factory Pattern

Factory patterns allow you to create a class, with its subclasses deciding which class to instantiate.

See the guide

Fan-Out

A pipeline design in which one operation is broken into - or results in - many parallel downstream tasks.

See Glossary entry

Fault Tolerance

The property that enables a system to continue operating properly in the event of the failure of some of its components.

Feather

A binary columnar serialization format optimized for use with DataFrames in analytics. It is language agnostic, though it is most commonly used with Python and R. Ideal for fast, lightweight reading and writing of data frames.

Feature Engineering

The process of using domain knowledge to create new features from the existing ones, improving the performance of machine learning models.

Feature Extraction

Identify and extract relevant features from raw data for use in analysis or modeling.

See Glossary entry

Feature Scaling

A method used to normalize the range of independent variables or features of data.

Feature Selection

The process of selecting a subset of relevant features (variables, predictors) for use in model construction, reducing overfitting and improving model generalization.

See Glossary entry

Feature Store

A centralized repository for storing, serving, and sharing machine learning features, allowing for the consistent use of features across different models.

Federated Learning

A machine learning approach that trains an algorithm across multiple decentralized devices or servers holding local data samples and without exchanging them.

Federated Query

A type of query in database computing, spanning multiple databases, possibly using different database management systems.

File Format

The way in which data is stored in a file, designated by a file extension.

Filter

Extract a subset of data based on specific criteria or conditions.

See Glossary entry

Flink

An open-source stream-processing framework for high-throughput, fault-tolerant, and scalable processing of data streams.

Flume

A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.

Foreign Key

A set of one or more columns used to establish a link between the data in two tables by referencing a unique key in another table.

Fragment

Break data down into smaller chunks for storage and management purposes.

See Glossary entry

Full Stack Development

The development of both front end (client-side) and back end (server-side) portions of a web application.

Function as a Service (FaaS)

A category of cloud services that provides a platform allowing customers to develop, run, and manage application functionalities without complex infrastructure.

Functional Programming

A programming paradigm that treats computation as the evaluation of mathematical functions and avoids changing state and mutable data.

Garbage Collection

Automatic memory management, the process by which a program runs in the background to identify and delete objects that are no longer needed by the program.

Gated Recurrent Unit (GRU)

A variant of the Recurrent Neural Network (RNN), designed to capture dependencies for sequences of varied lengths without using a fixed-size time step.

Genetic Algorithm

A search heuristic that is inspired by Charles Darwin’s theory of natural evolution, used to find approximate solutions to optimization and search problems.

Geo-replication

Replication of datasets across geographical locations, primarily for data resilience and availability purposes.

Geospatial Analysis

The gathering, display, and manipulation of imagery, GPS, satellite photographs, and historical data represented in terms of geographic coordinates.

See Glossary entry

Git

A free and open-source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.

GitHub

A web-based platform that provides hosting for software development and a community of developers to work together and share code.

Google BigQuery

A fully-managed, serverless data warehouse that enables super-fast SQL queries using the processing power of Google's infrastructure.

Google Cloud Platform (GCP)

A provider of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search, Gmail, and YouTube.

Gradient Boosting

A machine learning technique for regression and classification problems, which builds a model in a stage-wise fashion, optimizing for predictive accuracy.

Graph Database

A database designed to treat the relationships between data as equally important to the data itself, used to store data whose relations are best represented as a graph.

Graph Processing

A type of data processing that uses graph theory to analyze and visually represent data relationships.

See Glossary entry

Graph Theory

A field in discrete mathematics that studies graphs, which are mathematical structures used to model pairwise relations between objects, important in understanding the structure of various kinds of networks, including data networks.

See Glossary entry

Greedy Algorithm

An algorithmic paradigm that makes locally optimal choices at each stage with the hope of finding the global optimum.

Grid Computing

A form of distributed computing whereby a 'super and virtual computer' is composed of clustered, networked, loosely coupled computers acting in parallel to perform very large tasks.

Grid Search

An approach to hyperparameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid.

GZip

A file format and a software application used for file compression and decompression.

Hadoop Distributed File System (HDFS)

A distributed file system designed to run on commodity hardware, providing high-throughput access to application data and fault tolerance.

Hash

Convert data into a fixed-length code to improve data security and integrity.

See Glossary entry

Hash Function

A function that converts an input into a fixed-size string of bytes, typically a digest that is unique to the given input.

See Glossary entry

Hashing

The process of transforming input of any length into a fixed-size string of text, typically using a hash function.

See Glossary entry

HDF5 (Hierarchical Data Format version)

A file format and set of tools for managing complex data. It is designed for flexible, efficient I/O and for high volume and complex data sets and supports an unlimited variety of datatypes.

Find out more

Heap

A specialized tree-based data structure that satisfies the heap property, used in computer memory management and for heapsort algorithm.

Helm

A package manager for Kubernetes that allows developers and operators to more easily package, configure, and deploy applications and services onto Kubernetes clusters.

See vendor website

Heterogeneous Database System

A system that uses middleware to connect databases that are not alike and are running on different DBMSs, possibly on different platforms.

Hierarchical Database Model

A data model where data is organized into a tree-like structure with a single root, to which all other data is linked in a hierarchy.

See Wikipedia

High Availability

A characteristic of a system aiming to ensure an agreed level of operational performance for a higher than normal period.

High Cardinality

A term used to define the uniqueness of data values contained in a column. If a column has a high number of unique values, it is said to have high cardinality.

High-Availability Systems

Systems designed to be operational and accessible for longer periods, minimizing downtime and ensuring continuous service.

Homogeneous Database System

A system where all databases are based on the same DBMS technology.

Homogenize

Make data uniform, consistent, and comparable.

See Glossary entry

Horizontal Scaling

Adding more machines to a network to improve the capability to handle more load and perform better, also known as scaling out.

See LinkedIn Advice

Hortonworks

Provides comprehensive solutions for data management and analytics.

See Wikipedia entry

Hot storage

The immediate, high-speed storage of data that is frequently accessed and modified, enabling rapid retrieval and updates.

HTML Parsing

Analyzing HTML code to extract relevant information and understand the structure of the content, often used in web scraping.

Huge Pages

Memory pages that are larger than the standard memory page size, beneficial in managing large amounts of memory.

Hybrid Cloud

An IT architecture that incorporates some degree of workload portability, orchestration, and management across a mix of on-premises data centers, private clouds, and public clouds.

Hyperparameter

A configuration that is external to the model and whose value cannot be estimated from data, they are used in processes to help estimate model parameters.

Hyperparameter Tuning

The process of optimizing the configuration parameters of a machine learning model, called hyperparameters, to improve model performance on a given metric.

Hypervisor

A piece of software, firmware, or hardware that creates and runs virtual machines (VMs).

See Wikipedida

Idempotence

A property of certain operations in mathematics and computer science, whereby they can be applied multiple times without changing the result beyond the initial application.

Immutable Data

Data that once created, cannot be changed. Any modification necessitates the creation of a new instance.

Impala

An open-source, native analytic database for Apache Hadoop, providing high-performance, low-latency SQL queries on Hadoop data.

Project website

Imputation

The process of replacing missing data with substituted values, allowing more robust analysis when dealing with incomplete datasets.

See Glossary entry

Impute

Fill in missing data values with estimated or imputed values to facilitate analysis.

See Glossary entry

In-Memory Database (IMDB)

A database management system that primarily relies on main memory for computer data storage, faster than disk storage-based databases.

Index

Create an optimized data structure for fast search and retrieval.

See Glossary entry

Indexing

The process of creating a data structure (an index) to improve the speed of data retrieval operations on a database.

See Glossary entry

Informatica

A closed-source data management and data integration solutions provider.

See vendor website

Information Retrieval

The process of obtaining information from a repository, often concerning text-based search.

Infrastructure as Code (IaC)

A key DevOps practice that involves managing and provisioning computing infrastructure through machine-readable script files, rather than through physical hardware configuration or interactive configuration tools.

Ingest

The initial collection and import of data from various sources into your processing environment.

See Glossary entry

Ingestion

The process of importing, transferring, loading, and processing data for later use or storage in a database.

See Glossary entry

Input/Output Operations Per Second (IOPS)

A common performance measurement used to benchmark computer storage devices like hard disk drives (HDD), solid-state drives (SSD), and storage area networks (SAN).

Instance

A single occurrence of an object, often referring to virtual machines (VMs) or individual database items.

Integrate

The process of combining data from different sources and providing users with a unified view of them.

See Glossary entry

Integration Testing

A level of software testing where individual units are combined and tested as a group, to expose faults in the interaction between integrated units.

Integrity Constraints

Rules applied to maintain the quality and accuracy of the data inside a database, such as uniqueness, referential integrity, and check constraints.

Interactive Query

A query mechanism allowing users to ask spontaneous questions and receive rapid responses, used in analyzing datasets.

Interoperability

The ability of different IT systems, software applications, and devices to communicate, exchange, and use information effectively.

Interpolate

Use known data values to estimate unknown data values.

See Glossary entry

Interval Data Type

A type of data that represents a duration between two datetime values, such as the span of time between a start-time and an end-time.

Inversion of Control (IoC)

A design principle in which the custom-written portions of a computer program receive the flow of control from a generic, reusable library.

See Wikipedia entry

Isolation Levels

Different configurations used in databases to trade off consistency for performance, such as Read Uncommitted, Read Committed, Repeatable Read, and Serializable.

Iterative Model

A software development model that involves repeating the same set of activities for each portion of the project, allowing refinement with each iteration.

Java Database Connectivity (JDBC)

An API for the Java programming language that defines how a client may access a database, providing methods to query and update data in a database.

See Wikipedia entry

Jenkins

An open-source automation server, helping to automate parts of the software development process.

Join Operation

A SQL operation used to combine rows from two or more tables based on a related column between them.

Joins

An SQL operation performed to connect rows from two or more tables based on a related column.

Learn more

JSON (JavaScript Object Notation)

A lightweight, text-based, and human-readable data interchange format used for representing structured data. It is based on a subset of the JavaScript Programming Language and is easy for humans to read and write and for machines to parse and generate.

Jupyter Notebook

An open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text.

Just-In-Time Compilation (JIT)

A way of executing computer code that involves compilation during the execution of a program at runtime rather than prior to execution, improving the execution efficiency.

K-Means Clustering

A partitioning method that divides a dataset into subsets (clusters), where each data point belongs to the cluster with the nearest mean.

K-Nearest Neighbors (KNN)

A simple, supervised machine learning algorithm used for classification and regression, which predicts the classification or value of a new point based on the K nearest points.

Kafka

An open-source stream processing platform developed by LinkedIn and donated to the Apache Software Foundation, designed for high-throughput, fault-tolerance, and scalability.

Key Performance Indicator (KPI)

A type of performance measurement that evaluates the success of an organization, employee, etc., in achieving objectives.

Key-Value Store

A type of NoSQL database that uses a simple key/value method to store data, suitable for storing large amounts of data.

Kibana

An open-source data visualization dashboard for Elasticsearch, providing visualization capabilities on top of the content indexed in Elasticsearch clusters.

See Github repo

Amazon Kinesis

A platform provided by Amazon Web Services (AWS) to collect, process, and analyze real-time, streaming data.

Visit the website

Knowledge Graph

A knowledge base used to store complex structured and unstructured information used by machines and humans to enhance search and understand relationships and properties of the data.

Kubernetes

An open-source platform designed to automate deploying, scaling, and operating application containers, allowing for easy management of containerized applications across multiple hosts.

Lambda Architecture

A data processing architecture designed to handle massive quantities of data by combining batch processing and stream processing, providing a balance between latency, throughput, and fault-tolerance.

Late Binding

Delaying the binding of referenced attributes and methods until runtime.

Latent Semantic Analysis (LSA)

A technique in natural language processing and information retrieval to discover relationships between words and the concepts they form.

Lazy Loading

A design pattern used in computer programming to defer initialization of an object until the point at which it is needed.

Lineage

Understand of how data moves through a pipeline, including its origin, transformations, dependencies, and ultimate consumption.

See Glossary entry

Linear Regression

A statistical method used to model the relationship between a dependent variable and one or more independent variables, predicting outcomes.

Linked Data

A method of publishing structured data so that it can be interlinked and become more useful, leveraging the structure of the data to enhance its usability and discoverability.

Load

The process of transferring data from one location, format, or application to another, typically into a database.

See Glossary entry

Load Balancer

A device or software function that distributes network or application traffic across multiple servers, optimizing resource use, maximizing throughput, minimizing response time, and avoiding overload.

Load Shedding

The process of reducing the load on a system by restricting the amount of incoming requests.

Load Testing

A type of non-functional testing conducted to understand the behavior of the application under a specific expected load, identifying the maximum operating capacity of an application and any bottlenecks.

Localization

The process of adapting internationalized software for a specific region or language by adding locale-specific components and translating text.

Locking

A mechanism employed by RDBMSs to regulate data access in multi-user environments, ensuring the integrity of data by preventing multiple users from altering the same data at the same time.

Log Files

Files that record either events that occur in an operating system or other software runs, or messages between different users of a communication software.

Log Mining

A process that involves analyzing log files from different sources to uncover insights, which can be used for various purposes such as security, performance monitoring, and user behavior analysis.

Logistic Regression

A statistical method used to analyze a dataset and predict binary outcomes, utilizing a logistic function to model a binary dependent variable.

Logstash

A server-side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a 'stash' like Elasticsearch.

Long Short-Term Memory (LSTM)

A special kind of RNN, capable of learning long-term dependencies, and is particularly useful for learning from important experiences that have very long time lags.

Long-Polling

A web communication technique where the client requests information from the server, and the server holds the request open until new information is available.

Looker

A data exploration and discovery business intelligence platform.

Lookup Table

A table with one or more columns, where you look up a value in the table based on the value in one or more columns.

Loss Function

A function used in optimization to measure the difference between the predicted value and the actual value, guiding the model training process.

Low Latency

Characterized by a short delay from input into a system to the desired outcome, crucial in systems requiring real-time response.

Luigi

An older Python module that helps you build basic pipelines of batch jobs.

See the docs

Machine Learning

A method of data analysis that automates analytical model building, enabling systems to learn from data, identify patterns, and make decisions.

Machine Learning Operations (MLOps)

A practice for collaboration and communication between data scientists and operations professionals to help manage the production machine learning lifecycle.

Machine Learning Pipeline

A sequence of data processing and machine learning tasks, assembled to create a model, with each step in the sequence processing the data and passing it on to the next step.

See the guide

Machine-to-Machine (M2M)

Direct communication between devices using any communications channel, including wired and wireless.

Map

The process of defining relationships between two distinct data models.

MapR

Offers a comprehensive data platform with the speed, scale, and reliability required by enterprise-grade applications.

MapReduce

A programming model for processing and generating large datasets in parallel with a distributed algorithm on a cluster, initially developed by Google.

Markdown

A lightweight markup language with plain text formatting syntax designed for creating rich text using a plain text editor.

Mask

The method of protecting sensitive information in non-production environments by altering data records so that the structure remains similar while the information itself is changed.

See Glossary entry

Master Data Management (MDM)

A method that defines and manages the critical data of an organization to provide a single point of reference across the organization.

Materialize

Executing a computation and persisting the results into storage.

See Glossary entry

Materialized View

A database object that contains the results of a query, providing indirect access to table data by storing the results of the query in a separate schema object.

See guide

Mean Squared Error (MSE)

A measure of the average of the squares of the errors, used as a risk metric corresponding to the expected value of the squared (quadratic) error or loss.

Median

A measure of central tendency representing the middle value of a sorted list of numbers, separating the higher half from the lower half of the data set.

Memoize

Store the results of expensive function calls and reusing them when the same inputs occur again.

See Glossary entry

Merge

Combine data from multiple datasets into a single dataset.

See Glossary entry

Message Passing

A method by which information is communicated between distributed or parallel processes in a computer system.

Message Queue

A form of asynchronous service-to-service communication used in serverless and microservices architectures.

MessagePack

A binary format efficiently encoding objects and their fields in a compact binary representation. It is more efficient and compact compared to JSON, used when performance and bandwidth are concerns.

Metadata

Data that provides information about other data, such as data structure or content details.

Metadata Management

The administration of data that describes other data, involving establishing and managing descriptions, definitions, scope, ownership, and other characteristics of metadata.

Micro-Batching

A data processing method that deals with relatively small batches of data, providing a middle ground between batch processing and stream processing.

Microservices

A software development technique that structures an application as a collection of loosely coupled services, allowing for improved scalability and ease of updates.

Microservices Architecture

An architectural style that structures an application as a collection of services, which are highly maintainable and testable, loosely coupled, independently deployable, and precisely scoped.

Microsoft Azure

A cloud computing service created by Microsoft for building, testing, deploying, and managing applications and services through Microsoft-managed data centers.

Microsoft SQL Server

A relational database management system developed by Microsoft.

Microsoft SSIS (SQL Server Integration Services)

A platform for data integration and workflow applications.

Middleware

Software that acts as a bridge between an operating system or database and applications, enabling communication and data management.

Migrate

The process of transferring data between storage types, formats, or computer systems, usually performed programmatically.

Mine

Extract useful information, patterns or insights from large volumes of data using statistics and machine learning.

See Glossary entry

Model

The process of creating abstract representations of the structure and relationship between various data items in an application or database.

See Glossary entry

Model Deployment

The integration of a machine learning model into an existing production environment to make practical business decisions based on data.

Model Selection

The task of selecting a statistical model from a set of candidate models, based on the performance of the models on a given dataset.

Model Validation

The process of assessing how well your model performs at making predictions on new data, by using various metrics and statistical methods.

MongoDB

A popular NoSQL database, utilizing a document-oriented database model.

Monitor

Track data processing metrics and system health to ensure high availability and performance.

See Glossary entry

Monitoring

The process of observing and checking the quality or content of data over a period, aimed at detecting patterns, performance, failures, or other attributes.

Multi-Cloud

The use of multiple cloud computing and storage services in a single network architecture, utilized by businesses to spread computing resources and minimize the risk of data loss or downtime.

Multi-tenancy

A reference to the mode of operation of software where multiple independent instances of one or multiple applications operate in a shared environment.

Multidimensional Scaling (MDS)

A means of visualizing the level of similarity of individual cases of a dataset, used in information visualization to detect patterns in high-dimensional data.

Multilabel Classification

A type of classification task where each instance (or data point) can belong to multiple classes, as opposed to just one in the traditional case.

Multilayer Perceptron (MLP)

A class of feedforward artificial neural network consisting of at least three layers of nodes, used for classification and regression.

Multiprocessing

Optimize execution time with multiple parallel processes.

See Glossary entry

Multithreading

The ability of a CPU, or a single core in a multi-core processor, to provide multiple threads of execution concurrently.

Munge

See 'wrangle'.

See Glossary entry

Mutability

The capability of an object to be altered or changed, often used in contrast with immutability, which refers to the incapacity to be changed.

MySQL

A popular open-source relational database management system.

N+1 Query Problem

A common performance problem in applications that use ORMs to fetch data, occurs when the system retrieves related objects in a separate query for each object, leading to a high number of executed SQL queries.

Naïve Bayes Classifier

A family of simple probabilistic classifiers based on applying Bayes' theorem with strong independence assumptions between the features.

Named Entity Recognition (NER)

A subtask of information extraction that classifies named entities in text into pre-defined categories such as person names, organizations, locations, etc.

See Glossary entry

Namespace

A container that holds a set of identifiers to help avoid collisions between identifiers with the same name.

Natural Language Processing (NLP)

A field of artificial intelligence that focuses on the interaction between computers and humans through natural language, enabling computers to understand, interpret, and generate human language.

Network Partition

A network failure that divides a network into two or more disconnected sub-networks due to the failure of network devices.

Neural Network

A set of algorithms, modeled loosely after the human brain, designed to recognize patterns in data through machine learning.

Normality Testing

Assess the normality of data distributions to ensure validity and reliability of statistical analysis.

See Glossary entry

Normalization

The process of organizing the columns (attributes) and tables (relations) of a relational database to reduce redundancy and dependency.

Normalize

Standardize data values to facilitate comparison and analysis. Organize data into a consistent format.

See Glossary entry

NoSQL Database

A non-relational database that allows for storage and processing of large amounts of unstructured data and is designed for distributed data stores where very large-scale processing is needed.

See Glossary entry

Null Hypothesis

A general statement or default position that there is no relationship between two measured phenomena, to be tested and refuted in the process of statistical hypothesis testing.

NumPy

A library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

Obfuscate

The technique of disguising data by replacing, encrypting, or removing sensitive information to protect the data subject.

See Glossary entry

Object Storage

A storage architecture that manages data as objects, as opposed to other storage architectures like file systems or block storage.

Object-Relational Mapping (ORM)

A programming technique to convert data between incompatible type systems in object-oriented programming languages.

Observability

The ability to understand the internal state of a system from its external outputs, crucial in modern computing environments to ensure the reliability, availability, and performance of systems.

OLAP (Online Analytical Processing)

A category of software tools that allows users to analyze data from multiple database dimensions.

OLAP Cube

A multi-dimensional array of data used for complex calculations, enabling users to drill down into multiple levels of hierarchical data, making it a key technology for data analysis and reporting.

OLTP (Online Transaction Processing)

A type of processing that facilitates and manages transaction-oriented applications.

One-Hot Encoding

A process of converting categorical data variables so they can be provided to machine learning algorithms to improve predictions.

Online Analytical Processing (OLAP)

A category of software tools that analyze data from various database perspectives and enable users to interactively analyze multidimensional data from multiple perspectives.

Ontology

A representation of a set of concepts within a domain and the relationships between those concepts, used to reason about the entities within that domain.

Open Database Connectivity (ODBC)

A standard application programming interface (API) for accessing database management systems.

Operating System (OS)

Software that manages computer hardware and provides various services for computer programs, serving as a bridge between users and the computer hardware.

Operational Data Store (ODS)

A database designed to integrate data from multiple sources for additional operations on the data, serving as an intermediary between the data warehouse and the process of data sources.

Optimistic Concurrency Control

A type of concurrency control method applied on transactional systems to handle simultaneous updates.

Optimization

The process of adjusting a system to improve its efficiency or use of resources, usually in the context of improving the performance of algorithms and models.

Oracle Database

A multi-model database management system.

ORC (Optimized Row Columnar)

A columnar storage file format optimized for heavy read access and is highly suitable for storing and processing big data workloads. It is highly compressed and efficient, reducing the amount of storage space needed for large datasets.

Orchestration

Automated configuration, coordination, and management of computer systems, middleware, and services.

Find out more

Outlier Detection

The identification of rare items, events, or observations in a data set that raise suspicions due to differences in pattern or behavior from the majority of the data.

Overfitting

A modeling error that occurs when a function is too closely tailored to the training dataset; hence, the model performs well on the training dataset but poorly on new, unseen data.

P-value

A measure in statistical hypothesis testing that helps in determining the strength of the evidence that null hypothesis can be rejected.

Page Cache

A transparent cache for the pages originating from a secondary storage device such as a hard disk drive.

PageRank

An algorithm used by Google Search to rank web

Pandas

A fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation library built on top of Python.

Parallel Processing

A type of computation in which many calculations or processes are carried out simultaneously, suitable for tasks where many operations are independent of each other.

Parallelize

Boost execution speed of large data processing by breaking the task into many smaller concurrent tasks.

See Glossary entry

Parameter Tuning

The adjustment of weights in model training processes, with the aim of improving model accuracy, it refers to adjustments made to the internal parameters of the models.

Parquet

A columnar storage file format optimized for use with big data processing frameworks. It is highly efficient for both storage and processing, especially for complex nested data structures, and it supports schema evolution, allowing users to modify Parquet schema after data ingestion.

Parse

Interpret and convert data from one format to another.

See Glossary entry

Partition

The process of dividing a database into smaller, more manageable pieces, usually for improving performance, manageability, and availability.

See Glossary entry

Partitioning

A database design technique to improve performance, manageability, or availability by splitting tables into smaller, more manageable pieces.

Pattern Recognition

A branch of machine learning that focuses on the recognition of patterns and regularities in data.

Payload

The part of the transmitted data that is the actual intended message, excluding any headers or metadata sent mainly for the purpose of the delivery of the payload.

Peer-to-Peer Network

A decentralized network where each connected computer has equal status and can interact with each other without a central server.

Percentile

A statistical measure that indicates the value below which a given percentage of observations fall in a group of observations.

Performance Tuning

The improvement of system performance, typically in computer systems and networks, by adjusting various underlying parameters and configurations.

Permutation Test

A type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points.

Persistence Layer

The data access layer in a software application that stores and retrieves data from databases, files, and other storage locations.

Pickle

Convert a Python object into a byte stream for efficient storage.

See Glossary entry

Pipeline

A set of tools and processes chained together to automate the flow of data from source to storage, allowing for stages of transformation and analysis in between.

Polyglot Persistence

The use of various, often complementary database technologies to handle varying data storage needs within a given software application.

Polynomial Regression

A type of regression analysis in which the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial.

PostgreSQL

Advanced, open-source object-relational database management system.

Power BI

A business analytics service by Microsoft, providing interactive visualizations with self-service business intelligence capabilities.

See vendor site

Pre-aggregate

See 'aggregate'.

See Glossary entry

Precision

A metric in classification that measures the number of true positive results divided by the number of all positive results, including those not correctly identified.

Predictive Analytics

The use of data, statistical algorithms, and machine learning techniques to identify the likelihood of future outcomes based on historical data.

Predictive Modeling

The process of creating, testing, and validating a model to best predict the probability of an outcome.

Prep

Transform your data so it is fit-for-purpose.

See Glossary entry

Preprocess

Transform raw data before data analysis or machine learning modeling.

See Glossary entry

Primary Key

A unique identifier for a record in a database table, ensuring that each record can be uniquely identified and retrieved.

Principal Component Analysis (PCA)

A dimensionality reduction technique used to emphasize variation and bring out strong patterns in a dataset, often used before fitting a machine learning model to the data.

Probabilistic Data Structure

A high-performance, low-memory data structure that provides approximations to set operations, often used for tasks like membership tests, frequency counting, and finding heavy hitters.

Process

Manipulation of data to convert it from one form to another or to reduce it to a more manageable state.

Process Isolation

A form of data security which prevents running processes from interacting with each other, often used in multitasking operating systems to increase security and stability.

Profile

The process of examining, analyzing, and reviewing data to collect statistics and information about the quality and the nature of the data items.

See Glossary entry

Programmatic Advertising

The automated buying and selling of online advertising, optimizing based on algorithms and data.

Projection

A database operation that returns a set of columns (attributes) in a table, reducing the number of columns in the resultant relation.

Protobuf (Protocol Buffers)

Developed by Google, it is a method developed to serialize structured data, like XML and JSON. It is both simpler and more efficient than both XML and JSON. Protobuf is language-agnostic, making it highly versatile for different systems.

Prototyping

The process of quickly creating a working model (a prototype) of a part of a system, allowing for faster and more efficient final design and development.

Pseudonymization

A data management and de-identification procedure by which personally identifiable information fields within a data record are replaced by one or more artificial identifiers.

Pub/Sub (Publish/Subscribe)

A messaging pattern where senders of messages (publishers) do not prepare the messages to be sent directly to specific receivers (subscribers), defining classes of messages into topics.

Pull Request

A method of submitting contributions to an open development project, often used in collaborative development to manage changes from multiple contributors.

Purge

The process of permanently and irreversibly deleting old and irrelevant records from a database.

See Glossary entry

Push Notification

A message that pops up on a mobile device or desktop from an app or website, typically used to deliver updates, news, or promotions.

Python Pickle

A module in Python used for serializing and de-serializing Python object structures, converting Python objects into a byte stream.

See Glossary entry

PyTorch

An open-source machine learning library for Python, developed by Facebook’s AI Research lab.

See project website

QlikView

A Business Intelligence (BI) tool ideal for data visualization, analytics development, and reporting.

Learn more

Quantile

A data point or set of data points in a dataset that divide your data into “parts” of equal probability, such as the median, quartiles, percentiles, etc.

Quantum Computing

A type of computation that takes advantage of the quantum states of particles to store information, potentially allowing for the solving of complex problems much faster than classical computers can.

Query Language

A type of computer language that requests and retrieves data from database management systems.

Query Optimization

The process of choosing the most efficient means of executing a SQL statement, usually involving the optimization of SQL queries and projections, and the choice of optimal query plans.

Query Plan

A sequence of steps used to access data in a SQL relational database management system, important for optimizing database queries and improving system performance.

Rack Awareness

A concept applied in distributed computing to minimize the latency and use of resources while retrieving data and to ensure data availability during component failures.

Radial Basis Function (RBF)

A function whose value depends on the distance between the input and some fixed point, typically used in various areas such as function approximation, time series prediction, and classification.

Random Forest

An ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

Range Query

A type of query that retrieves data based on a range of values, typically used in the context of numerical or datetime values.

Real-Time Bidding (RTB)

A means by which advertising inventory is bought and sold on a per-impression basis, via programmatic instantaneous auction.

Real-Time Processing

The processing of data that continuously enters a system and obtains results within a timeframe short enough to affect the sources of the incoming data.

Recommender System

A subclass of information filtering system that seeks to predict the 'rating' or 'preference' a user would give to an item.

Reconcile

The process of ensuring that two or more datasets are consistent with each other, identifying any discrepancies and resolving them.

Record Linkage

The process of finding entries that refer to the same entity in different data sources.

Recurrent Neural Network (RNN)

A class of artificial neural networks designed for sequence prediction problems and other tasks where data points have connections to previous points, such as time series analysis and natural language processing.

Redis

An in-memory data structure store, used as a database, cache, and message broker.

Reduce

The process of reducing the amount of raw data, either by aggregating it, choosing representative subsets, or transforming it into a more compact representation.

Reduce

Convert a large set of data into a smaller, more manageable form without significant loss of information.

See Glossary entry

Redundancy

The duplication of critical components or functions of a system with the intention of increasing reliability of the system, usually in the form of a backup or fail-safe.

Referential Integrity

A property of data stating that all its references are valid and ensures that the relationship between tables remains consistent.

Regression Analysis

A statistical process for estimating the relationships among variables, often used for prediction and forecasting, where one variable is dependent on one or more independent variables.

Regular Expression (Regex)

A sequence of characters defining a search pattern, typically used by string-searching algorithms for 'find' or 'find and replace' operations on strings, crucial for data cleaning and transformation.

Regularization

A technique used to prevent overfitting in a machine learning model by adding a penalty term to the model’s loss function, commonly used regularizations are L1 and L2 regularization.

Reinforcement Learning

A type of machine learning where an agent learns how to behave in an environment by performing certain actions and receiving rewards or penalties in return.

Relational Algebra

A theoretical set of mathematical principles and concepts forming the foundational basis for implementing and optimizing queries in Relational Database Management Systems.

Relational Database

A type of database that stores data in structured tables and is based on the relational model.

Relational Model

A database model based on first-order predicate logic, serving as the basis for relational databases, where all data is represented in terms of tuples, grouped into relations.

Repartition

Redistribute data across multiple partitions for improved parallelism and performance.

See Glossary entry

Replica Set

A group of database nodes that maintains the same data set, providing redundancy and increasing data availability with multiple copies of data on different database servers.

Replicate

The process of copying data from a database in one server or computer to a database in another so that all users share the same level of information.

See Glossary entry

Representation Learning

An area of machine learning where automatic feature learning from raw data is explored, aimed at identifying better representations and improving model generalization.

Request-Response

A message exchange pattern in which a requester sends a request message to a replier system, which then sends a response message in return.

Reshape

Change the structure of data to better fit specific analysis or modeling requirements.

See Glossary entry

Resilient Distributed Dataset (RDD)

A fault-tolerant collection of elements that can be processed in parallel, fundamental data structure of Spark,

Response Variable

The variable that is being predicted or modeled, often denoted as the dependent variable or output variable.

RESTful API

An architectural style for designing networked applications, utilizing stateless, cacheable communications protocols, typically HTTP.

Ridge Regression

A regularization technique for analyzing multiple regression data that suffer from multicollinearity, shrinking the coefficients of the model towards zero to stabilize them.

Risk Analysis

The process of identifying and analyzing potential issues that could negatively impact key business initiatives or projects.

Rollback

The operation which undoes partially completed transactions by the database management system after a failed transaction.

Root Mean Square Error (RMSE)

A standard way to measure the error of a model in predicting quantitative data, it’s the square root of the average squared differences between the predicted and observed actual outcomes.

Routing

The process of selecting a path for traffic in a network or between or across multiple networks, based on routing table information.

Row-Level Security (RLS)

A method of restricting access at the database row level, based on parameters such as user roles or identity, enabling fine-grained access control.

Ruby on Rails

A server-side web application framework written in Ruby, it is a model-view-controller (MVC) framework, providing default structures for a database, a web service, and web pages.

Sample

Extract a subset of data for exploratory analysis or to reduce computational complexity.

See Glossary entry

Sampling

The process of selecting a subset of elements from a larger set to approximate the properties of the whole set, often used for statistical analysis.

Sandboxing

A security mechanism used to run an application in a confined environment, isolating it from the system, preventing it from causing harm or accessing sensitive data.

Scalability

The capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to accommodate that growth.

Scalar

A quantity represented by a single element in the corresponding field, usually a single number, as opposed to a vector or matrix.

Scaling

Increasing the capacity or performance of a system to handle more data or traffic.

See Glossary entry

Schema

The organization or structure for a database, defining tables, fields, relationships, indexes, etc.

Schema Evolution

The ability of a database system to handle changes in a database schema, especially relevant for systems that require flexibility and adaptability to changing data requirements.

Schema Mapping

Translate data from one schema or structure to another to facilitate data integration.

See Glossary entry

Schema-on-Read

A strategy where data structure is inferred at read time, typically used in big data processing where data is not predefined and is instead interpreted when it is analyzed.

Schema-on-Write

A strategy where data structure is defined before writing data, typically used in relational databases where data must conform to a known schema before it's written to disk.

Scikit-learn

A free software machine learning library for the Python programming language. It features various classification, regression, clustering algorithms, and efficient tools for data mining and data analysis.

SciPy

An open-source Python library used for scientific and technical computing.

Scrape

Extract data from a website or another source.

See Glossary entry

Scraping

The process of extracting data from websites, converting it from unstructured to structured form.

Scrub

A process of amending or removing data in a database that is incorrect, incomplete, improperly formatted, or duplicated, also known as data cleansing.

Search Engine

A software application designed to search for information in a database, with requested information returned to the user as search results.

Search Engine Optimization (SEO)

The practice of optimizing content to be discovered through a search engine’s organic search results, affecting the visibility of a website or a web page.

Secure

Protect data from unauthorized access, modification, or destruction.

See Glossary entry

Segmentation

The process of dividing a data set into distinct and meaningful groups, usually to perform more specific analysis, or to target specific subsets of users.

Semantic Analysis

The process of analyzing the meanings of words, texts, and sentences, typically used in NLP to understand the context and intent behind the words.

Semi-Supervised Learning

A class of machine learning tasks and techniques that also make use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data.

Sentiment Analysis

The use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.

See Glossary entry

Sequential Pattern Mining

A method of discovering frequent subsequences or patterns in a sequence of items or events, usually in datasets of customer transactions or other sequence data.

Serialize

The process of converting complex data structures into a format that can be easily stored or transmitted and then reconstructed later.

See Glossary entry

Serverless Computing

A cloud-computing execution model where the cloud provider runs the server and dynamically manages the allocation of machine resources, allowing developers to focus on individual functions.

Service-Oriented Architecture (SOA)

An architectural pattern in software design where services are provided to the other components by application components, through a communication protocol over a network.

Shard

A method of splitting and storing a single logical dataset in multiple databases to spread the load, enhancing the performance and enabling horizontal scaling.

See Glossary entry

Shred

Break down large datasets into smaller, more manageable pieces for easier processing and analysis.

See Glossary entry

Shuffle

Randomize the order of data records to improve analysis and prevent bias.

See Glossary entry

Similarity Measure

A numeric measure of how alike two data objects are, often used in clustering, classification, or nearest neighbor analysis.

Single Source of Truth (SSOT)

A practice of structuring information models and associated schema such that every data element is mastered in only one place.

Site Reliability Engineering (SRE)

A discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems, aiming for creating scalable and highly reliable software systems.

Skew

A condition in which the distribution of data is not uniform, impacting the performance of data processing in parallel computing environments.

See Glossary entry

Skewness

A measure of the asymmetry of the probability distribution of a real-valued random variable about its mean, indicating whether the data points are skewed to the left or right.

See Glossary entry

Sliding Window

A technique used in analyzing or processing sequences of data, where a window of specified size moves across the data, and for each position of the window, a computation is performed.

Snappy

A fast and efficient data compression and decompression library developed by Google, designed to balance processing speed and compression ratio. It is often used to compress data stored in Hadoop environments and for other similar applications.

Snapshot

A set point in time copy of data that can be used as a backup for recovery purposes.

Snapshot Isolation

A guarantee provided by some database systems that all reads made in a transaction will see a consistent snapshot of the database, and the transaction itself will successfully commit only if no updates it has made conflict with any concurrent updates made since that snapshot.

Snowflake

A cloud-based data warehouse service designed for high-performance analytics.

Snowflake Schema

A normalized form of Star Schema in a Data Warehouse, reducing redundancy and improving data integrity, with a central fact table connected to multiple normalized dimension tables.

Social Graph

A graph that depicts personal relations of internet users, representing the interconnection of relationships in an online social network.

Soft Delete

A data removal strategy where records are marked as deleted but are not physically removed from the database, enabling potential recovery.

Software as a Service (SaaS)

A cloud computing service model that provides access to software and its functions remotely as a web-based service, allowing users to access software applications over the internet.

Sorting Algorithm

An algorithm that puts elements of a list in a certain order, often numerical or lexicographical.

Sparse Matrix

A matrix mostly containing zero values, represented and stored efficiently in memory by only storing the non-zero elements.

Spatial Database

A database optimized to store and query data representing objects defined in a geometric space, often used for storing and analyzing geographical or spatial information.

Spatial Index

A data structure that allows for accessing a spatial object efficiently, essential in spatial databases and geodatabases.

Spatial Indexing

A data structure that allows for accessing a spatial object in a database in a more efficient manner, crucial in GIS systems, spatial databases, and spatial data processing.

Speculative Execution

An optimization technique where a computer system performs some tasks before it knows whether these tasks will be needed, to reduce latency and improve throughput.

Spill

Temporarily transfer data that exceeds available memory to disk.

See Glossary entry

Split

Divide a dataset into training, validation, and testing sets for machine learning model training.

See Glossary entry

SQL (Structured Query Language)

A standardized programming language used for managing and querying relational databases.

SQL Injection

A code injection technique, used to attack data-driven applications, in which malicious SQL statements are inserted into an entry field for execution.

SQLite

A C library that provides a lightweight, disk-based database.

Stack

A data structure that stores a collection of elements, with two main principal operations: Push, which adds an element to the collection, and Pop, which removes the most recently added element.

Standardize

Transform data to a common unit or format to facilitate comparison and analysis.

See Glossary entry

Star Schema

The simplest style of data warehouse schema that organizes data in a single fact table linked to one or more dimension tables, enabling easy and efficient data retrieval.

Stateful Application

An application that saves client data from the activities of one session for use in the next session.

Stateless Application

An application that does not save client data generated in one session for use in the next session with that client.

Stateless Protocol

A communications protocol that treats each request as an independent transaction, without requiring the server to retain session information or status about each communicating partner for the duration of multiple requests.

Stemming

The process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma.

Strategic Information Systems

Information systems that are developed in response to corporate business initiatives to give competitive advantage to organizations.

Stream Processing

The real-time processing of data continuously, concurrently, and record by record, often used in applications that require real-time response and analytics.

Streaming Data

Data that is generated continuously by thousands of data sources, sending data records simultaneously and in small sizes.

Structured Data

Data that is organized and formatted in a way that is easily searchable, often residing in relational databases and including data types such as numbers, dates, and strings.

Structured Query Language (SQL)

A standard programming language specifically for managing and querying data in relational databases.

Subquery

A SQL query nested inside a larger query, used to retrieve data that will be used in the main query as a condition to further restrict the data to be retrieved.

Support Vector Machine (SVM)

A supervised machine learning algorithm, used for classification or regression analysis, that separates data into classes by finding the hyperplane that maximizes the margin between the classes.

Surrogate Key

A unique identifier for a record in a database table that serves as a substitute for natural primary keys and is typically auto-generated.

Swarm Intelligence

The collective behavior of decentralized, self-organized systems, typically inspired by nature, like ant colonies, bird flocking, and fish schooling, used in artificial intelligence for problem-solving and optimization.

Synchronization

The coordination of events to operate a system in unison, ensuring that multiple threads or processes do not interfere with each other.

Synchronize

The process of establishing consistency among data from a source to a target data storage and vice versa.

See Glossary entry

Syntactic Sugar

Syntax within a programming language that is designed to make things easier to read or to express.

Syntax Analysis

The analysis of the symbols or statements in a computer program to ensure their correct arrangement, often used in compilers to check the syntax of the programming code.

Synthetic Data

Data that's artificially created, rather than being generated by actual events, often used for testing and training machine learning models when real data is scarce or sensitive.

Systematic Sampling

A statistical method involving the selection of elements from an ordered sampling frame, selecting every kth (where k is a constant) item in the frame.

Systems Development Life Cycle (SDLC)

The process of creating or altering systems, and the models and methodologies that development teams use to develop systems.

T-distributed Stochastic Neighbor Embedding (t-SNE)

A machine learning algorithm for dimensionality reduction, particularly well suited for the visualization of high-dimensional datasets.

T-distribution

A type of probability distribution that is symmetrical and bell-shaped, like the normal distribution, but has heavier tails.

Tableau

A data visualization tool that is used for converting raw, unstructured data into an understandable or readable format.

Tagging

The practice of labeling data with tags that categorize or annotate it, often used in organizing content or in natural language processing to identify parts of speech.

Talend

A software integration vendor that provides data integration, data management, enterprise application integration, and big data software and services.

Temporal Database

A database that is optimized to manage data relating to time instances, maintaining information about the times at which certain data is valid.

Tensor

A mathematical object represented as arrays of higher dimensions, extended from matrices and used in machine learning and deep learning models, particularly in neural networks.

TensorFlow

An open-source software library for dataflow and differentiable programming across a range of tasks, developed by the Google Brain team.

Terabyte (TB)

A unit of information or computer storage equal to one trillion bytes or 1,024 gigabytes.

Teradata

Offers products related to data warehousing, including a powerful, scalable, and reliable data warehousing solution.

Text Mining

The process of deriving meaningful information from natural language text, involves the preprocessing (cleaning and transforming) of text data and the application of natural language processing (NLP) techniques.

See Glossary entry

Thread

Enable concurrent execution in Python by decoupling tasks which are not sequentially dependent.

See Glossary entry

Throughput

The amount of data transferred or processed in a specified time period, often used as a measure of system or network performance.

Time Complexity

A concept in computer science that describes the amount of time an algorithm takes to run as a function of the length of the input.

Time Series Analysis

A statistical technique that deals with time series data, or trend analysis, involving the use of various methods to analyze time series data and extract meaningful statistics and characteristics about the data.

See Glossary entry

Time Series Database (TSDB)

A database optimized for handling time series data, which are data points indexed in time order, commonly used for analyzing, storing, and querying time series data.

Tokenization

The process of converting input text into smaller units, or tokens, typically words or phrases, used in natural language processing to understand the structure of the text.

Tokenize

Convert data into tokens or smaller units to simplify analysis or processing.

See Glossary entry

Top-Down Design

A design methodology that begins with specifying the high-level structure of a system and decomposes it into its components, focusing on the system as a whole before examining its parts.

Topology

In networking, it refers to the arrangement of different elements (links, nodes, etc.) in a computer network. In data analysis, it refers to the study of geometric properties and spatial relations.

Training Set

A subset of a dataset used to train machine learning models, helping the models make predictions or decisions without being explicitly programmed to perform the task.

Transactional Database

A type of database that manages transaction-oriented applications, ensuring ACID properties (Atomicity, Consistency, Isolation, Durability) to maintain reliability in every transaction.

Transfer Learning

A research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem.

Transform

The process of converting data from one format, structure, or type to another.

See Glossary entry

Transformation

The process of converting data from one format or structure into another, often involving cleaning, aggregating, enriching, and reformatting the data.

Tree Structure

A hierarchical structure used in computer science to represent relationships between individual data points or nodes, where each node is connected to one parent node and zero or more child nodes.

Triggers

Procedural code automatically executed in response to certain events on a particular table or view in a database, often used to maintain the integrity of the data.

Tuple

An ordered list of elements, often used to represent a single row in a relational database table, or a single record in a dataset.

Turing Machine

A mathematical model of computation that defines an abstract machine, which manipulates symbols on a strip of tape according to a table of rules, foundational in the theory of computation.

Type Casting

The process of converting a variable from one data type to another, such as changing a float to an integer or a string to a number.

Undirected Graph

A graph in which edges have no orientation, meaning the edge from vertex A to vertex B is identical to the edge from vertex B to vertex A.

Union

An operation in SQL that allows for the return of one distinct result set from multiple queries.

Unique Constraint

A constraint applied on a field to ensure that it cannot have duplicate values.

Univariate Analysis

The simplest form of analyzing data with one variable, without regard to any other variable, focusing on the patterns, and summarizing the underlying patterns in the data.

Unstructured Data

Information that doesn't reside in a traditional row-column database and is often text-heavy.

See Glossary entry

Unstructured Data Analysis

Analyze unstructured data, such as text or images, to extract insights and meaning.

See Glossary entry

Unsupervised Learning

A type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses, often for clustering or association.

Update Anomaly

A data inconsistency that occurs when not all instances of a redundant piece of data are updated, leading to inconsistent and inaccurate data in a database.

Upsert

A database operation that either inserts a row into a database table if a corresponding row does not exist, or updates the row if it does exist.

Upstream

In data processing, refers to the tasks, operations, or stages of processing occurring or located before a particular stage in a specified direction or flow.

URL Encoding

A method of encoding information in a Uniform Resource Identifier (URI) where certain characters are replaced by corresponding hexadecimal values, used in the submission of form data in HTTP requests.

User-Defined Function (UDF)

A function provided by the user of a program or environment, allowing for the creation of functions that are not included in the original software.

Validate

The process of ensuring that a program operates on clean, correct, and useful data, checking the accuracy and quality of the input data before it is processed.

See Glossary entry

Variable Selection

The process of selecting the most relevant features (variables, predictors) for use in model construction, reducing dimensionality and improving model performance.

Variance Inflation Factor (VIF)

A measure used to quantify how much the variance of a regression coefficient is inflated due to multicollinearity in the model.

Variational Autoencoder (VAE)

A type of autoencoder with added constraints on the encoded representations being learned, often used for generating new data that's similar to the training data.

Vectorization

The process of converting an algorithm from operating on a single value at a time to operating on a set of values (vector) at one time, improving performance by exploiting data-level parallelism.

See Glossary entry

Vectorize

Executing a single operation on multiple data points simultaneously.

See Glossary entry

Version

The approach of managing changes and history of data in a dataset, useful for reproducing results, rolling back changes, and understanding changes in data over time.

See Glossary entry

Version Control

The management of changes to documents, computer programs, large websites, and other collections of information, allowing for revisions and variations to be tracked and managed efficiently.

Vertex

In graph theory, a vertex is a point where two or more curves, lines, or edges meet, representing entities in graph-based storage and analysis systems.

Vertical Scaling

Adding more resources such as CPU, memory to an existing server, or replacing the server with a more powerful one.

View

A virtual table based on the result-set of an SQL statement, often used to focus, simplify, and customize the perception each user has of the database.

Virtual Private Network (VPN)

A technology that creates a safe and encrypted connection over a less secure network, such as the internet, allowing for secure remote access to network resources.

Virtualization (in analytics)

A data integration process to provide a unified, real-time, and consistent view of data across different data sources without having to move or replicate the data.

Virtualization

The process of creating a virtual version of something, including virtual computer hardware systems, storage devices, and network resources.

Visualization

The graphical representation of information and data, using visual elements like charts, graphs, and maps.

Volatile Memory

Computer memory that requires power to maintain the stored information; all data is lost when the system’s power is turned off or interrupted.

Volume Testing

A type of software testing that checks the system’s performance and behavior under high volumes of data, ensuring the software can handle large data quantities effectively.

Vulnerability Assessment

The process of identifying, quantifying, and prioritizing the vulnerabilities in a system, involving the evaluation of system or software weaknesses and potential threats.

Warehouse Modeling

The process of developing abstract representations of a data warehouse system, typically structured in a way that helps in understanding, analyzing, and designing the data warehouse.

Web Application Firewall (WAF)

A security policy enforcement point positioned between a web application and the client endpoint, monitoring, and controlling communications to protect against attacks.

Web Crawling

The automated process of browsing the web to collect information about websites and their pages, often used by search engines to index web content.

Web Framework

A software framework designed to aid the development of web applications including web services, web resources, and web APIs.

Web Scraping

An automated method used to extract large amounts of data from websites quickly, used in data mining where you extract useful information or knowledge from data.

Web Services

Standardized software systems designed to communicate over the Internet using standardized protocols, allowing different applications to talk to each other.

Weighted Graph

A graph in which a number (the weight) is assigned to each edge, representing quantities such as cost, length, or capacity, depending on the problem at hand.

Whitespace Tokenization

The process of breaking up text into tokens based on whitespace characters such as spaces, tabs, and newline characters, commonly used in natural language processing.

Wide Column Store

A type of NoSQL database that uses tables, rows, and columns, but unlike a relational database, names and format of the columns can vary from row to row in the same table.

Wildcard Character

A character used to replace or represent one or more characters in string comparisons, often used in search operations to represent unknown characters in the search pattern.

Window Function

In SQL, a type of function that performs a calculation across a set of table rows related to the current row, providing access to rows at a specified physical offset without using a self-join.

Workflow

The sequence of industrial, administrative, or other processes through which a piece of work passes from initiation to completion, automated by software in many cases.

Wrangle

The process of transforming raw data into a more usable or appropriate format.

See Glossary entry

Wrapper

A function, method, or class that contains a piece of existing code and typically adds some additional functionality or converts inputs or outputs.

Write-Ahead Logging (WAL)

A method where changes are written to a log before they are applied, ensuring data integrity and consistency by providing a recovery mechanism in case of system failures.

XML (eXtensible Markup Language)

A markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. Widely used for the representation of arbitrary data structures such as those used in web services.

XML Database

A database that stores data in a structured format, typically XML, allowing for complex and hierarchical data relationships.

XML Parsing

The process of analyzing an XML document to read the codes and to access or modify data, used in various applications to interact with XML data.

XOR (Exclusive Or)

A logical operator that outputs true only when inputs differ (one is true, the other is false).

XPath

A query language for selecting nodes from an XML document, providing a way to navigate through elements and attributes in XML documents.

YARN (Yet Another Resource Negotiator)

A resource-management technology in Hadoop, allocating resources to various applications and managing resource consumption and task scheduling.

Yottabyte

A unit of information or computer storage equal to one septillion bytes.

Z-Index

A property specifying the stack order of elements, commonly used in web development to manage overlaying of elements.

Z-Score

A statistical measurement that describes a value's relationship to the mean of a group of values, measured in terms of standard deviations from the mean.

Zero Trust Security

A security concept centered on the belief that organizations should not automatically trust anything inside or outside its perimeters and must verify

Zero-Copy

A method of transferring data in computer systems so that it does not need to be copied from one buffer or memory location to another.

Zero-Day Exploit

An attack that targets software vulnerabilities that are unknown

Zettabyte

A unit of digital information storage used to denote the size of data. It is equivalent to one sextillion (10^21) bytes or 1000 exabytes.

Zone Replication

The process of replicating data across different zones in a multi-zone environment, usually for data redundancy and availability.

Zoning

In storage area networking, zoning is the process of allocating resources in a network to communicate only with each other and isolated from other resources, improving security and performance.

Zookeeper

An open-source technology that provides a centralized service for maintaining configuration information, naming, and providing distributed synchronization and group services.

Data Engineering Terms Explained

Terms and Definitions You Need to Know as a Data Engineer

A/B Testing

ACID Properties

Aggregation

Agile Methodology

Alation

Aligning

Amazon DynamoDB

Amazon Kinesis

Amazon Redshift

Amazon Web Services (AWS)

Annotation

Anomaly Detection

Anonymize

Apache Airflow

Apache Arrow

Apache Atlas

Apache Camel

Apache Flink

Apache Hadoop

Apache Kafka

Apache Nifi

Apache Pulsar

Apache Samza

Apache Spark

Apache Storm

API (Application Programming Interface)

Append

Archive

Argo

Association Rule Mining

Asyncio

Augment

Augmented Data Management

Automated Machine Learning (AutoML)

Avro

AWS Step Functions

Backend-as-a-Service (BaaS)

Backpressure

Backup

Batch Processing

Big Data

Big Data Processing

Big O Notation

Binary Tree

Bitwise Operation

Blend

Blockchain

Broadcast

Broadcasting

BSON (Binary JSON)

Bucketing

Bulk Extract

Business Intelligence (BI)

Cache Invalidation

Caching

Callback

CAP Theorem

Cap’n Proto

Capacity Planning

Cassandra

Categorical Data

Categorize

Causal Inference

CBOR (Concise Binary Object Representation)

Chaining

Character Encoding

Checkpoint

Checkpointing

Circular Dependency

Class Variable

Class-Method

Classify

Clean Code

Clean or Cleanse

Cloud Computing

Cloudera

Cluster

Cluster Analysis