Data Pipeline

Overview

In AI, a data pipeline (often called an ETL pipeline) is the 'plumbing' that ensures high-quality data is consistently available for models. It automates the journey from raw data to actionable insights.

Stages

Ingestion: Collecting data from databases, APIs, or logs.
Cleaning: Removing duplicates, handling missing values, and fixing errors.
Transformation: Converting data into the right format (e.g., normalization, encoding).
Loading: Storing the processed data in a data warehouse or feature store.

Tools

Apache Airflow
dbt (data build tool)
AWS Glue

Overview

Stages

Tools

Related Terms