Overview

In AI, a data pipeline (often called an ETL pipeline) is the 'plumbing' that ensures high-quality data is consistently available for models. It automates the journey from raw data to actionable insights.

Stages

  1. Ingestion: Collecting data from databases, APIs, or logs.
  2. Cleaning: Removing duplicates, handling missing values, and fixing errors.
  3. Transformation: Converting data into the right format (e.g., normalization, encoding).
  4. Loading: Storing the processed data in a data warehouse or feature store.

Tools

  • Apache Airflow
  • dbt (data build tool)
  • AWS Glue

Related Terms