Overview
In AI, a data pipeline (often called an ETL pipeline) is the 'plumbing' that ensures high-quality data is consistently available for models. It automates the journey from raw data to actionable insights.
Stages
- Ingestion: Collecting data from databases, APIs, or logs.
- Cleaning: Removing duplicates, handling missing values, and fixing errors.
- Transformation: Converting data into the right format (e.g., normalization, encoding).
- Loading: Storing the processed data in a data warehouse or feature store.
Tools
- Apache Airflow
- dbt (data build tool)
- AWS Glue