Overview
Data Cleansing (or data cleaning) is the act of detecting and correcting (or removing) corrupt or inaccurate records from a dataset. It involves identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.
Common Techniques
- Parsing: Breaking data into components (e.g., splitting a full name into first and last).
- Standardization: Ensuring data follows a consistent format (e.g., date formats).
- De-duplication: Identifying and removing duplicate records.
- Verification: Checking data against known good sources.
Importance
Clean data is essential for accurate reporting, effective machine learning models, and reliable business decision-making.