Overview

Data Cleansing (or data cleaning) is the act of detecting and correcting (or removing) corrupt or inaccurate records from a dataset. It involves identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

Common Techniques

  • Parsing: Breaking data into components (e.g., splitting a full name into first and last).
  • Standardization: Ensuring data follows a consistent format (e.g., date formats).
  • De-duplication: Identifying and removing duplicate records.
  • Verification: Checking data against known good sources.

Importance

Clean data is essential for accurate reporting, effective machine learning models, and reliable business decision-making.

Related Terms