Data Distribution Changes


Data Drift and Skew

  • Drift refers to changes in data over time. It occurs when there are changes in the statistical properties of the features between different time periods or data collections.
  • Skew is the difference between two versions of the same dataset from different sources. It can be caused by changes in the data schema or distribution.

Model Decay and Data Issues

  • Model decay refers to the decline in model performance over time, which is often caused by data drift and concept drift.
  • Data drift occurs when there are changes in the data between training and serving, affecting the statistical properties of the features.
  • Concept drift refers to changes in the relationship between input and output variables, affecting the mapping learned by the model.

Types of Drift

img

Detecting Data Issue

Drift Detection: General Framework

img

Data Skew Detection Workflow

To detect data skew, a workflow can be followed:

  1. Compute baseline statistics and reference schema from the training data.
  2. Generate descriptive statistics and schema from the serving data.
  3. Compare the serving statistics and schema with the training data.
  4. Look for significant differences indicating skew or drift.
  5. Trigger an alert for anomalies, which can be analyzed by a monitoring system or human.
  6. Take remediation steps to address the detected data issues.

References

  1. Learning under Concept Drift: A Review https://arxiv.org/pdf/2004.05785.pdf