Data Drift and Skew
- Drift refers to changes in data over time. It occurs when there are changes in the statistical properties of the features between different time periods or data collections.
- Skew is the difference between two versions of the same dataset from different sources. It can be caused by changes in the data schema or distribution.
Model Decay and Data Issues
- Model decay refers to the decline in model performance over time, which is often caused by data drift and concept drift.
- Data drift occurs when there are changes in the data between training and serving, affecting the statistical properties of the features.
- Concept drift refers to changes in the relationship between input and output variables, affecting the mapping learned by the model.
Types of Drift
Detecting Data Issue
Drift Detection: General Framework
Data Skew Detection Workflow
To detect data skew, a workflow can be followed:
- Compute baseline statistics and reference schema from the training data.
- Generate descriptive statistics and schema from the serving data.
- Compare the serving statistics and schema with the training data.
- Look for significant differences indicating skew or drift.
- Trigger an alert for anomalies, which can be analyzed by a monitoring system or human.
- Take remediation steps to address the detected data issues.
References
- Learning under Concept Drift: A Review https://arxiv.org/pdf/2004.05785.pdf