Type of Missing Data
-
Structural deficiencies in the data For example: when
null
means “no” or 0. -
Random occurrences
-
Missing Completely at Random (MCAR) The likelihood of a missing results is equal for all data points. the missing value are independent of the data.
-
Missing at Random (MAR) The likelihood of a missing results is not equal for all data points. the probability of a missing result depends on the observed data but not on the unobserved data.
-
-
Specific Causes or Not Missing at Random (NMAR) For example, a patient may drop out of a study due to an adverse side effect of a treatment or due to death. For this patient, no measurements will be recorded after the time of drop-out.
Tip
When Facing Missing Value:
- Understand the nature of the missing values.
- Visualize:
- smaller data sets: a heatmap or co-occurrence plot
- Larger data sets: plotting the first 2 scores from a PCA of the missing data indicator matrix.
- Depend on the severity: deletion/encoding/imputation.
Deletion
Missing value could be eliminated by removing the features/column that has high degree of missingness on it, but when it’s known that the features is useful/valueable for predictions, it’s not desirable.
Missing value could be eliminated by removing any samples/row that contain at least 1 column/features with missing value on it. Generally, eliminating samples is not desirable as it could introduce bias, please be careful with this approach.
Encoding
When the features is Categorical, missingness can be encoded as one of the categorical label of the features with the missing value itself, such as “unknown”, “class is missing”, etc. A guiding principle that can be used to determine if encoding missingness is a good idea is to think about how the results would be interpreted if that piece of information becomes important to the model.
Imputation
1. Mean/median/Mode imputation:
- *Pros: Simple to implement and computationally efficient.
- Cons: Can lead to biased estimates and underestimate the variability of the data.
- When to use: When the missing values are assumed to be MCAR and the missing values do not exceed 5-10% of the total data.
2. Regression imputation:
- *Pros: Takes into account the relationship between the missing variable and other variables in the dataset.
- Cons: Requires a large sample size, can be computationally intensive, may not be suitable for highly correlated variables.
- When to use: When the missing values are not MCAR and there is a strong correlation between the missing variable and other variables in the dataset.
4. KNN imputation:
- *Pros: Accounts for the similarity between observations.
- Cons: Can be computationally intensive and may not be suitable for large datasets, may not perform well when the number of missing values is high.
- When to use: When the missing values are not MCAR and the data has a complex structure with multiple variables.
5. Hot-deck imputation:
- *Pros: Accounts for the similarity between observations.
- Cons: May not be suitable for large datasets, may not perform well when the number of missing values is high.
- When to use: When the missing values are not MCAR and the data has a complex structure with multiple variables.