Greedy Feature Selection


Simple Filters

The most basic approach to feature selection is to screen the predictors to see if any have a relationship with the outcome prior to including them in a model.

Some popular techniques:

Categorical Feature

  • when the target is categorical the relationship between feature and outcome forms a contingency tables.
    • 3 or more level for the feature: can use chi-squared test or exact methods
    • 2 level for the feature: can use odds-ratio
  • when the target is numeric
    • 2 level for the feature: basic t-test can be calculated
    • 3 level for the feature: traditional ANOVA F-statistic can be calculated

Numeric Feature

  • when the target is categorical similar as when feature is categorical and the target is numeric (but the role is reversed) correlation-adjusted t-scores are a good alternative to simple ANOVA statistics.
  • when the target is numeric simple correlation can be used, or if the relationship is non linear If the relationship is nonlinear, then the MIC values (Reshef et al. 2011) or A statistics can be used

Comparing different filters

since not all the method have the same scaling and comparable with each other, one way to have it compared to get like “top N” features In many cases, each statistic can be converted to a p-value so that there is a commonality across the screening statistics.

Note

$H_0$ no association between feature and target

when dealing with some statistics that is hard to convert to p-value: use permutation method:

  1. for a selected feature and corresponding target: the features are randomly permuted
  2. the statistic of interest is calculated in the permuted data
  3. the same feature is randomly permuted to generate a distribution of statistics. (represent the distribution of no association $H_0$)
  4. the statistic of the original data then can be compared with the $H_0$ to get a p-value

RFE

Not all models can be paired with the RFE method. Because RFE requires that the initial model uses the full predictor set, then some models cannot be used when the number of predictors exceeds the number of samples.


References

  1. https://bookdown.org/max/FES/greedy-search.html