Feature Selection

December 11, 2022 2-minute read

Goals

Some models (such as SVM and Neural Networks) are sensitive to irrelevant features/predictors. superfluous features can sink predictive performance in some situations.
Some models (such as logistic regression) are vulnerable to correlated features.
removing features can reduce cost and it make scientific sense to include the minimum possible set that provides acceptable results.

Classes of Feature Selection Method

in general feature selection method can be divided by 3

Info

title: Intrinsic feature selection naturally incorporated with the modeling process.

Examples:

Tree and rule-based models. search the best predictors and split point such that the outcomes are more homogeneous within each new partition.
Multivariate adaptive regression spline (MARS)
Regularization models. penalizes or shrinks predictor coefficient used by the model (in some cases such as with L1 regularization to absolute zero)

Cons: model dependent

Info

title: filter methods conduct an initial supervised analysis to the feature/predictors to determine which are important, and only provide these to the model.

is a greedy feature selection approach

Pros: Simple and fast, effective at capturing the large trends (i.e., individual predictor-outcome relationship)

Cons:

prone to over-selecting predictor
in many case some measure of statistical significance is used to judge “importance” and may be disconnected with the actual predictive performance

Info

title: wrapper methods iterative search procedure that repeatedly supply predictor subsets to the model, and use the model performance as the guide to select which subsets to supply and evaluate next.

can be greedy or global feature selection approach

Pros:

the one that have most potential to find the global optima of subset if they exist (especially non-greedy one)

Cons:

Computationally Costly

Recommended Approach

Tip

start with one or more intrinsic approach and see what they yield

If non-linear intrinsic method has a good performance: proceed with a wrapper method with a non-linear model
similarry if linear intrinsic method has a good performance: proceed with a wrapper methode with a linear model

Effect of Irrelevant Features

irrelevant features can have effect to a model depend on the type of the model itself

below is example of aditional irrelevant/noise features vs RMSE (x axes is number of additional noise features)

image source: https://bookdown.org/max/FES/feature-selection-simulation.html original y consist of nonlinear function from a 20 predictors $$y=x_1+sin(x_2)+log(|x_3|)+x_4^2+x_5x_6+I(x_7x_8x_9<0)+I(x_{10}>0)+x_{11}I(x_{11}>0)+√(|x_{12}|)+cos(x_{13})+2x_{14}+|x_15|+I(x_{16}<−1)+x_{17}I(x_{17}<−1)−2x_{18}−x_{19}x_{20}+ϵ $$

References

https://bookdown.org/max/FES/selection.html