Regularization


add penalty to the loss function and reduce the value of weight. (introduce bias) $$ Regularized loss = loss + penalty $$ example:

L1 Regularization

or called Lasso add “absolute value of magnitude” coefficient as a penalty term to the loss function (minimize sum of of coefficients/weights)

$$ penalty = \lambda ||w|| = \lambda \sum_{j=1}^M |w|$$

pros:
  1. robust to outliers
  2. works best when your model contains a lot of useless features (built-in feature selection) preferred when having a high number of features as it’s provide sparse solution
cons:
  1. if there is a high group of highly correlated variables, L1 tend to select 1 variable from the group and ignore the others

L2 Regularization

or called Ridge add  “squared magnitude” coefficient as a penalty term to the loss function (minimize sum of square of coefficients/weights)

$$ penalty = \lambda w^Tw = \lambda \sum_{j=1}^M w^2$$

pros:
  1. can deal with multicollinearity (independent variable are highly-correlated) problems through by reducing the impact/weight of correlated predictors/features but keeping all predictors in the model

Elasticnet

combined of L1 and L2 $$ penalty = \lambda_1 ||w|| + \lambda_2 w^Tw$$

Comments:

Note

higher $\lambda$ we will get smaller slope (less and less sensitive to weight)

L1 L2
Penalized Sum of absolute w penalized sum of squared w
sparse solution non sparse solution
multiple solution solution only has 1 solution
built-in feature selection no feature selection
simple and interpretable more accurate when output variable are functions of all input variables
robust to outliers not robust to outliers
can’t learn complex pattern can learn complex pattern
Computationally inefficient over non-sparse conditions. having analytical solutions.

References

  1. https://www.analyticssteps.com/blogs/l2-and-l1-regularization-machine-learning
  2. https://en.wikipedia.org/wiki/Elastic_net_regularization
  3. elastic: https://www.youtube.com/watch?v=1dKRdX9bfIo