add penalty to the loss function and reduce the value of weight. (introduce bias) $$ Regularized loss = loss + penalty $$ example:
L1 Regularization
or called Lasso add “absolute value of magnitude” coefficient as a penalty term to the loss function (minimize sum of of coefficients/weights)
$$ penalty = \lambda ||w|| = \lambda \sum_{j=1}^M |w|$$
pros:
- robust to outliers
- works best when your model contains a lot of useless features (built-in feature selection) preferred when having a high number of features as it’s provide sparse solution
cons:
- if there is a high group of highly correlated variables, L1 tend to select 1 variable from the group and ignore the others
L2 Regularization
or called Ridge add “squared magnitude” coefficient as a penalty term to the loss function (minimize sum of square of coefficients/weights)
$$ penalty = \lambda w^Tw = \lambda \sum_{j=1}^M w^2$$
pros:
- can deal with multicollinearity (independent variable are highly-correlated) problems through by reducing the impact/weight of correlated predictors/features but keeping all predictors in the model
Elasticnet
combined of L1 and L2 $$ penalty = \lambda_1 ||w|| + \lambda_2 w^Tw$$
Comments:
Note
higher $\lambda$ we will get smaller slope (less and less sensitive to weight)
L1 | L2 |
---|---|
Penalized Sum of absolute w | penalized sum of squared w |
sparse solution | non sparse solution |
multiple solution solution | only has 1 solution |
built-in feature selection | no feature selection |
simple and interpretable | more accurate when output variable are functions of all input variables |
robust to outliers | not robust to outliers |
can’t learn complex pattern | can learn complex pattern |
Computationally inefficient over non-sparse conditions. | having analytical solutions. |