Regularization

December 10, 2022 2-minute read

add penalty to the loss function and reduce the value of weight. (introduce bias) $$ Regularized loss = loss + penalty $$ example:

or called Lasso add “absolute value of magnitude” coefficient as a penalty term to the loss function (minimize sum of of coefficients/weights)

$$ penalty = \lambda ||w|| = \lambda \sum_{j=1}^M |w|$$

robust to outliers
works best when your model contains a lot of useless features (built-in feature selection) preferred when having a high number of features as it’s provide sparse solution

if there is a high group of highly correlated variables, L1 tend to select 1 variable from the group and ignore the others

or called Ridge add “squared magnitude” coefficient as a penalty term to the loss function (minimize sum of square of coefficients/weights)

$$ penalty = \lambda w^Tw = \lambda \sum_{j=1}^M w^2$$

can deal with multicollinearity (independent variable are highly-correlated) problems through by reducing the impact/weight of correlated predictors/features but keeping all predictors in the model

combined of L1 and L2 $$ penalty = \lambda_1 ||w|| + \lambda_2 w^Tw$$

Note

higher $\lambda$ we will get smaller slope (less and less sensitive to weight)

L1	L2
Penalized Sum of absolute w	penalized sum of squared w
sparse solution	non sparse solution
multiple solution solution	only has 1 solution
built-in feature selection	no feature selection
simple and interpretable	more accurate when output variable are functions of all input variables
robust to outliers	not robust to outliers
can’t learn complex pattern	can learn complex pattern
Computationally inefficient over non-sparse conditions.	having analytical solutions.

References