Dimensionality Reduction


Not to be confused with feature selection.

Tip

Feature selection is simply selecting and excluding given features without changing them. Dimensionality reduction transforms features into a lower dimension space

Why is it a problem?

  1. More dimensions → more features
    1. Redundant / irrelevant features
    2. More noise added than signal
    3. Hard to interpret and visualize
    4. Hard to store and process data
  2. Risk of overfitting our models
  3. Distances grow more and more alike
  4. No clear distinction between clustered objects
  5. Concentration phenomenon for Euclidean distance

Using the same number of data, there will be a point where increasing features will reduces model performances (irrelevant features can have effect to a model depend on the type of the model itself, see: feature selection) img Source: https://community.deeplearning.ai/t/mlep-course-3-lecture-notes/54454

In the distance function:

example: Euclidean distance

$$d_{ij} = \sqrt{\sum_{k=1}^n (x_{ik}-x_{jk})^2}$$

  1. New dimensions add non-negative terms to the sum
  2. distances increases with the number of dimensions
  3. feature space becomes increasingly sparse.

Some Techniques:

some techniques for dimensionality reduction:

  1. Principal Component Analysis (PCA)
  2. Linear Discriminant Analysis (LDA)
  3. Partial Least Squared (PLS)
  4. Non-Negative Matrix Factorization (NMF)
  5. Independent Component Analysis (ICA)
  6. Latent Semantic Indexing/Analysis (LSI and LSA) (SVD)

References

  1. https://www.coursera.org/learn/machine-learning-modeling-pipelines-in-production/home/week/2
  2. https://www.kaggle.com/general/59245