1. Singular Value Decomposition (SVD)

This is a form of matrix analysis that leads to a low-dimensional representation of a high-dimensional matrix. SVD allows an approximate representation of any matrix, and also makes it easy to eliminate the less important parts of that representation to produce an approximate representation with any desired number of dimensions. Suppose we want to represent a very large and complex matrix using some smaller matrix representation then SVD can factorize an m x n matrix, M, of real or complex values into three component matrices, where the factorization has the form USV*. The best way to reduce the dimensionality of the three matrices is to set the smallest of the singular values to zero. If we set a particular number of smallest singular values to 0, then we can also eliminate the corresponding columns. The choice of the lowest singular values to drop when we reduce the number of dimensions can be shown to minimize the root-mean-square error between the original matrix M and its approximation. A useful rule of thumb is to retain enough singular values to make up 90% of the energy. That is, the sum of the squares of the retained singular values should be at least 90% of the sum of the squares of all the singular values. It is also possible to reconstruct the approximation of the original matrix M using U, S , and V*. SVD is used in the field of predictive analytics. Normally, we would want to remove a number of columns from the data since a greater number of columns increases the time taken to build a model. Eliminating the least important data gives us a smaller representation that closely approximates the original matrix. If some columns are redundant in the information they provide then this means those columns contribute noise to the model and reduce predictive accuracy. Dimensionality reduction can be achieved by simply dropping these extra columns. The resulting transformed data set can be provided to machine learning algorithms to yield much faster and accurate models.

2. Forward Feature Selection

Forward Selection is performed by starting with 1 or a few features initially and creating a model. Another feature is repeatedly added to improve the model till the required level of accuracy is achieved. This is a rather slow approach and impractical when there are a large number of features available.

3. Backward Feature Elimination

Backward Elimination is performed by starting with all or most of the features to be used for the model and eliminating the features one at a time to improve the model. The removed features are indiscriminant and add confusion to the model. Statistical techniques such as R squared metric and statistical tests can be used to decide which features to remove.

4. Subset Selection

In this technique a subset of features is selected by manual trial. Variables are added and removed such that the Error term is reduced. An exhaustive approach would take 2^n models, where n is the number of features – therefore a heuristic technique is used because a thorough approach is too expensive. There are three methodologies – forward selection, backward selection and floating search. Forward selection is performed by incrementally adding a variable to the model to reduce the error. Backward selection is performed by starting with all the variables and reducing them stepwise to improve the model. Floating Search uses a back and forth approach to add and reduce variables to form different combinations.

5. Principal Component Analysis (PCA)

PCA is a projection technique which find a projection of the data in a smaller dimension. The idea is to find an axis in the data with highest variance and to map the data along that axis. In figure 15, the data along vector 1 shows a higher variance than vector 2. Therefore, vector 1 will be preferred and chosen as the first principle component. The axis has been rotated in the direction of highest variance. We have thus reduced the dimensionality from two (X1 and X2) to one (PC 1). PCA is useful in cases where the dimensions are highly correlated. For example, pixels in images have a high correlation with each other, here will will prove a significant gain my reducing the dimension. However, if the features are not correlated to each other than the dimension will be the almost the same in quantity after PCA.

6. Partial Least Squares Regression (PLS)

Partial least squares regression (PLS regression) is developed from principal components regression. It works in a similar fashion as it finds a linear regression model by projecting the predicted variables and the predictor variables to a new space instead of finding hyperplanes of maximum variance between the target and predictor variables. While, PCR creates components to explain the observed variability in the predictor variables, without considering the target variable at all, PLS Regression, on the other hand, does take the response variable into account, and therefore often leads to models that are able to fit the target variable with fewer components. However, it depends on the context of the model if using PLS Regression over PCR would offer a more parsimonious model.

7. Latent Dirichlet Analysis (LDA)

Latent Dirichlet Allocation (LDA) is one of the most popular techniques used for topic modelling. Topic modelling is a process to automatically identify topics present in a text object. A latent Dirichlet allocation model discovers underlying topics in a collection of documents and infers word probabilities in topics. LDA treats documents as probabilistic distribution sets of words or topics. These topics are not strongly defined – as they are identified based on the likelihood of co-occurrences of words contained in them. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place. The goal of LDA is to map all the documents to the topics in a way, such that the words in each document are mostly captured by those imaginary topics. A collection of documents is represented as a document-term matrix. LDA converts this document-term matrix into 2 lower dimensional matrices, where one is a document-topics matrix and the other is a topic-terms matrix. LDA then makes use of sampling techniques in order to improve these matrices. A steady state is achieved where the document topic and topic term distributions are fairly good. As a result, it builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

8. Regularized Discriminant Analysis (RDA)

The regularized discriminant analysis (RDA) is a generalization of the linear discriminant analysis (LDA) and the quadratic discriminant analysis (QDA). RDA differs from discriminant analysis in a manner that it estimates the covariance in a new way, which combines the covariance of QDA with the covariance of LDA using a tuning parameter. Since RDA is a regularization technique, it is particularly useful when there are many features that are potentially correlated.

9. t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE is a non-linear dimensionality reduction algorithm used for exploring high-dimensional data. It maps multi-dimensional data to lower dimensions which are easy to visualize. This algorithm calculates probability of similarity of points in high-dimensional space and in the low dimensional space. It then tries to optimize these two similarity measures using a cost function. To measure the minimization of the sum of difference of conditional probability, t-SNE minimizes the sum of Kullback-Leibler divergence of data points using a gradient descent method. t-SNE minimizes the divergence between two distributions: a distribution that measures pairwise similarities of the high-dimensional points and a distribution that measures pairwise similarities of the corresponding low-dimensional points. Using this technique, t-SNE can find patterns in the data by identifying clusters based on similarity of data points with multiple features. t-SNE stands out from all the other dimensionality reduction techniques since it is not limited to linear projections so it is suitable for all sorts of datasets.

10. Factor Analysis

Factor Analysis is designed on the premise that there are latent factors which give origin to the available data that are not observed. In PCA, we create new variables with the available ones, here we treat the data as created variables and try to reach the original ones – thus reversing the direction of PCA. If there is a group of variables that are highly correlated, there is an underlying factor that causes that and can be used as a representative variable. Similarly, the other variables can also be grouped and these groups can be represented using such representative variables. Factor analysis can also be used for knowledge extraction, to find the relevant and discriminant piece of information.

11. Multidimensional Scaling (MDS)

Multidimensional Scaling (MDS) computes the pairwise distances between data points in the original dimensions of the data. The data points are mapped on the a lower dimension space, like the Euclidean Space, such that the paints with low pairwise distances in higher dimension are also close in the lower dimension and points which are far apart in higher dimension, are also apart in lower dimension. The pitfall of this algorithm can be seen in the analogy of geography. Locations which are far apart in road distance due to mountains or rough terrains, but close by in bird-flight path will be mapped far apart by MDS because of the high value of the pairwise distance.

12. AutoEncoder

A tool for dimensionality reduction, an autoencoder has as many outputs as inputs and it is forced to find the best representation of the inputs in the hidden layer. There are fewer perceptrons in the hidden layer, which implies dimensionality reduction. Once training is complete, the first layer from the input layer to the hidden layer acts as an encoder which finds a lower dimension representation of the data. The decoder is from the layer after the hidden layer to the output layer. The encoder can be used to pass data and find a lower dimension representation for dimension reduction.

13. Independent Component Analysis (ICA)

ICA solves the cocktail party problem. At a cocktail party, one is able to seperate the voice of any one person from the voices in the background. Computers are not as efficient at separating the noise from signal as the human brain, but ICA can solve this problem if the data is not Gaussian. ICA assumes independence among the variables in the data. It also assumes that the mixing of the noise and signal is linear, and the source singal has a non-gaussian distribution.

14. Isomap

Isomap (Isometric Mapping) computes the geodesic distances between data points and maps those distances in a Euclidean space to create a lower dimension mapping of the same data. Isomap offers the advantage of using global patterns by first making a neighborhood graph using euclidean distances and then computes graph distances between the nodes. Thus, it uses local information to find global mappings.

15. Local Linear Embedding (LLE)

LLE reduces the dimension of the data such that neighbourhood information (topology) is intact. Points that are far apart in high dimension should also be far apart in lower dimension. LLE assumes that data is on a smooth surface without abrupt holes and that it is well sampled (dense). LLE works by creating a neighbourhood graph of the dataset and computing a local weight matrix using which it regenerates the data in lower dimension. This local weight matrix allows it to maintain the topology of the data.

16. Locality-Sensitive Hashing (LSH)

This technique uses a hash function to determine the similarity of the data. A hash function provide a lower dimensional unique value for an input and used for indexing in databases. Two similar values will give a similar hash value which is used by this technique to determine which data points are neighbours an which are far apart to produce a lower dimensional version of the input data set.

17. Sammon Mapping

Sammon Mapping creates a projection of the data such that geometric relations between data points are maintained to the highest extent. It creates a new dataset using the pairwise distances between points. Sammon mapping is frequently used in image recognition tasks.

Use Cases of Dimensionaliy reduction

Anomaly detection in social media analysis can be used to identify unusual patterns in user behavior, such as sudden spikes in activity, irregular sentiment patterns, or unexpected increases in follower counts. This can help in identifying fake accounts, bots, or potential security threats.

Investment Analysis

In investment analysis, anomaly detection is useful for spotting irregularities in financial transactions, unusual market activities, or detecting potential fraud. It helps investors identify outliers in stock prices, trading volumes, or financial ratios, which may indicate market manipulation or other risks.

Cybersecurity Analysis

Anomaly detection is crucial in cybersecurity for identifying unusual patterns that may indicate potential security breaches. This includes detecting unusual login attempts, abnormal network traffic, or unauthorized access to sensitive data, which could signal cyber-attacks, malware infections, or insider threats.

SEO Analysis

In SEO analysis, anomaly detection can help identify unusual traffic patterns to websites, such as sudden drops or spikes in traffic, irregular click-through rates, or unexpected changes in search rankings. This can indicate issues with SEO strategies or potential manipulations like click fraud or negative SEO attacks.

Conclusion

These dimensionality reduction techniques and anomaly detection methods can be powerful tools in various domains such as social media, investment, cybersecurity, and SEO analysis. By implementing these methods, one can gain deeper insights into data, detect irregular patterns, and make more informed decisions.

Explore the different dimensionality reduction and feature selection techniques above.

Check use cases of Dimensionality Reduction in various fields.