Why Dimensionality Reduction Is Crucial in Machine Learning Models?
High dimensionality throws challenges in performance accuracy of machine learning models. In this article, I explored the importance of dimensionality reduction in building machine learning models and discussed why high dimensionality impacts the performance of machine learning models / statistical models. Here, I touched upon the following questions:
- What is high dimensionality?
- What are the difficulties caused by high dimensionality?
- What is curse of dimensionality?
- What are the benefits of dimensionality reduction?
What is high dimensionality?
The number of features or input variables that contribute to the prediction process to predict target or output in machine learning model is called dimensionality. However, it is observed that all feature variables do not have the same significance / contribution in the output. Machine learning models such as regression and classification are built on training dataset. The dimension of the dataset plays a crucial role in the performance of the model. There are machine learning models that use weight factors to predict the output for unknown data. The dimension or feature variables show significant impact in the accuracy of the model. Therefore, before training a machine learning model, it is necessary to identify relevant features that contribute to output / target and this process is called dimensionality reduction. Dimensionality reduction is essential because of the following reasons:
- To reduce complexity of model
More number of features / high dimensionality will lead to build a complex model specially when there is a high correlation exits in the feature variables. Thus, it is useful to select the right set of features to overcome this challenge.
- To prevent overfitting
A dataset with high dimensionality sometimes may lead to overfitting of a machine learning model because the model captures key features and noise as well. Thus, the model performs well during training of the model, but performance degrades on testing on unknown data.
- To achieve computational efficiency
A machine learning model with low dimensionality takes less time in training of the model because it takes less time in computations.
The prime objective of dimensionality reduction is to find a low-dimensional representation of the dataset that preserves as much information as possible. In other words, dimensionality reduction is the means of lowering the number of arbitrary variables under consideration, by obtaining a set of key variables.
What are the difficulties caused by high dimensionality?
High dimensionality throws several challenges and adversely impacts on the performance accuracy of the model. Some of the notable points are:
- It may lead to high computational cost.
- It may cause overfitting during training of the model that means the model performs well during training, but performance accuracy degrades during testing on new data samples.
- The higher the number of features variables throws more difficulties to visualize the training set.
- High dimensionality may also have more chances to high correlation in data.
What is curse of dimensionality?
Dimensionality indicates the number of variables/features in the dataset that are used for training the machine learning model. When the number of variables/ features is large compared to the number of observations in the dataset which drastically degrades the performance of the model is called curse of dimensionality.
For example, if I build a linear regression model (a machine learning model) to predict price of house, I need to take into consideration the noteworthy features only such as area, number of bedrooms, number of bathrooms, furnishing, age of house, locality etc. Selection of nonsignificant feature such as loan amount, income of the owner, number of flats in the society etc. will make adverse impact on the performance accuracy of the regression model.
What are the benefits of dimensionality reduction?
There are several benefits of dimensionality reduction. Some of the most important benefits are:
- It eliminates insignificant features from the dataset which improves the performance of the model because irrelevant features or noise leaves an adverse impact on the performance accuracy of machine leaning model.
- It is useful to lower model training time and lower the data storage requirement.
- It prevents from curse of dimensionality
- It eliminates multi collinearity which results in better performance of the model.
In this article, I “explored what is high dimensionality?” And touched upon some of the important aspects of high dimensionality such as what are the difficulties caused by high dimensionality? what is curse of dimensionality? and what are the benefits of dimensionality reduction? I am sure that I will create a fundamental understanding of high dimensionality.
On wrapping up notes, feel free to share your comments and feedback. Your clapping and comments will surely help me to present contents in a better way. See you next week.