Data preprocessing in Machine Learning

Data preprocessing is an important step in any machine learning problem that aims to transform the raw input features into a feature space that is easily interpretable by a machine. The mist common preprocessing steps are standardization and whitening. Here X denotes the dataset, U denote the population mean, and (sigma)the population standard deviation.
Standardization - Standardization is the most popular form of preprocessing that is commonly comprised of mean subtraction and subsequent scaling by the standard deviation. The reason for mean subtraction is mainly that non-zero mean input data creates a loss surface that is steep in some directions and shallow in other such that it slows down convergence of gradient-based optimization techniques. Conversely, input data that has a large variation in spread along with different directions negatively affects the convergence rate.
    Mean subtraction can be formalized as-
                                                                     

Mean subtraction has the geometric interpretation of centering the cloud of data around the origin along every dimension as shown in fig.


standardization refers to altering the data dimensions such that they are of approximately the same scale. This is commonly achieved by dividing each dimension by its standard deviation once it has been zero centered as in 


where denotes the standard deviation and X  (power of s) the standardized data. Dividing by the standard deviation has the geometric interpretation of altering the spread of the data dimensions are proportional to each other.

                                                                                       

No comments:

Post a Comment

Algorithm For Loss Function and introduction

Common Loss functions in machine learning- 1)Regression losses  and  2)Classification losses .   There are three types of Regression losses...