MACHINE LEARNING: Problem and Issues in Supervised Learning

There are six issues taken into account while dealing with supervised learning as follows-

Heterogeneity of Data - Many algorithms like neural networks and support vector machines like their feature vectors to be homogeneous numeric and normalized. The algorithms that employ distance metrics are very sensitive to this, and hence if the data is heterogeneous, these methods should be the afterthought, Decision trees can handle heterogeneous data very easily.

Redundancy of Data - If the data contains redundant information, i.e. contain highly correlated values, then it's useless to use distance-based methods because of numerical instability. In this case, some sort of regularization can be employed to the data to prevent this situation.

Dependent Features - If there is some dependence between the feature vectors, then algorithms that monitor complex interactions like neural networks and decision trees fare better than other algorithms.

Bias-Variance Tradeoff- The training set may contain several different but equally good data sets. Now the learning algorithm is said to be biased for a particular input if, when trained on these data sets, it is systematically incorrect while predicting the correct output for that particular output. A learning algorithm has a high variance for a particular input when it provides different outputs when trained on different data sets. Thus, there is a tradeoff between bias and variance, and supervised learning approach is able to adjust this tradeoff.

Amount of Training Data and Function Complexity - The amount of data required to provide during the training period depends on the complexity if the function required to map from the training data set. So, for a simple function with low complexity, the learning algorithm can learn from a small amount of data. Whereas, on the other hand, for high complexity functions, the learning algorithm needs a large amount of data.

The dimensionality of the Input Space - If the input feature vectors have a high dimension then the learning algorithm can be difficult even it depends on a small number of features. This is because the many "extra" dimensions can confuse the learning algorithm and cause it to nave high variance. Hence, high input dimensionality typically requires tuning the classifier to have low variance and high bias.

MACHINE LEARNING