Monday, February 10, 2014

Machine learning in a few words

Machine learning is becoming a buzzword, everybody talks aboit it and few seem to be interested in the math underneath (I find statements like "I wanted to know more but all sources were too statistical/mathematical and I wanted more practical stuff").

Let me tell you something:
First: You can't really use Machine Learning if you don't know the statistical/mathematical basis
Second: You can't really use Machine Learning if you don't know the statistical/mathematical basis
Third: You can't really use Machine Learning if you don't know the statistical/mathematical basis

Machine Learning is just a fancy word for the statistical/mathematical tools lying underneath, whose objective is to extract something that we may loosely call knowledge (or something that we understand) from data (or something chaotic that we do not understand), so that computers may take action based on the inferred knowledge. An example of this would be a robot arm/humanoid: without programming actions on direction/velocity/acceleration vectors based on an established model, we may put sensonrs on a subject's articulations, and from these datapoints learn a regression model on the manifold of natural movements. Another example is in Business Intelligente: We may learn groups of customers (market segmentation) so that we may engage several groups with specific policies or offers target at them.

Maching Learning is applied Statistics/Mathematics. Is very little and very unpractical without Optimization/Operations Research, from the algorithmic and practical/scalable point of view.

I've come to the conclusion that there exists two large main approaches to ML, disregarding the specific technique we are dealing with and its target (i.e., supervised or unsupervised), plus one in the middleway:
  • Functional approach (Mathematical)
  • Neural Network/Deep Learning approach (Middle way)
  • Probabilistic approach (Statistical)
In the functional approach, one uses the theory of Hilbert spaces (and therefore of differential equations and linear algebra). The goal is to find a set of transformations of the data so as to best perform in a score for the task (called functional). These transformations come from a pre-defined set which is not related to the data in any way and are a combination of possibly orthogonal basis of a space of functions (transformation) defined on the domain of the original data. Examples of this are: Linear/Ridge/Sparse Regression (linear or identity transformation for regression), SVM (non-linear using the kernel trick), PCA (SVD/eig computation) and KPCA, Matrix factorizations (for signal separation/clusering...), K-means, Projection pursuit... The basic idea is:
I have my data set and a buch of (linear or non-linear) transformations, find a solution applying these transformations to my data so that I can maximize a score functional that I like for my problem. If I want to predict something (classification/regression) then combine these transformations so that the combination fits best my target variable. If I want to examine the nature of the data (dimensionality reduction, matrix factorization, clustering), then use a combination of my transformations to lose as least information as possible (measured with a functional, again).
In the probabilistic approach, the prior knowledge is an assupmtion of the prior probability and the likelyhood, and works towards obtaining a posterior probability (that the outcome is a given choice given the data just seen). Examples are: Logistic Regression (simple non-linear transformation for classification), Naive Bayes (classification), SNE, Gaussian Processes... The general idea is
I want to see independence at the end of the process, therefore I can assume Gaussian multivariate variables so that a linear or non-linear transformation gives me components that are as independent as possible (for unsupervised learning) or assume a probability distribution at the output and a likelyhood and compute a model so that I best fit the likelihood of seeing the target variable given the data.
Neural networks and deep learning are a different story. I consider them to be their own field, drawing tools from the above. They are neither functional because they are not dealing directly with functions (transformations) in the functional analytic setting, and they are not probabilistic for obvious reasons, but use any probability or information theoretic tool as needed. The fact that they connect output of transformations to inputs makes this field related to the first approach indeed, seen as a chain of transformations (spaces defined on the image of their predecessors), but the focus here is obviously the algorithms and that changes things.

Reinforment learning is nowadays just a fancy word for techniques widely known, studied and used in Stochastic Processes, HMM (Hidden Markov Models) being the only exception. Sometimes they call it Sequential Learning, but it is not widely considered Machine Learning, neither scholarly nor popularly.

So, as you can see, there is nothing new (not at least as the discovery of the fundamental theorem of calculus, or of quantum mechanics).

Regarding novelty, I am annoyed each time I read about ML techniques from the big data guys, who are normally programmers starting to do something more that data aggregation and querying. It seems that there is something both new and exotic in what they are saying, and it it mostly well known techniques from statistics, such as this article, or the beginning of Sean Owen's presentation at Big Data Spain.

Now, the practical side of things require that ML scales to Big Data. That limits the applicability of matrices and non-linear transformations to adapt linear methods, so let's see what the next breakthrough is.

No comments:

Post a Comment