## Tuesday, November 5, 2013

### Why everything in the classification field is irrelevant but two things

Yes, everything in the binary classification problem in Machine Learning is irrelevant: Throw your Logistic Regression algorithm away, forget about SVM with universal kernels, to hell with neural networks. All you need is:

Multi-dimensional Gaussianization + Naive Bayes

Gaussianization is the process of making a variable Gaussian. One-dimensional Gaussianization is trivial: just take the inverse Gaussian CDF and apply it to any random variable's CDF. This requires knowing our RV's CDF but one-dimensional RVs estimation offer no problem unless we have too few examples, and can be achieved by Gaussian Mixture Models for most cases. Multi-dimensional Gaussianization is more elaborate and there are several procedures. Let's assume we have a procedure to gaussianize a multi-dimensional RV. It is sufficient that the gaussianize version ends up with the identity covariance matrix (which can be directly the output by the procedure or can be done by just a rotation and scaling). Once we get a RV with identity covariance matrix, we know that the one-dimensional RVs in it are independent. This can be the input to a Naive Bayes Classifier and complies with all its assumptions (independence of variables), which automatically yields the best classifier according to the underlyting probability distributions.

In Chen, Scott Shaobing, and Ramesh A. Gopinath. "Gaussianization." (2000), the authors show a method to gaussianize multi-dimensional RVs. It is based on expectation-maximization iterations, in which one estimates the best gaussian distribution and then finds the parameters and rotations that best describe that distribution. At each iteration, the negentropy (the Kullback-Leibler divergence between our currently transformed RV's distribution and a standard Gaussian) is less than the previous interation's. Firstly, by finding a rotation we achieve less dependence, and then by marginal gaussianization we zero-out marginal neg-entropy. This procedure converges weakly (in distribution) and we end up with a multivariate Gaussian. With the chain of estimated rotations and mixture model parameters we can get the transformation we need for new (test) data. Therefore, classification is straighforward with Naive Bayes, and we certainly know that we fully meet its assumptions.

I will be implementing Gaussianization in MyML.