Thursday, May 30, 2013

Python as a data analysis platform

Despite the fact that I've been aware of Scikits Learn (sklearn) for some time, I never got the chance to really use Python for data analysis and, instead, I was a victim of my own inertia and limited myself to use R and especially Matlab.

I must say, in the beginning, Python looks awkward: it was inconceivable for me to use an invisible element (spaces or tabs) as a structural construction of a program (defining blocks), in a way much similar to Fortran, which I always considered weird (coming from the C world). This and the lack of the omnipresent, C-syntax end-of-line semicolon, prove to be a major boosting element when programming in Python. I must say that whatever lack in computer performance is overcome by the speed the programmer experiences when writing the software. This applies to general software, such as the App server that I am preparing, which is being written in Python using the Google App Engine, and I have to say that it just runs smoothly, no need for recompilations, clear syntax and one-line complex data-processing pieces of code.

Regarding data analysis, it is a little more complicated than Matlab's clear orientation towards numerical linear algebra (where everything is a Matrix). Good comparisons and reasons supporting my view are

It was precisely the last blog the one that spurred me to give it a try.

Now, going to Machine Learning specifics, sklearn has everything you need for the majority of the work a machine learning practitioner will ever need.
Data preprocessors, including text vectorizers and TF IDF preprocessors
SVM implementations
Stochastic Gradient Descent algorithms for fast regression and classification
Random Forest and other ensemble methods for robust regression and classification
Clustering algorithms
Data dimensionality reduction algorithms such as LLE, ISOMAP and spectral embeddings
Results presentation, including mean squared error for regression and precision/recall tables for classification. It even computes the area under the ROC curve.

This, added to the clean, standardized and well-designed interface, which always has a .fit method for every object which performs the task of learning from samples, and then either a .transform method if the learning is unsupervised (such as LLE, ISOMAP, ICA, PCA, or the preprocessors, etc) or .predict if the learning is supervised (SVM, SGD, ensemble...). If enables a pipelining mechanism that allows us to build the whole pipeline from data reading to results output.

One of the lead programmers of the project, Andreas Müller has a very insightful blog. Check it out in the following URL
peekaboo-vision.blogspot.com.es

I decided to be more active on Kaggle. For the moment I scored 13th on the Leaderboard of the Amazon employee access competition that recently opened. Competing against Alexander Larko or any of the other high-standing data scientists chills my blood.

Last but not least, just to comment that future work seems to be bent on using the GPU to perform all the linear algebra. Check out
Gnumpy: http://www.cs.toronto.edu/~tijmen/gnumpy.html
Deep Belief Networks: http://deeplearning.net/tutorial/DBN.html
PyCUDA: http://documen.tician.de/pycuda/tutorial.html

No comments:

Post a Comment