Tuesday, July 30, 2013

Kaggle survey

I am conducting a survey of kagglers' habits (this is, data scientist, machine learning practitioners...).

Including is: academic background, hardware used to analyze data, language of choice, views of machine learning, geographical origin, place of study...

If you are interested and you are a ML practitioner, please take the survey at
http://es.surveymonkey.com/s/SYYTCF2

For the moment, R has surpassed Python as the language of choice for data problems, most of the practitioners come from the US and, surprisingly, many of them (us) are originally software engineers.

Monday, July 29, 2013

Mathematical analysis of MapReduce

Everybody is talking about MapReduce. They talk a lot about it even though they barely know what it is. I guess they talk about it because of all Google hype.

To put it simply, MapReduce is
$$\left. F (f_y)  \right|_{y=k}$$

where the function $f$ is the map, $ F $ is the reduce and $k$ is the key. In the special case the reduce just adds the values, the above becomes
$$\left. \int f_y (x) dx \right|_{y=k}$$
where $x$ are the values and $F$ is a linear functional (i.e., an element of the algebraic dual of the space where $f_y$, for all $y$ -the keys-, live.

The prominent example of computing the maximum temperatures from "Hadoop: The definitive guide" is the operation
$$\left. \| f_y \|_{\infty} \right|_{y=k}$$


It is "just" an abstraction of a basic operation found ubiquitously.

Thursday, July 25, 2013

What is Machine Learning?

Machine Learning is Statistics taken over by physicists along with some mathematicians, computer scientists, econometricians and rogue statisticians.

Machine Learning is the engineering meaningful data representations, the science of designing and solving an optimization problem and the art of rationalizing and visualizing the output.

Machine Learning is nothing without optimization.

Friday, July 19, 2013

Powering up Python for Data Analysis

When working with Machine Learning algorithms we face large data movement, but in many algorithms the most important part is a heavy use of linear algebra operations and other mathematical computations.

Intel has a math library that is optimized for the latest processors, including programmer-made optimizations for multiple core counts, wider vector units and more varied architectures which yield a performance that could not be achieved only with compiler automated optimization for routines such as highly vectorized and threaded linear algebra, fast Fourier transforms, and vector math and Statistics. These functions are royalty-free, so including them statically in the program comes at no cost.

Cristoph Gohlke and collaborators have a MKL license and have taken the effort to compile a series of Python modules compiled agaist them. In particular, Numpy and Scipy include these powerful libraries. Add to this that he has already compiled the binaries for Windows 64 bits which are very rare on the internet.

The following are two tests with a positive definite matrix. We compute the eigenvalues in R and Python, using the symmetric eigenvalue solver in each case. The processor is a i5 3210M not plugged in to the socket (losing approx. half its performance). Note that this version of R is compiled against standard Atlas libraries.
B=read.csv("B.csv",header=F)
st=proc.time(); eigB=eigen(B,symmetric=T); en=proc.time()
> en-st
   user  system elapsed
   0.58    0.00    0.58 
In Python:
from time import time
import numpy
B=numpy.loadtxt("B.csv", delimiter=",")
st = time(); U, E = numpy.linalg.eigh(B); en = time()
>>> en-st
0.13400006294250488

A final remark is that there exists an opensource alternative to high-performance CPU computing, and it is the OpenBLAS libraries. Their performance is comparable to MKL.

Link to the positive definite matrix used in the experiments here.
Link to Christoph Gohlke's page here.

Thursday, July 18, 2013

Orange Machine Learning (Python), the charm of Machine Learning

I asked about a good visualization tool on Kaggle, and D33B pointed out to Orange. Despite not being what I was asking for, checking the tool out revealed it to be awesome (M. Horbal felt that way too).

You will never win a kaggle competition with Orange, but it will certainly help you deal with data and build models very, very quickly and intuitively. In fact, I used it to quickly build a prototype model for a company's data which was very nasty (loads of missing values, numbers with quotation marks...). I quickly build a linear regression and visualized some scatterplots and conditional distributions. All of it with this nice workflow


In the image above, the only datasource is the file element. For the analysis pipeline, we first select the attributes that have a number of values in the independent variables, since these variables incrementally have less and less values. We are interested in keeping most of the values and still visualize the relationship between the attributes. After selecting the attributes, we tell Orange to prune the data before injecting it into the several elements after that. We want to see the conditional distributions in case we can get rid of non-informative attributes, we want to study potential linear relationships, see their correlations via a distance map and perform a linear regression (also ridge and lasso). On the other hand, we also want to study the regressors in depth and for that we select only the regressor attrubutes in the pipeline below.

Definitely a piece of software to have in your toolset.