Friday, July 19, 2013

Powering up Python for Data Analysis

When working with Machine Learning algorithms we face large data movement, but in many algorithms the most important part is a heavy use of linear algebra operations and other mathematical computations.

Intel has a math library that is optimized for the latest processors, including programmer-made optimizations for multiple core counts, wider vector units and more varied architectures which yield a performance that could not be achieved only with compiler automated optimization for routines such as highly vectorized and threaded linear algebra, fast Fourier transforms, and vector math and Statistics. These functions are royalty-free, so including them statically in the program comes at no cost.

Cristoph Gohlke and collaborators have a MKL license and have taken the effort to compile a series of Python modules compiled agaist them. In particular, Numpy and Scipy include these powerful libraries. Add to this that he has already compiled the binaries for Windows 64 bits which are very rare on the internet.

The following are two tests with a positive definite matrix. We compute the eigenvalues in R and Python, using the symmetric eigenvalue solver in each case. The processor is a i5 3210M not plugged in to the socket (losing approx. half its performance). Note that this version of R is compiled against standard Atlas libraries.
B=read.csv("B.csv",header=F)
st=proc.time(); eigB=eigen(B,symmetric=T); en=proc.time()
> en-st
   user  system elapsed
   0.58    0.00    0.58 
In Python:
from time import time
import numpy
B=numpy.loadtxt("B.csv", delimiter=",")
st = time(); U, E = numpy.linalg.eigh(B); en = time()
>>> en-st
0.13400006294250488

A final remark is that there exists an opensource alternative to high-performance CPU computing, and it is the OpenBLAS libraries. Their performance is comparable to MKL.

Link to the positive definite matrix used in the experiments here.
Link to Christoph Gohlke's page here.

No comments:

Post a Comment