Machinomics: 2013

Wednesday, December 25, 2013

Benchmark functions in Python

This is a function to benchmark functions in Python using decorators, so that you can use it non-intrusively with your current code, just adding the decorator operator @ to the definition of your function. Feel free to modify it with higher resolution time functions.

Sunday, December 22, 2013

Clojure concurrency and some niceties

I stumbled upon the barber problem at some webpage I don't remember; it was a loose and open approach to it. I then imagined my own version and implemented that in Clojure (I'll put the code as a Gist, even though the style just does not match this blog's). Basically, I want to adjust the rate of customers entering the barber shop to a little faster than the barber can dispatch one customer.

To do that, I create two functions, one that increases the queue of customers (with the shop supporting up to 3 customers) amd one that dispatces customers, decreasing the queue. I assume 4 hours for customers to be able to take a seat (the barber's will close after 4 hours and cut the remaining wating customers).

Notice here that the barber cuts the hair only if a random number meets a condition (here making him slower than his customers). I implement this with a watcher function, that gets called whenever the reference changes. In order to be called every time a customer enter the shop, we need to issue and identity change, that does not change the value of the queue but fires the watcher. This is a very nice feature of Clojure.

The important functional element is this

(let [f (if (< @queue 3) inc identity)]
(dosync (alter queue f)) ))

Notice the conditional assignment to the variable in the let block. This removes the boilerplate code needed in the expression section within the let block.

Thursday, December 19, 2013

Why are so many people still using Internet Explorer?

As somebody that know a little bit of HTML and Javascript, I appreciate standards very much. Internet Explorer has broken (though they've restrained themselves) and continue to break every single one of them. Luckily I develop for mobile platforms when I have an idea and have time.

Internet Explorer never follows standards, the interfaces they expose are always Microsoft-y, cumbersome, intrincated, unpractical. Multimedia and interaction is assured with the latest developments of the triad HTML/Javascript/CSS3, yet in Explorer it always renders badly. I know programmers who have suffered making two versions of their webpage: one for Explorer and another for the rest. These are just bad guys.

And we have the following compelling reasons, that are beyond pure political reasons (of following standards):

You are safer by avoiding software that bad guys target. Mac users benefited from this for years. Windows users can lower their attack surface (be less vulnerable) by avoiding popular software. Internet Explorer is popular, so bad guys exploit known problems with the browser. No thanks.
Microsoft fixes bugs in Internet Explorer on a fixed schedule. But, bugs are not discovered on a schedule which means IE users remain vulnerable to know bugs until the next scheduled bug fix roll-out. Neither Firefox nor Chrome, my preferred browsers, are locked into a schedule.
In addition, Microsoft is just slow in fixing Internet Explorer bugs. The last release of IE patches included a fix to a bug that Microsoft had been told about six months ago. The topic of bugs in popular software brings Adobe's Flash Player to mind. Internet Explorer users with Flash enabled in their browser get notified of new versions of Flash using a very flawed system. And, when they are notified, they need to manually install the new version of Flash.
In this day and age, this is not acceptable; Flash is too popular and too buggy. Firefox fails here too. And speaking of Flash, it exists in Internet Explorer as an ActiveX control. The lack of security in ActiveX is what prompted me to jump on the Firefox bandwagon even prior to version 1.0.
ActiveX may be locked down a bit more than it used to be, but how many Internet Explorer users understand the security related prompts about running an ActiveX control, let alone the configuration options for ActiveX? To me, a browser that doesn't support ActiveX is safer. ActiveX was the first approach to extending browsers with extra features and functions. Now, both Firefox and Chrome have a huge number of available extensions. Internet Explorer has only a handful
Buggy browser extensions/plugins are often targeted by bad guys. Both Firefox and Chrome do some checking for outdated extensions. Internet Explorer does none.

Tuesday, December 17, 2013

Damn, Clojure is fast!

Well I feel positive today, I struggled for the past two days to make my logistic regression in Clojure faster. I even made up to four different implementations of the logistic regression, with none of them giving satisfactory results. It all was even more disappointing when comparing to my MyML logistic regression implementation.

Well, I found out what the problem was: Actually I was making two fatal errors.

My input data in Clojure were lists instead of vectors
My input data in Python was 256 datapoints, instead of 1000 as in the Clojure version.

Both points stemmed from me being not so careful. In the first case, I knew already that one should use vectors instead of list when going after performance, but I was assuming that my data was in vector form. Staring at the variable X, I wondered if it was vector or list and voila, performance just got x5 better. Then I went to the Python interpreter, checked whether the number of iterations in the gradient descent object was the same as in Clojure, and then checked the data... well, my data was smaller (from a previous test with logistic regression). I re-generated my data and Python just lagged behind. In particular, I give you the figures (notice that I did not bother to put the wrong results I was getting because of my mistakes):

Python: 0.56 sec
Clojure: 0.19 sec (iterative implementation through loop-recur) 0.04 sec (concurrent implementation through agents).

Bear in mind that I am conduncting the tests on my girlfriend's borrowed machine and that I installed Cristoph Gohlke's Numpy distribution, which shippes with Intel's MKL statically-linked libraries, so it should be pretty fast in terms of algebraic computations. Perhaps the lack of performance comes from Python's interpreter itself (read-interpret-execute...). This is even more supporting of Clojure, since we are focusing on the infrastructure of both systems.

I will be putting everything in order, making my logistic regression more idiomatic and building some tests.

Wednesday, November 20, 2013

Google Highway 101 brainteaser

Some days ago I read about the famous Google application invitation in the form of an ad on Highway 101 passing through LA.

The ad shows a conceiled URL that we need to guess by computing the first 10-digit prime that appears in the series of consecutive digits of the irrational e. It is an oldie but goodie.

I thought I could solve this in Clojure. Here is my solution.

First we need to compute the number e with as many digits as necessary.
For that we can implementan unbounded spigot algorithm for the number e. I googled about this and found out a blog with superb material for this exercise, so I implemented the ideas in Clojure.

(defn digitse [N n] (if (= n N) (.setScale 2.5M 200)
                      (if (= n 0)
                        (+ 2.0M (* (.divide
                                   (.setScale 1.0M 200 BigDecimal/ROUND_HALF_UP) 2.0M BigDecimal/ROUND_HALF_UP)
                                   (digitse N (inc n) )) )
                      (+ 1.0M (* (.divide
                                   (.setScale 1.0M 200 BigDecimal/ROUND_HALF_UP) (+ n 2.0M) BigDecimal/ROUND_HALF_UP)
                                 (digitse N (inc n)))) ) ))

We then create a subsequence with sufficient decimal places. It turs out that 150 digits is enough. We also convert it to string to improve the partition of our 10-digit (now characters) strings.

(def e (digitse 150 0))

(def se (.subSequence (.toString e) 0 150 ))

We need to define a function that tells us wheter a given number is a prime. As we know, we need only test up to its square root, but here a simple loop over the 10-digit numbers show us that none of the square roots exceeds 89000, so we take the first 8500 prime numbers, starting in 2, from a prime number generator, that I took from here.

(defn gen-primes "Generates an infinite, lazy sequence of prime numbers"
[]
(let [reinsert (fn [table x prime]
                   (update-in table [(+ prime x)] conj prime))]
    (defn primes-step [table d]
                 (if-let [factors (get table d)]
                   (recur (reduce #(reinsert %1 d %2) (dissoc table d) factors)
                          (inc d))
                   (lazy-seq (cons d (primes-step (assoc table (* d d) (list d))
                                                 (inc d))))))
    (primes-step {} 2)))

Please note that this generator is pretty fast, compared to the answers that I have read on the internet. Now we take the primes we need from this lazy sequence.

(def first-primes (take 8500 (gen-primes) ) )

And define a function to test that the condition that some remainder of a given number between any of the prime denominators is zero does not happen, which means that our number is prime.

(defn prime-restricted? [n first-primes]
(= nil (some #(= 0 (rem n %)) first-primes) ))

Finally, we recursivelly (with no stack overhead) until either a prime is found or we exceed the 150 digits capacity (in which case we would just increase it, but we don't need to).

(defn find-first-prime-in-e [se init first-primes]
(if (<= (- (.length se) init) 10)
    nil
    (let [number (java.lang.Long/parseLong (.subSequence se init (+ init 10)) )]
      (println init number)
      (if (prime-restricted? number first-primes)
        number
        (recur se (inc init) first-primes)
        )))
)

The last output lines and the returned value of the previous function are

96 6642742746
97 6427427466
98 4274274663
99 2742746639
100 7427466391
7427466391

So, it turns out that the first 10-digit prime number is in place 100, which makes the URL

7427466391.com

Upon connection, you would get another quizz, which having in mind the quantity you just computed is fairly straighforward.

Sunday, November 17, 2013

Two very simple Python functions to ckeck prime numbers and list divisors

Here are two simple functions (with no error checking, so watch your inputs) to check whether a number is prime and to list all divisors of a number.

To check for prime numbers:

def prime(x): return not any ( ( x % (np.array(range( int (np.sqrt(x)) ) ) + 2) == 0 ).tolist() )

Remember that the fundamental theorem of arithmetics guarantees that every integer can be decomposed in prime numbers, and that it is necessary to divide up only to a number's square root to know whether it is prime, a fact known in ancient Greece among many facts about integer arithmetic (for instance, Euclid proved that there are infinite prime numbers in Proposition 20 of his Elements)

def divisors(x):
return np.array( [k for k in range( 2, int(np.sqrt(x) ) ) if prime(k) and not x % k] )

In this case we need to check whether primes from 2 to half of it (we could optimize this by finding the minimum factor, and the range of checks we really need to do is (min_factor(x) , x/min_factor(x) ) ), since having a factor larger than its half would imply it is prime by the previous principle.

Thursday, November 7, 2013

Naive Bayes with Map Reduce

A fairly straighforward way of implmenting the Naive Bayes classifier for discrete data is using Map Reduce. This is especially useful if you have a bunch of characteristic or naturally discrete data that you can exploit, such as presence/absence, amount of clicks, page/item visited or not, etc.

This can be achieved by first using the data attributes as the key, and the labels as the values on the mapper, in which we need to process the keys and values in this way:

emit the label as key
for each variable (attribute) emit its index (for example, column index) also as key

We only need to emit the category (attribute value) as the value

In the reducer, we need to scan each category and find out how many of the elements in the current key belong to to a category, and divide by the sum of all its categories (which are our values) all which constitutes $P(X_i=x_{i,0}|y=y_0)$, for which we emit a triplet

emit the label as key
for each variable (attribute) emit its index (for example, column index) also as key
emit the category for this attribute of this example

As value we only need to emit the previous division.

To find out a new instance, we look into the dictionary entry corresponding to its attributes and return the bayes quotient.

I've just implemented this in MyML.

As an example:import numpy as np

Xd=np.random.random((256,2))
X=1*(Xd<.5)
y=1*(Xd.sum(axis=1)<.5)

from myml.supervised import bayes

reload(bayes)
nb = bayes.NaiveBayes()
nb.fit(X, y)
nb.predict(X[0,:])
pred=nb.predict(X)

1.0*np.sum(1.0*(pred>.5).reshape((1,len(y)))[0]==y)/len(y)

0.89453125

print X.T
[[0 1 0 1 1 1 1 0 0 1 0 0 0 1 0 1 1 1 0 0 0 0 0 1 0 0 1 1 1 1 1 0 0 0 1 1 0
1 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 1 0 0 1 0 1 1
0 0 0 1 1 1 1 1 0 0 0 1 1 0 0 1 0 0 0 1 0 1 0 1 0 1 1 0 0 1 0 0 0 0 1 0 0
1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 1 1 0 0 1 1 1
1 0 1 1 1 1 1 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 0 1 0 1 0 1 0 0 1
1 1 0 0 1 1 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0
0 0 1 1 1 1 0 1 1 0 0 0 1 0 1 1 0 0 1 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1]
[0 0 1 0 1 0 0 0 1 1 1 1 1 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0
1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 1 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1
0 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 1 0 0
1 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 1 1 1 1 0 1 0 1 1 0 1 1 0 0 0
0 0 1 1 0 1 0 1 0 1 1 0 1 0 1 1 0 1 0 1 0 1 0 1 1 0 1 1 0 0 0 1 1 1 0 1 0
1 1 0 1 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1
1 1 1 1 1 0 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 0 0 0]]

y
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0])

Wednesday, November 6, 2013

True random numbers in R

We know that rnorm() computes the output from an internal seed it keeps as a state variable. The rest are computations and, therefore, it allows up to compute the output if we guess the seed.

According to the CRAN package description, the random package in R provides an interface to the true random number service provided by the random.org website. It operates by sampling atmospheric noise via radio tuned to an unused broadcasting frequency together with a skew correction algorithm due to John von Neumann (which I don't know what it means yet). So if you are ever interested in improving your random numbers to better perform your simulations, bear this package in mind.

Tuesday, November 5, 2013

Why everything in the classification field is irrelevant but two things

Yes, everything in the binary classification problem in Machine Learning is irrelevant: Throw your Logistic Regression algorithm away, forget about SVM with universal kernels, to hell with neural networks. All you need is:

Multi-dimensional Gaussianization + Naive Bayes

Gaussianization is the process of making a variable Gaussian. One-dimensional Gaussianization is trivial: just take the inverse Gaussian CDF and apply it to any random variable's CDF. This requires knowing our RV's CDF but one-dimensional RVs estimation offer no problem unless we have too few examples, and can be achieved by Gaussian Mixture Models for most cases. Multi-dimensional Gaussianization is more elaborate and there are several procedures. Let's assume we have a procedure to gaussianize a multi-dimensional RV. It is sufficient that the gaussianize version ends up with the identity covariance matrix (which can be directly the output by the procedure or can be done by just a rotation and scaling). Once we get a RV with identity covariance matrix, we know that the one-dimensional RVs in it are independent. This can be the input to a Naive Bayes Classifier and complies with all its assumptions (independence of variables), which automatically yields the best classifier according to the underlyting probability distributions.

In Chen, Scott Shaobing, and Ramesh A. Gopinath. "Gaussianization." (2000), the authors show a method to gaussianize multi-dimensional RVs. It is based on expectation-maximization iterations, in which one estimates the best gaussian distribution and then finds the parameters and rotations that best describe that distribution. At each iteration, the negentropy (the Kullback-Leibler divergence between our currently transformed RV's distribution and a standard Gaussian) is less than the previous interation's. Firstly, by finding a rotation we achieve less dependence, and then by marginal gaussianization we zero-out marginal neg-entropy. This procedure converges weakly (in distribution) and we end up with a multivariate Gaussian. With the chain of estimated rotations and mixture model parameters we can get the transformation we need for new (test) data. Therefore, classification is straighforward with Naive Bayes, and we certainly know that we fully meet its assumptions.

I will be implementing Gaussianization in MyML.

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.8.7803

Monday, October 21, 2013

Clojure structures, deconstruction and higher-order functions to make our lives better

Clojure provides an easy and more productive way of writing programs and it is available to us if we want to stay away from the common imperative, sequential, start-to-end way of thinking.

During my time as a developer, both (and especially) in the academia and in production software, I've come across the problem of getting the index of those elements in an array that meet a constrain. This is easy (although after some training, as usual) in Clojure

(defn index-filter [pred coll]
(for [[i elem]
(map-indexed (fn [a b] [a b]) coll) :when (pred elem)]
i))

The way this works is: the for macro associates each i and elem to the pairs in the sequence given by (fn [a b] [a b]) coll). This is filtered by executing the predicate on each element. The predicate, in turn, filters out the elements in which we are not interested. The for body then returns each of the index that passed the condition.

We can separate the functionality into two functions, the first to write the indexing of the original elements as an independent function:

(def make-index (partial map-indexed (fn [a b] [a b]) ))

We use (partial) to make a function that still needs an input argument and associate it into the make-index symbol. Placing it into the general code:

(defn index-filter [pred coll]
(for [[i elem] (make-index coll) :when (pred elem)]
i))

The way you call this function is with a predicate and a collection. For example, Now we have a very elegant solution that is valid for many data types.

Monday, October 14, 2013

MyML: Yeat another Machine Learning library in Python

Yes I know there are a number of (very) well developed and advance ML libraries already out there, especially for Python. What is the point of starting another one?

Well, first of all, when one starts something, he usually does it for the sake of it. For learning. That is my prime reason. I want to sharpen my Python skills with fairly advanced topics, focusing the library on well designing principles and not-so-mainstream state-of-the art techniques, such as an implementation of

[1] K. Fukumizu, C. Leng - Gradient-based kernel method for feature
extraction and variable selection. NIPS 2012.

that I had already implemented in an ad hoc fashion.

Plus, one cannot help but implementing classical techniques and focus on doing it well for once. Look at this UML chart of the Logistic Regression implementation: The Logistic Regression is just a broker of other classes, it just creates a DifferentiableObjective of subclass Logistic, so that any AbstractGradientDescent method can use this implementation to compute the objective function values and the gradients at the parameter space locations (see the diagram):

The diagram was created with https://www.draw.io/

Therefore, the same logistic regression can be estimated by classical gradient descent such as the current implementation, or one can implement an online, stochastic or natural gradient descent variants (future work) and plug them into the factory, which then uses the user argument values to select the particular algorithm. The same applies to other methods, and one can implement a hige loss or classical regression with quadratic loss and just plug in the gradient descent algorithm.

Github: https://github.com/analyticbastard/myml

Sunday, October 6, 2013

Hadoop java.lang.ClassNotFoundException

Today I ran into some weird Hadoop error. It could not find my mapper class. It turns that I had defined HADOOP_CLASSPATH as only my current directory (where my classes were) and it lacked the generic Mapper class (org.apache.hadoop.mapreduce.Mapper), but instead of Hadoop reporting this later class was missing, it did so with my own class, which was clearly accessible.

So this blog entry is for those who run into this problem too, because there is no help from Stackoverflow regarding this issue.

This is the message you get:

java.lang.RuntimeException: java.lang.ClassNotFoundException: ProcessMapper
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
        at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:718)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)

Sunday, September 29, 2013

Kaggle survey results

The next are the results of the Kaggle survey that I conducted, in which several questions regarding Machine Learning/Data Analysis were asked to participants to extract their personal views on the subject and the tools they used. The online platform on which the survey ran does not offer much analytics beyond copying and pasting aggregated results per question, so here you go:

(the link http://es.surveymonkey.com/s/SYYTCF2 )

TOTAL PARTICIPANTS: 40

1. What is your background?

Biochemistry         0,0%    0
Chemistry         0,0%    0
Computer Engineering (Software Development)        30,0%    12
Computer Science (IA/Machine Learning)        12,5%    5
Econometrics         0,0%    0
Economics        5,0%    2
Engineering (Electrical)        5,0%    2
Engineering (Mechanical)         0,0%    0
Engineering (Other)         0,0%    0
Mathematics        15,0%    6
Medicine         0,0%    0
Physics        7,5%    3
Statistics        12,5%    5
Other (Science Applied)        7,5%    3
Other (Science Pure)         0,0%    0
Other        5,0%    2

2. What is your language of preferred usage for data analysis tasks?

Bash/sed/awk/any shell         0,0%    0
C/C++        2,5%    1
Excel         0,0%    0
Java        5,0%    2
Maple         0,0%    0
Mathematica         0,0%    0
Matlab/Octave        5,0%    2
Perl         0,0%    0
Python        37,5%    15
R/S-Plus        35,0%    14
SAS        2,5%    1
SPSS         0,0%    0
Stata         0,0%    0
Weka        2,5%    1
Other        10,0%    4

3. Where do you live? (Select the option of your political mainland country: e.g., Canary Islands - Spain - Europe (South) )

America (North - Canada)        2,5%    1
America (North - US)        42,5%    17
America (North - Mexico)         0,0%    0
America (Central)         0,0%    0
America (South - Brazil)         0,0%    0
America (South - Argentina)         0,0%    0
America (South - Others)         0,0%    0
Africa (East)         0,0%    0
Africa (Ecuatorial)         0,0%    0
Africa (Mediterranean including Egypt)         0,0%    0
Africa (Sahara)         0,0%    0
Africa (South Africa)        2,5%    1
Africa (West)         0,0%    0
Asia (China)         0,0%    0
Asia (Japan)         0,0%    0
Asia (Korea)        2,5%    1
Asia (India)        5,0%    2
Asia (Middle East)        2,5%    1
Asia (Europe - Russia)        2,5%    1
Asia (Other)        2,5%    1
Europe (Central)        10,0%    4
Europe (East)        2,5%    1
Europe (Islands)         0,0%    0
Europe (North)        10,0%    4
Europe (South)        5,0%    2
Oceania        10,0%    4

4. Where do you originally come from?

America (North - Canada)         0,0%    0
America (North - US)        35,0%    14
America (North - Mexico)         0,0%    0
America (Central)         0,0%    0
America (South - Brazil)         0,0%    0
America (South - Argentina)         0,0%    0
America (South - Others)         0,0%    0
Africa (East)         0,0%    0
Africa (Ecuatorial)         0,0%    0
Africa (Mediterranean including Egypt)         0,0%    0
Africa (Sahara)         0,0%    0
Africa (South Africa)        2,5%    1
Africa (West)         0,0%    0
Asia (China)        5,0%    2
Asia (Japan)         0,0%    0
Asia (Korea)        2,5%    1
Asia (India)        7,5%    3
Asia (Middle East)         0,0%    0
Asia (Europe - Russia)        2,5%    1
Asia (Other)        2,5%    1
Europe (Central)        10,0%    4
Europe (East)        7,5%    3
Europe (Islands)         0,0%    0
Europe (North)        7,5%    3
Europe (South)        10,0%    4
Oceania        7,5%    3

5. Where did you study?

America (North - Canada)         0,0%    0
America (North - US)        42,5%    17
America (North - Mexico)         0,0%    0
America (Central)         0,0%    0
America (South - Brazil)         0,0%    0
America (South - Argentina)         0,0%    0
America (South - Others)         0,0%    0
Africa (East)         0,0%    0
Africa (Ecuatorial)         0,0%    0
Africa (Mediterranean including Egypt)         0,0%    0
Africa (Sahara)         0,0%    0
Africa (South Africa)        2,5%    1
Africa (West)         0,0%    0
Asia (China)         0,0%    0
Asia (Japan)         0,0%    0
Asia (Korea)        2,5%    1
Asia (India)        7,5%    3
Asia (Middle East)        2,5%    1
Asia (Europe - Russia)        2,5%    1
Asia (Other)        2,5%    1
Europe (Central)        7,5%    3
Europe (East)        2,5%    1
Europe (Islands)        2,5%    1
Europe (North)        10,0%    4
Europe (South)        7,5%    3
Oceania        7,5%    3

6. What are the hardware/software configurations you use? (Mark the hardware you perfrom your data computations on, not the one you have i.e., do not mark GPU if you use it only for gaming and you don't perform data analysis on GPU.

Apple MacIntosh        20,0%    7
Cloud (Amazon)        5,7%    2
Cloud (Other)         0,0%    0
GPU (ATI)         0,0%    0
GPU (Nvidia)        14,3%    5
CPU (AMD/K10)         0,0%    0
CPU (AMD/Bulldozer)        2,9%    1
CPU (AMD/Bobcat)        2,9%    1
CPU (Intel/i3)        5,7%    2
CPU (Intel/i5)        37,1%    13
CPU (Intel/i7)        37,1%    13
CPU (Intel/Ivy Bridge)        8,6%    3
CPU (Intel/Sandy Bridge)        11,4%    4
CPU (Intel/Other)        8,6%    3
CPU (Other)        5,7%    2

7. What OS/browser(s) do you use?

Linux (Chrome)        22,9%    8
Linux (Chrominium)        2,9%    1
Linux (Firefox)        17,1%    6
Linux (Opera)         0,0%    0
Linux (Other)         0,0%    0
OSX (Chrome)        20,0%    7
OSX (Chrominium)         0,0%    0
OSX (Firefox)         0,0%    0
OSX (Other)         0,0%    0
OSX (Safari)        2,9%    1
Windows (Chrome)        54,3%    19
Windows (Chrominium)         0,0%    0
Windows (Firefox)        17,1%    6
Windows (Other)        5,7%    2
Windows (Safari)         0,0%    0
Other OS (Chrome)         0,0%    0
Other OS (Chrominium)        2,9%    1
Other OS (Firefox)         0,0%    0
Other OS (Other)         0,0%    0
Other OS (Safari)         0,0%    0

8. Have you used any Hadoop-related tools for any data analysis?

Cassandra         0,0%    0
Lucene         0,0%    0
Hadoop        77,8%    7
Mahout        22,2%    2
Hama         0,0%    0
HBase         0,0%    0
Hive        22,2%    2
Pig        44,4%    4

9. What is the Machine Learning technique that you generally find most useful for classification/regression?

Adaboost        3,2%    1
Bayesian Networks        3,2%    1
kNN         0,0%    0
Linear Regression (Lasso/ElasticNet)        3,2%    1
Linear Regression (OLS/Ridge/other regularized)        3,2%    1
Linear Regression (Other)         0,0%    0
Linear SVC/SVR         0,0%    0
Logistic Regression        6,5%    2
Naive Bayes         0,0%    0
Neural Networks        12,9%    4
Random Forests        67,7%    21
SVM/SVR (Non-linear kernel)         0,0%    0

10. According to you, Machine Learning is mostly?

Engineering/Algorithmics        14,3%    5
Engineering/Algorithmics and Optimization        34,3%    12
Mathematics        5,7%    2
Optimization        2,9%    1
Physics         0,0%    0
Programming        5,7%    2
Statistics and Probability Theory        37,1%    13

Wednesday, September 11, 2013

Load Kaggle datasets directly into Amazon EC2

Despite not having access to a suitable environment at home, I decided to enter a new Kaggle competition. The StumbleUpon Evergreen Classification Challenge seems to be easy to tackle since it is a classic binary classification problem with text features and numerical features.

I decided to do it on the cloud. For that purpose, one needs to load the data distributed by Kaggle into the Amazon EC2 instance. Kaggle will prevent any connection from there, since they require you to log in to access the data. No problem, it is the cookies which do the work, and we are going to use them from the EC2 instance, as they commented here

The first thing we need is a plugin to save the cookies into a text file. Use this for Firefox, and this for Chrome.

Then, we upload the file to the EC2 instance with some means. In my case I use Bittorrent Sync (a post will be coming later on). We tell wget to use the cookies with the option --load-cookies as this:

wget -x --load-cookies ~/BTSync/cookies.txt http://www.kaggle.com/c/stumbleupon/download/raw_content.zip

We get an output such as this, and we have successfully loaded the data:

ubuntu@ip-172-31-21-138:~/kaggle/evergreen$ wget -x --load-cookies ~/BTSync/cookies.txt http://www.kaggle.com/c/stumbleupon/download/raw_content.zip
--2013-09-09 22:37:17-- http://www.kaggle.com/c/stumbleupon/download/raw_content.zip
Resolving www.kaggle.com (www.kaggle.com)... 168.62.224.124
Connecting to www.kaggle.com (www.kaggle.com)|168.62.224.124|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://kaggle2.blob.core.windows.net/competitions-data/kaggle/3526/raw_content.zip?sv=2012-02-12&se=2013-09-12T22%3A37%3A18Z&sr=b&sp=r&sig=qAJZIFUmRu%2B9XX%2FM%2B7qPorR%2FkWAC7%2B9W6MEWL5xM0fg%3D [following]
--2013-09-09 22:37:18-- https://kaggle2.blob.core.windows.net/competitions-data/kaggle/3526/raw_content.zip?sv=2012-02-12&se=2013-09-12T22%3A37%3A18Z&sr=b&sp=r&sig=qAJZIFUmRu%2B9XX%2FM%2B7qPorR%2FkWAC7%2B9W6MEWL5xM0fg%3D
Resolving kaggle2.blob.core.windows.net (kaggle2.blob.core.windows.net)... 65.52.106.46
Connecting to kaggle2.blob.core.windows.net (kaggle2.blob.core.windows.net)|65.52.106.46|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 164757969 (157M) [application/zip]
Saving to: ‘www.kaggle.com/c/stumbleupon/download/raw_content.zip’

100%[======================================>] 164,757,969 2.29MB/s in 95s

2013-09-09 22:38:53 (1.65 MB/s) - ‘www.kaggle.com/c/stumbleupon/download/raw_content.zip’ saved [164757969/164757969]

ubuntu@ip-172-31-21-138:~/kaggle/evergreen$

Monday, September 9, 2013

Stuff: Cook computers in the oven to resucitate them and cool Windows account hacking

I wanted to share a couple of experiences with computers that I have had in the period of one week.

A couple of weeks ago a number of friends and I went on vacations and we rented a house in the middle of nowhere. We brought a Wifi router and our laptops and played a very old computer game that everybody still loves, which is Age of Mythology.

At the end of our stay, one of my laptops (the old one) stopped working, and my friends told me the last thing it did was painting strange lines along with the graphics, i.e., a graphics card issue. I must say that this comes at an already tumultous time between me and technology, since my new laptop just died some time ago and it's in repair.

When we got home, I looked for information on the possible issue. And I came across this

The funny thing is: IT WORKS!

We removed the NVidia video card, put it in the oven as it says, 8 min at 200 celsius, put it back on and voila! Graphics back!

The next experience was: I was given my sister's old laptop. It is old but still decent and she only changed laptop because the disk crashed... I took it and recovered her files later on. Anyway, it must be the two-year long rest but the disk is living now. The only problem: we could not access the laptop because of her password.

Here's Youtube to the rescue again. This, however, semms to only work with Windows 7 and 8, not with Vista, the one on the laptop. Anyway, we got the disk out, put it on my external drive case, and accessed the files from there. This hack works like this:

The accesibility button on the Windows log in screen (bottom left) calls a program called utilman.exe, located under C:\Windows\System32
Accessing the drive, we rename it to utilman1.exe and make a new copy of cmd.exe to utilman.exe, therefore utilman.exe is now a command prompt.
When we put the disk on again and started Windows, pressing the accessibility button shows a privileged command prompt. From there, we issued the command net user myuser x, effectively resseting the password to a plain x.

This left us with a ready to use laptop (not critial tasks, just online gaming), just what we needed for the afternoon

Addendum: I believe it is interesting if I mark the post with the laptop that was reflowed with this technique: This is an Acer 5920g and the NVidia card was a 8600M GT. Apparently, these NVidia processors (prior to 2009) have a problem with the heat from using them and the cooling down when not in use: This makes internal circuits break, much like a stone exposed to the sun. Heating them to over 200 celsius make the connections sufficiently liquid so as to resolder again.

Sunday, September 8, 2013

Video: Wes McKinney about time data handling in Pandas

Interesting stuff. Haven't seen it yet, came across it while searching for Time Series Analysis in Python (but this has nothing to do with ARIMA models). They've worked out timezones and other time data handling operations pretty well.

Wednesday, August 28, 2013

Clojure for project development

In line with my last Clojure post, and with several comments I have come across the internet, I am going to post a guide to build a Clojure piece of software runnable from the beginning (i.e., outside of the REPL).

First of all, you need a JVM, obviously. If you haven't done that, install the latest version of Java SDK.

Now you want to install Leinigen. Leinigen is a building and dependency management tool favored by the Clojure community. Download Leinigen script here https://raw.github.com/technomancy/leiningen/stable/bin/lein. Now you need to place it on a directory within your PAT (for example, /bin) and set it to be executable with chmod 755 ~/bin/lein.

Once we have done that, we need to tell the script to download the Leinigen system. You can easiy do that with lein self-install.

Now you can create a Clojure project, called hello:

lein new app hello

This uses the template app to create your new project. Now cd into the new directory, collect the dependencies and run the tests.

cd hello
lein deps
lein test

You'll see a single testcase which deliberately fails:

Testing hello.core-test FAIL in (replace-me) (core_test.clj:6) expected: false actual: false Ran 1 tests containing 1 assertions. 1 failures, 0 errors.

Great! Clojure is installed in this project and working! To get a feel for Clojure, let's try out some basic stuff by starting a script console:

lein repl

You'll see something like:

nREPL server started on port 59654 on host 127.0.0.1
REPL-y 0.3.0
Clojure 1.5.1
    Docs: (doc function-name-here)
          (find-doc "part-of-name-here")
Source: (source function-name-here)
Javadoc: (javadoc java-object-or-class-here)
    Exit: Control+D or (exit) or (quit)
Results: Stored in vars *1, *2, *3, an exception in *e

user=>

Type

(println "Hello World!")

and press return. You should get:

Hello World!
nil
user=>

Now let's define a function that does that:

(defn greet [] (println "Hello World!"))

The console will respond:

#'user/greet user=>

Run the function:

(greet)

Hello World! nil user=>

Returning to the project, edit src/hello/core.clj (the basic source skeleton that Leiningen created for you above). Add our greet function to it and call it, so core.clj reads:

(ns hello.core) (defn greet [] (println "Hello World!"))

(defn -main[] (greet "Sean"))

The (ns hello.core) line declares the namespace (think Java package) in which the code lives. The -main function will be the regular Java main function and we tell Clojure by writing the - prefix.
We can run this via Leiningen:

lein run -m hello.core

The -m argument specifies the namespace in which -main is defined.

Now let's modify our script so we can compile it and run it via the JVM. First we need to update the namespace declaration to tell Clojure we want to generate a (Java) class file, we remove the spaces in the output by call str to construct a single string (so we need a space after Hello), and we change our main method to accept an argument:

(ns hello.core (:gen-class))
(defn greet[who] (println (str "Hello " who "!")))
(defn -main[who] (greet who))

We also need to tell Leiningen about our main class. Edit project.clj and add a :main declaration so it looks like this:

(defproject hello "1.0.0-SNAPSHOT"
   :description "FIXME: write"
   :dependencies [[org.clojure/clojure "1.2.1"]]
   :main hello.core)

Don't worry about the rest of it, that's part of the Leiningen/Maven magic used to ensure the right libraries are available. Now tell Leiningen to compile your script and create a JAR that we can execute via Java:

lein uberjar

If you look in the current directory, you'll see hello-1.0.0-SNAPSHOT.jar and hello-1.0.0-SNAPSHOT-standalone.jar and it's the second one we'll use:

java -cp hello-1.0.0-SNAPSHOT-standalone.jar hello.core

You have now a functional project and you are ready to write some code for production software.

Sunday, August 25, 2013

The perils of the REPL

Functional [programming] people are proud of their new toy called the REPL almost as if interactive development was a new concept. I guess that coming from Java and the generalist software development languages makes you think it is (although generalist but interpreted languages such as Python have always had an interactive interpreter). People who have worked with scientific modelling software such as Matlab or R (myself included) are used to this way of developing: rapidly modelling an idea into a few lines that could be recalled and modified according to one's needs.

This, however, becomes dangerous when developing software. When making a software product, one is one step ahead from bare modeling, in the sense that full working conditions are taken into account, one of them being program startup. I say this because I've read some books on Clojure and always found them to work with the REPL to describe de language, obviating the classical software bulding cicle of write a source code file which includes a main function or entry point, compile it and execute the result. The REPL gives you the advantage of quick modeling, but it is very different from writing Clojure source files and integrate them into a whose system intended for production.

Despite not being so difficult being Clojure a JVM system, almost none of them explain the entry point to the program, and they stick to explaining language sintax and basic libraries on the REPL, forgetting about entry points and other production software issues such as multiple file integration. Some of them don't even include a section to Leiningen or Maven, and jump to using advanced features such as databases (Redis, MySQL, HBase) or web toolkits. Even those that come with brief introduction don't even put the reader in a context of making a deployable piece of software. Therefore, readers must resort to blogs to find that kind of information.

The REPL is useful. It is as useful as it is in the scientific/modeling world, but as software developers with deployable product, programmers must deal with things further than testing live, more important to their business.

Friday, August 2, 2013

Installing Theano on Windows 64 bit (x86_64) with GPU capabilities

Since Theano team works under Linux, those of us that bought a laptop with a fancy Windows version pre-installed and decided that we wanted some compatibility with technology-reluctant friends and family (therefore assuming difficulties with everything else), we are doomed to hack our way into getting Theano up and running.

In this post I assume you are going with Cristoph Gohlke's packages (for reasons, read a previous post)

Make sure you also have MS Visual C++ and the NVidia CUDA Toolkit. If you don't have it, add the Visual C++ cl.exe compiler's directory to the path. Mine was under C:\Program Files (x86)\Microsoft Visual Studio 10\VC\bin.

First think you need, after installing Theano, is the nose package, since Gohlke's build needs it at initialization time. Download it and install it from Gohlke's site along with Theano.

Next, you need this .theanorc to be put under your home directory under C:\USER\<yourname>

[global]device = gpu
[nvcc]compiler_bindir=C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin# flags=-m32 # we have this hard coded for now
[blas]ldflags =# ldflags = -lopenblas # placeholder for openblas support

I am not very sure how to use OpenBLAS from here. I assume that if all CPU operations are done via Numpy and SciPy, then their default BLAS routines are used, and no direct call to a third BLAS implementation is made, but who knows! (Well, I looked into it a little bit and it seems Theano calls BLAS directly, I guess you may want to install OpenBLAS).

OK, we have the NVidia compiler and tools, the MS compiler that nvcc needs and the configuration. The last thing we need is to install a GNU C and C++ compiler that supports 64 bit Windows binary creation. There is a project called MinGW-w64 that does that. I recommend to download a private build from the user rubenvb that does not come along with the Python environment embedded as the more official build does. Put the bin directory (where GCC is located) of that installation in the Path (Control panel, etc). Theano needs this to compile the symbolic operations to object code and then to CUDA kernels if applicable, I presume.

If you run into errors of type "GCC: sorry, unimplemented: 64-bit mode not compiled in", then your MinGW is not x86_64 compliant. The NVidia compiler nvcc can also complain if it finds no cl.exe in the path.

By the way, all of this was to use deep learning techniques for Kaggle competitions, so the intended consequence was to install PyLearn2. This is not listed under Gohlke's libraries, but it is not low level and all is based on Theano and maybe other numerical packages such as Numpy. Being a pure Python package, you need to clone it from Github:

git clone git://github.com/lisa-lab/pylearn2.git

And then perform

cd pylearn2
python setup.py install

There is an easier procedure that will not require you to manually perform the git operations, and it is through pip

pip install git+git://github.com/lisa-lab/pylearn2.git

You have pip under your Python installation, within the Scripts directory, in the case it came with Python, or if you got Gohlke's installer.

This will also leave the module correctly accessible through Python.

Edit: Pylearn2's tutorial test is a little bit complicated to be a "hello world" test, so I looked for another quick example to see if my installation was finished. A very nice one popped up in this link, which I reproduce here. But first I have to tell that this made me realize that Gohlke's Theano is missing three files, something very, very strange since they are called from within Theano. In particular, the module missing is everything under theano.compat. In this case, just copy the contents from Theano's Github repository directory compat to a compat directory created on your local theano installation under Python 2.7 (mine C:\Python27\Lib\site-packages\theano).

After that, run the code in this link, which is a neural network solving the XOR problem. And we are done.

MinGW-w64: rubenvb build.
Python libraries and builds for Windows: Cristoph Gohlke.
Link to a "truer" hello world Pylearn2 program: here.

Tuesday, July 30, 2013

Kaggle survey

I am conducting a survey of kagglers' habits (this is, data scientist, machine learning practitioners...).

Including is: academic background, hardware used to analyze data, language of choice, views of machine learning, geographical origin, place of study...

If you are interested and you are a ML practitioner, please take the survey at
http://es.surveymonkey.com/s/SYYTCF2

For the moment, R has surpassed Python as the language of choice for data problems, most of the practitioners come from the US and, surprisingly, many of them (us) are originally software engineers.

Monday, July 29, 2013

Mathematical analysis of MapReduce

Everybody is talking about MapReduce. They talk a lot about it even though they barely know what it is. I guess they talk about it because of all Google hype.

To put it simply, MapReduce is
$$\left. F (f_y) \right|_{y=k}$$

where the function $f$ is the map, $ F $ is the reduce and $k$ is the key. In the special case the reduce just adds the values, the above becomes
$$\left. \int f_y (x) dx \right|_{y=k}$$
where $x$ are the values and $F$ is a linear functional (i.e., an element of the algebraic dual of the space where $f_y$, for all $y$ -the keys-, live.

The prominent example of computing the maximum temperatures from "Hadoop: The definitive guide" is the operation
$$\left. \| f_y \|_{\infty} \right|_{y=k}$$

It is "just" an abstraction of a basic operation found ubiquitously.

Thursday, July 25, 2013

What is Machine Learning?

Machine Learning is Statistics taken over by physicists along with some mathematicians, computer scientists, econometricians and rogue statisticians.

Machine Learning is the engineering meaningful data representations, the science of designing and solving an optimization problem and the art of rationalizing and visualizing the output.

Machine Learning is nothing without optimization.

Friday, July 19, 2013

Powering up Python for Data Analysis

When working with Machine Learning algorithms we face large data movement, but in many algorithms the most important part is a heavy use of linear algebra operations and other mathematical computations.

Intel has a math library that is optimized for the latest processors, including programmer-made optimizations for multiple core counts, wider vector units and more varied architectures which yield a performance that could not be achieved only with compiler automated optimization for routines such as highly vectorized and threaded linear algebra, fast Fourier transforms, and vector math and Statistics. These functions are royalty-free, so including them statically in the program comes at no cost.

Cristoph Gohlke and collaborators have a MKL license and have taken the effort to compile a series of Python modules compiled agaist them. In particular, Numpy and Scipy include these powerful libraries. Add to this that he has already compiled the binaries for Windows 64 bits which are very rare on the internet.

The following are two tests with a positive definite matrix. We compute the eigenvalues in R and Python, using the symmetric eigenvalue solver in each case. The processor is a i5 3210M not plugged in to the socket (losing approx. half its performance). Note that this version of R is compiled against standard Atlas libraries.

B=read.csv("B.csv",header=F)
st=proc.time(); eigB=eigen(B,symmetric=T); en=proc.time()
> en-st
user system elapsed
0.58 0.00 0.58

In Python:

from time import time
import numpy
B=numpy.loadtxt("B.csv", delimiter=",")
st = time(); U, E = numpy.linalg.eigh(B); en = time()
>>> en-st
0.13400006294250488

A final remark is that there exists an opensource alternative to high-performance CPU computing, and it is the OpenBLAS libraries. Their performance is comparable to MKL.

Link to the positive definite matrix used in the experiments here.
Link to Christoph Gohlke's page here.

Thursday, July 18, 2013

Orange Machine Learning (Python), the charm of Machine Learning

I asked about a good visualization tool on Kaggle, and D33B pointed out to Orange. Despite not being what I was asking for, checking the tool out revealed it to be awesome (M. Horbal felt that way too).

You will never win a kaggle competition with Orange, but it will certainly help you deal with data and build models very, very quickly and intuitively. In fact, I used it to quickly build a prototype model for a company's data which was very nasty (loads of missing values, numbers with quotation marks...). I quickly build a linear regression and visualized some scatterplots and conditional distributions. All of it with this nice workflow

In the image above, the only datasource is the file element. For the analysis pipeline, we first select the attributes that have a number of values in the independent variables, since these variables incrementally have less and less values. We are interested in keeping most of the values and still visualize the relationship between the attributes. After selecting the attributes, we tell Orange to prune the data before injecting it into the several elements after that. We want to see the conditional distributions in case we can get rid of non-informative attributes, we want to study potential linear relationships, see their correlations via a distance map and perform a linear regression (also ridge and lasso). On the other hand, we also want to study the regressors in depth and for that we select only the regressor attrubutes in the pipeline below.

Definitely a piece of software to have in your toolset.

Saturday, June 29, 2013

Clojure vectors vs. PersistentQueue

Clojure has several pre-built data structures that can be used to implement our ideas. Here you have a comparison of the well-known vector and the poorly-documented clojure.lang.PersistentQueue. Edit: I changed the code in quotes to a Github Gist since I like it much better the way it looks. Sorry for the horrible contrast of the gists and the blogger template, I believe nothing will save me from modifiying the colors.

There was a fatal flaw in the implementation. It is now corrected. Thanks to the commentators.

Friday, May 31, 2013

Protein classification as a text mining problem

Man we are active on Kaggle.

I am writing a paper to apply a non-linear kernel combination technique that I invented. The goal is to predict proteins functions from protein interactions.

One of the methods that I want to compare against is the linear regression of the protein functional classes with the interactions of the given protein against the rest of the proteins. In this sense, the proteins are some kind of dual of themselves, since they are use to define themselves (via their interactions).

The input files are taken from CYGD-mips.gsf.de/proj/yeast and go like this

YJR002W YKL143W
YLL044W YPL238C
YDR471W YJL148W
YLR003C YNL174W
YGR285C YLR372W
YLR243W YLR435W
YKR057W YPL211W
YLR185W YNL067W
YLR185W YPL142C
YDL051W YER049W
YGL076C YNL248C
YNL247W YPL273W
YDR449C YLR129W

So you get an idea of the rest. Each line cointains an interaction of two proteins.

Assuming we have read all files and annotated, for each protein, which proteins does it interact with, then we have an array of strings, one string per protein, containing proten names as if in a text document.

>>> rels[:100]
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'YLR203C YER154W YML129C ', 'YER142C ', '', '', '', '', '', 'YBL099W YPL078C ', 'YLR393W YBL099W YPL078C YER154W ', 'YER154W YHL038C YDR375C ', '', '', 'YPL160W ', 'YBL099W YPL078C ', 'YHL004W YKL155C YER154W ', '', 'YBR037C YBR024W YER154W YDR231C ', '', 'YER154W ', '', '', '', '']
>>>

At this point we are ready to apply a classical Information Retrieval analysis. We vectorize each protein, so that we end up with a sparse matrix of 6336x6336 proteins with ones where there is an interaction. Then we apply the Term Frequency-Inverse Document Frequency to scale down the importance of very interacting, frequent (and thus less invormative) proteins. Lastly, we can use any classifier to perform regression on the sparse features given by the combination of TfIdf and the vectorizer, and the desired classes. In this case, we found that the Stochastic Gradient Descent classifier works very well.

from sklearn.cross_validation import train_test_split
relsta, relste, yta, yte = train_test_split(rels, y[:,0], test_size=0.2)

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.pipeline import Pipeline

text_clf = Pipeline([("vect", CountVectorizer()),
("tfidf", TfidfTransformer()),
("mnb", SGDClassifier())])
text_clf.fit(relsta, yta)
predicted = text_clf.predict(relste)

from sklearn.metrics import classification_report
print classification_report(yte, predicted)

precision recall f1-score support

0 0.85 0.91 0.88 972
1 0.62 0.48 0.54 296

avg / total 0.80 0.81 0.80 1268

This analysis is very similar to what you can do to score 0.87 in the Amazon employee access challenge

Thursday, May 30, 2013

Python as a data analysis platform

Despite the fact that I've been aware of Scikits Learn (sklearn) for some time, I never got the chance to really use Python for data analysis and, instead, I was a victim of my own inertia and limited myself to use R and especially Matlab.

I must say, in the beginning, Python looks awkward: it was inconceivable for me to use an invisible element (spaces or tabs) as a structural construction of a program (defining blocks), in a way much similar to Fortran, which I always considered weird (coming from the C world). This and the lack of the omnipresent, C-syntax end-of-line semicolon, prove to be a major boosting element when programming in Python. I must say that whatever lack in computer performance is overcome by the speed the programmer experiences when writing the software. This applies to general software, such as the App server that I am preparing, which is being written in Python using the Google App Engine, and I have to say that it just runs smoothly, no need for recompilations, clear syntax and one-line complex data-processing pieces of code.

Regarding data analysis, it is a little more complicated than Matlab's clear orientation towards numerical linear algebra (where everything is a Matrix). Good comparisons and reasons supporting my view are

It was precisely the last blog the one that spurred me to give it a try.

Now, going to Machine Learning specifics, sklearn has everything you need for the majority of the work a machine learning practitioner will ever need.

Data preprocessors, including text vectorizers and TF IDF preprocessors

SVM implementations

Stochastic Gradient Descent algorithms for fast regression and classification

Random Forest and other ensemble methods for robust regression and classification

Clustering algorithms

Data dimensionality reduction algorithms such as LLE, ISOMAP and spectral embeddings

Results presentation, including mean squared error for regression and precision/recall tables for classification. It even computes the area under the ROC curve.

This, added to the clean, standardized and well-designed interface, which always has a .fit method for every object which performs the task of learning from samples, and then either a .transform method if the learning is unsupervised (such as LLE, ISOMAP, ICA, PCA, or the preprocessors, etc) or .predict if the learning is supervised (SVM, SGD, ensemble...). If enables a pipelining mechanism that allows us to build the whole pipeline from data reading to results output.

One of the lead programmers of the project, Andreas Müller has a very insightful blog. Check it out in the following URL

peekaboo-vision.blogspot.com.es

I decided to be more active on Kaggle. For the moment I scored 13th on the Leaderboard of the Amazon employee access competition that recently opened. Competing against Alexander Larko or any of the other high-standing data scientists chills my blood.

Last but not least, just to comment that future work seems to be bent on using the GPU to perform all the linear algebra. Check out
Gnumpy: http://www.cs.toronto.edu/~tijmen/gnumpy.html
Deep Belief Networks: http://deeplearning.net/tutorial/DBN.html
PyCUDA: http://documen.tician.de/pycuda/tutorial.html

Tuesday, April 30, 2013

Hadoop "Incompatible namespaceIDs" error

This is an error that I faced a while ago and I made a blog entry to be publised some day, so here it goes.

When setting up an installation and if not everything goes well, we might face the situation of datanode/namenode desyncrhonization. You should immediately refer to the logs. There you can see the following message:

 ... ERROR org.apache.hadoop.dfs.DataNode: java.io.IOException: Incompatible namespaceIDs in /app/hadoop/tmp/dfs/data: namenode namespaceID = 308967713; datanode namespaceID = 113030094
        at org.apache.hadoop.dfs.DataStorage.doTransition(DataStorage.java:281)
        at org.apache.hadoop.dfs.DataStorage.recoverTransitionRead(DataStorage.java:121)

Go to you namenode's directory (dfs.name.dir) and get the namespaceID value from current/VERSION.
Go to your data node's storage directory (dfs.data.dir) and edit current/VERSION.
Change the value to the datanode's value for the namenode's value.

I had not been able to repair the filesystem with any other way.

Monday, February 25, 2013

David MacKay's information theory & pattern recognition course corse

David MacKay is a well-known and very respected professor in the area of machine learning, information theory and neural networks. His work is astonishing, to say the least.

Surprisingly, he is also a concerned citizen and author of the book "Sustainable Energy - without the hot air", of which I have already spoken briefly. He has the healthy habit of releasing his books online...

Here you have a wonderful introductory course to information theory and pattern recognition

http://videolectures.net/mackay_course_01/

Friday, February 22, 2013

Encapsulation in Python

I was reading the discussion on Stackoverflow about encapsulation in Python

A user complained that he could call a private function by calling the mangling of internal names.

>>> class MyClass:
...     def myPublicMethod(self):
...             print 'public method'
...     def __myPrivateMethod(self):
...             print 'this is private!!'
... 
>>> obj = MyClass()
>>> obj.myPublicMethod()
public method
>>> obj.__myPrivateMethod()
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: MyClass instance has no attribute '__myPrivateMethod'
>>> dir(obj)
['_MyClass__myPrivateMethod', '__doc__', '__module__', 'myPublicMethod']
>>> obj._MyClass__myPrivateMethod()
this is private!!

A good answer is, if this behavior is to be prevented, to check where the call was initiated

import re
import inspect

class MyClass :

    def __init__(self) :
        pass

    def private_function ( self ) :
        try :
            function_call = inspect.stack()[1][4][0].strip()

            # See if the function_call has "self." in the begining
            matched = re.match( '^self\.', function_call )
            if not matched :
                print 'This is Private Function, Go Away'
                return
        except :
            print 'This is Private Function, Go Away'
            return

        # This is the real Function, only accessible inside class #
        print 'Hey, Welcome in to function'

    def public_function ( self ) :
        # i can call private function from inside the class
        self.private_function()

### End ###

It's cool to use the internals of languages and compilers to access stuff that does not appear in the beginner's guide!

Saturday, February 9, 2013

Neil Lawrence's opening course

Neil Lawrence is Professor at the University of Sheffield. He has worked on unsupervised learning for a long time, and has developed algorithms applicable to dimensionality reduction such as the Gaussian Process Latent Variable model (GP-LVM), see the JMLR paper here.

He has a superb inaugural lecture in which he talks about Machine Learning. The link to the starting page to see that video is here. It opens some embedded and annoying player but it is worth dealing with it.

Neil Lawrence's Inaugural Lecture

Title: Life, The Universe and Machine Learning

Time: 17:15 Thursday 6th September 2012

Venue: St George's Church Lecture Theatre, University of Sheffield
Abstract
What is Machine Learning? Why is it useful for us? Machine learning algorithms are the engines that are driving forward an intelligent internet. They are allowing us to uncover the causes of cancer and helping us understand the way the universe is put together. They are suggesting who your friends are on facebook, enabling driverless cars and causing flagging potentially fraudulent transactions on your credit card. To put it simply, machine learning is about understanding data.
In this lecture I will try and give a sense of the challenges we face in machine learning, with a particular focus on those that have inspired my research. We will look at applications of data modelling from the early 19th century to the present, and see how they relate to modern machine learning. There will be a particular focus on dealing with uncertainty: something humans are good at, but an area where computers have typically struggled. We will emphasize the role of uncertainty in data modelling and hope to persuade the audience that correct handling of uncertainty may be one of the keys to intelligent systems.