Friday, May 31, 2013

Protein classification as a text mining problem

Man we are active on Kaggle.

I am writing a paper to apply a non-linear kernel combination technique that I invented. The goal is to predict  proteins functions from protein interactions.

One of the methods that I want to compare against is the linear regression of the protein functional classes with the interactions of the given protein against the rest of the proteins. In this sense, the proteins are some kind of dual of themselves, since they are use to define themselves (via their interactions).

The input files are taken from CYGD-mips.gsf.de/proj/yeast and go like this

YJR002W YKL143W
YLL044W YPL238C
YDR471W YJL148W
YLR003C YNL174W
YGR285C YLR372W
YLR243W YLR435W
YKR057W YPL211W
YLR185W YNL067W
YLR185W YPL142C
YDL051W YER049W
YGL076C YNL248C
YNL247W YPL273W
YDR449C YLR129W

So you get an idea of the rest. Each line cointains an interaction of two proteins.

Assuming we have read all files and annotated, for each protein, which proteins does it interact with, then we have an array of strings, one string per protein, containing proten names as if in a text document.


>>> rels[:100]
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'YLR203C YER154W YML129C ', 'YER142C ', '', '', '', '', '', 'YBL099W YPL078C ', 'YLR393W YBL099W YPL078C YER154W ', 'YER154W YHL038C YDR375C ', '', '', 'YPL160W ', 'YBL099W YPL078C ', 'YHL004W YKL155C YER154W ', '', 'YBR037C YBR024W YER154W YDR231C ', '', 'YER154W ', '', '', '', '']
>>>

At this point we are ready to apply a classical Information Retrieval analysis. We vectorize each protein, so that we end up with a sparse matrix of 6336x6336 proteins with ones where there is an interaction. Then we apply the Term Frequency-Inverse Document Frequency to scale down the importance of very interacting, frequent (and thus less invormative) proteins. Lastly, we can use any classifier to perform regression on the sparse features given by the combination of TfIdf and the vectorizer, and the desired classes. In this case, we found that the Stochastic Gradient Descent classifier works very well.

from sklearn.cross_validation import train_test_split
relsta, relste, yta, yte = train_test_split(rels, y[:,0], test_size=0.2)

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.pipeline import Pipeline

text_clf = Pipeline([("vect", CountVectorizer()),
                     ("tfidf", TfidfTransformer()),
                     ("mnb", SGDClassifier())])
text_clf.fit(relsta, yta)
predicted = text_clf.predict(relste)

from sklearn.metrics import classification_report
print classification_report(yte, predicted)

precision    recall  f1-score   support

          0       0.85      0.91      0.88       972
          1       0.62      0.48      0.54       296

avg / total       0.80      0.81      0.80      1268

This analysis is very similar to what you can do to score 0.87 in the Amazon employee access challenge

Thursday, May 30, 2013

Python as a data analysis platform

Despite the fact that I've been aware of Scikits Learn (sklearn) for some time, I never got the chance to really use Python for data analysis and, instead, I was a victim of my own inertia and limited myself to use R and especially Matlab.

I must say, in the beginning, Python looks awkward: it was inconceivable for me to use an invisible element (spaces or tabs) as a structural construction of a program (defining blocks), in a way much similar to Fortran, which I always considered weird (coming from the C world). This and the lack of the omnipresent, C-syntax end-of-line semicolon, prove to be a major boosting element when programming in Python. I must say that whatever lack in computer performance is overcome by the speed the programmer experiences when writing the software. This applies to general software, such as the App server that I am preparing, which is being written in Python using the Google App Engine, and I have to say that it just runs smoothly, no need for recompilations, clear syntax and one-line complex data-processing pieces of code.

Regarding data analysis, it is a little more complicated than Matlab's clear orientation towards numerical linear algebra (where everything is a Matrix). Good comparisons and reasons supporting my view are

It was precisely the last blog the one that spurred me to give it a try.

Now, going to Machine Learning specifics, sklearn has everything you need for the majority of the work a machine learning practitioner will ever need.
Data preprocessors, including text vectorizers and TF IDF preprocessors
SVM implementations
Stochastic Gradient Descent algorithms for fast regression and classification
Random Forest and other ensemble methods for robust regression and classification
Clustering algorithms
Data dimensionality reduction algorithms such as LLE, ISOMAP and spectral embeddings
Results presentation, including mean squared error for regression and precision/recall tables for classification. It even computes the area under the ROC curve.

This, added to the clean, standardized and well-designed interface, which always has a .fit method for every object which performs the task of learning from samples, and then either a .transform method if the learning is unsupervised (such as LLE, ISOMAP, ICA, PCA, or the preprocessors, etc) or .predict if the learning is supervised (SVM, SGD, ensemble...). If enables a pipelining mechanism that allows us to build the whole pipeline from data reading to results output.

One of the lead programmers of the project, Andreas Müller has a very insightful blog. Check it out in the following URL
peekaboo-vision.blogspot.com.es

I decided to be more active on Kaggle. For the moment I scored 13th on the Leaderboard of the Amazon employee access competition that recently opened. Competing against Alexander Larko or any of the other high-standing data scientists chills my blood.

Last but not least, just to comment that future work seems to be bent on using the GPU to perform all the linear algebra. Check out
Gnumpy: http://www.cs.toronto.edu/~tijmen/gnumpy.html
Deep Belief Networks: http://deeplearning.net/tutorial/DBN.html
PyCUDA: http://documen.tician.de/pycuda/tutorial.html