Friday, May 31, 2013

Protein classification as a text mining problem

Man we are active on Kaggle.

I am writing a paper to apply a non-linear kernel combination technique that I invented. The goal is to predict  proteins functions from protein interactions.

One of the methods that I want to compare against is the linear regression of the protein functional classes with the interactions of the given protein against the rest of the proteins. In this sense, the proteins are some kind of dual of themselves, since they are use to define themselves (via their interactions).

The input files are taken from CYGD-mips.gsf.de/proj/yeast and go like this

YJR002W YKL143W
YLL044W YPL238C
YDR471W YJL148W
YLR003C YNL174W
YGR285C YLR372W
YLR243W YLR435W
YKR057W YPL211W
YLR185W YNL067W
YLR185W YPL142C
YDL051W YER049W
YGL076C YNL248C
YNL247W YPL273W
YDR449C YLR129W

So you get an idea of the rest. Each line cointains an interaction of two proteins.

Assuming we have read all files and annotated, for each protein, which proteins does it interact with, then we have an array of strings, one string per protein, containing proten names as if in a text document.


>>> rels[:100]
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'YLR203C YER154W YML129C ', 'YER142C ', '', '', '', '', '', 'YBL099W YPL078C ', 'YLR393W YBL099W YPL078C YER154W ', 'YER154W YHL038C YDR375C ', '', '', 'YPL160W ', 'YBL099W YPL078C ', 'YHL004W YKL155C YER154W ', '', 'YBR037C YBR024W YER154W YDR231C ', '', 'YER154W ', '', '', '', '']
>>>

At this point we are ready to apply a classical Information Retrieval analysis. We vectorize each protein, so that we end up with a sparse matrix of 6336x6336 proteins with ones where there is an interaction. Then we apply the Term Frequency-Inverse Document Frequency to scale down the importance of very interacting, frequent (and thus less invormative) proteins. Lastly, we can use any classifier to perform regression on the sparse features given by the combination of TfIdf and the vectorizer, and the desired classes. In this case, we found that the Stochastic Gradient Descent classifier works very well.

from sklearn.cross_validation import train_test_split
relsta, relste, yta, yte = train_test_split(rels, y[:,0], test_size=0.2)

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.pipeline import Pipeline

text_clf = Pipeline([("vect", CountVectorizer()),
                     ("tfidf", TfidfTransformer()),
                     ("mnb", SGDClassifier())])
text_clf.fit(relsta, yta)
predicted = text_clf.predict(relste)

from sklearn.metrics import classification_report
print classification_report(yte, predicted)

precision    recall  f1-score   support

          0       0.85      0.91      0.88       972
          1       0.62      0.48      0.54       296

avg / total       0.80      0.81      0.80      1268

This analysis is very similar to what you can do to score 0.87 in the Amazon employee access challenge

No comments:

Post a Comment