Thursday, November 7, 2013

Naive Bayes with Map Reduce

A fairly straighforward way of implmenting the Naive Bayes classifier for discrete data is using Map Reduce. This is especially useful if you have a bunch of characteristic or naturally discrete data that you can exploit, such as presence/absence, amount of clicks, page/item visited or not, etc.

This can be achieved by first using the data attributes as the key, and the labels as the values on the mapper, in which we need to process the keys and values in this way:
• emit the label as key
• for each variable (attribute) emit its index (for example, column index) also as key
We only need to emit the category (attribute value) as the value

In the reducer, we need to scan each category and find out how many of the elements in the current key belong to to a category, and divide by the sum of all its categories (which are our values) all which constitutes $P(X_i=x_{i,0}|y=y_0)$, for which we emit a triplet
• emit the label as key
• for each variable (attribute) emit its index (for example, column index) also as key
• emit the category for this attribute of this example
As value we only need to emit the previous division.

To find out a new instance, we look into the dictionary entry corresponding to its attributes and return the bayes quotient.

I've just implemented this in MyML.

As an example:import numpy as np
Xd=np.random.random((256,2))
X=1*(Xd<.5)
y=1*(Xd.sum(axis=1)<.5)

from myml.supervised import bayes

nb = bayes.NaiveBayes()
nb.fit(X, y)
nb.predict(X[0,:])
pred=nb.predict(X)

1.0*np.sum(1.0*(pred>.5).reshape((1,len(y)))[0]==y)/len(y)

0.89453125
print X.T
[[0 1 0 1 1 1 1 0 0 1 0 0 0 1 0 1 1 1 0 0 0 0 0 1 0 0 1 1 1 1 1 0 0 0 1 1 0
1 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 1 0 0 1 0 1 1
0 0 0 1 1 1 1 1 0 0 0 1 1 0 0 1 0 0 0 1 0 1 0 1 0 1 1 0 0 1 0 0 0 0 1 0 0
1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 1 1 0 0 1 1 1
1 0 1 1 1 1 1 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 0 1 0 1 0 1 0 0 1
1 1 0 0 1 1 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0
0 0 1 1 1 1 0 1 1 0 0 0 1 0 1 1 0 0 1 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1]
[0 0 1 0 1 0 0 0 1 1 1 1 1 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0
1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 1 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1
0 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 1 0 0
1 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 1 1 1 1 0 1 0 1 1 0 1 1 0 0 0
0 0 1 1 0 1 0 1 0 1 1 0 1 0 1 1 0 1 0 1 0 1 0 1 1 0 1 1 0 0 0 1 1 1 0 1 0
1 1 0 1 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1
1 1 1 1 1 0 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 0 0 0]]

y
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0])