Machinomics: November 2013

Wednesday, November 20, 2013

Google Highway 101 brainteaser

Some days ago I read about the famous Google application invitation in the form of an ad on Highway 101 passing through LA.

The ad shows a conceiled URL that we need to guess by computing the first 10-digit prime that appears in the series of consecutive digits of the irrational e. It is an oldie but goodie.

I thought I could solve this in Clojure. Here is my solution.

First we need to compute the number e with as many digits as necessary.
For that we can implementan unbounded spigot algorithm for the number e. I googled about this and found out a blog with superb material for this exercise, so I implemented the ideas in Clojure.

(defn digitse [N n] (if (= n N) (.setScale 2.5M 200)
                      (if (= n 0)
                        (+ 2.0M (* (.divide
                                   (.setScale 1.0M 200 BigDecimal/ROUND_HALF_UP) 2.0M BigDecimal/ROUND_HALF_UP)
                                   (digitse N (inc n) )) )
                      (+ 1.0M (* (.divide
                                   (.setScale 1.0M 200 BigDecimal/ROUND_HALF_UP) (+ n 2.0M) BigDecimal/ROUND_HALF_UP)
                                 (digitse N (inc n)))) ) ))

We then create a subsequence with sufficient decimal places. It turs out that 150 digits is enough. We also convert it to string to improve the partition of our 10-digit (now characters) strings.

(def e (digitse 150 0))

(def se (.subSequence (.toString e) 0 150 ))

We need to define a function that tells us wheter a given number is a prime. As we know, we need only test up to its square root, but here a simple loop over the 10-digit numbers show us that none of the square roots exceeds 89000, so we take the first 8500 prime numbers, starting in 2, from a prime number generator, that I took from here.

(defn gen-primes "Generates an infinite, lazy sequence of prime numbers"
[]
(let [reinsert (fn [table x prime]
                   (update-in table [(+ prime x)] conj prime))]
    (defn primes-step [table d]
                 (if-let [factors (get table d)]
                   (recur (reduce #(reinsert %1 d %2) (dissoc table d) factors)
                          (inc d))
                   (lazy-seq (cons d (primes-step (assoc table (* d d) (list d))
                                                 (inc d))))))
    (primes-step {} 2)))

Please note that this generator is pretty fast, compared to the answers that I have read on the internet. Now we take the primes we need from this lazy sequence.

(def first-primes (take 8500 (gen-primes) ) )

And define a function to test that the condition that some remainder of a given number between any of the prime denominators is zero does not happen, which means that our number is prime.

(defn prime-restricted? [n first-primes]
(= nil (some #(= 0 (rem n %)) first-primes) ))

Finally, we recursivelly (with no stack overhead) until either a prime is found or we exceed the 150 digits capacity (in which case we would just increase it, but we don't need to).

(defn find-first-prime-in-e [se init first-primes]
(if (<= (- (.length se) init) 10)
    nil
    (let [number (java.lang.Long/parseLong (.subSequence se init (+ init 10)) )]
      (println init number)
      (if (prime-restricted? number first-primes)
        number
        (recur se (inc init) first-primes)
        )))
)

The last output lines and the returned value of the previous function are

96 6642742746
97 6427427466
98 4274274663
99 2742746639
100 7427466391
7427466391

So, it turns out that the first 10-digit prime number is in place 100, which makes the URL

7427466391.com

Upon connection, you would get another quizz, which having in mind the quantity you just computed is fairly straighforward.

Sunday, November 17, 2013

Two very simple Python functions to ckeck prime numbers and list divisors

Here are two simple functions (with no error checking, so watch your inputs) to check whether a number is prime and to list all divisors of a number.

To check for prime numbers:

def prime(x): return not any ( ( x % (np.array(range( int (np.sqrt(x)) ) ) + 2) == 0 ).tolist() )

Remember that the fundamental theorem of arithmetics guarantees that every integer can be decomposed in prime numbers, and that it is necessary to divide up only to a number's square root to know whether it is prime, a fact known in ancient Greece among many facts about integer arithmetic (for instance, Euclid proved that there are infinite prime numbers in Proposition 20 of his Elements)

def divisors(x):
return np.array( [k for k in range( 2, int(np.sqrt(x) ) ) if prime(k) and not x % k] )

In this case we need to check whether primes from 2 to half of it (we could optimize this by finding the minimum factor, and the range of checks we really need to do is (min_factor(x) , x/min_factor(x) ) ), since having a factor larger than its half would imply it is prime by the previous principle.

Thursday, November 7, 2013

Naive Bayes with Map Reduce

A fairly straighforward way of implmenting the Naive Bayes classifier for discrete data is using Map Reduce. This is especially useful if you have a bunch of characteristic or naturally discrete data that you can exploit, such as presence/absence, amount of clicks, page/item visited or not, etc.

This can be achieved by first using the data attributes as the key, and the labels as the values on the mapper, in which we need to process the keys and values in this way:

emit the label as key
for each variable (attribute) emit its index (for example, column index) also as key

We only need to emit the category (attribute value) as the value

In the reducer, we need to scan each category and find out how many of the elements in the current key belong to to a category, and divide by the sum of all its categories (which are our values) all which constitutes $P(X_i=x_{i,0}|y=y_0)$, for which we emit a triplet

emit the label as key
for each variable (attribute) emit its index (for example, column index) also as key
emit the category for this attribute of this example

As value we only need to emit the previous division.

To find out a new instance, we look into the dictionary entry corresponding to its attributes and return the bayes quotient.

I've just implemented this in MyML.

As an example:import numpy as np

Xd=np.random.random((256,2))
X=1*(Xd<.5)
y=1*(Xd.sum(axis=1)<.5)

from myml.supervised import bayes

reload(bayes)
nb = bayes.NaiveBayes()
nb.fit(X, y)
nb.predict(X[0,:])
pred=nb.predict(X)

1.0*np.sum(1.0*(pred>.5).reshape((1,len(y)))[0]==y)/len(y)

0.89453125

print X.T
[[0 1 0 1 1 1 1 0 0 1 0 0 0 1 0 1 1 1 0 0 0 0 0 1 0 0 1 1 1 1 1 0 0 0 1 1 0
1 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 1 0 0 1 0 1 1
0 0 0 1 1 1 1 1 0 0 0 1 1 0 0 1 0 0 0 1 0 1 0 1 0 1 1 0 0 1 0 0 0 0 1 0 0
1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 1 1 0 0 1 1 1
1 0 1 1 1 1 1 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 0 1 0 1 0 1 0 0 1
1 1 0 0 1 1 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0
0 0 1 1 1 1 0 1 1 0 0 0 1 0 1 1 0 0 1 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1]
[0 0 1 0 1 0 0 0 1 1 1 1 1 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0
1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 1 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1
0 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 1 0 0
1 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 1 1 1 1 0 1 0 1 1 0 1 1 0 0 0
0 0 1 1 0 1 0 1 0 1 1 0 1 0 1 1 0 1 0 1 0 1 0 1 1 0 1 1 0 0 0 1 1 1 0 1 0
1 1 0 1 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1
1 1 1 1 1 0 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 0 0 0]]

y
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0])

Wednesday, November 6, 2013

True random numbers in R

We know that rnorm() computes the output from an internal seed it keeps as a state variable. The rest are computations and, therefore, it allows up to compute the output if we guess the seed.

According to the CRAN package description, the random package in R provides an interface to the true random number service provided by the random.org website. It operates by sampling atmospheric noise via radio tuned to an unused broadcasting frequency together with a skew correction algorithm due to John von Neumann (which I don't know what it means yet). So if you are ever interested in improving your random numbers to better perform your simulations, bear this package in mind.

Tuesday, November 5, 2013

Why everything in the classification field is irrelevant but two things

Yes, everything in the binary classification problem in Machine Learning is irrelevant: Throw your Logistic Regression algorithm away, forget about SVM with universal kernels, to hell with neural networks. All you need is:

Multi-dimensional Gaussianization + Naive Bayes

Gaussianization is the process of making a variable Gaussian. One-dimensional Gaussianization is trivial: just take the inverse Gaussian CDF and apply it to any random variable's CDF. This requires knowing our RV's CDF but one-dimensional RVs estimation offer no problem unless we have too few examples, and can be achieved by Gaussian Mixture Models for most cases. Multi-dimensional Gaussianization is more elaborate and there are several procedures. Let's assume we have a procedure to gaussianize a multi-dimensional RV. It is sufficient that the gaussianize version ends up with the identity covariance matrix (which can be directly the output by the procedure or can be done by just a rotation and scaling). Once we get a RV with identity covariance matrix, we know that the one-dimensional RVs in it are independent. This can be the input to a Naive Bayes Classifier and complies with all its assumptions (independence of variables), which automatically yields the best classifier according to the underlyting probability distributions.

In Chen, Scott Shaobing, and Ramesh A. Gopinath. "Gaussianization." (2000), the authors show a method to gaussianize multi-dimensional RVs. It is based on expectation-maximization iterations, in which one estimates the best gaussian distribution and then finds the parameters and rotations that best describe that distribution. At each iteration, the negentropy (the Kullback-Leibler divergence between our currently transformed RV's distribution and a standard Gaussian) is less than the previous interation's. Firstly, by finding a rotation we achieve less dependence, and then by marginal gaussianization we zero-out marginal neg-entropy. This procedure converges weakly (in distribution) and we end up with a multivariate Gaussian. With the chain of estimated rotations and mixture model parameters we can get the transformation we need for new (test) data. Therefore, classification is straighforward with Naive Bayes, and we certainly know that we fully meet its assumptions.

I will be implementing Gaussianization in MyML.

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.8.7803

Machinomics