Machinomics: March 2012

Saturday, March 31, 2012

Scientific computing at home with CUDA

I wanted to write about CUDA because I feel it is the future of non-desktop computation, meaning any process that goes beyond data movement and has to apply a scientific computation to obtain information from raw data.

CUDA (Compute Unified Device Architecture) is a form of GPGPU that uses the graphics processors on NVIDIA graphics cards as computation units. One can send programs (called kernels) that perform those computations to the graphics device.

The importance of CUDA lies in the specific hardware architecture aimed at performing vector computations. Scientific and graphics software making extensive use of arithmetic operations will therefore benefit from CUDA parallelization (this includes everywhere you see matrix algebra, such as in quadratic optimization including SVMs, PCA, ICA, CCA, and other discretized operations such as fast Fourier transform, wavelet filter banks and so on). On the other hand, software using large memory transfers are impervious to (any kind of arithmetic) parallelization (this includes databases, web servers, algorithmic/protocol-driven networking software, etc.).

It is interesting for data processing practitioners in the sense that it can cope with the larga datasets. Modern CPUs contain a high among of L2 cache, and the tree-structured cache may expand up to three levels. This design is so because software programs have high data coherency, meaning predominance of serial (and thus continuously accessible) code. However, when working quickly across a large dataset, this cache design performance decays very rapidly. The GPU memory interface is very different from that of the CPU. GPUs use massive parallel interfaces in order to connect with their memory. For example, the GTX 280 uses a 512-bit interface to its high performance GDDR3 memory. This interface is approximately 10 times faster than a typical CPU-RAM interface. On the other hand, GPUs lack the vast amount of memory that the main CPU system enjoy. Customized, large memory parallel GPUs are commercially available at a higher price than that of a home high-performing gaming system. But a nice approach to higly-parallelized software can be taken notwithstanding.

There is a serious drawback from a software engineering point of view, and that is that, Unlike most programming languages, CUDA is coupled very closely together with the hardware implementation. While CPU families do not basically change to maintain retro-compatibility, CUDA-enabled hardware significantly change in basic architectural design.

An important concept to be aware of is the thread hierarchy. CPUs are designed to run just a few threads. GPUs, on the other hand, are designed to process thousands of threads simultaneously. So, in order to take full advantage of your graphics card, the problem must be liable to be broken down to small pieces. Important CUDA tree-like thread organizational structures one has to bear in mind:
Half-Warp: A half-warp is a group of 16 consecutive threads. Half-warp threads are generally executed together and aligned, for instance, threads 0-15 will be in the same half-warp, 16-31, and so on.
Warp: A warp of threads is a group of 32 consecutive threads. On future computing devices from NVIDIA, it might be possible that all threads in the same Warp are generally executed together in parallel. Therefore, it is a good idea to make your programs as if all threads within the same warp will execute together in parallel.
Block: A block is a collection of threads. For technical reasons, blocks should have at least 192 threads to obtain maximum efficiency and full latency avoidance. Typically, blocks might contain 256 threads, 512 threads, or even 768 threads. Here’s the important thing you need to know. Threads within the same block can synchronize with each other, and quickly communicate with each other.
Grid: A grid is a collection of blocks. Blocks can not synchronize with each other, and therefore threads within one block can not synchronize with threads in another block.

Regarding memory, CUDA uses the following kinds:

Global Memory: Global memory can be thought of as the physical memory on your graphics card. All threads can read and write to Global memory.
Shared Memory: A GPU consists of many processors, or multiprocessors. Each multiprocessor has a small amount of shared memory, with a usual size of 16KB. Shared memory is generally used as a very quick working space for threads within a block. It is allocated on a block by block basis. For example, you may have three blocks running consecutively on the same multiprocessor. This means that the maximum amount of shared memory the blocks can reserve is 16KB/3. Threads within the same block can quickly and easily communicate with each other by writing and reading to the shared memory. It’s worth mentioning that the shared memory is at least 100 times faster than global memory, so it’s very advantageous if you can use it correctly.
Texture Memory: A GPU also has texture units and memory which can be taken advantage of in some circumstances. Unlike global memory, texture memory is cached, and is generally read only. This can be taken advantage by using this memory to store chunks of raw data to process.

From the software development point of view, the programs run within a thread are:

Kernels: In a Warp, a kernel is executed multiple times in a SIMD (single instruction multiple data) fashion, which means that the flow of execution in the processors is the same, executing the same instruction, but each processor operation that instruction on different data. We are interested in splitting the software in pieces so that a piece is executed multiple types on different data. For example, in the matrix multiplication A*B, a single kernel may have the task to multiply a row from A and a column from B. When instantiating this kernel in N x M threads, each instantiation (thread) performs the dot-product of a specific row of A and a specific row of B.

Since, at the time of compilation and linkage, the kernel, which is CUDA code, is stored as data in the program (as seen by the operating system), and is subsequently sent to the GPU via the corresponding DMA channel. Once in the GPU, the kernel is instantiated on each processor.

I made a quick Example that can directly be compiled with Visual C++ with help of the the included project files (assuming one has the CUDA toolkit, driver, and Visual C++ environment setup). It can also be compiled anywhere one has a working C++ and NVCC environment.

I will write a follow up article with the basic setip with Visual C++ 2008 Express, which requires some tweaks to get up and running.

Tuesday, March 20, 2012

Multiclass SVM with e1071

When dealing with multi-class classification using the package e1071 for R, which encapsulates LibSVM, one faces the problem of correctly predicting values, since the predict function doesn't seem to deal effectively with this case. In fact, testing the very example that comes in the svm help (?svm on the R command line), one sees the failing performance of the function (albeit working with a correctly fitted model).

data(iris)
attach(iris)

model <- svm(Species ~ ., data = iris)

x <- subset(iris, select = -Species)
y <- Species
model <- svm(x, y, probability = TRUE)

This model is correctly fitted. However

pred <- predict(model, x)

Does not show the correct values

> pred     1      2      3      4      5      6      7      8      9     10     11     12     13     14     15 
setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa 
    16     17     18     19     20     21     22     23     24     25     26     27     28     29     30 
setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa 
    31     32     33     34     35     36     37     38     39     40     41     42     43     44     45 
setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa 
    46     47     48     49     50     51     52     53     54     55     56     57     58     59     60 
setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa 
    61     62     63     64     65     66     67     68     69     70     71     72     73     74     75 
setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa 
    76     77     78     79     80     81     82     83     84     85     86     87     88     89     90 
setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa 
    91     92     93     94     95     96     97     98     99    100    101    102    103    104    105 
setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa 
   106    107    108    109    110    111    112    113    114    115    116    117    118    119    120 
setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa 
   121    122    123    124    125    126    127    128    129    130    131    132    133    134    135 
setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa 
   136    137    138    139    140    141    142    143    144    145    146    147    148    149    150 
setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa 
Levels: setosa versicolor virginica

We see that every prediction is setosa, even though the dataset is equally divided between the three classes (setosa, versicolor and virginica). It seems that something went wrong with the direct prediction, but one way to overcome this problem is to use the predicted probabilities, that seem to be well computed:

pred <- predict(model, x, decision.values = TRUE, probability = TRUE)

Observe how they look

> attr(pred, "probabilities")         setosa  versicolor   virginica
1   0.980325653 0.011291686 0.008382661
2   0.972927977 0.018061300 0.009010723
3   0.979044327 0.011921880 0.009033793
...
48  0.977140826 0.013050710 0.009808464
49  0.977831001 0.013359834 0.008809164
50  0.980099521 0.011501036 0.008399444
51  0.017740468 0.954734399 0.027525133
52  0.010394167 0.973918376 0.015687457
...
97  0.009263806 0.986123276 0.004612918
98  0.008503724 0.988168405 0.003327871
99  0.025068812 0.965415124 0.009516064
100 0.007514580 0.987584706 0.004900714
101 0.012482541 0.002502134 0.985015325
...
149 0.013669944 0.017618659 0.968711397
150 0.010205071 0.140882630 0.848912299

Now we see that the most probable class is indeed the ground truth and we can correctly classify with the following function

predsvm<-function(model,newdata)
{
prob<-attr(predict(model, newdata, probability = TRUE),"probabilities")
n<-dim(prob)[1]
m<-dim(prob)[2]

me<-which(prob==apply(prob,1,max))
return(as.factor(model$labels[floor((me-1)/n)+1]))
}

One might also program the following function, that deals with the way the support vector coefficients are stored in the model object, in model$coefs and model$rho:

## Linear Kernel function
K <- function(i,j) crossprod(i,j)

predsvm <- function(object, newdata) {
  ## compute start-index
  start <- c(1, cumsum(object$nSV)+1)
  start <- start[-length(start)]

  ## compute kernel values
  kernel <- sapply (1:object$tot.nSV,
                    function (x) K(object$SV[x,], newdata))

  ## compute raw prediction for classifier (i,j)
  predone <- function (i,j) {
    ## ranges for class i and j:
    ri <- start[i] : (start[i] + object$nSV[i] - 1)
    rj <- start[j] : (start[j] + object$nSV[j] - 1)
    
    ## coefs for (i,j):
    coef1 <- object$coefs[ri, j-1]
    coef2 <- object$coefs[rj, i]

    ## return raw values:
    crossprod(coef1, kernel[ri]) + crossprod(coef2, kernel[rj])
  }

  ## compute votes for all classifiers
  votes <- rep(0,object$nclasses)
  c <- 0 # rho counter
  for (i in 1 : (object$nclasses - 1))
    for (j in (i + 1) : object$nclasses)
      if (predone(i,j) > object$rho[c <- c + 1])
        votes[i] <- votes[i] + 1
      else
        votes[j] <- votes[j] + 1

  ## return winner (index with max. votes)
  object$levels[which(votes %in% max(votes))[1]]
}

Sunday, March 18, 2012

Smoothing splines and interest rate curves

Yield curves are important in Economics and used by finance professionals to analyze bonds and look for trading opportunities and by economists, to try to understand economic conditions.

The yield is the amount in cash that returns to the owners of a security, for example, if a sovereign bond with maturity at time $T$ is bought at time $t$ for a price $P(t,T)$, the yield would be high if $P(t,T)$ is much less than the maturity price $P(T,T)$ (which would obviously be the risk-free amount received by the lender). A higher yield allows the owner to make more profit from the investment, but it may also mean that the security is being sold at a discount as a result of some risk related to the security. For a recent example of this, Greek bonds, which were sold at such discounts because the market anticipated a default, this made yields soar, with risk-averting investors unloading bonds at the lower prices that risk-taking investors were comfortable buying in. Ultimately everybody took a haircut.

For fixed income mathematics, a set of axioms must be assumed, namely:

The market trades continuously over its trading horizon: it extends from the current time to some distant future time such that the maturities of all the instruments to be valued fall between now and the trading horizon..
The market is efficient: information is available to all traders simultaneously, and every trader makes use of all the available information, also related to the axiom that
There are no arbitrage opportunities: the price of a portfolio is the sum of its constituent parts.
The market is complete: any desired cash flow can be obtained from a suitable self-financing strategy based on a portfolio of discount bonds.

It is assumed further that there are no legal barriers to trading and that the market is rational (traders try to maximize their profits).

Now, define the zero yield in terms of the bond price above. The zero yield, as seen at time $t$, of a bond that matures at time $T$, is denoted by $y(t,T)$ and is defined by
$$
P(t,T)=e^{-(T-t) y(t,T) }
$$
for every $t<T$, which means that the current price of the bond is equal to discounted price at maturity (using compound interest).

Suppose that at time $t$ we enter into a forward contract to deliver at time $T_1$ a bond that will mature at time $T_2$ (obviously $T_2>T_1$). Let the forward price of the bond be denoted by $P(t,T_1,T_2)$. At the same time, a bond that matures at time $T_1$ is purchased at a price of $P(t,T_1)$. Again at time $t$, a bond that matures at time $T_2$ is bought with a price of $P(t,T_2)$. Notice that the axiom specifying that there are no arbitrage opportunities implies that the price of the bond maturing at time $T_2$ must be equal to the product of the price of the bond maturing at time $T_1$ and the forward price $P(t,T_1) P(t,T_1,T_2)$. Let the implied forward rate $f(t,T_1,T_2)$, valued at time $t$, for the period $(T_1,T_2)$ as
$$
P(t,T_1,T_2)=e^{-(T_2-T_1) f(t,T_1,T_2) }
$$
Notice that $P(t,T_2) =P(t,T_1) P(t,T_1,T_2)=e^{-(T_1-t) y(t,T_1) } e^{-(T_2-T_1) f(t,T_1,T_2) }$ and, summing the exponents, we get
$$
f(t,T_1,T_2)= \frac{(T_2-t) y(t,T_2) - (T_1 - t) y(t,T_1)}{T_2 - T_1}
$$
which is the period forward rate. However, the instantaneous forward rate is of much greater importance in the theory of the term structure. The instantaneous forward rate for time $T$, as seen at time $t$, is denoted by $f(t,T)$ and is the continuously compounded rate defined by
$$
f(t,T)= \lim_{h \rightarrow 0} {f(t,T,T+h) } = y(t,T) + (T-t) y_T (t,T)
$$
where $y_T = \frac{\partial}{\partial T} y(t,T)$. We can interpret the previous equation as the current yield value plus the instantaneous change of the yield with the maturity time.

To interpolate curves, and thus have values for all the points within the interval and not only the data points that are made available to us by the problem, we can use a number of different methods. One needs some complexity to be able to capture what the nature is saying, but at the same time this complexity might be cause by some "noise", or perturbations that we want to avoid (market panics, manipulations or poorly registered data). We can choose a parametric family, such as polynomials, fix the order and fit the coefficients. However, this appears as somehow arbitrary. One can choose to interpolate with a non-parametric method such as splines. When using some functional space, one must restrict himself to functions that meet some criteria, not only that approximate well (or maybe the best approximation in that space). This is to avoid overfitting the said noise.

Now we can choose to fit an approximating yield or the forward curve that approximates the true yield or forward curves, with the information given by actual bond prices as of time $t$, by minimizing the following functional (assuming we fit the yield curve $\varphi$):
$$
R(\varphi,t) = \sum_{i=1}^n (P(t,T_i) - P(t,T_i,\varphi))^2 + \lambda \int \left( \frac{\partial^2}{\partial s^2} \varphi(s) \right)^2 ds.
$$
Where $\lambda$ is a regularization parameter, and $P(t,T_i,\varphi)$ is a functional that prices a bond of maturity $T_i$ at time $t$ with the yield curve $\varphi$. We change formulation, for practical derivation purposes (we now compute the observed yield $y(t,T_i)$ from the observed bond price)
$$
R(\varphi,t) = \sum_{i=1}^n (y(t,T_i) - \varphi(T_i))^2 + \lambda \int \left( \frac{\partial^2}{\partial s^2} \varphi(s) \right)^2 ds.
$$
And then we apply the standard smoothing splines theory to compute the solution, which is:
$$
\mathbf{w} = (I+\lambda K)^{-1} \mathbf{y}
$$
Where $\mathbf{y}$ is a vector with elements $y(t,T_i)$, $K$ is a matrix whose elements are of the form $\int \frac{\partial^2}{\partial s^2} \phi_i (s) \frac{\partial^2}{\partial s^2} \phi_j (s) ds$, with $\phi$ is a choosen spline basis (notice the dot-product structure) and $\mathbf{w}$ is a vector with the coefficients on that basis, so that the estimated function is
$$
\varphi = \sum_i w_i \phi_i
$$

Thursday, March 15, 2012

Beautiful spectacle in the sky

I enjoy Astronomy so much that I examine the night sky every time I can. Tonight I couldn't wait to describe the delightful sight in the sky I was witnessing: Jupiter and Venus almost at the closest distance (as they meet the eye, not actual Euclidean distance, they were closest two days ago). And these are some beautiful pictures on the Internet (you may see them in space.com) so that the blog records this spectacular celestial art.

Mostar in Bosnia-Herzegovina

Eagle Harbor, Mich. USA

As per the fabulous site http://www.solarsystemscope.com, the current planet locations are like this, observe the narrow angle in which we can see Jupiter and Venus.

An observer from California at the time of this post would see Venus on the top right of Jupiter in the sunset (in the previous picture of the solar system, put your head horizontally to the left, and see that this would indeed be true).

If the doors of perception were cleansed every thing would appear to man as it is, infinite

William Blake

One feels blessed to having been granted the opportunity to grasp such a marvelous look into infinity.

The coin denomination problem

I read sometime that this is a common question in interviews with Google. The problem states:

Given any set of coins, build an algorithm that returns the minimum amount of coins necessary to make up this number.

The first naive strategy one tends to follow is to use the maximim amount of the maximum valued coins in the set, and then follow with lower valued coins. This usually works with real world coin systems, with follow a pattern that makes this possible. However, the general problem is slightly more complicated. Consider the following set of coins: {1,2,5,8,10}. This set won't work with this strategy.

A better approach consists on considering the lower-valued coins not only when more higher valued coins are substracted, but on every iteration. When we are finished, we perform the same operation recursively, without using the highest-valued coin at the present recurseve call. In the example, we would take the 10-coin out, in the next recursive call, we take the 8-coin out, an so forth. Thus, we take into account all possibilities. Of course, we should track the minimum number of coins to cut the tree.

The following program implements the solution to this problem:

/* This program solves the coin denomination problem.
//
// You may redistribute this file according to the GPL terms.
// Copyright 2006 Javier Arriero Pais
*/

#include <stdio.h>
#include <stdlib.h>

#define MAXC 5
int coins[MAXC] = {1,2,5,8,10};

int amount = 13;

int numcoins (int, int);

int main (int argc, char *argv[])
{
if (argc > 1)
amount = strtol (argv[1], NULL, 10);

printf ("Minimal number of coins: %i\n", numcoins(amount, MAXC) );

return 0;
}

int numcoins (int amount, int maxc)
{
int i;
int auxamount=amount, mon=10000, auxmon=0, auxmon2;

if (maxc<=0) return 0;
if (maxc==1) return amount;
if (amount <= 0) return 0;
if (amount == 1) return 1;

i=maxc-1;
while (auxamount >= coins[i]) {
auxamount-=coins[i];
auxmon2 = numcoins (auxamount, maxc-1);
auxmon++;
if (auxmon + auxmon2 < mon) mon = auxmon2 + auxmon;
}

auxmon2 = numcoins(amount, maxc-1);

if (auxmon2 != 0 && auxmon2 < mon)
mon = auxmon2;

return mon;
}

Sunday, March 11, 2012

Integrating Latex in Blogger

If you wonder how I managed to get the beautiful formulae of the previous post, you need to go to the Dashboard, and in Templates, find the "Edit HTML" button. You will edit the very template you are using to visualize your blog. Then, right before the HTML tag "</head>", put this:

<script src='http://cdn.mathjax.org/mathjax/latest/MathJax.js' type='text/javascript'>
MathJax.Hub.Config({
extensions: [&quot;tex2jax.js&quot;,&quot;TeX/AMSmath.js&quot;,&quot;TeX/AMSsymbols.js&quot;],
jax: [&quot;input/TeX&quot;, &quot;output/HTML-CSS&quot;],
tex2jax: {
inlineMath: [ [&#39;$&#39;,&#39;$&#39;], [&quot;\\(&quot;,&quot;\\)&quot;] ],
displayMath: [ [&#39;$$&#39;,&#39;$$&#39;], [&quot;\\[&quot;,&quot;\\]&quot;] ],
},
&quot;HTML-CSS&quot;: { availableFonts: [&quot;TeX&quot;] }
});
</script>

Then write your $\LaTeX$ formulae as if you were using your MikTex, eclosing them within dollar symbols, or within double dollar symbols if you need an environment similar to displaymath.

By the way, I prevented the parsing of the previous code by the Math.js Latex parser (on the http://cdn.mathjax.org server, as you can see) by surrounding the text by the HTML tag <pre>. Otherwise the parser finds some of the codes above as erroneous Latex.

Buffon's needle

What a better way to start the blog but with the classical probabilistic puzzle of Buffon's needle?

The statement, as per Wikipedia:

Suppose we have a floor made of parallel strips of wood, each the same width, and we drop a needle onto the floor. What is the probability that the needle will lie across a line between two strips?

For the sake of clarity, we will consider the case where the separation of the strips is $t=1$, the needle's lenght is $l=1$ and the needle can fall equally at every point in the floor. We will call $h$ to the distance from the center of the needle to the closest point of the closest strip, and $\theta$ to the angle of the needle and the strip line.

Now let's think the situation in which the needle will cut the strip. If the angle $\theta$ is zero, then the needle is forced to rest along the strip, therefore $h=0$. If the needle is perpendicular to the strip (the angle is $\frac{\pi}{2}$, then the distance can be anything, up to $\frac{1}{2}$. We see that we can rotate a needle from an angle of zero up to $\pi$ before we get the same position of the needle and that the distance can range from zero (the center of the needle lying on a strip) to $\frac{1}{2}$ (before being closer to another strip).

We must add "as many" combinations of $\theta$ and $h$ make the needle touch a strip. Since these variables are continuous, we need to integrate them. Therefore, to "add" all the values, we fix an angle $\theta$ and think of how far left or right (in the picture above) we can move the needle without preventing it from touching the strip. So, when $\theta=0$ then the needle cannot move without leaving the strip, and thus $h=0$, its minimum value. On the other hand, when $\theta=\frac{\pi}{2}$ then $h$ is completely free to have values within its possible range from $-\frac{1}{2}$ to $\frac{1}{2}$, this is, $\frac{1}{2}$ away from the closest strip. In between, we will be able to move the needle's center within a range $[-\frac{\sin\theta}{2}, \frac{\sin\theta}{2}]$ from the closest strip. Each value must be weighted by its "importance" in the interval we are integrating. As per our assumptions, every point "weights" the same (has the same probability or density in the inverval). For the angles, the density all over the interval is $\frac{1}{\pi}$, and for the distances, $1$.
$$\int_{0}^{\pi} { \int_{-\frac{\sin\theta}{2}}^{\frac{\sin\theta}{2}} {1 dh} \frac{1}{\pi} d\theta} =\\ \int_{0}^{\pi} {\left( \frac{\sin\theta}{2} - (-\frac{\sin\theta}{2}) \right) \frac{1}{\pi} d\theta} = \\ \frac{1}{\pi} \int_{0}^{\pi} {\sin\theta d\theta} = \\
\frac{1}{\pi} \left(- \cos\pi - (-\cos0) \right) = \frac{2}{\pi}$$
Although we didn't say, the equal weighting is called uniform probability. The result would change if the needle was to be thrown with some other probability law.

This problem also gives us the change to estimate $\pi$. If we throw $N$ needles over the floor, and count how may actually met the strips and divide it by $N$, we would get an approximation to $ \frac{1}{\pi}$ (by the law of large numbers). Inverting, we have estimated $\pi$. In this image (taken from here), 100.000 needle throws are simulated. If you focus on the angle, what this histogram tells you is the freedom to move the center of the needle away from the closest strip. The farthest distance would be $\frac{1}{2}$, but the author scaled the histogram to one to match it with the $\sin \theta$ function.

Machinomics