## Saturday, March 31, 2012

### Scientific computing at home with CUDA

I wanted to write about CUDA because I feel it is the future of non-desktop computation, meaning any process that goes beyond data movement and has to apply a scientific computation to obtain information from raw data.

CUDA (Compute Unified Device Architecture) is a form of GPGPU that uses the graphics processors on NVIDIA graphics cards as computation units. One can send programs (called kernels) that perform those computations to the graphics device.

The importance of CUDA lies in the specific hardware architecture aimed at performing vector computations. Scientific and graphics software making extensive use of arithmetic operations will therefore benefit from CUDA parallelization (this includes everywhere you see matrix algebra, such as in quadratic optimization including SVMs, PCA, ICA, CCA, and other discretized operations such as fast Fourier transform, wavelet filter banks and so on). On the other hand, software using large memory transfers are impervious to (any kind of arithmetic) parallelization (this includes databases, web servers, algorithmic/protocol-driven networking software, etc.).

It is interesting for data processing practitioners in the sense that it can cope with the larga datasets. Modern CPUs contain a high among of L2 cache, and the tree-structured cache may expand up to three levels. This design is so because software programs have high data coherency, meaning predominance of serial (and thus continuously accessible) code. However, when working quickly across a large dataset, this cache design performance decays very rapidly. The GPU memory interface is very different from that of the CPU. GPUs use massive parallel interfaces in order to connect with their memory. For example, the GTX 280 uses a 512-bit interface to its high performance GDDR3 memory. This interface is approximately 10 times faster than a typical CPU-RAM interface. On the other hand, GPUs lack the vast amount of memory that the main CPU system enjoy. Customized, large memory parallel GPUs are commercially available at a higher price than that of a home high-performing gaming system. But a nice approach to higly-parallelized software can be taken notwithstanding.

There is a serious drawback from a software engineering point of view, and that is that, Unlike most programming languages, CUDA is coupled very closely together with the hardware implementation. While CPU families do not basically change to maintain retro-compatibility, CUDA-enabled hardware significantly change in basic architectural design.

An important concept to be aware of is the thread hierarchy. CPUs are designed to run just a few threads. GPUs, on the other hand, are designed to process thousands of threads simultaneously. So, in order to take full advantage of your graphics card, the problem must be liable to be broken down to small pieces. Important CUDA tree-like thread organizational structures one has to bear in mind:
Half-Warp: A half-warp is a group of 16 consecutive threads. Half-warp threads are generally executed together and aligned, for instance, threads 0-15 will be in the same half-warp, 16-31, and so on.
Warp: A warp of threads is a group of 32 consecutive threads. On future computing devices from NVIDIA, it might be possible that all threads in the same Warp are generally executed together in parallel. Therefore, it is a good idea to make your programs as if all threads within the same warp will execute together in parallel.
Block: A block is a collection of threads. For technical reasons, blocks should have at least 192 threads to obtain maximum efficiency and full latency avoidance. Typically, blocks might contain 256 threads, 512 threads, or even 768 threads. Here’s the important thing you need to know. Threads within the same block can synchronize with each other, and quickly communicate with each other.
Grid: A grid is a collection of blocks. Blocks can not synchronize with each other, and therefore threads within one block can not synchronize with threads in another block.

Regarding memory, CUDA uses the following kinds:

Global Memory: Global memory can be thought of as the physical memory on your graphics card. All threads can read and write to Global memory.
Shared Memory: A GPU consists of many processors, or multiprocessors. Each multiprocessor has a small amount of shared memory, with a usual size of 16KB. Shared memory is generally used as a very quick working space for threads within a block. It is allocated on a block by block basis. For example, you may have three blocks running consecutively on the same multiprocessor. This means that the maximum amount of shared memory the blocks can reserve is 16KB/3. Threads within the same block can quickly and easily communicate with each other by writing and reading to the shared memory. It’s worth mentioning that the shared memory is at least 100 times faster than global memory, so it’s very advantageous if you can use it correctly.
Texture Memory: A GPU also has texture units and memory which can be taken advantage of in some circumstances. Unlike global memory, texture memory is cached, and is generally read only. This can be taken advantage by using this memory to store chunks of raw data to process.

From the software development point of view, the programs run within a thread are:

Kernels: In a Warp, a kernel is executed multiple times in a SIMD (single instruction multiple data) fashion, which means that the flow of execution in the processors is the same, executing the same instruction, but each processor operation that instruction on different data. We are interested in splitting the software in pieces so that a piece is executed multiple types on different data. For example, in the matrix multiplication A*B, a single kernel may have the task to multiply a row from A and a column from B. When instantiating this kernel in N x M threads, each instantiation (thread) performs the dot-product of a specific row of A and a specific row of B.

Since, at the time of compilation and linkage, the kernel, which is CUDA code, is stored as data in the program (as seen by the operating system), and is subsequently sent to the GPU via the corresponding DMA channel. Once in the GPU, the kernel is instantiated on each processor.

I made a quick Example that can directly be compiled with Visual C++ with help of the the included project files (assuming one has the CUDA toolkit, driver, and Visual C++ environment setup). It can also be compiled anywhere one has a working C++ and NVCC environment.

I will write a follow up article with the basic setip with Visual C++ 2008 Express, which requires some tweaks to get up and running.

## Tuesday, March 20, 2012

### Multiclass SVM with e1071

When dealing with multi-class classification using the package e1071 for R, which encapsulates LibSVM, one faces the problem of correctly predicting values, since the predict function doesn't seem to deal effectively with this case. In fact, testing the very example that comes in the svm help (?svm on the R command line), one sees the failing performance of the function (albeit working with a correctly fitted model).

data(iris)
attach(iris)

model <- svm(Species ~ ., data = iris)

x <- subset(iris, select = -Species)
y <- Species
model <- svm(x, y, probability = TRUE)

This model is correctly fitted. However

pred <- predict(model, x)
Does not show the correct values

> pred     1      2      3      4      5      6      7      8      9     10     11     12     13     14     15
setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
16     17     18     19     20     21     22     23     24     25     26     27     28     29     30
setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
31     32     33     34     35     36     37     38     39     40     41     42     43     44     45
setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
46     47     48     49     50     51     52     53     54     55     56     57     58     59     60
setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
61     62     63     64     65     66     67     68     69     70     71     72     73     74     75
setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
76     77     78     79     80     81     82     83     84     85     86     87     88     89     90
setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
91     92     93     94     95     96     97     98     99    100    101    102    103    104    105
setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
106    107    108    109    110    111    112    113    114    115    116    117    118    119    120
setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
121    122    123    124    125    126    127    128    129    130    131    132    133    134    135
setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
136    137    138    139    140    141    142    143    144    145    146    147    148    149    150
setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica
We see that every prediction is setosa, even though the dataset is equally divided between the three classes (setosa, versicolor and virginica). It seems that something went wrong with the direct prediction, but one way to overcome this problem is to use the predicted probabilities, that seem to be well computed:

pred <- predict(model, x, decision.values = TRUE, probability = TRUE)
Observe how they look

> attr(pred, "probabilities")         setosa  versicolor   virginica
1   0.980325653 0.011291686 0.008382661
2   0.972927977 0.018061300 0.009010723
3   0.979044327 0.011921880 0.009033793
...
48  0.977140826 0.013050710 0.009808464
49  0.977831001 0.013359834 0.008809164
50  0.980099521 0.011501036 0.008399444
51  0.017740468 0.954734399 0.027525133
52  0.010394167 0.973918376 0.015687457
...
97  0.009263806 0.986123276 0.004612918
98  0.008503724 0.988168405 0.003327871
99  0.025068812 0.965415124 0.009516064
100 0.007514580 0.987584706 0.004900714
101 0.012482541 0.002502134 0.985015325
...
149 0.013669944 0.017618659 0.968711397
150 0.010205071 0.140882630 0.848912299

Now we see that the most probable class is indeed the ground truth and we can correctly classify with the following function
predsvm<-function(model,newdata)
{
prob<-attr(predict(model, newdata, probability = TRUE),"probabilities")
n<-dim(prob)[1]
m<-dim(prob)[2]

me<-which(prob==apply(prob,1,max))
return(as.factor(model$labels[floor((me-1)/n)+1])) } One might also program the following function, that deals with the way the support vector coefficients are stored in the model object, in model$coefs and model$rho: ## Linear Kernel function K <- function(i,j) crossprod(i,j) predsvm <- function(object, newdata) { ## compute start-index start <- c(1, cumsum(object$nSV)+1)
start <- start[-length(start)]

## compute kernel values
kernel <- sapply (1:object$tot.nSV, function (x) K(object$SV[x,], newdata))

## compute raw prediction for classifier (i,j)
predone <- function (i,j) {
## ranges for class i and j:
ri <- start[i] : (start[i] + object$nSV[i] - 1) rj <- start[j] : (start[j] + object$nSV[j] - 1)

## coefs for (i,j):
coef1 <- object$coefs[ri, j-1] coef2 <- object$coefs[rj, i]

## return raw values:
crossprod(coef1, kernel[ri]) + crossprod(coef2, kernel[rj])
}

## compute votes for all classifiers
votes <- rep(0,object$nclasses) c <- 0 # rho counter for (i in 1 : (object$nclasses - 1))
for (j in (i + 1) : object$nclasses) if (predone(i,j) > object$rho[c <- c + 1])