Saturday, March 31, 2012

Scientific computing at home with CUDA

I wanted to write about CUDA because I feel it is the future of non-desktop computation, meaning any process that goes beyond data movement and has to apply a scientific computation to obtain information from raw data.

CUDA (Compute Unified Device Architecture) is a form of GPGPU that uses the graphics processors on NVIDIA graphics cards as computation units. One can send programs (called kernels) that perform those computations to the graphics device.

The importance of CUDA lies in the specific hardware architecture aimed at performing vector computations. Scientific and graphics software making extensive use of arithmetic operations will therefore benefit from CUDA parallelization (this includes everywhere you see matrix algebra, such as in quadratic optimization including SVMs, PCA, ICA, CCA, and other discretized operations such as fast Fourier transform, wavelet filter banks and so on). On the other hand, software using large memory transfers are impervious to (any kind of arithmetic) parallelization (this includes databases, web servers, algorithmic/protocol-driven networking software, etc.).

It is interesting for data processing practitioners in the sense that it can cope with the larga datasets. Modern CPUs contain a high among of L2 cache, and the tree-structured cache may expand up to three levels. This design is so because software programs have high data coherency, meaning predominance of serial (and thus continuously accessible) code. However, when working quickly across a large dataset, this cache design performance decays very rapidly. The GPU memory interface is very different from that of the CPU. GPUs use massive parallel interfaces in order to connect with their memory. For example, the GTX 280 uses a 512-bit interface to its high performance GDDR3 memory. This interface is approximately 10 times faster than a typical CPU-RAM interface. On the other hand, GPUs lack the vast amount of memory that the main CPU system enjoy. Customized, large memory parallel GPUs are commercially available at a higher price than that of a home high-performing gaming system. But a nice approach to higly-parallelized software can be taken notwithstanding.

There is a serious drawback from a software engineering point of view, and that is that, Unlike most programming languages, CUDA is coupled very closely together with the hardware implementation. While CPU families do not basically change to maintain retro-compatibility, CUDA-enabled hardware significantly change in basic architectural design.

An important concept to be aware of is the thread hierarchy. CPUs are designed to run just a few threads. GPUs, on the other hand, are designed to process thousands of threads simultaneously. So, in order to take full advantage of your graphics card, the problem must be liable to be broken down to small pieces. Important CUDA tree-like thread organizational structures one has to bear in mind:
Half-Warp: A half-warp is a group of 16 consecutive threads. Half-warp threads are generally executed together and aligned, for instance, threads 0-15 will be in the same half-warp, 16-31, and so on.
Warp: A warp of threads is a group of 32 consecutive threads. On future computing devices from NVIDIA, it might be possible that all threads in the same Warp are generally executed together in parallel. Therefore, it is a good idea to make your programs as if all threads within the same warp will execute together in parallel.
Block: A block is a collection of threads. For technical reasons, blocks should have at least 192 threads to obtain maximum efficiency and full latency avoidance. Typically, blocks might contain 256 threads, 512 threads, or even 768 threads. Here’s the important thing you need to know. Threads within the same block can synchronize with each other, and quickly communicate with each other.
Grid: A grid is a collection of blocks. Blocks can not synchronize with each other, and therefore threads within one block can not synchronize with threads in another block.

Regarding memory, CUDA uses the following kinds:

Global Memory: Global memory can be thought of as the physical memory on your graphics card. All threads can read and write to Global memory.
Shared Memory: A GPU consists of many processors, or multiprocessors. Each multiprocessor has a small amount of shared memory, with a usual size of 16KB. Shared memory is generally used as a very quick working space for threads within a block. It is allocated on a block by block basis. For example, you may have three blocks running consecutively on the same multiprocessor. This means that the maximum amount of shared memory the blocks can reserve is 16KB/3. Threads within the same block can quickly and easily communicate with each other by writing and reading to the shared memory. It’s worth mentioning that the shared memory is at least 100 times faster than global memory, so it’s very advantageous if you can use it correctly.
Texture Memory: A GPU also has texture units and memory which can be taken advantage of in some circumstances. Unlike global memory, texture memory is cached, and is generally read only. This can be taken advantage by using this memory to store chunks of raw data to process.

From the software development point of view, the programs run within a thread are:

Kernels: In a Warp, a kernel is executed multiple times in a SIMD (single instruction multiple data) fashion, which means that the flow of execution in the processors is the same, executing the same instruction, but each processor operation that instruction on different data. We are interested in splitting the software in pieces so that a piece is executed multiple types on different data. For example, in the matrix multiplication A*B, a single kernel may have the task to multiply a row from A and a column from B. When instantiating this kernel in N x M threads, each instantiation (thread) performs the dot-product of a specific row of A and a specific row of B.

Since, at the time of compilation and linkage, the kernel, which is CUDA code, is stored as data in the program (as seen by the operating system), and is subsequently sent to the GPU via the corresponding DMA channel. Once in the GPU, the kernel is instantiated on each processor.

I made a quick Example that can directly be compiled with Visual C++ with help of the the included project files (assuming one has the CUDA toolkit, driver, and Visual C++ environment setup). It can also be compiled anywhere one has a working C++ and NVCC environment.

I will write a follow up article with the basic setip with Visual C++ 2008 Express, which requires some tweaks to get up and running.

No comments:

Post a Comment