01 Apr Parallel Programming, Really?
I am often asked, “Is it hard to write parallel programs” The short answer is a resounding “yes.” My answer is sincere and my intention is to scare away those that don’t like a challenge. Like many programming questions, the devil is in the details and the challenges are derived from the specific application area. In response, I often ask two of my own questions. First, “What is the application area in which you are interested?” And second, “What type of hardware do you expect to use?” The answers are an important step in determine the tools you may use to create your masterpiece.
In the past, most people would assume parallel programming was writing programs for a cluster. This assumption has changed in recent years to mean one of three possible hardware platforms; a multi-core, a cluster, or a GP-GPU (or a some combination of these three). As mentioned, the choice is very much tied to the application and understanding the trade-offs of each approach is important.
For many users, HPC hardware is determined by available institutional resources. Within this hardware there are still some choices, however. First, programming for a cluster is always a possibility. The primary tool is the Message Passing Interface (MPI) which is set of communication routines for C/C++ and Fortran (several free/open versions are available). The routines are part of a library that are used to communicate between cluster nodes, or more specifically, between processes on different (or the same) cluster node. MPI programs are called “message passing” programs and have several advantages. They can scale to a large number of processors (or processes) and they can run on both distributed (clusters) and shared memory machines (SMP). Indeed, the advent of multi-core has made highly parallel nodes (e.g. up to 48 cores per node and growing) attractive to the HPC audience. Large MPI programs can be difficult to debug and efficient execution may require an intimate understanding of your application’s computation and communication needs.
The next method is for a single SMP system (e.g. a 16 core node). In this case the computational power of a single node is “enough” for your application. There is no requirement that MPI be used on a single node, although it will work. In this scenario, one could use Posix Threads or OpenMP. For scientific codes, OpenMP is best choice as it requires only comment-pragmas (hints) to be added to your existing C or Fortran programs. Virtually all compilers support OpenMP and can use the “hints” to parallelize your code into threads, which will run in parallel on an SMP architecture. Note, OpenMP programs do not work well across multiple cluster nodes and thus the scalability is limited to the number of cores per node. OpenMP also preserves the original source code and allows easy experimentation with parallelization. Keep in mind, however, some forms of parallelism may be hard to express with OpenMP. There are cases where both OpenMP and MPI have been combined to produce hybrid programs that run as OpenMP threads on the nodes an use MPI routines to communicate between nodes.
The final situation is programming a GP-GPU. This programming model is very different than that suggested by both MPI and OpenMP because it represents a specific type of parallelism, namely Single Instruction Multiple Data (SIMD), that is the same instructions is performed on many different pieces of data at the same time. Strictly speaking, SIMD applications can be coded in MPI and OpenMP as well, but will not run on GP-GPU hardware. The most popular languages used to program GP-GPU’s are CUDA (NVidia only) or OpenCL (NVidia and AMD/ATI). Like OpenMP it is very difficult to write efficient SIMD programs that run across multiple cluster nodes and the scalability is limited by the number of SIMD processors per node. Though CUDA is limited to NVidia and only runs on GP-GPUs, it is easier to learn than OpenCL, which can be used to program all the cores in node (both host and GP-GPU). There are some tools from Portland Group and Pathscale that allow Fortran and C codes to augmented (like OpenMP) for use on GP-GPUs.
The best way to navigate the possible choices it to carefully evaluate the type of problem you want to solve. These include current and future problem size (scalability), how quickly do you need to have it working (ease of programming), how long to you need to use it (portability), and how fast to do need it to run (type of hardware). There is more of course. Like all programming, there is “quick and dirty” and there is “right and robust.” Your choice.