30 December 2011
A blog about making HPC things (kind of) work
I have been working with "small HPC" since 2005. The idea is basically to built small personal HPC systems using low cost off-the-shelf parts. If it sounds familiar to the original Beowulf idea, it is. However, Beowulf has grown into "big HPC" and such systems require a data center. My approach needs a wall plug and some space next to your desk. Performance of these systems is quite good. My most recent system design produced 200 GFLOPS using approximately $3500 in raw parts. If you do the math that is $17.50 per GFLOP. (Note: These are general purpose CPU FLOPS which are different than GPU FLOPS.) One of the key design features is the use of single socket motherboards. This decisions was needed to meet the the power, heat, performance, noise, and space design envelope. The current system provides 16 cores spread over four motherboards. One might ask, why not use two dual socket motherboards and place 8 cores on each motherboard? Or, use 8-core processors and just use one dual socket motherboard in a standard workstation? Unfortunately, "core math" is not so simple. While a motherboard may offer large numbers of cores, the real question is how well do they perform in parallel? Indeed, it is perhaps the most important question, otherwise, why put so many cores on the motherboard. I have covered this topic in a previous post where core utilization on a 12-core (dual socket with 6-core processors) ranged from 41%-98% and the average utilization for all tests was 64%. Thus, on average you can expect to effectively use 7.7 cores out of 12. Contrast this with my 4-core single socket processor tests where the performance ranged from 50%-100% and the average utilization was 74%. On average, one can expect to use 3 out of 4 cores. The variation is due to memory bandwidth of each system. In general, more cores means more sharing of memory and more possible contention. It should be noted, that on the rare single core system, utilization is always 100% for HPC applications. Cache friendly programs usually scale well on multi-core, while those that relay on heavy access to main memory have the most difficulty with multi-core systems. In addition to better scaling, using a single socket node reduces the load on the interconnect. In the case of the low cost system mentioned above, reducing the number of messages that need to enter or leave a node allows for a slower less costly interconnect to be used (i.e. Gigabit Ethernet). In the case of "small HPC," less cores per node is the better choice and can provide much more effective resource utilization. Of course, as processor core counts continue to increase, the point of diminishing returns does not seem far off.| < Prev | Next > |
|---|





