Big Battle on Little Systems: SMP vs Cluster

Big Battle on Little Systems: SMP vs Cluster

For those that don’t know, I have been working on the Limulus Project for quite a while. The goal of the project is to create and maintain an open specification and software stack for a personal workstation cluster. Ideally, a user should be able to build or purchase a small personal workstation cluster using the Limulus reference design, low cost hardware, and open software. The idea started in 2005 when Jeff Layton and I asked, “how much computing power can you buy for $2500?”.

Today that question is still open to interpretation. The initial 8-node cluster we built performed quite well and offered an outstanding price-to-performance ratio — wire racks and all. We managed to get 14.5 GFLOPS running HPL (remember this was 2005 and our budget was $2500). I am in the process of building a new 4-node system using Intel’s new Sandy Bridge processors. It will have 16 cores. BTW, I also developed a single case design for this new cluster.

When I started building my cheapskate clusters, there were no multi-core processors and most cluster nodes were dual socket (two single core processors). Today it is almost impossible to buy a single core x86 processor that is not designed for low power applications. It is possible, however, to buy a 16-core (or more) desk-side SMP system. (i.e. a single motherboard with 16+ cores). This type of system has the advantage of a single OS image and shared memory programming.

The question I wonder about is how well applications run on such a “core heavy” box. As my previous tests indicate, depending on the workload, you may not always “see” all the cores. In some cases, you may be surprised how little speed-up you achieve on these SMP systems. The culprit, of course, is memory contention.

What about my “personal cluster?” In my new, Sandy Bridge system, each of the four nodes will have four cores sharing the local memory. If my applications are not network limited (I use Gigabit Ethernet), then I should be able to get better memory bandwidth on parallel applications than on a typical SMP node. Of course, I’ll want to test this assumption. I’ll post results as I get them. Stay tuned.

No Comments

Sorry, the comment form is closed at this time.