20 May What You Need to Know Can’t Hurt You
One of the issues facing the HPC market/community is the lack of good system administrators for clusters. Many believe this issue holds back the market and I have to agree. I don’t think, however, that being a Linux cluster administrator is all that much different than any other Linux systems administrator. There are some new and different things to learn, but most of what you need is common knowledge (i.e. there is no secret cluster admin cabal).
In the past I have taught classes on cluster administration and found that those with Linux/Unix experience often have little trouble handling the concepts and ideas. In addition, I am often asked what skills are needed to be a good cluster administrator. What follows is not a complete list for sure, but it is a start. The list begins with general topics and as you proceed toward the end the topics become more cluster specific.
- Basic Skills – There are some basic skills that any Linux/Unix administrator need to have mastered. These include, the use of a command line text editors, sorry no mice and menus allowed in the trenches. Bash scripting is essential. Perl is helpful, but bash is what drives a lot of things on a cluster. In addition, understanding basic Linux concepts such as booting, mounting file systems, kernel modules and tools, performance monitoring, and SMP concepts. A good understating of x86 server hardware helps as well.
- RPM/YUM or Deb packaging – Depending on your distribution, understanding Linux package management is essential. Not only do you need to know how to install packages, but querying package contents, installing, updating, and other package mojo is very important. Being able to build RPMS is nice, but not necessary. I’ll have more on this at another time.
- Compilers – Most cluster experts (and administrators) have a good understanding of compilers and building code. Understanding that the long stream of error messages can be due to missing libraries (and easily fixed) prevents the sense of overwhelm that comes with trying to build that new software package. And, it makes you look like a genius to your users.
- Networking – Networking is perhaps the toughest area to find good information. In many other market sectors, non-optimal network performance works quite well for just browsing the web or transferring files. Clusters need the fastest networks possible. High end networks have been even more obscure with the use of “user space” or “zero-copy” protocols. The market is focused on either 10 GigE or InfiniBand solutions and most drivers are part of the kernel. Also GigE is still a real alternative in some cases.
- Cluster Provisioning – When a cluster node boots, it needs to come up in a predictable and manageable fashion. There are many packages out there that provide help with this task. Most cluster tools and provisioning packages use standard Linux/Unix concepts to achieve manageable systems.
- Schedulers – Resource scheduling has been around ever since people started sharing computers. The basic concept is to allow multiple users the ability to share the cluster. While the issue of resource scheduling can get quite involved, the basic concepts are not too hard to grasp. One should also know that no matter how hard you work to optimize your scheduling system, there will still be complaints.
- Message Passing Interface (MPI) Libraries and OpenMP – MPI has been around before clusters hit the big time, there are numerous books and classes on the topic. MPI is basically a software library that allows processes to exchange data (on the same or different machine). It is supported on all popular (and even unpopular) networks. OpenMP is implemented by the compiler and uses source code directives to create threaded programs for a single SMP server.
There you have it. It you work with HPC clusters, you bump into these issues most of the time. There is ample and freely available documentation (and software) on all of these topics. There are even cluster courses if you can find them. Of course, there are some exclusive cluster issues which deal with parallel computing, but a good grasp of the above creates a solid foundation and enough to get you on you way to becoming an HPC cluster maven.