About the Author

Douglas EadlineDouglas Eadline PhD, is both a practitioner and a chronicler of the Linux Cluster HPC revolution. He has worked with parallel computers since 1988 and is a co-author of the original Beowulf How To document.  Prior to starting and editing the popular http://clustermonkey.net web site in 2005, he served as Editor-in-chief for ClusterWorld Magazine. He is currently Senior HPC Editor for Linux Magazine and a consultant to the HPC industry. Doug holds a Ph.D. in Analytical Chemistry from Lehigh University and has been building, deploying, and using Linux HPC clusters since 1995.

User Rating: / 0
PoorBest 
A blog about making HPC things (kind of) work

Almost all clusters are shared computing resources. In order to facilitate shared use, clusters have some kind of resource scheduler that sits between the user and the actual hardware. Examples of freely available resource schedulers include target="_blank"Open Grid Scheduler (formerly Sun Grid Engine) and Torque/Maui. There are also commercial versions of these and other schedulers. The point is, virtually everyone in HPC uses a scheduler of some sort. Those lucky enough to own a personal HPC resource probably use one as well so they don't have to babysit the system while it is crunching away.

If a resource scheduler supports more than one person the task of scheduling jobs, maximizing cluster use, and resource usage become a difficult problem. For the technically inclined, job scheduling is an NP-hard problem, thus no optimal solution can be computed in polynomial time and heuristics are employed to find a reasonable solution based on a scheduling policy. This is a fancy way of saying, job scheduling is not guaranteed to be fair. Your job, which is undoubtedly the most important, may not run when you think it should.

To cope with this situation I have devised three simple rules that can help users work with a cluster job scheduler (and administer) without throwing chairs.

  1. Users should understand that queuing will never be fair
  2. Users should commit to understanding how the job queue works (meaning more than "qsub" and emailing the admin "why is my job not running?")
  3. All users must all agree on a policy otherwise the powers that be will make the policy

In my experience, without this type framework, things can become unglued and the cluster may become underutilized because users get frustrated because it seems no one is happy. There are plenty of places on the Internet to learn more about resource management systems. Just schedule some time and have at it.