14 Jan Caring about Sharing: HPC Work Queues
Almost all clusters are shared computing resources. In order to facilitate shared use, clusters have some kind of resource scheduler that sits between the user and the actual hardware. Examples of freely available resource schedulers include target=”_blank”Open Grid Scheduler (formerly Sun Grid Engine) and Torque/Maui. There are also commercial versions of these and other schedulers. The point is, virtually everyone in HPC uses a scheduler of some sort. Those lucky enough to own a personal HPC resource probably use one as well so they don’t have to babysit the system while it is crunching away.
If a resource scheduler supports more than one person the task of scheduling jobs, maximizing cluster use, and resource usage become a difficult problem. For the technically inclined, job scheduling is an NP-hard problem, thus no optimal solution can be computed in polynomial time and heuristics are employed to find a reasonable solution based on a scheduling policy. This is a fancy way of saying, job scheduling is not guaranteed to be fair. Your job, which is undoubtedly the most important, may not run when you think it should.
To cope with this situation I have devised three simple rules that can help users work with a cluster job scheduler (and administer) without throwing chairs.
- Users should understand that queuing will never be fair
- Users should commit to understanding how the job queue works (meaning more than “qsub” and emailing the admin “why is my job not running?”)
- All users must all agree on a policy otherwise the powers that be will make the policy
In my experience, without this type framework, things can become unglued and the cluster may become underutilized because users get frustrated because it seems no one is happy. There are plenty of places on the Internet to learn more about resource management systems. Just schedule some time and have at it.