About the Author

Douglas EadlineDouglas Eadline PhD, is both a practitioner and a chronicler of the Linux Cluster HPC revolution. He has worked with parallel computers since 1988 and is a co-author of the original Beowulf How To document.  Prior to starting and editing the popular http://clustermonkey.net web site in 2005, he served as Editor-in-chief for ClusterWorld Magazine. He is currently Senior HPC Editor for Linux Magazine and a consultant to the HPC industry. Doug holds a Ph.D. in Analytical Chemistry from Lehigh University and has been building, deploying, and using Linux HPC clusters since 1995.

User Rating: / 0
PoorBest 
A blog about making HPC things (kind of) work

Did you know, the typical HPC node has a bunch of programs running in addition to the users application? Have you ever wondered what these are and what they are doing or if they are needed?

Most of the programs are service daemons or things that run in the background. For instance, the web server httpd daemon waits for input on port 80 and then tries to deliver a web page based on the request. Of course, there is no need for a webserver to run on an HPC cluster node.

If you use a stock distribution on your nodes, then it might be useful to see what services are actually running. This can be done by running chkconfig --list and noting what services are started on boot-up (if there is a "on" in one of the run levels then the service will be started at that level). Also, check what services are enabled in /etc/xinetd.d/*. You can turn unwanted services off using chkconfig and the /etc/xinetd.d/* configuration files. You may also want to check /etc/rc.local to see if there are any other services started when the nodes boot. Finally, it may be instructive to run top on a node when nothing is running and see how much system load and memory is being used by the services (hit "M" to sort by process by memory). This simple test if often surprising to many cluster administrators.

Before you start killing processes, here is a list of essential services that should be running on most nodes. There may be more and before you decide to turn of a service, make sure the node does not require it for proper operation.

  • Remote login services: sshd (preferable) or rsh
  • Time synchronization: ntpd
  • Remote logging of system logs: rsyslogd
  • Remote monitoring: gmond (or similar)
  • Remote batch execution services: sge_execd, pbs_mom, slurmd, or other resource manager
  • Remote file locking: nfslock (note: normally NFS daemons do not need to run on the nodes)
  • Remote procedure calls: rpc* daemons
  • Hardware monitoring/control: lm_sensors, ipmi or similar

There can be other daemons running, but in general a node should have minimal services running. If you notice things like httpd, iptables, cups, mysqld enabled, you can probably turn them off. There may also be vendor specific daemons running such as parallel file system daemons, hardware monitors, or management daemons (e.g. Bright Cluster Manager). Compared to the cluster login/management nodes, the compute nodes should be running a small sub-set of system services.

There is also some overlap in services. Most of the resource scheduler daemons monitor the node resources and communicate this data back to the scheduling server. Much of this same information is transmitted with the ganglia monitoring package. Also, most of the resource reporting daemons, have facilities for users to add metrics for the node that are not normally reported (i.e. a list of user names or temperatures).

Keeping service daemons to a minimum helps keep your compute nodes "lean and mean" so a maximum amount of node resources can be devoted to user applications. Sounds like a plan.