21 November 2011
A blog about making HPC things (kind of) work
Did you know, the typical HPC node has a bunch of programs running in addition to the users application? Have you ever wondered what these are and what they are doing or if they are needed? Most of the programs are service daemons or things that run in the background. For instance, the web server httpd daemon waits for input on port 80 and then tries to deliver a web page based on the request. Of course, there is no need for a webserver to run on an HPC cluster node. If you use a stock distribution on your nodes, then it might be useful to see what services are actually running. This can be done by running chkconfig --list and noting what services are started on boot-up (if there is a "on" in one of the run levels then the service will be started at that level). Also, check what services are enabled in /etc/xinetd.d/*. You can turn unwanted services off using chkconfig and the /etc/xinetd.d/* configuration files. You may also want to check /etc/rc.local to see if there are any other services started when the nodes boot. Finally, it may be instructive to run top on a node when nothing is running and see how much system load and memory is being used by the services (hit "M" to sort by process by memory). This simple test if often surprising to many cluster administrators. Before you start killing processes, here is a list of essential services that should be running on most nodes. There may be more and before you decide to turn of a service, make sure the node does not require it for proper operation.- Remote login services: sshd (preferable) or rsh
- Time synchronization: ntpd
- Remote logging of system logs: rsyslogd
- Remote monitoring: gmond (or similar)
- Remote batch execution services: sge_execd, pbs_mom, slurmd, or other resource manager
- Remote file locking: nfslock (note: normally NFS daemons do not need to run on the nodes)
- Remote procedure calls: rpc* daemons
- Hardware monitoring/control: lm_sensors, ipmi or similar
| < Prev | Next > |
|---|





