18 Mar It Still Does Not Work
I’m in the middle of solving another cluster problem. I won’t mention any names or vendors, although at times I think some healthy shamming may be in order. In any case, the problem started out simple enough. I was asked to make a small change with some queues on a torque/maui cluster. The change had to do with twenty new nehalem nodes they had purchased last year.
The nodes were working fine up until this point, or more precisely they were working for what was asked of them. Splitting the nodes into two queues caused the nodes to be used in a different fashion and thus exacerbated a problem that had been lurking in the system.
I’ll save the details other than to say it was traced to a faulty GigE switch. Faulty, may not be the best word, because I am not sure if the switch is broken or has a firmware issue. It turns out the switch in question consists of two 48 port switches that are “stacked” to look like one switch. The slave switch was recently replaced due to spontaneous reboots. The current problem with the switch is the inability of some nodes to contact other nodes. This issue is repeatable and isolated to a few ports on the switch. The first logical (and easy) thing to do is to update the somewhat dated firmware and see if it helps.
Here is the catch-22. The cluster vendor will only support the cluster with the old firmware because the new firmware has not been “qualified” yet. But, when the vendor is told about the problem, they gladly provide a new switch with new firmware that must be downgraded to the old firmware so that the support contract can remain in place. And, the person that is sent to install the new switch “does not do firmware.” The customer who purchased a support contract now has to downgrade the switch so it will “stack” with the existing switch.
The current problem might be solved with a firmware update. The vendor does not seem interested in fixing the “whole” problem other than sending incompatible parts. As I see it, shabby support requires that old firmware be used, which requires the end user to do their own support and trouble-shooting which means they essentially get no support. Make sense? Not to me.
I’m not interested in telling war stories or discussing how to fix cluster issues. There is a higher lesson here. What many vendors forget is clusters are a “system” not a pile of servers, switches, and cables. Many vendors treat them as individual parts and have no clue how to support the “whole system.” They proudly boast of selling clusters, but what they are selling are connected islands of hardware each with individual support. It is rare that a vendor takes responsibility for the whole system.
To be fair, clusters can be complicated and custom systems (much like storage networks). There are vendors, usually not the large vendors, who understand this reality. They design, build, and support clusters as complete systems. These integrator-vendors usually take some responsibly for both hardware and software.
What surprises me the most about the above situation is that it is generally the rule and not the exception. I have been involved with clusters since the mid-90’s and today’s situation reminds me of the early years where everyone was still trying to figure out how to build these things. At least there was an excuse back then. The only piece of advice I can give is when buying a “cluster” ask the vendor for the name and phone number of the person (or group) who is going to help solve software and hardware issues that involve multiple components from multiple vendors. Most vendors hold up mirror at this point. Keep talking to the ones who don’t.