* Posts by Geeko

2 publicly visible posts • joined 13 Jan 2009

HP hunts down 'rare' BladeSystem problem

Geeko

Having to lose two power supplies does not make a SPOF

To Anonymous Coward(Posted Thursday 15th January 2009 13:13 GMT), SPOF means Single Point of Failure. Losing two power supplies is Multiple Points of Failure...far less probable. Even so, in the IBM Bladecenter, the power supply pair that must fail together to take the enclosure down are connected to separate power harnesses, with each power harness meant to connect to separate external power grids. Therefore the failure of any single power grid will only take out one member of each of the two power supply pairs, leaving the enclosure running.

HP and Dell have also adopted the same design in splitting the six power supply inputs over two power grids. Unfortunately, instead of extending this redundancy to the DC side of the power supplies, they all converge onto one DC bus on one midplane.

The IBM Bladecenter has two separate midplanes, each one with it's own DC bus. That's why all IBM blade servers have two power connectors, they draw power from two separate DC buses. I don't know if the active components you speak of are hardware monitors or in the data and power paths...whatever the case may be, there are two duplicate sets because there are two midplanes, so again, no SPOF.

Was the HP power supply recall a result of a bad batch of power supplies? That does happen to every vendor from time to time, so it is plausible that this is just bad luck. However, the recall affects all power supplies for the C7000 manufactured before 20 March 2008, that is, since the launch of the C7000 in 2006. By their own calculation, HP claim to have shipped more than a million blades. Considering e-class, p-class and c-class, c-class is by far the most successful and would account for 500,000 blades or more. Assuming 6 power supplies for every 16 blades, that's around 180,000 power supplies! That is not a bad batch, it's an expensive design flaw. A profit making company would not make that kind of recall unless the cost of not doing it was even more costly...it makes you wonder about HP's definition of "extremely rare".

Unfortunately, the design flaw is not in the power supply (I would expect HP to be capable of making power supplies as good as IBM) but in not having a redundant DC bus. To fix this is a lot harder, because the midplane would need to be changed and a redundant power connector has to be added to every blade. This is a whole new architecture which would be incompatible with existing blades, something HP would loathe to do given that e-class, p-class and c-class blades are mutually incompatible.

So rather than fix the real problem, HP have elected to issue improved power supplies (probably with better DC fault isolation) to reduce the probability of failure. It's like issuing a recall on all cars to upgrade the suspension rather that fixing the potholes in the road that are causing the crashes in the first place. I can understand why they have done this, but it certainly convinces me that my VMware cluster is going to be deployed on rack mount servers rather than blades..at least not HP or Dell blades anyway.

Geeko

"Single DC Power Bus" = Single Point of Failure

True, not all blades are created equal. Check your favourite vendor's blade architecture:

a) Does the blade have only one power connector?

b) Can the power supplies be configured to do N+1 redundancy?

If the answer to either is "Yes", you have blades that are dependent on a single DC power bus. A power supply fault on the DC side can kill the DC bus and take down everything powered by it.

This single DC power bus design is clearly in the HP C7000. It is also in the Dell M1000e (see page 32 in http://www.dell.com/downloads/global/products/pedge/en/pedge_m1000e_white_paper.pdf)

Changing to better power supplies might reduce the risk, but cannot eliminate it because the SPOF is not in the power supplies, it's in the enclosure midplane's single DC bus. Good luck to all the people who have heeded the HP recall, don't be surprised if the problem re-surfaces as the power supplies age.

The only way to eliminate this SPOF is to duplicate the DC bus, have half the power supplies on one and half on the other, then connect each blade to both DC buses. This duplicate power midplane is not a new design, it's been around since November 2002 in another vendor's (IBM) blade chassis. Why HP and Dell have not copied it is a mystery.