Reply to post: Re: 15 9s

HGST has an entry-level 14PB archive box... is that enough for your, er, home collection?

Anonymous Coward
Anonymous Coward

Re: 15 9s

You're right that human error is a great contributor to outages, but that's not availability figures are about. Availability measures the machine's availability, so if you were to yank all the power cables out, it wouldn't count, as you could hardly blame the machine, but if the power supplies were to all fail simultaneously it would.

There's two ways to talk about availability and the Xx9s numbers. The first is the design - storage systems are designed to meet availability targets. So to go for 5x9s you will need two controllers for example. This is based on the probability that each of these controllers will fail. If you want to go above 5x9s you either need to increase the resilience of these controllers, which is difficult and expensive or add a third controller, especially if you're using off-the-shelf components as most people now are.

Similarly there's the way you protect your data. RAID 5 won't cut it with large disk drives nowadays, so to get 5x9s you need RAID 6. Rebuild times are terrible so various people have designed more granular dual-protection mechanisms to mitigate against that. Those can generally give you 6x9s and above.

The second way to talk about it is to actually measure it. This is a bit more difficult for a new product as there aren't any boxes out in the field. For established products, one can build up a reasonably accurate availability picture, to an extent. Most outages go up the support chain as far as the people who do the measuring. They actually time the outage from when it begins until the last host is back up. Or at least that's what they're supposed to do.

There are some caveats: if the outage was not caused by a failure in the box, it doesn't count at all. If the system is back up but the customer is taking their time bringing up hosts, then a decision will be taken regarding whether to include that as part of the outage. More importantly, if an outage is caused, for example, by a heavily loaded system losing a controller, and the resulting latency increase resulting in many hosts timing out, it doesn't count as an outage as the system is "working as designed".

The other thing to bear in mind is the vendors do fiddle the numbers a bit. If you remember about a decade or so ago, a lot of hardware (not just storage) was hit by a timer bug in the Linux kernel (I forget the details). A lot of machines fell over so availability was terrible for that day/month. Vendors will not have highlighted that, smoothing it out by taking a 6-month number, or discarding it completely as it wasn't their fault.

The system discussed in the article is new, so HDS are making their claim based on the architecture they have used. Clearly they have no field measurements. It's a bold claim, and the details of the architecture will help people decide how realistic this is. Ultimately though, this is the best way to determine what your availability is going to be like: understand the architecture and the components within. Take the vendors' figures with a pinch of salt.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon