KVMhost69 - some kind of head alignment problem, I guess.
A number of Heart Internet customers remain unable to get online for three days thanks to a hosting failure. On 28 January, the Nottingham-based web hosting biz explained in its status update page that one of its KVM hosts (KVMhost69) had suffered a disk failure, which had forced the file-system to mount in read-only mode. " …
It says a lot about their disaster recovery practices!
I thought the whole point of virtual machines is that they aren't restricted to any physical hardware. Why can't they just spin up a new machine and move the VMs over from the read-only array or recover from backups? Or just move the VMs over to other hosts with free capacity?
My previous employer had a power outage and lost our first production server and the backup server (both mainboard failures) when the power came back 2 hours later (complete industrial estate lost power). We managed to recover all VMs from the first server onto the second server within a couple of hours and jury-rig a backup process until 2 new servers and SANs could be installed a couple of days later.
The VMs were a little sluggish as the second production server was carrying its load and the load from the first server for a couple of days, but everything was back up and working.
The IT manager was on his own and managed to get the infrastructure back online in a few hours, surely an ISP, whose job it is is to provide IaaS, must have some sort of disaster recovery procedure in place to deal with such problems? I mean, that is their bread and butter after all, they are not the underfunded IT "department" of a manufacturing concern...
Edit: I posted this against the next thread... :-S
"they fail even quicker when a company uses cheaper consumer grade disks "
A lot of stats show that if anything the more expensive drives fail faster. It was certainly the case for all our scsi-UW drives.
There's at least one filesystem out there which works on the basis of "Disks are crap. Deal with it" - where loss of a drive or two isn't a big deal, vs systems with expensive raid systems and expensive disks that don't get adequate supervision and where loss of a drive is a performance-sapping event.
In any case, any outfit which doesn't have monitoring setup to send out a distress call when a RAID drive dies isn't fit for hosting other peoples' VMs.
how long had the array been running degraded?
Was also my first thought. So, either they have had an uncommon run of bad luck (which can be mitigated against by doing things like not having all the drives in the array from the same production batch) or their monitoring and supply arrangements are nothing short of shocking.
I've added them to the reasonably-long list of "companies with which to not do business"..
Status page shows that two kvmhost had issues. How many more are going to fail due to Hearts inability to maintain it's own servers properly? Surely they data centre team have received notifications of a degraded array? Assuming they actually have a data centre team and they haven't all been poached by a rival hosting company...
Obviously they learnt nothing from the last two major incidents they had in 2016 and 2017, or are they aiming to have an incident every year as some kind of twisted anniversary gift?
Every time there's been a buyout, Heart -> Host Europe -> GoDaddy
Same happens across the industry (and is the reason why I haven't been a customer of Demon Internet for a number of years..).
Also, GoDaddy is one of those aquisition canaries - when a company gets bought by them you know it's well on the way to dying and it's time to bail. Much like Capita..
"Every time there's been a buyout, Heart -> Host Europe -> GoDaddy the level of service has degraded"
When it comes to virtual hosting, redundancy is best attained by hosting with multiple providers in different locations.
Likewise with RAID cloud storage. Each element in a different cloud provider.
It doesn't give off the impression that their setup/kit/etc is ideal. Things happen, but you shouldn't really ever be in the position of having a disk fail in a RAID 10 to cause such an outage. If it was multiple disks, then either someone wasn't monitoring or they used crappy disks (no excuse for either if hosting is your business!). And that's before you even mention the fact that their estate doesn't seem to support moving virtual machines to other hardware as other commenters have mentioned.
The only bit I have empathy with is when a disk goes, the RAID controller does its best, but a filesystem makes itself read-only until you can run an fsck/check. But still, at worst that's a reboot and a few hours. And if your business is hosting, you should be able to withstand a piece of physical hardware breaking.
Where I used to work, they had a massive push to move everything to Heart and get rid of all in-house "anything". No they (manglement) didn't do any of the things I mentioned in internal discussions - like looking into their capabilities to see if what they promised was something they could provide. Their portal isn't the best I've used, especially for DNS which is (IMO) a right PITA to work with.
Anyway, now I've been made redundant reading this gives me a right feeling of Schadenfreude. They (my ex employer) got rid of everyone with a clue and have been busy making screwups after screwups.
On the other hand, I feel bad for any of the customers who have been affected by this. As a professional I really dislike seeing customers screwed up - and then usually lied to to avoid taking the blame.
But there is another hosting outfit that promises "we can move your entire Heart hosting setup to us - automatically". Apparently it was setup by the people who setup Heart before it got sold on.
Biting the hand that feeds IT © 1998–2022