So...
basically they tested it in production with no idea what would happen? Good to know they don't actually have any important data to lose.
Before turning on its new custom-built data center in Prineville, Oregon, Facebook simulated the facility inside one of its two existing data center regions. Known as Project Triforce, the simulation was designed to pinpoint places where engineers had unknowingly fashioned the company's back-end services under the assumption …
I know what you mean, but what would you suggest? Loadrunner? I have tested extreme ingestion rates for email archiving, and getting the type of unique load that is needed requires some doing. We tried signing up for SPAM and got a decent volume per day. This did not reflect the types of attachments though, so we needed to use additional sending agents. There is nothing like real user data to understand performance characteristics. Trying to model it is exceedingly difficult.
The stress tests are not easy to design. With facebook's growth rate, they had a lot on the line to get it right. One of the most fascinating things about user activity is that if something is not responsive it will actually increase load...they will click refresh or issue another request. You know what this does to a web server or the database in the background. First transaction is not cancelled on the server side...
I have also tested very large Email Archive (100k mailboxes) what we did was ran a pre-live test trial with simulted data to check that everything was basically functioning ok. Once complete we got volunteers from different areas of the business to be signed up by their bosses. Once we'd seen that that worked we moved our users into the archive system in tranches of a few thousand a time to make sure there was no adverse effects on their work or the system.
I have a dev right now that is telling me we can just throw hardware at a design we are building. My old database and design experience tells me no amount of hardware makes up for sloppy code, it follows the 2nd Law of Thermodynamics. We are bringing in an architect and performance analyst.
It's never done is it?
I have this argument with devs and project managers at least every 6-9 months and have had for the last 16 years in database work. Yes sometimes it is necessary, but as every DB tuning book as ever preached, start with the app code, get that almost perfect then start picking on the DB and the hardware. Then when you're done, go round the loop again as many times as time allows.
I work in financial services IT, if we open another datacentre, move a system or systems, or add a fundamental component to our systems (such as when we Euro enabled) we test, test, test, test, every single component - in a dedicated test environment.
It must be really nice that facebook users are worth so little to them that they effectively test in production. Where else would they go, but from facebook? (It's a serious question, I can't really think of anywhere else.)
I'm sure they are so proud of themselves, but let's see what they have achieved: They were not confident enough that their code base was developed properly for scalability, and so instead of building a simulation in a controlled test environment, they just tested in production.
They could have just switched on the third data center. What was the point of the simulation? If it failed, since it was handling real traffic, their production would have been impacted just the same, no?
Seems typical for Facebook, and perhaps for computer programmers fresh from college, with enough money and iron to do what they want without fear of consequences.
-dZ.
Another thing that I see with CS guys straight from Uni is that you've got to school them in the way things are done in business. You see incredibly piss-poor treatment of students', who these days are multi thousand pound paying customers, by their IT / IS services departments. My partner was at Oxford a few years back and their webmail would regularly be taken down in the middle of the day for upgrade work. Having never worked in business, she didn't really understand that this isn't normal and shouldn't be put up with. Now move that to a new graduate moving into business and there just isn't the uptime and customer services ethos drummed into you at Uni that you really need to hit the ground running.
...at first job out of university. Mind you it was a bit of a departmental mindset! To this day my former colleagues and I can't quite figure out how we were allowed to get away with it.
Though in hindsight it did teach us all exactly how not to do things, which I'd like to think we all took note of! Also fobbing off helpdesk/customer support bods is a lot easier than fobbing off a rather irate traders...