n^2? Are you #*$&ing kidding me?
Having learned SRE by supporting Hangouts at Google, I know exactly why they have the front end servers all talking to each other. It does not matter. IT. CANNOT. SCALE. Software engineers completely rearchitect systems rather than implement n^2. If we cannot figure out how to do it ourselves, we ask for help. If that means bringing in a computer scientist, that's what we do. I'm not saying I've never delivered n^2 into prod. I'm saying I never delivered it into prod where scale would ever be a concern.
The worst part about it is this: "To speed restart, in parallel with our investigation, we began adding a configuration to the front-end servers to obtain data directly from the authoritative metadata store rather than from front-end server neighbors during the bootstrap process." In other words, Amazon switched to an n log n solution in order to dig themselves out of the hole, but then immediately went back to the old way. Brilliant.
Sure, cells will help. Sure, expanding the number of threads the OS supports will help. Now, where do you put the hard limit on the number of servers in the cell based on the number of OS threads so that no human overrides or changes it? No. Bad architect. No more colored markers. Fix the n^2 problem.
But, as is often the case, this jape is not over. Their systems, nearly overloading from processing new configuration information, were trying to handle customer traffic--and overloading completely. Let me let you in on a secret: "My job is to keep the network up. It's merely out of my good graces that I allow customers on at all." If your servers are returning 100% 500s while keeping themselves and the network healthy, that can be recovered quickly. If they fall over, or the network does, that is BAD. Really, really bad. Design your systems to drop 100% of traffic before they fall over.
Moving the critical services to dedicated servers is part of how you do that--that's a good move. But only part of it. As Google recently found out, you need a strict hierarchy for traffic. Configuration changes over everything. Critical logs next--but ever here, you require rollups & squelching. Server fleet health traffic next. Then you get into general status & servicing the customer.
The other issue, and Amazon appears to be feeling its way in this direction, is that you MUST have a clear understanding of your dependencies, and systems in place to handle the failures of these dependencies. Throwing capacity at a problem does not make it go away--it makes the eventual failure that much more spectacular.