back to article Signal president Meredith Whittaker says they had no choice but to use AWS, and that's a problem

Messaging service Signal may be unusual in its deployment of credible end-to-end encryption, but it shares a common availability vulnerability with many other internet services – dependence on Amazon Web Services (AWS). Signal, like many other internet services, failed briefly during the sizable AWS outage that occurred on …

  1. Mr.Nobody

    Really?

    I know of quite a few largish companies in the colo facilities that we use that do a whole lot without using any of the hyperscalers, including ourselves.

    We had a lot of people that focused on resume driven development over the last 10 years and put a whole bunch of critical services we use in AWS, but we have been slowly bring them back to a colo running on our gear.

    If they are profitable enough to pay the AWS costs to run their service, they are profitable enough to do it themselves, they either lack the imagination or don't want to.

    1. Throatwarbler Mangrove Silver badge
      Thumb Down

      Re: Really?

      "If they are profitable enough to pay the AWS costs to run their service, they are profitable enough to do it themselves"

      This statement is simply wrong. As the Signal CEO notes, the organization runs a global operation, and it's almost certainly cheaper to piggyback on existing cloud infrastructure than to, for example, build out highly-available points of presence around the globe. Signal is also a non-profit organization, which presumably means they're not swimming in cash with which to build out and maintain that infrastructure. For many organizations, there can certainly be a point at which on-prem hosting becomes more affordable than paying a cloud provider, and your example of running your services "in a colo" is probably one of them, since you clearly don't need significantly distributed systems, but different organizations face different challenges.

      1. Mr.Nobody

        Re: Really?

        It looks like its an LLC?

        https://en.wikipedia.org/wiki/Signal_Foundation#Signal_Messenger_LLC

        And again, colo space is relatively cheap for what one gets. WAN connections are cheap compared to what they used to be. Internet bandwidth is cheap compared to what it used to be.

        AWS has to do the same work as everyone else, but then they take a hunk in profit on top of it.

        For them to say they have no other choice is just not true,

        1. Throatwarbler Mangrove Silver badge
          Thumb Down

          Re: Really?

          From their Web site, Signal is a 501c3 nonprofit.

          In any case, I suspect their CEO knows better what their challenges are than you do. Individual colo spaces, WAN connections, etc. may be inexpensive, but those costs, of course, mount as one needs more points of presence around the globe, and those costs, I suspect, are relatively small compared to the cost of deploying and supporting the operational technology needed to provide the actual service, which means network equipment, servers, storage, and the software needed to tie it all together. The value of global cloud providers is that they already have performed the heavy lifting in terms of deploying that stack; they've made the investment in the infrastructure, including software services, and are, in essence, renting it out to others. To correct your point, AWS has done the work that independent operators would have to do in order to provide similar capabilities.

          So, to the question about whether Signal has a choice, the question is really whether they could have reached global coverage in a cost-effective manner without running on previously-built cloud infrastructure. The CEO's answer seems to be "no," but she points out that it's a deal with the Devil, one made by many organizations, which puts outsized power in the hands of the hyperscalers, in particular Amazon.

          Again, because I suspect I know what the incoming comments will be, I'm not saying that public cloud is the right choice for every organization, but many companies and organizations find using cloud infrastructure highly beneficial for a variety of use cases.

    2. O'Reg Inalsin Silver badge

      Re: Really?

      We had a lot of people that focused on resume driven development ... - Does this mean employees, presumably management with such decision making authority, directed services to be ported to AWS because it looks good on their resumes?

      1. Mr.Nobody

        Re: Really?

        Developers that decided that a service had to be in AWS because they convinced the PHB it would be better and cheaper there and never go down and it would look good on their resumes.

        They built it with all sorts of SPF, documented nothing, got new jobs somewhere else and left operations people holding the bag with 5 digit monthly bills.

        They would sign up for third party services with their own email address and a company credit card. The card would be near the expiration date, and the third party would send and email to the person no longer at the company to say it needed to be updated. No one would answer, and the third party would cut off the service, bringing the whole service down.

    3. Anonymous Coward
      Anonymous Coward

      Re: Really?

      “ The result is that most companies, Signal included, can't afford to replicate AWS' global network of data centers and computing power. ”

      Smells of the same race to the bottom as importing and promoting the use of wholly disposable foreign agricultural workers for North American and European Farms, vineyards, fisheries, food processing etc over local workers or indigenous seasonal workers…. ‘Because we can’t afford’.

      1. Anonymous Coward
        Anonymous Coward

        Re: Really?

        Smells of the same race to the bottom as ...

        Yes, but then that's everything on the internet these days where most end users have become conditioned to "everything is free". I use signal - despite pressure form family to join them on Whatsapp - but I have to confess I don't pay anything. Yes, that does make me a freeloader reliant on those that do pay (for whatever reason).

        And now I'm feeling guilty.

    4. Charlie Clark Silver badge

      Re: Really?

      You have a point – and both Dropbox and StackOverflow have been public about their move away from IaaS – and I think this was the case before Signal allowed bigger group calls, shortly followed by becoming massively more popular. Bandwidth then becomes an issues for which there is no quick and easy fix.

  2. Anonymous Coward
    Anonymous Coward

    So how exactly did Skype circa 2008 (pre ebay, pre MS) manage to have a global messaging system that performed on par (or somewhat better) than Signal does today?

    1. Nuno

      I think that Skype was peer 2 peer, the comms did no go through a central server

      1. Anonymous Coward
        Anonymous Coward

        As I understand it, Skype still relied on a central server for discovery and connection/session initiation, and message buffering.

        A P-P connection for the voice channel is what WebRTC gives us today (again, as I understand it)

        So the system seems much the same.

    2. Alumoi Silver badge

      Skype didn't have to store, analyse, share your data with every TLA, sell said data to anybody willing to pay... erm, sorry, suffer a security breach.

    3. doublelayer Silver badge

      Interesting you should ask. An article gives an interesting summary of how they used to do it, but a little blurb at the top gives us the important information:

      UPDATE: Unfortunately, this post is no longer accurate with regard to Skype’s infrastructure. After the massive Skype outage in December 2010, it was expected that Skype was exploring ways to make their system more stable and resilient. In early 2012, Skype (at that point now owned by Microsoft) was reported to have replaced much of the P2P supernode infrastructure with supernodes hosted in Microsoft data centers.

      So your answer is that they moved it to the cloud because their previous self-run infrastructure proved insufficient. Of course, they probably could have gotten a bunch of money and decided to build their own, and if they had built enough of that, they might have decided to rent out some of that and become a cloud provider in their own right.

      Global distributed systems take a lot of work that a lot of people choose not to think about. Me too, when I can get away with it, because a lot of it is boring. I have watched people think they've done it when they really haven't, though, and they tend not to like the results. If you host in a single facility, you're not distributed. If you think that a few colos on different continents does it, you're probably not as interrelated as global communication services are.

  3. Nate Amsden Silver badge

    Depends on their use case specifically

    and their requirements etc, someone on LinkedIn mentioned to me last week something along the lines of "if you're going to build a CDN you have to use cloud", which is a line of BS, they thought pretty much all CDNs used public cloud (and I have no doubt many/most/all CDNs probably have some aspect of public cloud usage). Global, real time mass comms platform as CDNs are as well, all of the major ones and probably most/all of the minor ones use their own infrastructure for their edge. Not only for cost reasons but also (more important for them) routing/traffic control reasons (less important for an app like Signal).

    You can see pretty easily whether or not a CDN node(the most important part of a CDN) is using a public cloud or their own stuff by just looking at it's IP, if the WHOIS info for the IP reveals a public cloud provider then that is clearly cloud, if it does not then most likely it is their own infrastructure.

    I know for example when Snapchat went public, here on el reg there was an article (https://www.theregister.com/2017/02/03/snap_files_for_ipo/), where Snapchat said they had commited to spending $400 MILLION PER YEAR to Google for their cloud stuff. Sorry it's going to be hard to convince me that they can't build their own global network for a lot less than $400M per year... Snapchat is in a similar model as Signal I think ... ? (never having used Snapchat though I do use signal).

    To me, one of the best (on paper) use cases for public cloud is you have to go from say ~100 CPU cores to 5,000 CPU cores for max of 2 hours per day (averaged over a month, so say max of 60 hours per month). Building infrastructure for ~60 hours of month of usage probably doesn't make sense (though I haven't run the numbers specifically). Another really good use case for public cloud is one off things, such as I think I have seen at least one article here on el reg about some group doing some kind of HPC test on cloud where they spun up a few thousand servers or something to do one test, then spun them down(never to be needed again). Obviously such situations are few and far between.

    (Again on LinkedIn) there was a cluless tech leader dude from State Farm who wrote a dumb post saying everyone should use cloud, at their scale they want their business not to be focused on computers etc (typical outsourcing BS), anyway found it kind of ironic more recently another person posted about how Geico (same industry as State Farm, insurance) spent a decade moving into public cloud spending $300M/year, only to find out(why did it take a decade to find out?) that it costed them 2.5X more, and now they have reversed course.

    But most anything with a real steady state load in 95%+ of use cases doesn't make sense to have on public cloud.

    THAT SAID - if you are happy with overpaying your public cloud provider and don't care about the costs you are just a happy customer that is fine, continue to use them, just don't pretend that you are saving any money.

    IaaS is broken by design, something I first wrote about 15 years ago and posted a link here on El reg, here is the link again

    http://www.techopsguys.com/2010/10/06/amazon-ec2-not-your-fathers-enterprise-cloud/

    Some back story to that, at the time the CEO of the company I was at was/is the sister of the head of Amazon cloud (now is the CEO of Amazon). I actually met with him and his chief scientist back in 2010 to complain about their bad service and he spent a bunch of time apologizing for it. But that's not the real story. The real story is even though I sent that link to my boss on that same day, he read it, and he thought it was a well thought out balanced post, someone over at Amazon got into a hissy fit and that came down on my employer(was before noon on the same day as I posted it) whom then gave me legal threats to take the post down(BS reasons), they threatened me again when I left the company (and triggered a mass exodus from the tech team, about a dozen came to the next company). I complied and hid the post for a few years, they eventually went out of business and I put it back up online about a decade ago.

    I've started to think I will refer to these people (like that State Farm person above), as members of "Cult of the Cloud". (for whatever reason I came up with that sort of named similarly as "Cult of the Dead Cow"), where they can be faced with so many different facts and figures and they are so brainwashed that they just can't believe their eyes/ears (similar to "MAGA" folks). Same sort of thing applies to so many folks pushing Kubernetes as well(and "IaC" to a lesser extent). All complicated coping means to try to tame "the cloud". Make it simpler, don't use it. (I happily admit there are use cases for all of these things they just don't apply to everyone(don't apply to most really), and many of these folks think these things should apply to everyone).

    The post is still valid today, as the flawed design of IaaS remains unchanged.

    I moved my last org out of AWS in early 2012 with a 7 month ROI, and followed with a decade of flawless operation.

    1. O'Reg Inalsin Silver badge

      Re: Depends on their use case specifically

      To me, one of the best (on paper) use cases for public cloud is you have to go from say ~100 CPU cores to 5,000 CPU cores for max of 2 hours per day ...

      In the days of new internet startups startups, in particular social media, waiting for that lucky viral moment when usage would suddenly explode was key to getting an IPO, and missing such an opportunity was like missing the boat. That whole industry though doesn't produce anything worthwhile except crappy LLM training data, and negative cultural impact of viral+shallow is horrifying.

      1. Nate Amsden Silver badge

        Re: Depends on their use case specifically

        Viral moments should be cached by CDN. I worked for 2 social media startups in 2006-2008 and 2010-2011 (both in Seattle). The latter one used AWS (when I wrote that blog post). Their bill at times was in excess of $500,000/mo(I have always suspected due to the relationships they likely did not pay the full dollar value on their bills, but I have no proof either way). Not because they had tons of users, but because things were in such a chaotic state and high turnover.They did have bursts of traffic but in the grand scheme of things it was not a lot of traffic. I had a plan with a 6 and a half month ROI for bringing stuff in house. I didn't like the company much so I spent WAY TOO MUCH time on that presentation and research and stuff(I enjoyed it). (the executive slideshow was only 15 pages including a few pages with mostly images, the full technical slide show covering every aspect of things was a full 170 pages)

        Everyone in the company was on board from my manager, to the CTO, the CEO, the software developers, everyone. The board shot the plan down and wanted to re-evaluate in a year or so. I left within a week of that. My manager resigned the day after I left, and a bunch more left soon after. My hiring manager at THAT company hired me at the next company where I spent over 10 years(that manager left after 2-3 years).

        I know AWS' support is better now, but an example from the time, my (then) new manager had a decade of experience working at Amazon(we had many ex-Amazon employees including our CTO). Our CEO was the sister of the head of AWS. We were in the same city as AWS. My manager reached out and in a kind way said basically "everyone at my company hates your product, non stop problems. We must be doing something wrong, can you come on site and talk to us about what is going on? Knowing we spend a lot of $$ in your cloud and we have a lot of relationships with your leadership". Their answer ? (something along the lines of) "Tough shit, that's not our model, you figure it out". Even my manager was floored at the response. An earlier company I was at Oracle flew people on site on one occasion to deal with problems(for multiple days) we were having and we were spending a FRACTION on Oracle DB as my social media company was spending on AWS. My (then) manager later went to work for Oracle cloud for a few years till he retired(he tried to hire me several times), another person on my team at that social media company still works for Oracle cloud as a tech architect of some kind (very smart guy, I didn't know him well)

      2. Nate Amsden Silver badge

        Re: Depends on their use case specifically

        memory triggered ... I remember one time the tech leadership of that social media company were freaking out claiming someone was attacking our site, and our site was crashing. It was crashing, they were hitting some "special" API endpoint I don't recall the details other than it was something like not even 3 requests per second. It was a joke, what a terrible code base (made in part again due to high turnover, stress, death marches etc).

        I also recall a couple of years after I left I happened to be in Seattle again visiting folks, I got a call early in the morning on my cell phone. Someone was trying to get in touch with someone at the company but they could not find contact info. Website had nothing, and I guess they weren't trying very hard because they came to me, apparently my contact info was on their domain still even though I left the company a long time ago. It seemed kind of strange... then he eventually came clean saying "I don't want to alarm you, but I am calling from the FBI". Oh, wow, ok. I never learned as to the cause of them wanting to contact the company(it was legit as far as I know). This caller was in search of log events for something... I was able to contact the company and get him in touch with them. I sort of joked with the company saying "Hey your splunk instance is on the internet you can just give him a login to it". The app stack did support "user generated content" forums, and other things, so I imagine some users posted some illegal content of some kind and that triggered the response. Nobody told me what the end result was beyond they were successfully in contact with the FBI.

  4. Paul Crawford Silver badge

    Internet damage mitigation

    The Internet, as in TCP/IP style packet switching, continues perfectly well if AWS goes down. It is the higher levels of services that depend upon AWS computing that go TITSUP when there is a problem.

    1. Bluck Mutter

      Re: Internet damage mitigation

      This is what pees me off, even in the press that should know better... some endpoint service that leverages the Internet goes down BUT that doesn't mean the Internet went down and yet it's presented as such.

      It also pees me off when something some vapid personality does goes viral and the headline states "XYZ breaks the Internet" or "The Internet melts down after...."

      Bluck

      1. Martin M

        Re: Internet damage mitigation

        For the sake of your blood pressure I’d let this one go. I’d estimate the last time the majority of users defined the internet as a set of networks connected by TCP/IP was about 1994, about when they stopped capitalising the first letter. Now it’s just ‘the set of things I access that aren’t on my computer/device”. You’ll notice the article uses lower case ‘i’ all the way through.

        1. I could be a dog really Silver badge
          Facepalm

          Re: Internet damage mitigation

          Or as Wes Borg put it in his Internet Helpdesk sketch, "the internet" is the big blue "e" or the big green "N"

    2. Irongut Silver badge

      Re: Internet damage mitigation

      Yeah I keep reading about how the internet was broken but it was working perfectly fine here. I never even noticed AWS had a wobble.

  5. jaypyahoo

    I think signal should just go paid app way like Threema does. Maybe lesser but yearly subscription

  6. Marty McFly Silver badge
    Alert

    Signal is right to be worried

    <<Set your politics aside for a moment & look at the facts, okay?>>

    Remember "Parler"? This was a social media network that was popular with conservatives.

    Now rewind your clock back to January 2021. Google suspended Parler from their Play Store on January 8, citing posts that incited violence. Apple removed Parler on January 9th, after warning Parler to improve moderation. And AWS terminated their hosting on January 10th, effectively killing them. On January 11th, Salesforce blocked the RNC from sending emails. Shopify & Stripe also took politically motivated actions against some of their customers.

    This DID happen, and regardless of our personal politics we cannot deny it happened. Did Big Tech take these action of their own accord, or was there collusion with government? Politics go back and forth, and we are mere pawns in their game. Neither the Red team or the Blue team will hesitate to use this control & leverage in the future if they find it serves their interests.

    We should NEVER forget that it happened. Centralized Cloud services represent a real threat to free & secure communication - which is exactly what Signal provides. With this demonstrated past behavior from Big Tech, it is a small step to see Signal get a ransom note the next time there is a crisis.... "Give us a back door or go off-line".

    We simply CANNOT trust Big Tech in the future because of what they have done in the past. Yeah, given what Signal represents for security & privacy, they are in a tough spot when the next social unrest arrives.

  7. Anonymous Coward
    Anonymous Coward

    Self-Serving Bulls**t????

    Quote: "... Meredith Whittaker ... explained, "The question isn't 'why does Signal use AWS?' It's to look at the infrastructural requirements of any global, real-time, mass comms platform and ask how it is that we got to a place where there's no realistic alternative to AWS and the other hyperscalers......"

    Confused old fool here -- but way back when, the email standard on the internet got along REALLY WELL with a server-to-server (store-and-forward) protocol!!!

    Isn't Signal a modern email messaging system? Is this diatribe by Meredith Whittaker just self-serving bulls**t?

    1. Peter Gathercole Silver badge

      Re: Self-Serving Bulls**t????

      You can't really equate Signal with an email service. As was said, Signal does voice and video calls as well as messages, needing relatively low latency for connections. It works by having broker systems effectively keeping track of all online Signal client systems, and acts as to allow them to find each other contact addresses, which are then used as a point-to-point connection. As a result, if the broker systems are down, Signal stops.

      Email, even in it's earlier incarnations relies on fixed servers, at known IP addresses (though abstracted through DNS), which provides store-and-forward type messaging. People didn't send mail to your PC or phone, they sent it to an email server, which you then contacted (if you were remote) to check your mail. Thus, the number of mail systems that needed to be tracked was low, and finding one for a mail domain was initially done through DNS.

      Even now, the only way you get push mail to a mobile device is because your device either has a notification channel for waiting mail, or the client running on your remote PC or phone checks on a regular basis.

      And if a mail wasn't sent because there was no communication between the sending and receiving server, it would be left in the outbound queue on the sending server until either it was able to send it, or it timed out (normally days).

      But email was still vulnerable to DNS problems. You could mitigate this by having your DNS domain hosted on more than one domain server, and even have a fallback mail server if you needed it.

      Traditional DNS and mail can be made resilient. Unfortunately, traditional DNS, with relatively long TTLs for cached entries does not suite a dynamic environment where things can move, be duplicated or just as easily be deleted at the drop of a hat, so they've had to kludge DNS with ultra short TTLs, so that stale cached entries do not prevent that new server you've just spun up from being ignored because the systems that need to talk to it have not seen the new DNS entries.

      Google has been pushing to change DNS for years, and maybe it is time for another, quicker and more responsive naming service to appear on the Internet. But I would be sad to see the current system be completely replaced, as it's served us moderately well for over 40 years. But maybe it is time for it to fade into the background of just serving legacy services.

    2. doublelayer Silver badge

      Re: Self-Serving Bulls**t????

      No, it's not self-serving, because Signal is not email. Signal has the voice and video call system, in fact they started with the voice call part, but even when it is text messaging which is a lot of the traffic, it's in that category known as "instant messaging". That sounds like kind of an old term; wasn't IM a thing we used to say a lot more in the 1990s and early 2000s? Yes, it was, to distinguish it from very not instantaneous email. Signal wants message delivery latency to be very low, and it is, whereas email latency is still higher, and they want to do it without having user-run mailservers, which they do. That means different infrastructure requirements, and if you don't understand why, you don't understand what communication apps do.

      1. Anonymous Coward
        Anonymous Coward

        Re: Self-Serving Bulls**t????

        @doublelayer

        Quote: "...you don't understand what communication apps do...."

        Maybe so......but YOU don't understand the point the original OP was making!!!

        1. doublelayer Silver badge

          Re: Self-Serving Bulls**t????

          I don't? Do clarify. I clarified the problems with their post. If I'm misunderstanding something important, for example the part where they made an inaccurate comparison between Signal and email, then you could explain to me and anyone unfortunate enough to share my misconception, what the problem is.

          1. Anonymous Coward
            Anonymous Coward

            Re: Self-Serving Bulls**t????

            @doublelayer

            ....and today (October 29) AWS has more problems....and today, so does Microsoft Azure.

            So......perhaps you can explain (again) why this "cloud" thing is "inevitable"....as suggested by Meredith Whittaker....

            1. doublelayer Silver badge

              Re: Self-Serving Bulls**t????

              How does that make any point relevant to the bad email comparison? The cloud viability discussion is about the ability to scale a global service with less resources, which has been covered in plenty of threads you have likely already read, and even if I totally agreed with your latest argument, which I don't because nobody, including Whittaker said cloud never failed*, it wouldn't do anything to improve the previous one which is still incorrect.

              * In fact, the fact that it does sometimes fail is why she made the statement in the first place and isn't pleased with her limited options. Pointing out that she correctly recognized that cloud isn't perfection incarnate isn't very helpful at finding an alternative, especially if you start assuming that something other than cloud is.

  8. Claptrap314 Silver badge

    Hard to "get" until you've done it.

    I supported Hangouts 2015-6. There were two SRE teams (one video, one text), 16 members each. About 2/3rds of my alarms ended with me calling another SRE team to confirm that they knew about their issue.

    We were in about 20 data centers. For each datacenter, you need a team of SEs.

    My recollection is that we (the text side) ran on more than 100000 servers.

    While the our SREs were dedicated, the servers, let alone the datacenters, most definitely were not. Nor were those other SREs I called.

    The hyperscalars have a huge advantage in amortizing servers and datacenters. Also, well, they used to have a huge institutional memory.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon