back to article AWS builds a DNS backstop to allow changes when its notoriously flaky US East region wobbles

The cause of major internet outages is often the domain name system (DNS) and/or problems at Amazon Web Services’ US East region. The cloud giant has now made a change that will make its own role in such outages less painful. As explained in a Wednesday post, AWS customers told the cloud colossus “they need additional DNS …

  1. Number6

    I assume this means they've set the default TTL on DNS queries to 60 minutes. I've done that sort of thing (except down to 10 minutes) for scheduled changes to allow IP address changes to propagate faster when a change is made. 60 minutes is probably a reasonable compromise for unscheduled outages.

    1. theblackhand

      It's not the DNS lifetime that caused the issue - it's the inability to create new DNS entries that stalls/stops provisioning resulting in a feedback loop that further overloads provisioning as failures generate even more provisioning requests.

      I believe it means that if AWS experiences a major Issue with it's provisioning service, within 60 minutes (an SLA rather than how quickly they can potentially enable this feature), they will allow an alternative method of provisioning resources.

      My guess is that AWS have preallocated a limited emergency range of DNS/IP mappings that can be safely allocated in an emergency and don't require DB access. That relieves pressure on the DynamoDB solution to allow key organisations to recover faster and probably allow AWS to recover faster as well.

  2. Anonymous Coward
    Anonymous Coward

    At last!

    There's been regular, albeit infrequent, global problems due to us-east-1 dependencies. At least if they are now admitting it openly it may mean a program to fix it and give regions true independence. If they do, I think it will provide advantage with critical app clients being more inclined to use AWS.

    1. Anonymous Coward
      Anonymous Coward

      Re: At last!

      Me thinks you are much too optimistic. IIRC, they wrote the entire thing themselves. This means that the more fundemental the problem, the more of their stack they're going have to redesign and then rewrite.

      1. Anonymous Coward
        Anonymous Coward

        Re: At last!

        Me thinks you are much too optimistic. IIRC, they wrote the entire thing themselves. This means that the more fundemental the problem, the more of their stack they're going have to redesign and then rewrite.

        not sure how doing something in house can be seen as a negative in this respect.

        my understanding was that the majority of their stuff was propriety in house & i'd have thought it weird and uncharacteristic for them to have got a 3rd party tool in to do this for them. A 3rd party they don't own or have design control over.

        i don't see constantly outsourcing as a positive, more negative especially in crucial systems. i doubt a 3rd party could have written & implemented a fix in 6 weeks.

        1. Roland6 Silver badge

          Re: At last!

          >” my understanding was that the majority of their stuff was propriety in house & i'd have thought it weird and uncharacteristic for them to have got a 3rd party tool in to do this for them.”

          My understanding is the tools are all based on Open Source with in-house/proprietary enhancements.

          So in some respects their infrastructure is based on “ A 3rd party they don't own or have design control over.”

          >” i doubt a 3rd party could have written & implemented a fix in 6 weeks.”

          I doubt AWS have written a full solution, they have most probably modified the implementation to make better use of existing capabilities. Expect in the background a group is working on a more permanent solution. However, as with everything to do with cloud, don’t expect anything to be fed back to the original Open Source projects.

          1. StewartWhite Silver badge
            Mushroom

            Re: At last!

            AWS haven't fixed the problem. They've just introduced a bodge that might make things somewhat better next time round although as with any bodge there's a reasonably high likelihood that it will make things worse when the next issue occurs.

            Ultimately AWS can't be bothered to properly resolve the issue as they've accumulated too much technical debt to make anything other than a complete redesign work but that would cost too much time and money according to the beancounters. After all, $10+ billion in profits doesn't go very far you know.

    2. anothercynic Silver badge

      Re: At last!

      US-East-1 is the default selection in AWS, so it's no wonder that it gets hammered more than anything else.

      Maybe AWS should check the rough region of where the user is logging in from to suggest alternatives... It would also default European users to Ireland-1 or similar for Europeans. Nothing stops you from selecting another region, but the default should be a region closer to you than US-East-1.

      Just my opinion...

    3. Excused Boots Silver badge

      Re: At last!

      "There's been regular, albeit infrequent, global problems due to us-east-1 dependencies. At least if they are now admitting it openly it may mean a program to fix it and give regions true independence.”

      This does sound like a step in the right direction. But, I do worry that as AWS (and Azure and Google cloud) become more and more complex, and the people who originally designed and built it, leave or are let go (because AI), they may not actually know what all the dependencies on us-east-1 are, until it fall over.

      A little like everyone is fine until the eponymous maintainer in Nebraska decides that he or she can’t be bothered anymore!*

      * No I’m not providing a reference, you all know what it means.

    4. TeeCee Gold badge

      Re: At last!

      ...give regions true independence.

      Surely that's the exact opposite of the correct approach? If this really were cloud computing then a major outage anywhere should have no impact on the users, as other locations just pick up the workloads and continue transparently.

      We still seem to be a long way away from getting the originally promised benefits, of which availability and fault-tolerance due to having no single point of failure were front and centre waving the flags.

  3. xyz Silver badge

    oh noes...

    are they going to fit an on/off switch, so they can switch it off and on again...?

    1. cd Silver badge

      Re: oh noes...

      If they cram AI into that switch, I'm on board. /sincerity

    2. FirstTangoInParis Silver badge

      Re: oh noes...

      So long as the switch is big and red and has “Do Not Touch” written on it we’re all good.

      1. Anonymous Coward
        Anonymous Coward

        Re: oh noes...

        *eyes the growing queue of monkeys waiting to smash that button....*

      2. An_Old_Dog Silver badge

        Re: oh noes...

        ... with a bypassable Mollyguard on it.

    3. TeeCee Gold badge

      Re: oh noes...

      Ooooohhhh. Look 'n feel lawsuit incoming from Microsoft.

  4. MJB7

    Testing?

    I don't see how you can properly test this until us-east-1 throws a wobbly. It's a worthy effort, but I won't get really excited until it is seen to work.

  5. This post has been deleted by its author

    1. FirstTangoInParis Silver badge

      Re: Testing

      So someone needs to be brave enough to provoke the sleeping dragon by taking it down, or a replica of it.

      1. Excused Boots Silver badge

        Re: Testing

        'Take off and nuke it from orbit, it’s the only way to be sure'

  6. Claptrap314 Silver badge

    El Reg should be more sceptical

    Yes, us-east-1 is huge. But that's not why so much of the trouble is there. AWS runs all of its "global" systems out of us-east-1, so a failure in that region affects a lot more than necessary. In particular, since IAM is global, if us-east-1 has issues impacting it, you cannot spin up new servers anywhere. Us-east-1 is also where experiments happen. (Or, at least, is was--have they fixed that?)

    Again, Amazon has never had the right people in the room to build a resilient system.

  7. Greyeye

    during the last outage, IAM log in was disrupted and prevented login, this also affected aws cli sso login.

    having DNS console is rather pointless if you cannot login isnt it?

  8. Claptrap314 Silver badge

    Yo, dawg!

    I heard you liked DNS...

    My prediction: major outage caused by this new facility within 18 months...

  9. Jamie Jones Silver badge
    Headmaster

    "Yet last year, AWS told The Register that the scale of US East is not less reliable than its other regions, but operates at such colossal scale that it stresses cloud services more sternly than its smaller installations."

    In other words, it is less reliable.

  10. disk iops

    damnable fancy-pants

    The problem here is 2-fold, too damn big of a single-zone - "we" (internal staff) were harping about this back in 2012. That it would be the ruin of us.

    2 - too damn cute by half.

    The footprint of these services is massive, and in particular DNS is LOUSY at 'real time' data updates in particular because every client software is STUPID about caching answers and won't move off and try the next one. But they chose to abuse the shit out of DNS so they could steer traffic and "rapidly" enter and exist resources that were coming in or going out/dead. IT IS NOT A SIN IF THE ENDPOINT IS NOT THERE! Since we KNOW DNS clients are total shite, you're supposed to do all your *magic* where people (clients) can't see it. Furthermore they can and should have rolled out a custom resolver across the fleet that was smarter than the typical dumb-ass implementation and do things like using A record priority tags.

    They tried to force DNS into "instant" updates and it bit them rightfully in the ASS. It obviously doesn't help when your coders are too friggin stupid to write software that is DEFENSIVE in what it accepts. Partially wrong (old) answers are preferable to no answer. The internal rate-limiting is also mostly a manual intervention. There is/was very little AGGRESSIVE rate-limit logic built in. It is all written with the assumption "it just works".

    Akamai did this correctly - the IPs are fixed as far as the 'front door' is concerned. All the machinations and adding/dead discovery happens on the BACKSIDE with KNOWN FIXED last-ditch endpoints in their CDN pods. So yes, user traffic may overrun the capacity of these fixed points, but that's relatively minor compared to service outage. Furthermore, internal cross-talk should have utter priority if not its own sub-division of resources that are impervious to end-user load.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon