Outage and Recovery: What Comes Next After AWS Disruption

Though many services were eventually restored, questions remain about the risks of concentrated reliance on cloud providers.

Joao-Pierre S. Ruth

Senior Writer

December 09, 2021

Credit: Marcos Alvarado via Alamy Stock

On Tuesday, which should have been AWS Innovation Day at re:Invent 2021, Amazon Web Services instead was contending with yet another region outage that affected vast segments of the internet. Analysts with Forrester and Gartner say while the issue was significant it was not a reason, nor realistic, to backslide on cloud migration.

According to updates from AWS, the cause of the outage was resolved for the most part after some seven hours. Recovery of services continued after that. Beyond questions about how it happened, concerns turn to what systemic breakdowns in the cloud of this scale mean in a world dominated by a small group of hyperscalers.

AWS indicated the latest outage stemmed from “an impairment of several network devices” that affected the company’s Northern Virginia, US-East-1 Region. The outage struck EC2, DynamoDB, Athena, and Chime as well as other AWS APIs and services. This caused issues and downtime for third parties such as Disney Plus and Netflix. It also affected Amazon’s own resources such as its package delivery management software and the Alexa virtual assistant.

If this seems a bit like déjà vu, it should. About one year ago, in late November 2020, the US-East-1 Region of AWS saw an outage that the company attributed to issues as more capacity was added to its front-end servers for its Kinesis data stream.

While the frequency of such cloud outages has not necessarily increased, the overall impact increases, says Sid Nag, vice president of cloud services and technologies research for Gartner. “This was one of the largest since AWS started conducting business.”

Mission-Critical Apps More Susceptible

Back when organizations mostly ran non-mission critical applications on the cloud, outages could be taken in stride more readily. The migration to the cloud has meant more mission-critical apps are susceptible to such disruptions, Nag says. “The cloud is a multitenant model,” he says. “Many different organizations were affected, not just IT services.” For example, the latest outage also cut off customers of Amazon Prime Video and Ring home monitoring service. “We’re seeing a bigger impact because of reliance on the cloud,” Nag says.

Consolidation of the cloud landscape has put the responsibility of maintaining this resource on the shoulders of a shrinking set of providers. That concentration may be a point of concern. “When they get impacted’ it’s almost like ‘too big to fail,’” Nag says. “That kind of thing worries me.”

In addition to wanting to see greater architecture resiliency across data centers, he says it may be time for major cloud providers to work hand in hand when outages occur and cover each other’s traffic during widespread outages. “They’re not doing that today,” Nag says.

There are competitive businesses reasons that keep that from happening, he says, but there may come a time when providers either do it on their own or under some form of regulation. “These cloud providers have gotten so big; they just can’t go down and have the whole world around them crash for 24 to 48 hours,” he says. “Not acceptable.”

If the major cloud providers do not adopt such a strategy, Nag says there could be a way for those providers to create ecosystems of smaller cloud providers as their backups. There also may be a way to use edge computing solutions to run distributed cloud as another alternative, he says.

Hyperscalers Have Different Risk Profile

Brent Ellis, senior analyst with Forrester, says hyperscalers have a different risk profile than other data centers and with that brings complications to their environments, which can cascade. “You can have a localized problem spread very quickly,” he says.

Outages are not just a problem for AWS. Other hyperscalers, Microsoft Azure and Google Cloud, have seen their share of outages and issues that were dealt with, Ellis says. In some instances, an outage may occur because of a mistyped command. Human error should not be an issue though, he says, if greater automation is properly deployed. He still sees significant value in adopting cloud, but organizations should also think about how they might mitigate against risks. Attempting to revert to on-prem data centers may be harder than expected. Once you’ve started a wholesale migration, it’s hard to replicate that infrastructure,” Ellis says.

As systems and cloud infrastructure become more interconnected, he says outages may mean organizations will just have to wait for the matter to be resolved. “Not a whole lot you can do,” Ellis says. “There is a reason why everything is measured in nines.”

The consolidation of cloud resources consolidates the risk, he says, which can be of great concern in a country where a large amount of the economy is dependent on hyperscalers. “When one of those very large data centers goes down, it affects 10s of thousands of companies, if not more, at the same time,” Ellis says.

Nasdaq CEO at AWS re:Invent Talks Cloud’s Impact on FinTech

How are Organizations Doing with Cloud?