Monday, April 25, 2011

Timeline for Amazon Web Services' Failure

Here, in one easy-to-read place, is the timeline for Thursday's Amazon Web Services FAIL in their Northern Virginia data center. In case you are interested my phone started ringing at 3:15 AM, CST, about 20 minutes after the error (still not disclosed) that started this whole sorry mess.

Thursday April 21

  • 1:41 AM PDT We are currently investigating latency and error rates with EBS volumes and connectivity issues reaching EC2 instances in the US-EAST-1 region.
  • 2:18 AM PDT We can confirm connectivity errors impacting EC2 instances and increased latencies impacting EBS volumes in multiple availability zones in the US-EAST-1 region. Increased error rates are affecting EBS CreateVolume API calls. We continue to work towards resolution.
  • 2:49 AM PDT We are continuing to see connectivity errors impacting EC2 instances, increased latencies impacting EBS volumes in multiple availability zones in the US-EAST-1 region, and increased error rates affecting EBS CreateVolume API calls. We are also experiencing delayed launches for EBS backed EC2 instances in affected availability zones in the US-EAST-1 region. We continue to work towards resolution.
  • 3:20 AM PDT Delayed EC2 instance launches and EBS API error rates are recovering. We're continuing to work towards full resolution.
  • 4:09 AM PDT EBS volume latency and API errors have recovered in one of the two impacted Availability Zones in US-EAST-1. We are continuing to work to resolve the issues in the second impacted Availability Zone. The errors, which started at 12:55AM PDT, began recovering at 2:55am PDT
  • 5:02 AM PDT Latency has recovered for a portion of the impacted EBS volumes. We are continuing to work to resolve the remaining issues with EBS volume latency and error rates in a single Availability Zone.
  • 6:09 AM PDT EBS API errors and volume latencies in the affected availability zone remain. We are continuing to work towards resolution.
  • 6:59 AM PDT There has been a moderate increase in error rates for CreateVolume. This may impact the launch of new EBS-backed EC2 instances in multiple availability zones in the US-EAST-1 region. Launches of instance store AMIs are currently unaffected. We are continuing to work on resolving this issue.
  • 7:40 AM PDT In addition to the EBS volume latencies, EBS-backed instances in the US-EAST-1 region are failing at a high rate. This is due to a high error rate for creating new volumes in this region.
  • 8:54 AM PDT We'd like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it's difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We're starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.
  • 10:26 AM PDT We have made significant progress in stabilizing the affected EBS control plane service. EC2 API calls that do not involve EBS resources in the affected Availability Zone are now seeing significantly reduced failures and latency and are continuing to recover. We have also brought additional capacity online in the affected Availability Zone and stuck EBS volumes (those that were being remirrored) are beginning to recover. We cannot yet estimate when these volumes will be completely recovered, but we will provide an estimate as soon as we have sufficient data to estimate the recovery. We have all available resources working to restore full service functionality as soon as possible. We will continue to provide updates when we have them.
  • 11:09 AM PDT A number of people have asked us for an ETA on when we'll be fully recovered. We deeply understand why this is important and promise to share this information as soon as we have an estimate that we believe is close to accurate. Our high-level ballpark right now is that the ETA is a few hours. We can assure you that all-hands are on deck to recover as quickly as possible. We will update the community as we have more information.
  • 12:30 PM PDT We have observed successful new launches of EBS backed instances for the past 15 minutes in all but one of the availability zones in the US-EAST-1 Region. The team is continuing to work to recover the unavailable EBS volumes as quickly as possible.
  • 1:48 PM PDT A single Availability Zone in the US-EAST-1 Region continues to experience problems launching EBS backed instances or creating volumes. All other Availability Zones are operating normally. Customers with snapshots of their affected volumes can re-launch their volumes and instances in another zone. We recommend customers do not target a specific Availability Zone when launching instances. We have updated our service to avoid placing any instances in the impaired zone for untargeted requests.
  • 6:18 PM PDT Earlier today we shared our high level ETA for a full recovery. At this point, all Availability Zones except one have been functioning normally for the past 5 hours. We have stabilized the remaining Availability Zone, but recovery is taking longer than we originally expected. We have been working hard to add the capacity that will enable us to safely re-mirror the stuck volumes. We expect to incrementally recover stuck volumes over the coming hours, but believe it will likely be several more hours until a significant number of volumes fully recover and customers are able to create new EBS-backed instances in the affected Availability Zone. We will be providing more information here as soon as we have it.
  • Here are a couple of things that customers can do in the short term to work around these problems. Customers having problems contacting EC2 instances or with instances stuck shutting down/stopping can launch a replacement instance without targeting a specific Availability Zone. If you have EBS volumes stuck detaching/attaching and have taken snapshots, you can create new volumes from snapshots in one of the other Availability Zones. Customers with instances and/or volumes that appear to be unavailable should not try to recover them by rebooting, stopping, or detaching, as these actions will not currently work on resources in the affected zone.
  • 10:58 PM PDT Just a short note to let you know that the team continues to be all-hands on deck trying to add capacity to the affected Availability Zone to re-mirror stuck volumes. It's taking us longer than we anticipated to add capacity to this fleet. When we have an updated ETA or meaningful new update, we will make sure to post it here. But, we can assure you that the team is working this hard and will do so as long as it takes to get this resolved.
Friday April 22
  • 2:41 AM PDT We continue to make progress in restoring volumes but don't yet have an estimated time of recovery for the remainder of the affected volumes. We will continue to update this status and provide a time frame when available.
  • 6:18 AM PDT We're starting to see more meaningful progress in restoring volumes (many have been restored in the last few hours) and expect this progress to continue over the next few hours. We expect that well reach a point where a minority of these stuck volumes will need to be restored with a more time consuming process, using backups made to S3 yesterday (these will have longer recovery times for the affected volumes). When we get to that point, we'll let folks know. As volumes are restored, they become available to running instances, however they will not be able to be detached until we enable the API commands in the affected Availability Zone.
  • 8:49 AM PDT We continue to see progress in recovering volumes, and have heard many additional customers confirm that they're recovering. Our current estimate is that the majority of volumes will be recovered over the next 5 to 6 hours. As we mentioned in our last post, a smaller number of volumes will require a more time consuming process to recover, and we anticipate that those will take longer to recover. We will continue to keep everyone updated as we have additional information.
  • 2:15 PM PDT In our last post at 8:49am PDT, we said that we anticipated that the majority of volumes "will be recovered over the next 5 to 6 hours." These volumes were recovered by ~1:30pm PDT. We mentioned that a "smaller number of volumes will require a more time consuming process to recover, and we anticipate that those will take longer to recover." We're now starting to work on those. We're also now working to enable customers to be able to launch EBS backed instances and create, delete, attach and detach EBS volumes in the affected Availability Zone. Our current estimate is that this will take 3-4 hours until full access is restored. We will continue to keep everyone updated as we have additional information.
  • 6:27 PM PDT We're continuing to work on restoring the remaining affected volumes. The work we're doing to enable customers to be able to launch EBS backed instances and create, delete, attach and detach EBS volumes in the affected Availability Zone is taking considerably more time than we anticipated. The team is in the midst of troubleshooting a bottleneck in this process and we'll report back when we have more information to share on the timing of this functionality being fully restored.
  • 9:11 PM PDT We wanted to give a more detailed update on the state of our recovery. At this point, we have recovered a large number of the stuck volumes and are in the process of recovering the remainder. We have added significant storage capacity to the cluster, and storage capacity is no longer a bottleneck to recovery. Some portion of these volumes have lost the connection to their instance, and are waiting to be connected before normal operations can resume. In order to re-establish this connection, we need to allow the instances in the affected Availability Zone to access the EC2 control plane service. There are a large number of control plane requests being generated by the system as we re-introduce instances and volumes. The load on our control plane is higher than we anticipated. We are re-introducing these instances slowly in order to moderate the load on the control plane and prevent it from becoming overloaded and affecting other functions. We are currently investigating several avenues to unblock this bottleneck and significantly increase the rate at which we can restore control plane access to volumes and instances-- and move toward a full recovery.
  • The team has been completely focused on restoring access to all customers, and as such has not yet been able to focus on performing a complete post mortem. Once our customers have been taken care of and are fully back up and running, we will post a detailed account of what happened, along with the corrective actions we are undertaking to ensure this doesnt happen again. Once we have additional information on the progress that is being made, we will post additional updates.
Saturday April 23
  • 1:55 AM PDT We are continuing to work on unblocking the bottleneck that is limiting the speed with which we can re-establish connections between volumes and their instances. We will continue to keep everyone updated as we have additional information.
  • 8:54 AM PDT We have made significant progress during the night in manually restoring the remaining stuck volumes, and are continuing to work through the remainder. Additionally we have removed some of the bottlenecks that were preventing us from allowing more instances to re-establish their connection with the stuck volumes, and the majority of those instances and volumes are now connected. We've encountered an additional issue that's preventing the recovery of the remainder of the connections from being established, but are making progress. Once we solve for this bottleneck, we will work on restoring full access for customers to the control plane.
  • 11:54 AM PDT Quick update. We've tried a couple of ideas to remove the bottleneck in opening up the APIs, each time we've learned more but haven't yet solved the problem. We are making progress, but much more slowly than we'd hoped. Right now we're setting up more control plane components that should be capable of working through the backlog of attach/detach state changes for EBS volumes. These are coming online, and we've been seeing progress on the backlog, but it's still too early to tell how much this will accelerate the process for us.
  • For customers who are still waiting for restoration of the EBS control plane capability in the impacted AZ, or waiting for recovery of the remaining volumes, we understand that no information for hours at a time is difficult for you. We've been operating under the assumption that people prefer us to post only when we have new information. Think enough people have told us that they prefer to hear from us hourly (even if we don't have meaningful new information) that we're going to change our cadence and try to update hourly from here on out.
  • 12:46 PM PDT We have completed setting up the additional control plane components and we are seeing good scaling of the system. We are now processing through the backlog of state changes and customer requests at a very quick rate. Barring any setbacks, we anticipate getting through the remainder of the backlog in the next hour. We will be in a brief hold after that, assessing whether we can proceed with reactivating the APIs.
  • 1:49 PM PDT We've reduced the backlog of outstanding requests significantly and are now holding to assess whether everything looks good to take the next steps toward opening up API access.
  • 2:48 PM PDT We have successfully completed processing the backlog of state changes between our control plane services and the degraded EBS cluster. We are now starting to slowly roll out changes that will re-enable the EBS APIs in the affected zone. Once that happens, requests to attach volumes and detach volumes will begin working. If that goes well, we will open up the zone to the remainder of the EBS APIs, including create volume and create snapshot. In parallel, we are continuing to manually recover the remainder of the stuck volumes. Once API functionality has been restored, we will post that update here.
  • 3:43 PM PDT The API re-enablement is going well. Attach volume and detach volume requests will now work for many of the volumes in the affected zone. We are continuing to work on enabling access to all APIs.
  • 4:42 PM PDT Attach and detach volume requests now work for all volumes that have been recovered. We are still manually recovering volumes in parallel, the APIs will still not work for any volume that has not been recovered yet. We are currently working to enable the ability for customers to create new volumes and snapshots in the affected zone.
  • 5:51 PM PDT We have now fully enabled the create snapshots API in addition to the attach and detach volume APIs. Currently all APIs are enabled except for create volume which we are actively working on. We continue to work on restoring the remaining stuck volumes.
  • 6:56 PM PDT The create volume API is now enabled. At this time all APIs are back up and working normally. We will be monitoring the service very closely, but at this time all EC2 and EBS APIs are operating normally in all Availability Zones. The majority of affected volumes have been recovered, and we are working hard to manually recover the remainder. Please note that if your volume has not yet been recovered, you will still not be able to write to your volume or successfully perform API calls on your volume until we have recovered it.
  • 8:39 PM PDT We continue to see stability in the service and are confident now that that the service is operating normally for all API calls and all restored EBS volumes.
  • We mentioned before that the process to recover this remaining group of stuck volumes will take longer. We are being extra cautious in this recovery effort. We will continue to update you as we have additional information.
  • 9:58 PM PDT Progress on recovering the remaining stuck volumes is proceeding slower than we anticipated. We are currently looking for ways in which we can speed up the process, and will keep you updated.
  • 11:37 PM PDT The process of recovering the remaining stuck volumes continues to proceed slowly. We will continue to keep you updated as we make additional progress.
Sunday April 24
  • Apr 24, 1:38 AM PDT We are continuing to recover remaining stuck EBS volumes in the affected Availability Zone, and the pace of volume recovery is now steadily increasing. We will continue to keep you posted with regular updates.
  • Apr 24, 3:12 AM PDT The pace of recovery has begun to level out for the remaining group of stuck EBS volumes that require a more time-consuming recovery process. We continue to make progress and will provide additional updates on status as we work through the remaining volumes.
  • Apr 24, 5:05 AM PDT As detailed in previous updates, the vast majority of affected EBS volumes have been restored by this point, and we are working through a more time-consuming recovery process for remaining volumes. We have made steady progress on this front over the past few hours. If your volume is among those recently recovered, it should be accessible and usable without additional action.
  • Apr 24 7:22 AM PDT No significant updates to report at this time. We continue to make steady progress on recovering remaining affected EBS volumes and making them accessible to customers.
  • Apr 24, 9:59 AM PDT We continue to make steady progress on recovering remaining affected EBS volumes and making them accessible to customers. If your volume is not currently responsive, we recommend trying to detach and reattach it. In many cases that may restore your access.
  • Apr 24, 11:36 AM PDT The number of volumes yet to be restored continues to dwindle. If your volume is not currently responsive and your instance was booted from EBS, you may need to stop and restart your instance in order to restore connectivity.
  • Apr 24, 2:06 PM PDT We continue to make steady progress on recovering the remaining affected EBS volumes. We are now working on reaching out directly to the small set of customers with one of the remaining volumes yet to be restored.
  • 7:35 PM PDT As we posted last night, EBS is now operating normally for all APIs and recovered EBS volumes. The vast majority of affected volumes have now been recovered. We're in the process of contacting a limited number of customers who have EBS volumes that have not yet recovered and will continue to work hard on restoring these remaining volumes.

    If you believe you are still having issues related to this event and we have not contacted you tonight, please contact us here. In the "Service" field, please select Amazon Elastic Compute Cloud. In the description field, please list the instance and volume IDs and describe the issue you're experiencing.

    We are digging deeply into the root causes of this event and will post a detailed post mortem.


1 comment:

poojaa2015 said...

Great thoughts you got there, believe I may possibly try just some of it throughout my daily life.



Web Services Chennai

ShareThis