We’d like to apologize for the service disruption that occurred on Dec 22nd, 2021 between 7:33am EST to 12:07pm EST.
Jump Desktop is a critical tool for many teams to get their daily work done, which is why our cloud services are designed with redundancy and fault tolerance. We’ve designed Jump Desktop's cloud services to withstand multiple availability zone (AZ) failures in AWS.
Although Amazon’s public status page for the incident states the failure happened in only one AZ, our experience during the outage and conversations with Amazon technical support suggest the failure was wider than what is being reported on the public status page.
In our case, AWS load balancers for Jump Desktop Connect service in the entire region (rather than one AZ) stopped forwarding traffic, which eventually caused downtime.
Timeline
- On Dec 22nd, at approximately 7:33am EST, our monitoring system alerted us to a drop in traffic.
- While investigating the issue, engineers discovered some servers in the fleet were unresponsive. Jump’s cloud services are designed with redundancy and server failures in mind. We run redundant servers in multiple availability zones to handle such cases.
- Our architecture relies on Amazon’s AWS load balancers to route traffic when servers fail or are replaced. In this case, engineers noted that load balancers were not routing traffic to healthy servers.
- Although Amazon’s status page did not show it, a conversation with AWS technical support revealed that load balancers in the entire region were having trouble, not just one AZ. This was consistent with what our engineering team was seeing.
- During the outage, engineers tried multiple workarounds, including spinning up brand-new servers in healthy availability zones, failing over to back-up load balancers and also creating brand-new load balancers
- At approximately 12:07pm EST AWS load balancers started working, which resolved the issue.
How we plan on moving forward
- We are working with AWS technical support to understand why Jump Desktop’s load balancers for the entire region were impacted during a single AZ outage.
- We will implement cross-region fail-over. Based on 3 AWS incidents over that last month, it is clear that single availability zones failures tend to spill over to other AZs and cause region wide problems. Cross regional failover would have allowed our engineers to recover Jump’s Connect service much sooner.
- We will add a feature that lets administrators enable manual fluid connections locally from the machine. Some customers were unable to use manual fluid connections because Connect Settings prevented this option from being engaged locally. We will provide a way to override this in a secure way.
Comments
0 comments
Article is closed for comments.