Post-Mortem: Service Incident with Jump Desktop Connect Service on July 25-26, 2023

Incident Summary

On July 25, 2023, we noticed an issue with the Jump Desktop Connect service based on tickets raised by our customers, starting from 10am EST. The problem affected users' ability to see and connect to all available machines via the client. The issue was traced to a network partition within our message passing service cluster that occurred during a server replacement process the previous day. This incident impacted a subset of our users over the course of approximately 24 hours, and the problem was fully resolved as of 7:35am EST on July 26, 2023.

Detailed Timeline

July 24, 2023 - 3:29pm EST: The initiation of the incident. A fault developed on one of the machines that run our services, automatically triggering a replacement process. During the initialization of this new machine, several instances of the nodes in our message passing service were unable to communicate with other nodes. This led to the creation of a network partition in the cluster, the root cause of the incident.

July 25, 2023 - 1pm EST: The problem became apparent due to an increased load handled by the new machine. The first user complaints were noted at around 10am EST, but due to gradual load transfer to the new server, most users were not affected in the initial hours.

July 25, 2023 - 1:35pm EST: Our technical team intervened to repair the network partition. This intervention was only partially successful, with 18% of machines still in partition.

July 26, 2023 - 7:35am EST: Another attempt was made to repair the network partition, which successfully resolved the issue. All machines were confirmed to be joined correctly at this time.

Impact

The main symptom of this issue was users seeing an inconsistent view of the available machines on their client. Depending on which server they were connected to, they could only see and connect to certain machines while others appeared offline. Restarting the client would result in a different set of available and unavailable machines.

Steps We're Taking

In response to this incident, we are making the following changes to prevent similar incidents in the future:

1. Improve Monitoring: We are enhancing our monitoring systems to detect potential network partitions and other inter-node communication issues early in the process.

2. Message Passing Architecture: We are investigating an architectural redesign for the message passing service to minimize the chances of future network partitions.

We sincerely apologize for any inconvenience caused by this incident and appreciate your understanding. Our team is committed to learning from this incident and making necessary improvements to our services and systems.

Articles in this section

Comments

Articles in this section

Related articles