Skype Explains Outage In-Depth
Six days after an extended outage left its network inaccessible to many users for nearly 24 hours, Skype CIO Lars Rabbe has published a post-mortem write-up of the situation.
Essentially, a server overload set into motion a chain of events that led to a perfect storm of problems and issues that impacted the very core of the P2P network that keeps Skype () running. As a result, the service was down for many users for up to 24 hours.
In his blog post, Rabbe describes the sequence of circumstances that led to the outage. The main point of breakdown — aside from the initial overload of a cluster of support servers — centered around the Skype for Windows () client. Instead of correctly processing the delayed response from the overloaded servers, Skype for Windows version 5.0.0.152 would instead crash.
The latest version of Skype for Windows, version 5.0.0.156, the 4.0 versions of Skype for Windows, Skype for Mac, Skype for iPhone, Skype on your TV and Skype Connect/Skype Manager were not impacted by this first wave of issues.
The problem, unfortunately, was that approximately 50% of all Skype users across the globe were using the 5.0.0.152 version of Skype for Windows. This was the first stable release of Skype 5, released in October. The updated version of Skype for Windows was released on December 14, but unless a user happened to manually check for the update or download the latest version, chances are, he or she was running the crashtastic Windows client. Rabbe says that program crashes caused approximately 40% of clients running the buggy version of Skype for Windows to fail — in other words, 20% of Skype clients in use failed because of this issue with the older version of the software.
This is where the perfect storm elements start to come together. Those failed clients represented 25 to 30% of the publicly available “supernodes.” In essence, a supernode is a connection point that can also help funnel traffic for other users. The way that peer-to-peer VoIP networks like Skype work is that a client must connect to a supernode in order to make a connection, send voice or video data or exchange instant messages. By default, every Skype client can be a supernode, depending on your firewall settings and bandwidth capacity. If your Skype client crashed and you were a supernode, the number of available connection points for other users just dropped.
Rabbe writes, “The failure of 25—30% of supernodes in the P2P network resulted in an increased load on the remaining supernodes. While we expect this kind of increase in the instance of a failure, a significant proportion of users were also restarting crashed Windows clients at this time. This massively increased the load as they reconnected to the peer-to-peer cloud.”
As luck would have it, all of this occurred just before the usual daily peak in usage. That meant that traffic to the remaining supernodes “was about 100 times what would normally be expected at that time of day.” To further complicate matters, this additional load triggered built-in-protection mechanisms, that under ordinary circumstances, could indicate something beyond just a sudden drop in available supernodes. These triggers created what amounted to a positive feedback loop, where overloaded sueprnodes shut themselves off, which in turn overloaded other supernodes, causing them to shut themselves off and so-on. This was the event that basically took down Skype for the majority of users — whether you were using Windows or not.
This Skype outage and Rabbe’s detailed explanation are interesting in that they highlight what — for all intents and purposes — was a fluke. Had the Windows client not had the propensity to crash and had the time of the outage not occurred during peak usage and just ahead of a major holiday, the situation likely would have been much different.
The big takeaway, at least from our perspective, is that Skype needs to look at providing better auto-update mechanisms for its desktop clients. While it’s true that auto-updating can be considered user-hostile, for minor (relatively speaking) revisions like the latest Skype update, it would probably be better to push those updates to clients automatically and set that as the default. This is what Google () does with its Google Chrome () browser to great success. Skype wouldn’t even have to go as far as Google — it could still require users to approve an upgrade to a major version (provided the old version was still supported) and only auto-update smaller hot fixes.
This outage was also an interesting look at how the Skype ecosystem operates. Skype continues to be unique amongst VOiP providers in part because of its P2P roots. This system is an implicit part of why Skype works so well, but under the right circumstances, it can also provide its own unique set of problems.
Essentially, a server overload set into motion a chain of events that led to a perfect storm of problems and issues that impacted the very core of the P2P network that keeps Skype () running. As a result, the service was down for many users for up to 24 hours.
In his blog post, Rabbe describes the sequence of circumstances that led to the outage. The main point of breakdown — aside from the initial overload of a cluster of support servers — centered around the Skype for Windows () client. Instead of correctly processing the delayed response from the overloaded servers, Skype for Windows version 5.0.0.152 would instead crash.
The latest version of Skype for Windows, version 5.0.0.156, the 4.0 versions of Skype for Windows, Skype for Mac, Skype for iPhone, Skype on your TV and Skype Connect/Skype Manager were not impacted by this first wave of issues.
The problem, unfortunately, was that approximately 50% of all Skype users across the globe were using the 5.0.0.152 version of Skype for Windows. This was the first stable release of Skype 5, released in October. The updated version of Skype for Windows was released on December 14, but unless a user happened to manually check for the update or download the latest version, chances are, he or she was running the crashtastic Windows client. Rabbe says that program crashes caused approximately 40% of clients running the buggy version of Skype for Windows to fail — in other words, 20% of Skype clients in use failed because of this issue with the older version of the software.
This is where the perfect storm elements start to come together. Those failed clients represented 25 to 30% of the publicly available “supernodes.” In essence, a supernode is a connection point that can also help funnel traffic for other users. The way that peer-to-peer VoIP networks like Skype work is that a client must connect to a supernode in order to make a connection, send voice or video data or exchange instant messages. By default, every Skype client can be a supernode, depending on your firewall settings and bandwidth capacity. If your Skype client crashed and you were a supernode, the number of available connection points for other users just dropped.
Rabbe writes, “The failure of 25—30% of supernodes in the P2P network resulted in an increased load on the remaining supernodes. While we expect this kind of increase in the instance of a failure, a significant proportion of users were also restarting crashed Windows clients at this time. This massively increased the load as they reconnected to the peer-to-peer cloud.”
As luck would have it, all of this occurred just before the usual daily peak in usage. That meant that traffic to the remaining supernodes “was about 100 times what would normally be expected at that time of day.” To further complicate matters, this additional load triggered built-in-protection mechanisms, that under ordinary circumstances, could indicate something beyond just a sudden drop in available supernodes. These triggers created what amounted to a positive feedback loop, where overloaded sueprnodes shut themselves off, which in turn overloaded other supernodes, causing them to shut themselves off and so-on. This was the event that basically took down Skype for the majority of users — whether you were using Windows or not.
Lessons Learned
This Skype outage and Rabbe’s detailed explanation are interesting in that they highlight what — for all intents and purposes — was a fluke. Had the Windows client not had the propensity to crash and had the time of the outage not occurred during peak usage and just ahead of a major holiday, the situation likely would have been much different.
The big takeaway, at least from our perspective, is that Skype needs to look at providing better auto-update mechanisms for its desktop clients. While it’s true that auto-updating can be considered user-hostile, for minor (relatively speaking) revisions like the latest Skype update, it would probably be better to push those updates to clients automatically and set that as the default. This is what Google () does with its Google Chrome () browser to great success. Skype wouldn’t even have to go as far as Google — it could still require users to approve an upgrade to a major version (provided the old version was still supported) and only auto-update smaller hot fixes.
This outage was also an interesting look at how the Skype ecosystem operates. Skype continues to be unique amongst VOiP providers in part because of its P2P roots. This system is an implicit part of why Skype works so well, but under the right circumstances, it can also provide its own unique set of problems.
This post was written by: Albertolida
Subscribe to:
Post Comments (Atom)
0 Responses to “Skype Explains Outage In-Depth”
Post a Comment