Subject and date
Facility power maintenance on 8/16 and 8/20 will cause downtime
Aug 14 2019 12:01:10 AM PT
We have been notified by the facilities provider in Chicago (Equinix) that they will be replacing half of their Automatic Static Transfer Switches on Friday, August 16, between 10pm CDT and 6am of the next day, and the rest of the transfer switches on Tuesday, August 20, between 10pm CDT and 6am of the next day.
They have told us that these replacements will take down specific power feeds for the entire maintenance window. This will have consequences for customers:
- Since most of our network switches are single-fed, we will see a connectivity interruption for those switches that are connected to the impacted power feeds on each day. We have asked the facility to mitigate this by moving the switches to different power feeds after power has been lost, and then back again afterward; this means that nearly all customers in Chicago will see at least two connectivity blips on one of the nights of the maintenance. The connectivity interruptions could last as long as an hour, depending on how quickly the facility moves power cords in our cabinets; we will ask them to prewire the new cords to speed up this process. (Note that our core routers and core aggregation switch have redundant power and should not also go offline.)
- Most of our VDS, standalone game server, and standalone voice server machines are single-fed. This means that we will need to power these machines entirely down during the maintenance events. For each night, approximately 30 minutes prior to the start of the 10pm event, we will be running a script that will shut down all specific VDS-hosting machines that our records indicate will lose power that night. Since our records may not be fully accurate, the shutdown operations may not end up being clean for other reasons, or the facility could make mistakes (such as by starting the maintenance early or taking down the wrong circuit), we recommend that VDS customers back up important files before each maintenance window, and also limit heavy disk writing operations around the start of each window (in-flight disk writes can cause file/disk corruption when power is abruptly lost).
- Newer customer dedicated machines -- E3-1270v3 and better machines -- all have redundant power supplies that are plugged into two separate power feeds, so they shouldn't go offline. However, most of these machines are designed to throttle CPU performance if one of the two power feeds is lost, and these customers will see higher CPU usage until power is restored on each night.
Having switches and servers lose power is a big deal, and having such a long (multi-hour) outage, starting near peak usage hours, makes this event an even bigger deal. The point of the facility having layers of UPS and generator equipment is to provide continuous power and avoid this type of outage. We are making sure to communicate how serious this is to the facility and to make sure that they take all possible steps to do the maintenance right and to help us keep downtime to a minimum.
We will update this event as we have more information, such as if the start or duration of the maintenance changes.