On November 25, 2020, Amazon Web Services' Kinesis service suffered an extended outage, which impacted availability of Kickserv's server infrastructure (along with many other applications on the Web). During the outage, Kickserv engineers built and transitioned to a backup production web environment that did not depend on the affected services. Total Kickserv application downtime was approximately 2 hours 25 minutes.
(all times EST)
At 8:14 am, Amazon's Kinesis service began experiencing an outage. However, since Amazon's Service Health Dashboard also relies on Kinesis, most AWS tenants were unaware of the problem until later.
At 10:54 am, Kickserv's alerting systems began indicating an application outage. Engineers discovered that Elastic Container Service (ECS) was affected, but Elastic Compute Cloud (EC2) was not. Over the next 2 hours, engineers built a new, temporary production web application environment based on EC2 servers. All application data was preserved.
By 1:20 pm, main Kickserv functionality was once again available. Over the next two hours, engineers restored search indexing, calendar live updates, and QuickBooks™ Desktop sync.
By 6:58 am on November 26, ECS was once again functional, with all other services affecting Kickserv restored by 2:24 pm that afternoon. As of the time of this document, Kickserv is once again running on the regular ECS production stack.
AWS outages that affect an entire application are very rare; only 2 such events in the past seven years have caused Kickserv to be down for an hour or more. However, they do happen, and they represent a single point of failure that should be eliminated. During the next few weeks, Kickserv engineers will build a failover production environment in a separate AWS availability zone. The backup environment will maintain a replica of all application data (database records and user-uploaded files), and with some improvements to our DNS architecture, we expect to be able to switch over in 30 minutes or less. We expect to have the new stack in place, along with protocols and training, by the end of January.
We realize that our customers depend on Kickserv for the minute-by-minute smooth operation of their business, and regret the inconvenience this event caused. Thank you for bearing with us.
Summary of the Amazon Kinesis Event in the Northern Virginia (US-EAST-1) Region