Infrastructure issues
Incident Report for Kickserv
Postmortem

Kickserv 11/25/20 Outage

Summary

On November 25, 2020, Amazon Web Services' Kinesis service suffered an extended outage, which impacted availability of Kickserv's server infrastructure (along with many other applications on the Web). During the outage, Kickserv engineers built and transitioned to a backup production web environment that did not depend on the affected services. Total Kickserv application downtime was approximately 2 hours 25 minutes.

Situation

(all times EST)

At 8:14 am, Amazon's Kinesis service began experiencing an outage. However, since Amazon's Service Health Dashboard also relies on Kinesis, most AWS tenants were unaware of the problem until later.

At 10:54 am, Kickserv's alerting systems began indicating an application outage. Engineers discovered that Elastic Container Service (ECS) was affected, but Elastic Compute Cloud (EC2) was not. Over the next 2 hours, engineers built a new, temporary production web application environment based on EC2 servers. All application data was preserved.

By 1:20 pm, main Kickserv functionality was once again available. Over the next two hours, engineers restored search indexing, calendar live updates, and QuickBooks™ Desktop sync.

By 6:58 am on November 26, ECS was once again functional, with all other services affecting Kickserv restored by 2:24 pm that afternoon. As of the time of this document, Kickserv is once again running on the regular ECS production stack.

Remedy

AWS outages that affect an entire application are very rare; only 2 such events in the past seven years have caused Kickserv to be down for an hour or more. However, they do happen, and they represent a single point of failure that should be eliminated. During the next few weeks, Kickserv engineers will build a failover production environment in a separate AWS availability zone. The backup environment will maintain a replica of all application data (database records and user-uploaded files), and with some improvements to our DNS architecture, we expect to be able to switch over in 30 minutes or less. We expect to have the new stack in place, along with protocols and training, by the end of January.

We realize that our customers depend on Kickserv for the minute-by-minute smooth operation of their business, and regret the inconvenience this event caused. Thank you for bearing with us.

For further reading

Summary of the Amazon Kinesis Event in the Northern Virginia (US-EAST-1) Region

Posted Dec 02, 2020 - 14:05 CST

Resolved
This incident has been resolved.
Posted Nov 26, 2020 - 10:33 CST
Update
QuickBooks™ Desktop sync has been restored to service. Kickserv should be fully operational once again. We are still using a backup server infrastructure, so you may occasionally notice longer than usual page load times. Don't hesitate to reach out if you have questions or experience problems.
Posted Nov 25, 2020 - 16:07 CST
Update
Calendar views should now be operating normally. We are still working to restore QuickBooks™ Desktop sync.
Posted Nov 25, 2020 - 15:33 CST
Update
QuickBooks/Xero sync, search, and emails are once again operating normally on our backup infrastructure. We're still working on the calendar views. We'll post another update when those are restored.
Posted Nov 25, 2020 - 14:27 CST
Update
While most areas of Kickserv are up and running, we are still working on a few areas that are not yet functioning properly on our backup infrastructure:

- Calendar screens
- Live search results
- Adding notes and files

We want to stress again that no data has been lost. For instance, if an event is missing from your calendar, you should find it on the Schedule screen. We’re working on restoring full operation and we’ll let you know if we encounter any more difficulties.

Thanks for your continued patience.
Posted Nov 25, 2020 - 13:28 CST
Update
Display of the calendar pages continues to be affected by the AWS outage. While we're sorting that out, you can use the Schedule page or the Jobs page to keep track of today's events.
Posted Nov 25, 2020 - 12:49 CST
Monitoring
AWS is still having trouble, but in the meantime, we have restored service. Please let us know of any ongoing availability issues. Your data is safe: https://www.kickserv.com/security/
Posted Nov 25, 2020 - 12:24 CST
Update
We're continuing to work on our contingency plan for getting around the AWS outage. We will update this page the second we have service restored.
Posted Nov 25, 2020 - 11:58 CST
Identified
We are still working with AWS and their team on timing of their systems getting back up. We are still working through alternative solutions to get our service restored.
Posted Nov 25, 2020 - 11:13 CST
Investigating
We're currently experiencing infrastructure issues with Amazon Web Services. The issue seems to be affecting many parts of the Internet. Thanks for bearing with us while we investigate further.
Posted Nov 25, 2020 - 10:10 CST
This incident affected: Web Application (app.kickserv.com), Customer Center, QuickBooks Online Sync, QuickBooks Desktop Sync, Email Notifications, Mobile API, Amazon Web Services, iOS App, and Android App.