Brief unexpected downtime

Incident Report for Kickserv

Postmortem

At 5:55 pm US Central Time on January 25, Kickserv’s locations database (used for tracking technician movements in the field) began running automated daily cleanup operations. At nearly the same time, one or more mobile devices running the Kickserv application reestablished contact with the network, and began uploading a large backlog of locations stored throughout the day while the devices were offline. The combination of these two events (a large database insert operation and a maintenance process which momentarily locks data) degraded Kickserv database performance over the next several minutes, eventually becoming deadlocked at 6:13 and causing the main application to become unavailable.

Kickserv engineers were automatically alerted and switched over to the backup database, a process which took one minute (no data is lost during the switchover process). Degraded database performance continued for a few more minutes while engineers reset the web application to clear the database locks. Full availability was restored by 6:23 US Central Time.

The downtime was caused by unanticipated database load from a combination of two things: a spike in location insert operations and a regularly-scheduled maintenance procedure. The engineering team is pursuing two courses of action to eliminate this confluence of events:

The maintenance process has already been rescheduled for later in the evening to avoid close-of-business time on the US West Coast (but not so late as to affect start-of-business in Australia).
Location storage may be moved out of the main Kickserv database and into Amazon’s DynamoDB service (a lower-latency, high-performance serverless database service) so as to avoid placing excessive load on the rest of the system.

Kickserv apologizes for any inconvenience caused by the downtime.

Posted Jan 26, 2021 - 14:40 CST

Resolved

At 6:13 pm US Central Time on Monday, January 25, the Kickserv application was briefly unavailable due to a spike in operations on our mobile location tracking database. The downtime was limited to brief periods between 4:13 and 4:23, at which time the database had completed its regular failover operation. We have identified the cause of the surge in database operations and will post details to this page shortly.

Posted Jan 25, 2021 - 18:13 CST