At 5:55 pm US Central Time on January 25, Kickserv’s locations database (used for tracking technician movements in the field) began running automated daily cleanup operations. At nearly the same time, one or more mobile devices running the Kickserv application reestablished contact with the network, and began uploading a large backlog of locations stored throughout the day while the devices were offline. The combination of these two events (a large database insert operation and a maintenance process which momentarily locks data) degraded Kickserv database performance over the next several minutes, eventually becoming deadlocked at 6:13 and causing the main application to become unavailable.
Kickserv engineers were automatically alerted and switched over to the backup database, a process which took one minute (no data is lost during the switchover process). Degraded database performance continued for a few more minutes while engineers reset the web application to clear the database locks. Full availability was restored by 6:23 US Central Time.
The downtime was caused by unanticipated database load from a combination of two things: a spike in location insert operations and a regularly-scheduled maintenance procedure. The engineering team is pursuing two courses of action to eliminate this confluence of events:
Kickserv apologizes for any inconvenience caused by the downtime.