My name’s Josh and I’m the lead developer at FieldClock. I’d originally planned a happier topic for our first news post, but last week we had a disruption that caused some headaches and I’m going to address that instead.
Many large tech companies publish summaries after serious incidents. A typical summary describes the timeline of the event and (in varying degrees) the details of the causes and effects. FieldClock is still in a state of rapid growth but we respect the example that’s been set and we appreciate the trust that our customers have placed in us. When something goes wrong, you’ll hear about it from us.
This news feed won’t just be disclosures. In fact, we’re working our hardest to make sure that it rarely has any. The rest of the time we’ll be posting cool and exciting tidbits, like new features that are coming down the pipe. But we’re either winning or learning, and last week we learned the following…
Summary of Temporary Sync Issues
On July 6 at 3:06am, an update went live on our server that introduced a bug. This particular bug caused some sync data to be potentially ignored when it was pushed from phones in a certain way.
To give you some background, the mobile app syncs data slightly differently when you are viewing the Jobs List than when you are viewing a specific Job’s details. If a user scanned some badges at a job, then “exited” to the Jobs List before the pending data could be sent to the server, some of the data may have been ignored by the server. Make no mistake, this potential situation (lost data) is the absolute worst-case-scenario to us.
Starting on Thursday afternoon, we received a few sporadic reports on that clocked-out workers “weren’t staying clocked-out”, but we were unable to reproduce the issue and all tests were passing so couldn’t determine if there was a real problem out there. (We often get reports of “my data didn’t sync” when it turns out phones were in no coverage or in airplane mode.)
On Friday we had more reports of issues and we tracked down the cause (so we thought) to the sync behavior in the iOS app. Based on the timing of the issue, it appeared that the problem was simply something that we hadn’t seen in the iOS app until it hit mid-harvest volume. We rushed an update to Apple and waited for them to approve the update and push it to the app store. At this point, we had reports from 2 customers that something was “off”, but we didn’t realize how widespread it was. We tweaked some settings on our server to minimize the potential impact and we didn’t get any new reports.
Late on Monday Apple approved our iOS update and we thought we were home-free. But on Tuesday morning we got more intermittent reports about the same symptoms. At this point we had received enough information from a variety of sources and we were able to pinpoint with certainty that it was a line of code on the server that had been pushed the previous week. We wrote new tests, verified the fix, and pushed an update that went live at 3:11pm on July 11.
The extent of the potential impact was not immediately visible. Most users worked entire harvest shifts with all of their data synchronizing properly, while other users had large gaps in their datasets. We were able to push updates to the server that caused many phones to sync old data, but there were still some accounts that ended up with gaps.
If you have any uncertainty about your data from July 6-11, please contact us and we will help you analyze it to see if you were potentially impacted.
We learned a lot from this incident, and we’re using it to make FieldClock better and more resilient.
We need to communicate better with all of our customers if something might be going haywire. We should be proactive in letting you know about potential issues, even if they may not impact you. To solve this, we’re going to take a much more aggressive stance in the future with email notifications and a Status web page where you can keep an eye on things, as well as updates on this news feed.