Post-Mortem Analysis of Monday's Reporting Issue

We had an issue with Monday night's run that affected most of our customer's reports. If you've checked your stats in the last couple days you might have noticed that you're no longer seeing reports for previous months.

It's all fixable, and we expect to have everybody's reports back to normal in a day or so, but I think it's important to let everybody know exactly what happened, and what we're doing to make sure it never happens again.

We pushed a new version of the software that runs our reports on Monday. That build had a bug that caused it to skip downloading the history files that Webalizer, our reporting package, needs to do its thing. Without those files in place, Webalizer assumed it was running reports for a single day on a site it had never seen before. Worse, when we pushed those reports out, we overwrote the existing, good, history files that were in place.

Fortunately, no old reports were affected (you can verify this by changing the date in the URL for one of your March detail reports), so this is all fixable by re-building reports for March and regenerating the summary pages that show previous month's activity.

Step One is complete. As soon as we had a good handle on what had happened, we spun up 100 Amazon EC2 machines and set them loose to completely rebuild March reports for every affected bucket and distribution. As of this morning, all March reports are back up to date.

Step Two will involve rebuilding historical monthly data for those buckets, which is relatively straightforward. We're currently testing the job that will do that, and plan to rebuild summary reports for everybody's buckets over the next few days.

Moving forward, it's clear that we need to look at our testing and deploy procedures to ensure that a bug like this has a harder time finding its way onto our production machines. In hindsight, the actual code issue was so small (a single line) and testable that it's amazing that it made it past our unit and integration tests before making it live. In actuality though, we simply didn't have a unit test case that covered that particular piece of code, and the issue manifested in a way that exploited a minor difference between our integration environment and production. Still, a live run on one of our test buckets using an actual production worker machine would have caught this, and that was the one missing piece in our deploy process. That will be fixed.

Anyway, please accept my apology for Monday night's outage. We'll try to get everything back in order as quickly as possible. If you notice any lingering issues with any of your reports, please email me directly so that I can look into them.

Jason Kester
jason@s3stat.com



Jason Kester
Thursday, March 31, 2011




A quick follow up to the above. As of this morning, everybody's summary reports should be in place. If you're still seeing any issues with your reports please contact us by email to let us know.


Jason Kester
Monday, April 4, 2011

[ reply to this topic ]   [ return to topic list ]

© 2024 Expat Software Back to Top