The last couple of weeks have been thrilling, but not always in a good way. For those that are interested, here is the story.
On Thursday, February 14th, we released some sim and web updates. The maintenance was rather routine and took about an hour. Our changes had been in testing for a while and nothing was rushed. We were looking forward to what was for us a 3 day weekend.
Shortly after deployment, we recognized that some server loads were higher than normal on our web and database nodes. We had some race servers timing out while trying to store their results in the database. While we are well positioned to deal with hardware issues, as we have redundancy at disk, server, and network levels, there can sometimes be issues in the systems software or our application that result in failure or reduced throughput.
When everything is in an alarm state, it usually means that there is an issue with the database. Sure enough, some queries were taking much longer than normal. Additionally, those that were reporting as trouble, weren't changed as part of the update. We performed the sanity checks of verifying yet again that the there was not an issue with the updates. We also scrutinized member usage -- were we getting requests from members that fell outside of what was normal but wasn't alarming? Everything seemed to check out.
After some further investigation, we determined that at least one unchanged query had degraded performance characteristics. With what had changed (in this case NOT change), the only explanation is that the database was choosing a different execution plan. When Oracle parses a query, it arrives at a query plan, or a plan for how it is going to get the data, based upon the data and system performance statistics that it has collect. Sometimes it comes up with a poor plan. We refresh our test database from production on a regular basis. The data is really close. But even with the same code and data, Oracle will sometimes make different choices based upon the data values it sees in the first query that it parses. For some queries, a table scan is the best approach, and other times it should use an index. Oracle caches the plan and reuses it. If the plan is wrong for most values of data, then there is trouble. We took what we were seeing, made some changes to help Oracle perform better, and things seemed better.
Until they weren't. While the database load was better, and as a result, the race servers were once again happy, the web tier was only sometimes happy. Loads would be normal and stable, and then seemingly out of nowhere, the loads would increase, in some cases, by a factor of nearly 50. Our website application runs in java virtual machines (jvm). When we've seen load spikes before, we've correlated them with garbage collection cycles, and we've made the changes to avoid them. These issues weren't GC related. The website itself is broken into many apps, that are running on different servers. Issues were tied to one type of web app. It was hot for CPU. It couldn't get enough, and it was user cpu. It was running application or jvm code (vs kernel code). What to do?
We went back to the logs. Did our monitoring miss something? Were there new errors? Were members doing something differently? Were there issues with the code that we deployed? Everything seemed to check out. We took time to tighten the ship. We increased our logging levels. We found a few things, but things that seemed minor. We still fixed them. We backed out seemingly unrelated changes. The problems remained.
We were running out of options. We reviewed the list of what changed. We entertained some of the dumbest questions, because sometimes, they aren't so dumb. We were still having issues, so we had to keep working on it. As part of the deployment we had applied some bug fixes to the Websphere servers. This code had been running on our test environments. It had been deployed on all systems - not just the ones that we were seeing issues with. Also, the issue was affecting only that one web app type. There were other apps running on the same servers that weren't impacted. Well, the astute reader knows the conclusion. Once we backed out the Websphere update, we no longer had the unexpected sustained load spikes.
This past weekend we successfully hosted the Daytona 500 World Tour events. Two events which, if not the largest, are pretty close to the largest sim racing events in history. While successful, it tested the limits of a few servers, and those servers are going to be upgraded. There is vertical scaling, where an individual server has more capacity, and horizontal scaling, where we add more servers. We are doing both. If it were only always as easy as adding hardware.