Over the last 3 months we have had some rare occurrences where our servers would spike out of the blue and take up all of the CPU on a given machine. It would happen anywhere from once every two weeks to twice a week. We have been scouring the code, logs, online, trying to find the root cause. It was such a rarity that we could never track it down.
With this new season release we changed the method of how the site determines who is online and what they are doing. This change was made so that in the future, this information could be shared with servers that exist in different cells at different service providers. This change worked as designed in our development environment, our alpha environment, our beta environment and our staging environment. It also worked on the production site, to a degree.
As we continued to look at logs yesterday, trying to find the issue, we noticed on one of the servers a reference to the online status. We thought it was worth a shot so we reverted the code back to how it was pre-rollout. When we opened the site, things looked good, it seemed to be running fine. After an hour, the site spiked and crashed across the entire cluster of servers. We checked logs again and saw on one server, and only one, a reference about the online status again. We completely turned off the online status tracking. Since that time the load has been absolutely fantastic across all of the servers, we have not had any spikes and believe this to be the root of the problem.
...
We are leaving the online status tracking off while we investigate a new implementation. The new implementations will allow us to easily turn it off and on so that we can test and get the site running again quickly if there are any issues. We are considering writing a few different implementations that we can switch between “on-the-fly” so that we can find the best solution with the least downtime.
Good news from our perspective is this unfortunate incident has actually helped us to find an elusive root cause, and we are sorry for the downtime. We will continue to do our best to provide a quality service and experience for our members. Again our apologies