Update: I have confirmed that Twitter is using MySQL (unless the mentioned upgrades are a move to a different DB).
With all due respect, Michael Jackson may have given Twitter the final knockout punch. I’m sure you guys are sick of hearing of Twitter’s problems, and frankly I’m sick of writing about them. This new failure is worse than any failwhale, this is a failure of concurrency. Here is a brief timeline of what has happened so far:
- June 3 – Twitter realizes there is a half-hour lag on follow/unfollow that should be resolved WITHIN THE NEXT HOUR OR SO.
- June 18 – Twitter states it is making infrastructure upgrades to fix the follow/unfollow delay OVER THE NEXT 24 HOURS.
- June 23 – Twitter again states there is a lag on follow/unfollow and offers additional info on areas affected – device notification changes and favoriting, BUT WE ALREADY KNEW THIS.
- June 23 – Twitter announces additional upgrades for the 24th to fix the problem and says the problem will persist UNTIL LATER IN THE DAY TOMORROW.
- June 24 – Twitter says the upgrades were successful and that the catch up period will last FOR AT LEAST ANOTHER 6-12 HOURS.
- June 25, 11:26 am – A Twitter employee states “We’re still working on the fix and this is currently the top priority of the services team. It’s a pretty extensive code deployment so it is taking some time.”
- June 26 – I’m following fewer people today than I was yesterday and my 1,000 FOLLOWING/DAY LIMIT HAS BEEN HIT.
I can’t talk about the Twitter infrastructure, but I’ve seen this problem before in one of my own companies. With MySQL replication across multiple servers and tons of activity going on, it is almost impossible for the slaves to catch up. In addition, each replicator is generating enormous log files in the event replication fails. These log files can quickly fill up a server especially if you don’t know what you’re doing and have MySQL in a tiny /var partition. Once the log file overruns the server, you cease to replicate until the situation is rectified. I suspect this is what the problem is and with each addition of servers (in the above-mentioned upgrades), those log files get nastier and nastier. There is a fix for this replication problem, but it involves taking all systems offline, rsynching from a master (if there is one) and clearing all logs.
Now MJ steps into the picture and blows the infrastructure away. The search sidebar was removed and later re-added, but this just keeps the failwhale at bay and does nothing but compound the follow/unfollow delays. Now we’re at critical mass with this problem as follow/unfollow basically does not work, or works inconsistently at best. This is going to turn people off in droves as the system is not working as expected. The “I just don’t get it” of Twitter has just been amplified. The image on the above-right show a single person that REALLY wants to follow me, each mail highlighted in blue says “This person has just followed you,” sad thing is, after all this effort they still aren’t following me.
With the failwhale, people got upset but realized that there is so much cool stuff going on here I can hang tight until the system is back. It is sort of like the logic behind the beta-invite. This is entirely different, Twitter isn’t acting as expected.
Also, I’m only taking about one aspect of Twitter in this post. There are the search problems which still aren’t fixed despite the update provided in that link. There was the “all posts coming from the web” problem which occurred over a weekend where, apparently, they take a holiday. This may not sound like a big deal, but it was for a lot of developers and even one business had to SHUT DOWN until the problem was fixed. There are many, many other issues that I’m just not going to bring up.
I’m not giving up on Twitter, but you can find me on FriendFeed.
To Twitter’s credit they have been fairly open on their status blog and their employees are pretty active on the mailing lists.