I was going to spend some time in the evening (here in Sydney, Australia) doing some blogging when, to my surprise, I got an e-mail from my uptime monitor saying I had been downed for over 20 minutes. The downtime apparently started at 8:44pm.
When I migrated from my last host, I thought I was doing the right thing being hosted by one of the biggest hosting providers in the world, namely GoDaddy, and so far they haven’t been anything as annoying as my former hosting company who went down for 5 minutes every few hours. At GoDaddy, several times my databases would go missing from my cPanel, and my caching plugin will start screwing me over (probably because it had, at one stage, been interrupted while writing metadata due to resource exhaustion) but I had a way of dealing with that and keeping things alive. But it seems even the big players are not immune to downtime.
At first, I didn’t believe my monitor, as I thought they might just have a path issue to the server. Not so, I tried accessing my site and no dice. Cloudflare can’t get to the origin. I logged into cPanel, where everything was fine, and it claimed my account had no issues. I checked GoDaddy’s Status Page and that bore some bad news.
Not too bad, at least they know they have a problem so I don’t have to call them. I waited patiently … but an hour later, nothing had changed and no updates were forthcoming. Getting into cPanel was now throwing up 500 errors as well, which wasn’t a great sign.
Even worse was that I found that the issue seems to be related to a routing or connectivity problem. The cPanel servers were alive enough that I could eventually get a database backup and full site backup running, but not reachable from the outside. They could not be pinged, and a traceroute shows the packets “getting lost” within GoDaddy Group’s internal network.
I suspect the cause of this latest downtime to be routing related because of this, but we may never know unless they decide to tell us.
There was a storm on Twitter, with many disgruntled users tweeting @GoDaddy and @GoDaddyHelp, myself included, all was in vain as there wasn’t much communication at all. Only two tweets were issued by @GoDaddyHelp mirroring what was on their status page, with no ETA, or cause identified. Many users complained of long phone queue waits, a lack of helpfulness in general, and of being dumped to voicemail or being unable to initiate a live-chat online.
Others, however, were quite creative – my favourites included:
— Randy Hilarski (@RandyHilarski) July 29, 2015
— Saulreal (@Saulreal) July 29, 2015
Of course, the guys at downdetector.com registered a spike in complaints … so we know we are not alone.
I did wait a little, patiently refreshing the page until three hours had passed and I was getting a little worried. At one stage they were claiming no significant issues but my site was still down. What’s the deal?
After a few minutes, they reverted back to the truth – they were still down. So much for their communication. They really could have done a lot more to keep people informed as to what is happening, when a fix might be had, why this is happening, etc. Even just a regular “change” in wording might placate some users who think that “it’s about 5am their time, so they’re probably just having a snooze.”
Anyway, it seems the service is back up again, after what is possibly the longest downtime in the history of running my site. Pingdom Tools pegs the downtime today at three hours and 32 minutes.
That being said, many users may have signed up to GoDaddy on the impression that they would provide a 99.9% uptime – but a close reading of their terms and conditions for hosting gives the following:
Key take-aways are that the uptime guarantee is determined solely by them, and at best, you get a 5% credit on your monthly hosting fee for that month. If you’re paying $8/month, you’ll get a credit of 40 lousy cents for not meeting their guarantee. Hardly worth the price to call them to request it given the cost of a phone call is about 25 cents.
My suspicions of a routing issue (or possibly connectivity issue) seem confirmed when I look at the post-downtime traceroute – notice how it terminates correctly, and there are two hops to the actual shared hosting server, from where the packet last got lost.
On the whole, while it would be nice to have a back-up host, at this stage, it’s not quite worth the effort and time to maintain for a hobbyist running a personal site like myself. But I do feel for others, who may be running businesses and professional sites where downtime like this is unacceptable. If one of the world’s leading hosts can have this happen to them, I think it’s likely that any other host at a similar price bracket would probably have this happen from time to time. Lets try to not make this too frequent, mmkay?
For the less fortunate, it seems that even as of 1:42am Sydney time, there are still some issues that remain.
UPDATE: It’s now the following morning, and the site now says the following:
Interesting that a network hardware failure had caused this – I wonder what sort of redundancy failure had happened. Maybe it was a failure that caused corruption to routing tables or broadcasted bad routing information to other routers, causing it to topple. They claimed “some” customers and “intermittently” unavailable – judging from twitter, that seemed to be quite a few customers and continually unavailable right until the end of the outage window. No point saving face now, but hey, at least they told us why.