Guest post from Jon Hyman, one of the co-founders of Appboy. From time to time, you’ll start to see technology oriented blog posts as Appboy gears up for the public launch of our iOS SDK.
Appboy is going to cancel our Pro CloudFlare account and leave the service. CloudFlare has a great feature set, but their uptime track record has been awful.
I’ve been a big fan of CloudFlare’s since I heard of it: I was in the audience at TechCrunch Disrupt NYC 2011 where CloudFlare presented. I was so impressed that I immediately pulled out my laptop and moved all my personal websites to CloudFlare. My first Tweet ever was about how cool CloudFlare is.
I put Appboy on CloudFlare as soon as we brought our first servers online. Since then, my professional experience with CloudFlare has been suboptimal. The first major interruption was in early November. SSL randomly stopped working, which broke server-client communication in our iOS SDK product. When I logged in to troubleshoot, I couldn’t find the SSL settings page. In a frenzy, I thought that my account had been accidentally downgraded from Pro and that SSL options were no longer available. I sent in a support ticket, received a response that it was a known issue, and that I should disable the CloudFlare proxy in the meantime. The SSL options were quietly removed as part of the upgrade; seemingly no one was told. I repeatedly emailed in every few hours asking for status reports but never got a response. It was a serious issue for us. Fortunately, in November we were in limited testing on our production environment, but had it been live it would’ve caused a massive amount of damage to us. After submitting two tickets for someone to contact me, Michelle Zatlyn, a CloudFlare co-founder, gave me a call. I suggested things like proactive notifications about major maintenance, and was happy she listened, but I feel like nothing has changed since.
The last few weeks, it has seemed as if CloudFlare was being attacked constantly, taking our site down in the crossfire. I was home for the holidays having dinner when our monitors hit for 502s and SSL problems. 502 hit again in January due to attacks in Newark. Over the past few months, dozens of 502 errors have tripped up my monitors, woken me up overnight, and broken our site for some of our customers. Numerous support tickets led to no progress. I ended up ignoring 502 errors in our functional monitoring scripts. We get over 100,000 unique visitors a month. Downtime has major visibility for us.
The last two weeks have been exceptionally problematic. One of our customers emailed us that random links on our site was broken. The links made AJAX requests which were not returning. Sure enough, everyone in the office could reproduce. I sent in a support ticket. The one-line response: “Thanks for writing in. This is a known issue that we’re trying to tackle this week. Sorry for the inconvenience!” That was it. No additional info. Was it just with AJAX? Should I turn off the CloudFlare proxy on other sites? Should I look to @cloudflaresys for updates? The worst part was that CloudFlare didn’t notify me about the known issue! It wasn’t on the status page, I couldn’t find it on Twitter. I had to find out from one of our customers. Later, the support associate agreed that “[CloudFlare's] notification of what’s working and what’s not is a bit… lacking” and said that he’d notify me when he got an update. I have not received any further updates.
Last night was also really bad. CloudFlare released a new version of its DNS software and accidentally deleted their master database of domain records, which broke name resolution for all of Appboy’s servers. I couldn’t go to the main website, our client-server communication broke, my app servers couldn’t talk to the databases because they couldn’t resolve the hosts, etc. We were completely down due to a bad software release that was, again, completely unannounced.
Whenever there have been issues, the CloudFlare engineers have jumped to resolve it. And resolution time is usually fast. But that 100% of my site downtime the last 2 months has been caused by CloudFlare is unacceptable. Even if CloudFlare fixes the problems quickly, they’re breaking too many things too frequently.
Everyone here at Appboy thinks that CloudFlare is a great product. We want to use CloudFlare, but right now can’t take on the risk.
Do you have any suggestions for DNS service providers? Let us know what you use or recommend.