January 24, 2019
Update: 24 hours later Argo had another outage. This time we were prepared.
Our job is to ensure these potential disruptions do not become visible to our users. They depend on us staying up in order to enable vital services like login and checkout to function on their sites.
Due to this reality, we engineered our operations for high reliability from day one. This proved to be a wise choice: we maintained better than 99.99% uptime in 2018 despite heavy growth in popularity of the hCaptcha service.
However, early this morning we suffered our first general service interruption of more than a few seconds. This was due to an outage at Cloudflare, one of our upstream infrastructure vendors.
Background: 13% of all internet traffic goes through Cloudflare. We use them for DDoS protection, load balancing, and smart routing via their Argo Tunnel service. [1] This allows us to reduce latency and transparently survive major outages like one of our redundant Kubernetes clusters completely degrading.
Unfortunately, even with a 100% uptime service-level agreement Cloudflare can still go down. Around 6am Pacific time on January 24th 2019 their Argo Tunnels suffered a complete outage lasting several hours. [2]
Our on-call staff were paged within 60 seconds, and immediately worked to re-route inbound traffic. This limited our downtime to 12.5 minutes.
However, this is not good enough for a service like hCaptcha: our systems should transparently handle complete outages even from providers with strong SLAs and a good historical reliability record.
Other parts of our infrastructure have multiple redundancies and safeguards to prevent downtime, but we had assessed the likelihood of a sustained Cloudflare outage as low enough that no automated system was put in place to remediate this issue.
Resolution: To prevent a Cloudflare issue from affecting our users in the future we are implementing a fully redundant traffic routing and DDoS prevention system. External health checks will automate rapid failover to another vendor in the event that Cloudflare suffers intermittent or continuous downtime in the future.
We plan to open source our implementation of this failover functionality, which is being built on top of our current multi-cluster Kubernetes operations. If you’re interested in getting early access or collaborating, send us an email at [email protected] and we’ll let you know before the repo goes public.
And to our users: we know you rely upon us, and will continue to improve operations to make sure you get the reliable service you expect. Thank you for your trust and support.
— Eli, Alex, and the hCaptcha infrastructure team