June 13, 2025
This week, much of the internet went offline for several hours due to simultaneous outages from Google and Cloudflare.
The timing was not a coincidence: Cloudflare's retrospective revealed that their core distributed datastore in fact relied entirely on Google being up, and its availability was also tied to a single region.
When Google went down, Cloudflare's datastore thus went down globally. This took most of Cloudflare's newer services offline, as they have an internal mandate that new services should depend upon their cloud components rather than running dedicated internal services.
This was an extreme example of how far we've come from the original design intent of the Internet. Instead of many independent networks, we have just a few cloud providers hosting most services.
Availability becomes more challenging in this environment, but hCaptcha is designed to deliver better than 99.99% availability, and has done so in every month since day 1, despite running at enormous scale.
This requires us to use multiply redundant systems for everything in the critical path, and to fully analyze and automate common failure scenarios. Below we'll cover a bit of industry background, and some of the strategies we use to deliver high availability.
If you wonder why we ended up with extreme consolidation in these markets, it is really quite simple.
The economic model of the internet is an unusual hybrid of commensalism and commerce. This privileges earlier and larger players in the hosting and CDN markets in particular, and punishes new entrants with higher operating costs.
Companies like Cloudflare and Google have unmetered, unbilled peering arrangements with other large providers, while new market entrants must pay backbone transit providers for exactly the same service.
Overcoming this disadvantage is enormously expensive, with only a handful of newer companies able to create enough leverage for themselves to peer directly and reduce their costs.
Bytedance is a recent example. After investing (according to public figures) tens of billions of dollars in TikTok, they became a large enough traffic originator to be an attractive direct peer for other networks due to all of the video traffic they originated.
For public clouds, the concentration dynamic is even stronger. As a service provider, when your customers' servers are all in a few datacenters, your incentive is to colocate your service into those same datacenters in order to provide the lowest latency to them.
This problem is difficult to solve without regulation, and due to the enormous benefits incumbents gain from the existing system it is unlikely to change any time soon.
Over the years, best practices for security and availability have been codified into certifications like ISO 27001, SOC 2, etc. hCaptcha has many certifications, but we go far beyond what is required.
For example, a common requirement is to maintain a vendor list, and to do due diligence on vendor security or availability.
Many companies treat this as a checklist exercise, but having built web-scale systems for decades, the things cloud providers do are not magic to us.
We understand that different products come from different teams, have differing levels of maturity and adoption, and are not always the same in terms of availability or performance.
While hCaptcha does not host any services on Google Cloud, we do use Cloudflare as one of our CDN providers.
However, we did a lot of work to evaluate which parts of Cloudflare we could safely approve for use in production.
1. When was it introduced?
2. How popular is it?
3. How often is it mentioned on their or other public availability status pages?
In this week's example, we do use Cloudflare as one of our CDNs, including edge compute via Workers.
However, when we evaluated Workers KV we had concerns about its reliability and performance based on public information, and thus opted not to use it.
Similarly, if required we can completely turn off use of Workers and run identical code ourselves without Cloudflare.
This strategy gives us more options than being forced to immediately de-route Cloudflare entirely during most of their outages, and has let us handle many Cloudflare incidents gracefully and automatically with no user impact.
We frequently observe brownouts, regional outages, or even full outages from cloud providers that are not reported on their status page, are minimized, or are reported long after the incident starts.
In speaking with engineers and PMs at these public clouds, they have described a somewhat ad hoc and entirely manual process that determines what shows up publicly, and how long it takes to show up.
Combining inside-out and outside-in distributed tests is the safest way to identify this.
Once you're able to see these issues yourself, you can build your own estimates as to how much you trust any particular feature.
Cloud providers today generally (with the exception of Google) do not fail entirely. Even when many parts of their systems fail, others are basically independent due to their age or non-overlapping components.
Regions or sets of services do fail regularly, but in many cases it is possible to design lower risk failover paths that simply turn off or switch to an alternative feature of the same cloud provider.
We previously wrote about an example of this in our blog post Surviving Cloudflare Argo Outages With Zero Downtime.
Ideally, for each path you can over services within the cloud, and then fail to an alternative provider.
At hCaptcha's scale, this requires careful analysis. Public clouds are not as scalable as you might think, and it is surprisingly easy to cause outages for entire cloud regions by moving traffic too quickly.
This means that failing over within the same provider is often a safer option than completely de-routing them, but we maintain live channels to alternative providers like CDN services that allow us to instantly de-route them if required.
Multi-cloud infrastructure has never been simpler to maintain thanks to the many abstractions now popular, whether Kubernetes or Pulumi.
However, if you are only ever running on a single cloud in primary use, it is difficult to guarantee that your fallback option will really work at full load.
By contrast, with active-active multi-cloud topologies, you can reliably and transparently test both gradual and sudden load shifts to build confidence they can be both rapid and automated.
Google took down some companies that don't even run on Google Cloud in their recent outage.
How? Those companies relied on services like Google Container Registry to fetch system images, and did not put caching in place.
This changed their moderate-risk build-time dependency into a high-risk run-time / deploy-time dependency, as their services went offline upon restart or could not bring up new nodes due to being unable to fetch containers.
This is easily avoidable by ensuring you separately understand which services are in your critical path for each of:
There are probably more than you think. Any external dependencies should ideally be cached within your infrastructure, providing more control over how and when they are accessed and updated.
As always, we appreciate your support, and hope you enjoyed .
PS: If you want to work on some of the most interesting problems in online security, distributed systems, and privacy-preserving machine learning, we're hiring.