About three weeks ago there was a span of a week when I experienced more than usual service outages from multiple places. I’ll try to list all of the events:
- Facebook, Instagram and WhatsApp had an outage on July 3rd: Mashable
- Slack had a seemingly bigger outage, although I did not experience as much of it as some people reported Slack outage.
- Cloudflare Web Application Firewall (WAF) had some issue which resulted in 502. As it turned out a seemingly innocent regular expression was resource hungry Cloudflare WAF outage blog. Looks like regular expressions are something to watch out for, since a bad one took down Stackoverflow as well back in 2016: Stackoverflow regex down 2016
- CloudFlare’s situation is really hard because they mitigate DDoS (
Distributed Denial of Service) attacks on a daily basis, since that’s their service (to protect from those DDoS attacks), therefore many code need to scale really well there. So I’m not surprised at this CloudFlare outage map.
- Not so much after and during the CloudFlare issues I also received reports that certain sub-services of Azure have specific problems. When we look at maps like this we need to keep in mind that these are compound maps. The one I cite is the
July 03, 2019 07:04 UTC, WARN, about 1 hour, Information Diagnostic logs, Autoscale, Classic Alerts (v2).
- Linode is very good communicating scheduled maintenance, and most of the time they don’t result in any outages. However the biggest box I have (16 CPU, 768MB+ SSD) experiencing problems sometimes. We had a surprise restart where the production site was down in the middle of the day for 1.5-2 hours. The VM host had some physical problems and the migration to other host took that much time. Earlier we did have a surprise restart two weeks before that one, but that was overnight.
THe bottom line is: service outages happen. The best is if the architecture of the system is redundant enough that it can survive these. Sometimes that is not possible though.