Yesterday CircleCI had probably the worst outage I’ve ever seen from them (I still thoroughly recommend them, at the very least you can answer the question “Do we have to do anything to fix this?” with “No, someone else’s problem”).
From the looks of things I think this happened:
- CloudFlare noticed a DNS issue at Dyn DNS. (CloudFlare post)
- Dyn DNS had a mistake up updating critical records. (Dyn DNS post)
- It looks like this caused a failure in GitHub, in particular delivering web hooks.
- This in turn backed up event delivery, when it was fixed it landed a torrent of build requests onto CircleCI simultaneously.
- CircleCI promptly had a meltdown, recovered and limped on. (CircleCI status post)
- This also seems to have triggered a DB failure, which took a day to fix. (CircleCI status post)
Note that this is all conjecture on my part :)
I’m not sure what the moral of the story here is, with all these services you’re trading off functionality, development time and reliability. Sometimes that trade off can bite you when you inadvertently become reliant on the reliability of a random DNS provider :)
CircleCI have posted a post mortem.