Tuesday, 5 March, 2019 UTC


Summary

This is Teacup, our adopted wombat and latest on-call engineer.
Just like everyone, Teacup has to take responsibility for what we ship to production. It won’t be smooth sailing on-boarding her — the late night alerts, the fact she doesn’t even own a computer, or just how difficult it is to acknowledge PagerDuty alerts with paws.
Okay, okay, maybe I’ve stretched the truth a teensy bit.
Not everyone is on-call — just most software engineers and some managers.
We also don’t force an adopted wombat to answer alerts — although Teacup is real, utterly adorable, and would rock on-call if she wanted to. ️
Phew!
So who is on-call and what does that look like at npm?
Je Suis Wombats
When I say ‘most engineers’ are on-call, I mean:
  • Site Reliability Engineers (SREs, ops/infrastructure)
  • Platform Engineers (APIs and backend services)
  • Web Engineers (frontend + backend)
  • Engineering Managers
In the past we have also been joined by members of the security team — I’ll be covering later the way other teams are involved in the on-call process.
Hi, I’m Wes
I’m an SRE wombat, which means I spend my workdays improving and maintaining the reliability of the npm registry.
That might sound like I’m permanently on-call, but really it’s more about preventing incidents and supporting infrastructure changes, such as changes required for product updates, new features, security updates, and whatnot.
Wombat’s Day On-Call
If you reside in time zone between the Pacific and Atlantic oceans then an AM shift is just that — a morning.
Yes, there is also a slightly more disruptive PM shift, although for aforementioned time zones it only runs to around 9-11pm, and a contracted third party takes over night low priority and easy to triage pages.
However, some wombats, like myself, are in other time zones - like the UK, where the 'AM shift’ actually runs 2pm - 10pm.
If you’re on primary that means you get notified by automated alerts first if they trigger.
On receiving an alert, you acknowledge it (so it doesn’t fall through to the secondary wombat on-call after a while), and get to a computer if you’re not at one already.
Next:
  • Check Slack for chatter about the issue or potentially active work that could caused the alert to trigger.
  • State in Slack that you’re looking into it.
  • Check the PagerDuty web UI for extra details.
  • We may also check our Nagios web UI or Amazon CloudWatch for extra details.
  • This is the point at which we likely start to investigate the issue more closely, check metrics on Grafana, looking at logs, ssh'ing into VMs…
  • We keep a small set of playbooks and FAQs with common issues and how to resolve them which are always worth a quick search.
  • Throughout the investigation, updates are posted in Slack, which helps with later incident reviews and provides other wombats with context so they can help if needed.
  • If the incident has a high impact on users and no immediate fix, we have an incident response process to follow:
  1. A specific Slack channel is used.
  2. The primary on-call wombat takes point (aka incident manager).
  3. Everyone involved joins a Zoom video call to communicate with as much 'bandwidth’ as possible.
  4. The marketing team follows their own IR process designed to ensure updates to users are easy to understand, compassionate about user frustrations, and timely.
  5. The secondary on-call wombat takes point publishing any required status updates to our status page.
  • Incidents aren’t marked as fixed until all user visible effects are fixed/stable. Internally, we create tickets in Leankit for expedited maintenance work to ensure the same incidents don’t reoccur.
  • We revisit all major incidents in a formal, blame-free retrospective process as soon after as possible, while some minor incidents are discussed in our regular infrastructure chats.
None of this is set in stone, and one of the reasons I have found npm’s on-call culture to be one of the nicest I’ve been a part of is precisely because everyone is prepared to adapt and go with the flow in an emergency (and out).
Wombats have a high level of empathy, compassion, and respect for each other, so when someone needs to skip being on-call because some life event just sprung on them, or swap for a holiday, we’re there in same way as when things are on fire.
Be excellent to each other, and make sure Teacup takes the following day off after her out-of-hours call, 'key?