My manager and I and my teammate have been owning the on-call rotation for the platform for my company's flagship product that we launched recently. The rotation of 2-3 engineers is hectic and overwhelming, and my manager and I have brought up this issue, and finally got the acknowledge from the rest of the organization that more engineers needed to be added into the on-call rotation to form a healthy on-call? Is 8-10 engineer on-call rotation a healthy rotation?
2-3 engineers in an oncall is unhealthy IMO. The rule should be that the knowledge/context on the team should still survive if 2 people are not available (e.g. one person on vacation, and one person who quits). So having 2 people total in one rotation is definitely not good.
Of course, this depends on the nature of the oncall. Do you have a log of what issues the oncall deals with, and how much time it requires? If it's a low-stakes oncall, e.g. just updating documentation, it may be fine. But for something that directly impacts production, 8-10 people is much better.
Yes, 2-3 people is going to be overwhelming. I think you need at least 4 to make it sane (once a month) with a manager serving as a backstop. Will Larson endorses the notion of 8 people (https://lethain.com/sizing-engineering-teams/). I would also take a hard look at the seniority of your team to make sure you have engineers who are independent enough to deal with ambiguous questions and debug issues that cut across multiple domains/systems. Runbooks and documentation -- not to mention shadowing are important here too in making sure there is sufficient training and expertise to not have issues escalate to other engineers who are trying to get project work done.
I have complied a documentation for support run-book log for each corresponding issue/alert, so the the on-call team understand what's the level of severity of impacts to business. As the range of the issue can range from low priority to business critical.
However, The support run book doc is not complete as the ultimate source of truth since the production system support behaves like triage rather than debug system.
Then, the nature of on-call rotation can change from to acknowledge the alerts and follow the steps of documentation that support run book documentation to working with business owners. And, there's few time the issue is caused by another team or 3rd party vendor, and it's an issue that cannot be solved by the on-call person.
I am interested to learn more about what other people view about healthy on-call rotation.
It seems like there are several factors to form a healthy on-call.
I see that there's few online post here:
The rotation of 2-3 engineers is hectic and overwhelming
2-3 engineers is way too few for a full, healthy oncall rotation. At this point, you should just do the "traditional" way where when an issue comes in, it just goes to whoever owns the code behind the issue.
I am surprised that a rotation of 2-3 engineers is hectic though: The amount of surface area owned across just 2-3 people should be low. From this, 1 of 2 things is probably true:
Is 8-10 engineer on-call rotation a healthy rotation?
Generally yes. I think 7-12 is the sweet spot for oncall size from my experience.
Here's some other resources on oncall, which may be helpful:
I am considering what other actions I should take action to establish a more sustainable and healthy on-call rotation.
If you aren't able to get the bodies, I just don't think you should have a formal oncall rotation. The ideal scenario here is to have a TPM who's decently good at routing fires and bugs to the proper owner.
In the meantime, focus on improving the system. Oncalls aren't inherently hectic - They become overwhelming when the system quality is poor. If you're getting a lot of issues, I would try to figure out the common root causes and fixing them. Maybe even dedicate an entire sprint towards "Better Engineering" to make the system more reliable and break less.
Over the past three months, my manager (Director) has repeatedly told me that we should have at least six full-time equivalent (FTE) engineers to handle on-call support. Unfortunately, he has been unable to deliver on this promise due to push-back from other teams. Currently, my director is still part of the on-call rotation and he told me that he would like to be relieved from on-call duties. I am considering what other actions I should take action to establish a more sustainable and healthy on-call rotation.
My manager told me that his ideal goal is to have on-call rotation for every 3 month, but the math currently does not add up. Any thoughts?
Hi Team,
Noted: My Director is also part of the on-call, and he says it's expected to get alerts at odd hours due to the vendor outage, and he would re-route the fire back to the Vendor management over the weekend.
Props to your director for putting themselves on the line and being on the oncall rotation! That's pretty strange, but it's a nice gesture for sure.
At the end of the day, you don't need to fix every bug. It's great to fix most bugs to uphold system quality, but sometime this isn't possible and the org has to prioritize something else. Every org is different: As a software engineer, it's important to be open to every possible outcome.
So if you're clear to ignore 80%+ of bugs, then you don't need a super refined oncall system. You could probably just do the basic router system that I cover in the beginning of my Instagram oncall revamp case study and spend 90% of your time on feature work like your manager says.
Leaving a bunch of bugs open sucks for sure, but learning to effectively ignore many tasks and keep them out of your mental space is a really important skill. This is especially important for senior+ engineers, which I talk about in-depth here: "How does one effectively handle pressure especially when the stakes are high?"