1

Seeking input on Forming a Healthy On-Call Rotation

Profile picture
Tech Lead at Taro Community2 months ago

My new manager, my old manager, and the broader team have been managing the on-call rotation for the platform of my company's flagship product, which we launched two years ago. Initially, the rotation included just 3 engineers, but after discussions with my directors and acknowledgment from the rest of the organization, we increased it to 8 engineers to form a healthier on-call rotation.

Despite having 8 engineers, I've noticed that many team members, including our principal and staff engineers, are still not familiar with the on-call procedures. I have compiled a support run-book log documenting the steps for handling each issue/alert, so the on-call team understands the severity and business impact of different issues. The issues can range from low priority to business-critical.

However, the support run-book documentation is not entirely reliable as the ultimate source of truth because our production system support behaves more like triage than a debug system.

Additionally, the nature of the on-call rotation can vary from simply acknowledging alerts and following documented steps to collaborating with business owners. Sometimes, issues are caused by other teams or third-party vendors, making them unsolvable by the on-call engineer alone. I noticed that Production Issue happened almost daily, and the on-call issues have impacts to company's revenues and customer facing experience..

I am interested in learning more about how others view a healthy on-call rotation.

What are the key factors to consider when forming a healthy on-call rotation?

40
2

Discussion

(2 comments)
  • 1
    Profile picture
    Tech Lead @ Robinhood, Meta, Course Hero
    2 months ago

    The thing that immediately jumps out to me is that the Staff/Principal engineers still aren't being proper champions of oncall (and they are literally the most important people who need to do this). There are 2 possible scenarios here:

    1. The runbook you made can be greatly improved - Writing a good oncall runbook that is both easy to understand (i.e. not esoteric) and thorough is incredibly hard. I would get more feedback on it (tell them to be as brutally honest as possible) and see what level ups you can make.
    2. They just don't really care about oncall - This is unfortunately a much harder problem to solve as the root cause is cultural. It is also a very common pain point for oncalls as many engineers see oncall as this evil annoying monster they have to put up with. The solution here is to get strong buy-in from your managers and craft a powerful message together to all engineers about why they need to care about oncall (and then enforce a contract by doing things like guaranteeing SLAs). Incentives help a lot too (make sure that managers tell their reports that they'll get rewarded in performance review for strong oncall performance).

    I cover all this and more in my oncall case study here: [Case Study] Revamping Oncall For 20 Instagram Engineers - Senior to Staff Project

    Lastly, if your oncall is getting hammered with production issues almost daily, that is a huge problem (no wonder people are zoning out of oncall). One of the best ways to elevate the oncall is to set the example yourself:

    • Compile all the root causes of production issues over the past 3 months into a spreadsheet and find the most common ones. Detail the impact of each problem (you mentioned these hurt revenue and UX, prove that!)
    • After that, lead a workstream to stamp them out (you will probably need to do a lot of the fixes yourself)
    • Get buy-in from your manager so you can lay off your feature work for a while and just become a doctor patching up all these production issue root causes
  • 1
    Profile picture
    Tech Lead [OP]
    Taro Community
    2 months ago

    Thank you for your input. Totally agree with your input here.

    I rewatched your video oncall case study [Case Study] Revamping Oncall For 20 Instagram Engineers - Senior to Staff Project

    Plan to continue to work with my manager/director on this on-call