My new manager, my old manager, and the broader team have been managing the on-call rotation for the platform of my company's flagship product, which we launched two years ago. Initially, the rotation included just 3 engineers, but after discussions with my directors and acknowledgment from the rest of the organization, we increased it to 8 engineers to form a healthier on-call rotation.
Despite having 8 engineers, I've noticed that many team members, including our principal and staff engineers, are still not familiar with the on-call procedures. I have compiled a support run-book log documenting the steps for handling each issue/alert, so the on-call team understands the severity and business impact of different issues. The issues can range from low priority to business-critical.
However, the support run-book documentation is not entirely reliable as the ultimate source of truth because our production system support behaves more like triage than a debug system.
Additionally, the nature of the on-call rotation can vary from simply acknowledging alerts and following documented steps to collaborating with business owners. Sometimes, issues are caused by other teams or third-party vendors, making them unsolvable by the on-call engineer alone. I noticed that Production Issue happened almost daily, and the on-call issues have impacts to company's revenues and customer facing experience..
I am interested in learning more about how others view a healthy on-call rotation.
What are the key factors to consider when forming a healthy on-call rotation?
The thing that immediately jumps out to me is that the Staff/Principal engineers still aren't being proper champions of oncall (and they are literally the most important people who need to do this). There are 2 possible scenarios here:
I cover all this and more in my oncall case study here: [Case Study] Revamping Oncall For 20 Instagram Engineers - Senior to Staff Project
Lastly, if your oncall is getting hammered with production issues almost daily, that is a huge problem (no wonder people are zoning out of oncall). One of the best ways to elevate the oncall is to set the example yourself:
Thank you for your input. Totally agree with your input here.
I rewatched your video oncall case study [Case Study] Revamping Oncall For 20 Instagram Engineers - Senior to Staff Project
Plan to continue to work with my manager/director on this on-call