My new manager, my old manager, and the broader team have been managing the on-call rotation for the platform of my company's flagship product, which we launched two years ago. Initially, the rotation included just 3 engineers, but after discussions with my directors and acknowledgment from the rest of the organization, we increased it to 8 engineers to form a healthier on-call rotation.
Despite having 8 engineers, I've noticed that many team members, including our principal and staff engineers, are still not familiar with the on-call procedures. I have compiled a support run-book log documenting the steps for handling each issue/alert, so the on-call team understands the severity and business impact of different issues. The issues can range from low priority to business-critical.
However, the support run-book documentation is not entirely reliable as the ultimate source of truth because our production system support behaves more like triage than a debug system.
Additionally, the nature of the on-call rotation can vary from simply acknowledging alerts and following documented steps to collaborating with business owners. Sometimes, issues are caused by other teams or third-party vendors, making them unsolvable by the on-call engineer alone. I noticed that Production Issue happened almost daily, and the on-call issues have impacts to company's revenues and customer facing experience..
I am interested in learning more about how others view a healthy on-call rotation.
What are the key factors to consider when forming a healthy on-call rotation?