2

How can I improve the unstructured, chaotic oncall rotation I'm on?

Profile picture
Mid-Level Software Engineer [E4] at DoorDash3 years ago

My team is kind of crazy - My manager has 30 direct reports and this craziness reflects in my oncall as well. Here's how my oncall is currently:

  • It's made up of 3 engineers, 2 other back-end engineers and myself. This means I'm oncall 1/3 of the time.
  • Around 15% of the time is spent actually resolving issues. For the remaining 85%, we focus on operational excellence (improving oncall documentation, adding tests, etc).
  • There is no agreed upon SLA time. So we just push ourselves to respond to all issues ASAP.
  • We aren't fully equipped to solve a lot of the issues our rotation gets. The oncall often has to rope in another team/oncall to fix the issue.
382
1

Discussion

(1 comment)
  • 1
    Profile picture
    Robinhood, Meta, Course Hero, PayPal
    3 years ago
    • Merge your oncall into another oncall. Your oncall is too small on its own, and the fact that you're triaging issues a lot of the time means that the impact of your oncall is very hazy. Oncalls should primarily be a direct shield for the system they own, not a router that then finds the true shield.
    • Change your oncall so that you are only responsible for fixing issues. Losing 1/3 of your time doing oncall stuff makes it really hard to ship on your team's core goals (i.e. the product work). With such a small oncall, there's only so much truly impactful operational excellence you can do.
    • Define an SLA time. This is one of the most important aspects of any oncall. In a vacuum, people will think the expected response time to every issue is ASAP, which makes the oncall a work-life balance black hole.
    • Copy another oncall within the company. Oncall is one of the 1st things any legit tech company has to figure out. DoorDash is an amazing company with great engineering talent. I'm sure at least 1 org has a well-oiled oncall machine that you can borrow from.