19

How do I factor in oncall into my team selection?

Profile picture
Entry-Level Software Engineer [P3] at Atlassian2 years ago

As I figure out which team to join, oncall comes to mind. I've heard that oncall can be pretty intense and stressful here, especially on infra teams. The company has said that they're making efforts to fix this problem, but I'm unsure what to expect there.

How can I figure out whether a team has a healthy oncall rotation? I don't want join a team just to be burned out by a crazy oncall.

713
4

Discussion

(4 comments)
  • 9
    Profile picture
    Robinhood, Meta, Course Hero, PayPal
    2 years ago
    • In a vacuum, aim for an SLA > 8 hours so you can sleep, at least 8 people on the rotation so oncall is around once every 2 months, and time spent fire-fighting during oncall to be <50%
    • The Android oncall I was on at Instagram (which I also designed!) was an example of a good oncall (it actually came in 1st in the engineering survey of my org rating how healthy eng pillars were). It met all the above attributes.
    • Some of the back-end oncalls in my org at Instagram, at least earlier on in their lifespan, were what I would consider a bad oncall (for the individual). Ads is high-pri as it's revenue generating, back-end SLA is low as you can fix/break everything instantly, and some of these oncalls had only 5-6 engineers so you're oncall fairly often. I would read their oncall summaries and they would sometimes have to debug like 25 issues during the oncall, which is crazy.
  • 8
    Profile picture
    Robinhood, Meta, Course Hero, PayPal
    2 years ago
    • Kudos on considering this angle: Most earlier-in-career engineers I've worked with don't even realize that this is a thing!
    • The 2 main aspects of an oncall:
      • Severity: How rough is the oncall when you're on it?
      • Frequency: How often are you oncall?
    • You want both of the above to be as low as possible (obviously).
    • For severity, understand SLA (i.e. how fast you need to respond to issues) and issue volume (how many fires hit the oncall on average).
    • Questions to ask:
      • How many people are on the oncall rotation?
      • How quickly do I need to respond to issues?
      • How many issues does the average oncall person face in their rotation?
    • Overall oncall attributes:
      • Mobile oncalls are generally more relaxed as you can't fix everything instantly like you can for web/back-end.
      • Flagship products will have more stressful oncalls.

    Related resources:

  • 7
    Profile picture
    Meta, Pinterest, Kosei
    2 years ago

    One thing to evaluate is the type of incidents the oncall has historically faced. (Best to talk to a senior eng on the team to talk you through it.)

    Some oncalls are difficult because the team is basically middleware -- you get the alert, and your job is to find the correct team to actually fix the issue. These are not fun teams to be on, and it's hard to make these oncalls better.

    However, some oncalls can be improved relatively easily, through better documentation, or building a simple tool around timing or correlation. If that's the case, joining one of these teams could actually be an opportunity! Improving the oncall is a great way to ramp up, and generally the bar is lower to "ship something" compared to a production feature.

  • 0
    Profile picture
    Entry-Level Software Engineer [P3] [OP]
    Atlassian
    2 years ago

    Appreciate the advice! What are some examples of good and bad on-call schedules?