We fix a lot of random/low frequency crashes to meet an overall crash rate SLA. How should I think about impact here?

Question

My team spends a good amount of time fixing random/recurring low priority crashes. The goal is to keep the overall crash rate under our agreed SLA. Crash rate naturally goes up over time, so we need to continually invest time plugging up these crashes to keep the overall metric below SLA.

Is there a way to quantify this in numbers. Maybe we can draw the connection between these fixes and a better user experience that ultimately affects the app users' ratings in the Play Store?

Alex Chiou · Accepted Answer

Immediate reaction**: What's this SLA and is it necessary? The SLA helps in a way in that this work has the clear impact of satisfying the SLA, but it's sort of conjured up impact. It could be high impact flipping this dynamic: Reduce or remove the SLA. As a lead engineer, I recommend that you seriously take the time to understand how much engineering effort goes into these low-pri crashes and whether it's worth it. Something else to consider is whether or not the fixes are being done properly. When you have a goal like an overall SLA percentage, engineers could be incentivized to just plug up the crash, which will suppress real issues later on. The classic Android one is just null-checking an NPE without understanding its root cause. When it comes to small issues, your best shot at understanding impact is bundling them all together in an experiment. Look for time spent in particular as they're crashes and consider using a holdout group to observe impact long-term.

We fix a lot of random/low frequency crashes to meet an overall crash rate SLA. How should I think about impact here?

Discussion