Unable to reproduce a production bug, what will be the repercussions?

Question

Since a week I have been investigating a production bug that last occurred on 20th Sept 2023, and earlier in 2021, but unable to reproduce it in test environment. At times there's a transaction that erroneously overrides some database values, and I need to reproduce a similar transaction to see where in the code the database value is getting overridden. I followed the following steps:

Got the request URL from logs for the buggy transaction, to see which UI page is responsible for it.
Spoke to the user who performed that transaction if he remembers what he changed on that UI page. It was a POST request and only the URL is logged, not the body. He faintly remembers what he changed. The database history tells me what columns were updated exactly for that transaction, but those columns are not visible on the UI. Those were updated indirectly as part of the page save. But I have not figured out still in what scenario those columns will get updated in the database from the given UI page.
Tried performing the same steps in test environment, but unable to reproduce. Tried for different scenarios by changing different fields on the UI, since the user doesn't remember what he changed exactly.
Will next be debugging the code on my local system to get an understanding of the flow.

I'm worried - what if I'm not able to reproduce it? I often think that others will doubt my capabilities. Can I be put on PIP for this?

Alex Chiou · Accepted Answer

I'm worried - what if I'm not able to reproduce it?

This happens all the time, especially at Big Tech. There's many options here (in order of what I recommend):

Bring in outside help - Bugs are weird, and sometimes you just have a massive blind spot. The solution might be to bring in a fresh pair of eyes who can see an angle you completely missed.
Double down, add more logging - If you have enough information, you should theoretically be able to mimic exactly what the triggering user went through to make the bug happen.
Blind fix - You understand the symptom but not the root cause, so you write some hacky code that patches up the symptom. In general, you should lean against this but there may be no other option. It usually doesn't make sense to spend weeks, if not months, trying to figure out an insanely complicated bug, especially if it's fairly small.
Deprioritize the bug - This is a last resort. If you are legitimately trying super, super hard to repro a bug and you can't do it, chances are that it doesn't affect that many users. But again, it's chance, and a "small" bug at Big Tech can easily affect 1 million+ users. Make sure to have sufficient logging to understand the full blast radius of a bug before you fully give up on it.

Can I be put on PIP for this?

I can't speak for how petty (or evil) every tech company may be, but I would be incredibly surprised if you were put on a PIP for this. PIPs generally come after a prolonged period of underperformance, and this is just a single bug. Also, it seems like you're trying very hard to figure out the issue - You aren't being lazy with it like I've seen other mediocre engineers do (some of which got PIP-ed).

I actually just made a playlist with our best debugging resources. I hope it helps: [Taro Top 10] Debugging

Unable to reproduce a production bug, what will be the repercussions?

Discussion

Other Great Discussions