2

Unable to reproduce a production bug, what will be the repercussions?

Profile picture
Senior Software Engineer at Taro Communitya year ago

Since a week I have been investigating a production bug that last occurred on 20th Sept 2023, and earlier in 2021, but unable to reproduce it in test environment. At times there's a transaction that erroneously overrides some database values, and I need to reproduce a similar transaction to see where in the code the database value is getting overridden. I followed the following steps:

  1. Got the request URL from logs for the buggy transaction, to see which UI page is responsible for it.
  2. Spoke to the user who performed that transaction if he remembers what he changed on that UI page. It was a POST request and only the URL is logged, not the body. He faintly remembers what he changed. The database history tells me what columns were updated exactly for that transaction, but those columns are not visible on the UI. Those were updated indirectly as part of the page save. But I have not figured out still in what scenario those columns will get updated in the database from the given UI page.
  3. Tried performing the same steps in test environment, but unable to reproduce. Tried for different scenarios by changing different fields on the UI, since the user doesn't remember what he changed exactly.
  4. Will next be debugging the code on my local system to get an understanding of the flow.

I'm worried - what if I'm not able to reproduce it? I often think that others will doubt my capabilities. Can I be put on PIP for this?

146
2

Discussion

(2 comments)
  • 5
    Profile picture
    Tech Lead @ Robinhood, Meta, Course Hero
    a year ago

    I'm worried - what if I'm not able to reproduce it?

    This happens all the time, especially at Big Tech. There's many options here (in order of what I recommend):

    1. Bring in outside help - Bugs are weird, and sometimes you just have a massive blind spot. The solution might be to bring in a fresh pair of eyes who can see an angle you completely missed.
    2. Double down, add more logging - If you have enough information, you should theoretically be able to mimic exactly what the triggering user went through to make the bug happen.
    3. Blind fix - You understand the symptom but not the root cause, so you write some hacky code that patches up the symptom. In general, you should lean against this but there may be no other option. It usually doesn't make sense to spend weeks, if not months, trying to figure out an insanely complicated bug, especially if it's fairly small.
    4. Deprioritize the bug - This is a last resort. If you are legitimately trying super, super hard to repro a bug and you can't do it, chances are that it doesn't affect that many users. But again, it's chance, and a "small" bug at Big Tech can easily affect 1 million+ users. Make sure to have sufficient logging to understand the full blast radius of a bug before you fully give up on it.

    Can I be put on PIP for this?

    I can't speak for how petty (or evil) every tech company may be, but I would be incredibly surprised if you were put on a PIP for this. PIPs generally come after a prolonged period of underperformance, and this is just a single bug. Also, it seems like you're trying very hard to figure out the issue - You aren't being lazy with it like I've seen other mediocre engineers do (some of which got PIP-ed).

    I actually just made a playlist with our best debugging resources. I hope it helps: [Taro Top 10] Debugging

  • 4
    Profile picture
    Tech Leadership Coach • Former Head of Engineering
    a year ago

    First, I'd ensure everything is the same in test vs. prod environments. Ensuring parity at the infrastructure, environment level will ensure that your efforts at the software level don't go to waste.

    Document all the paths you've already explored to reproduce the bug. Create a document capturing 1) what you tried 2) what you expected 3) what happened. This will make it much easier to rally for help from others.

    It sounds like you've given this a fair shot and as long as the document capturing what you tried so far reflects that, any reasonable person won't fault your for that. Like Alex mentioned, focus the discussion around "what else could I have missed" vs. worrying about the perception of others.

    It's a signal of maturity to not throw endless hours at it. A couple things are likely to happen.

    • If they struggle to come up with another idea to try, you'll know that you've coved a reasonable enough surface area to re-evaluate if they task is worth throwing more effort at. Either deprioritize or come to an agreement on what's needed to bring this to closure (i.e. who else to help, more time, etc.)
    • In the off-chance someone points out something obvious that you missed, that's OK. Acknowledge it, thank them, and update them after you've executed on the idea. Add it to your playbook, so you'll catch it next time.

    Final thoughts: don't worry about the PIP, we'd all be on one if it's handed out every time someone can't fix a bug.

    Interested to hear how this pans out. DM me if you'd like to chat further.