51

How can I root cause bugs faster?

Profile picture
Mid-Level Software Engineer [Senior Associate] at Capital One2 years ago

I get a lot of JIRA tickets for bugs where it's not clear what the fix should be. How can I find the problem area and relevant code faster with these issues?

14.5K
2

Discussion

(2 comments)
  • 68
    Profile picture
    Meta, Pinterest, Kosei
    2 years ago
    • My tactic for quickly root causing: start by grepping the codebase for anything relevant to the problem (do this by searching for the relevant string, module name), and then blame the code.
    • If the regression just happened recently, look at recent code or config changes and try to see if that fixes the issue. This strategy lets you quickly fix issues even if you don't know what's going on :)
    • It's important to clarify if the gap here is around the desired behavior or identifying the cause/fix. If it's the latter, I'd push back aggressively on the person/team who filed the bug and ask for more info, or send the task to the PM to get input on the desired behavior. It's also fine for you (as the engineer) to have input on the correct fix, but it should be documented and clarified what is a bug, feature, or desired behavior. (you'll be surprised how often there is ambiguity around that)
  • 51
    Profile picture
    Tech Lead @ Robinhood, Meta, Course Hero
    2 years ago

    Here's my process for debugging:

    1. Figure out the overall flow (i.e. series of steps) for whatever your issue is
    2. Go through each of the steps and use log statements/debugger to figure out if it's working
    3. After finding the broken step, narrow it down to the relevant module/class that is broken.
    4. Step through the code carefully to find the broken method and finally, the broken line of code.
    5. If you're struggling with #3 or #4, blame the code (or use your org knowledge of component ownership) to find the proper person to ping for help. If the person can't help you, get someone else who could and traverse the "help chain".

    The step I see junior and mid-level engineers struggle with the most is #1, followed by not having the confidence to do #5. The failure mode with #1 is they just try to take in everything all at once and don't have this structure to step through the issue pragmatically. So with data pipeline, here's a really scuffed example from me.

    Bug: Client event triggered by user behavior (e.g. clicking on a button) isn't properly being logged into the SQL table.

    The steps could be something like:

    1. User behavior happens - Maybe the event just doesn't fire?
    2. Event is sent to some back-end endpoint - Maybe the endpoint doesn't receive anything and the sending protocol is broken.
    3. Endpoint parses the payload into some local data model - If this breaks, maybe the sent payload is malformed?
    4. Back-end logic inserts model into table - If this fails, the issue could be a database connection issue, malformed local data model, or maybe the entire table is just broken and missing a column or something.

    Another thing I recommend while debugging is to write everything down as you're stepping through the issue and learning bits and pieces along the way, similar to what a detective would put together investigating a crime scene. The paper trail helps you zoom out and remember theories you have already considered and debunked, and it helps you claim full credit for fixing the bug at the end as you can show how complex the journey was.

    For a detailed case study of how to find bugs and get proper recognition on fixing them, check out this video where I break down a nasty bug I found while at Instagram: [Case Study] Solving A Multi-Million $$ Instagram Bug

    Related resources: