Profile picture

Debugging Q&A and Videos

About Debugging

Unable to reproduce a production bug, what will be the repercussions?

Senior Software Engineer at Taro Community profile pic
Senior Software Engineer at Taro Community

Since a week I have been investigating a production bug that last occurred on 20th Sept 2023, and earlier in 2021, but unable to reproduce it in test environment. At times there's a transaction that erroneously overrides some database values, and I need to reproduce a similar transaction to see where in the code the database value is getting overridden. I followed the following steps:

  1. Got the request URL from logs for the buggy transaction, to see which UI page is responsible for it.
  2. Spoke to the user who performed that transaction if he remembers what he changed on that UI page. It was a POST request and only the URL is logged, not the body. He faintly remembers what he changed. The database history tells me what columns were updated exactly for that transaction, but those columns are not visible on the UI. Those were updated indirectly as part of the page save. But I have not figured out still in what scenario those columns will get updated in the database from the given UI page.
  3. Tried performing the same steps in test environment, but unable to reproduce. Tried for different scenarios by changing different fields on the UI, since the user doesn't remember what he changed exactly.
  4. Will next be debugging the code on my local system to get an understanding of the flow.

I'm worried - what if I'm not able to reproduce it? I often think that others will doubt my capabilities. Can I be put on PIP for this?

Show more
Posted a year ago
146 Views
2 Comments

What do you do when you're faced with a problem that you can't solve?

Software Engineer at Taro Community profile pic
Software Engineer at Taro Community

So, I'm the only frontend developer on a mobile application. My boss is BE and so if I ask for help he just tells me, "im sorry but I have my own things, you need to figure this out". I've expressed concerns when I wasn't happy with that answer; but, he doubled down that I knew more about him than my problem and so couldn't help me.

The issue is, the things I have problems with are exactly specific to frontend, maybe I'm trying to do some data flow stuff and just want to bounce off a coworker. Or, I have to integrate some FE piece to the BE and since we're a 3 person (engineer-wise) startup, we don't have documentation or really anything besides slack messages to explain stuff.

This has led to me being forced to just white knuckle my way through problems. For the past year and a half I've been able to do this; however, I'm now facing more difficult problems, live-streaming, bridging native modules ( I work with RN ).

More recently, I got stuck on a problem where, I seriously contemplated quitting the company because I couldn't figure it out. There is a ton of pressure because we have daily stand up and I can only say, "I'm still working on X due to Y" for so long. And so I thought, what happens when breaking it down, trying to solve a simpler problem, posting online, talking to teammates, reading docs, just doesn't work? I seriously thought everything was spiraling out of control.

I honestly don't know if there is an answer to this problem. But I was truly feeling hopeless just blindly trying to solve an issue by googling, chatgpt, and hoping for the best each time I hit compile.

Show more
Posted 10 months ago
77 Views
3 Comments

Seeking input on Forming a Healthy On-Call Rotation

Tech Lead at Taro Community profile pic
Tech Lead at Taro Community

My new manager, my old manager, and the broader team have been managing the on-call rotation for the platform of my company's flagship product, which we launched two years ago. Initially, the rotation included just 3 engineers, but after discussions with my directors and acknowledgment from the rest of the organization, we increased it to 8 engineers to form a healthier on-call rotation.

Despite having 8 engineers, I've noticed that many team members, including our principal and staff engineers, are still not familiar with the on-call procedures. I have compiled a support run-book log documenting the steps for handling each issue/alert, so the on-call team understands the severity and business impact of different issues. The issues can range from low priority to business-critical.

However, the support run-book documentation is not entirely reliable as the ultimate source of truth because our production system support behaves more like triage than a debug system.

Additionally, the nature of the on-call rotation can vary from simply acknowledging alerts and following documented steps to collaborating with business owners. Sometimes, issues are caused by other teams or third-party vendors, making them unsolvable by the on-call engineer alone. I noticed that Production Issue happened almost daily, and the on-call issues have impacts to company's revenues and customer facing experience..

I am interested in learning more about how others view a healthy on-call rotation.

What are the key factors to consider when forming a healthy on-call rotation?

Show more
Posted 5 months ago
41 Views
2 Comments