In my 3rd week at Big Tech, I was over-eager to get my first PR merged (into a particular legacy Airflow repo) so I could complete my first ticket and show some progress. I made probably the most classic rookie-mistake of not properly testing my code in staging, and my code ended up causing a sev that took down Prod for about 30 minutes. Since this was a particular legacy Airflow repo, it wasn't the end of the world since only internal workers were affected, only a small subset of the company, and it happened at night. Still, this was a pretty bad look for me to my manager and I've been working hard since to make a better impression.
At my company, for every sev, there's a process to write up a Root Cause Analysis (RCA) Doc where you describe the issue, 5-whys for why it happened, the timeline for how it happened, who it affected, and a few other details. There's technically an SLA of 2 weeks set to each RCA, but looking at other RCA docs, I see a lot of them were never actually filled out.
From my perspective, the reason for the sev was simple: I didn't adequately test in staging. The oncall guy who helped me navigate the issue encouraged me to not personalize it as much and to think in terms of the process, e.g. testing in staging should have been required or canary testing in prod should have caught and rolled back my code.
I have filled out the RCA doc on Confluence and can publish it but am hesitant to do so because I'm concerned about reminding people that I caused the sev.
I have 2 concrete questions:
Thank you for reading this!
If you've already written everything up and it's just a click of the "Publish" button away, I feel like you should just hit the button (but don't draw any more attention to it).
If there's meaningful work on top of pressing a button, I would just drop it and move on. It seems like the negative impact of this SEV was pretty minimal.
Zooming out, hiding the fact that you did something bad isn't a great motivation. The factors here are more about your general productivity and spending your time well. It seems like this could just be a giant distraction if sharing out the full RCA is a meaningful amount of work.
One thing I should clarify is I did test locally, but only with 1 dag, and the issue only appeared at scale when many dags were run. If I had simply build and deployed the code to staging, I would have seen staging become unavailable because multiple dags would have run my change. Because this was my first PR, I didn't really know staging was a thing, or perhaps I should say I was too set on getting my PR deployed to think about it. Just want to clarify that I take ownership of my mistake, but add a little context.