184

What is the effective way to understand new repository in order to make the required changes in that repo?

Profile picture
Software Engineer at Microsoft2 years ago

Every time, when my manager asked me to do some changes to the repository that is totally new to me. I became scared.

I prefer to do research by myself first. But I got lost in the new repo by reading file by file, and don't get the clarity.

So I ask the repository owner to provide documentation, mostly they don't maintain documentation, and even if they do, it is not updated or it involves a lot of detailed feature-wise documentation, which is usually not relevant to my requirement.

Then, I call the POC of that repo, but I couldn't figure out what is the right question to ask in the first call. Over time, I ping him asking questions whenever I face hurdles while achieving the requirements.

Sometimes, I put a debugger or logs to understand the flow of code.

The above processes took a lot of my time.

What is your suggestion to get clarity in the new repo such that I can complete my requirements in less time?

31.3K
3

Discussion

(3 comments)
  • 223
    Profile picture
    OpenAI Engineer (ChatGPT Team), Ex-Microsoft, Ex-Meta, Ex-CEO of tech nonprofit
    2 years ago

    This is a super-common experience, so I'm glad you brought it up. I worked in six different teams at Microsoft, and then on many teams at Meta (where the challenge was not only the codebase, but the fact that I had never done Javascript or PHP).

    My main advice on this feeling is to:

    1. get more comfortable with working under ambiguous circumstances, and
    2. focus on the zen of learning how to understand a given codebase

    Obligatory Ambiguity

    When I joined Windows in 2000, it was already perhaps 50M lines of code. In such a world, it's impossible to understand even 5% of the codebase. At the time, even understanding just the DLLs would have been too much to ask. Most large corporations you join will have legacy codebases that are large (perhaps not as large as Windows, but large).

    So here's the deal: most engineers in such codebases are operating with very limited knowledge of the actual codebase. What becomes important then is to be able to be comfortable working with limited knowledge, and to also know what things are most important to know.

    Comfort at working with limited knowledge depends on a few things. The existence of thorough tests makes things easier. The factoring of code to have loose coupling / strong cohesion also helps. The discoverability of all code (e.g. monorepo) is huge. But personality is also part of this. Some people can't go on vacation unless every day is already booked and all activities are accounted for. Others arrive at some location and just free-range. The key mindset, in a large codebase, is to embrace that you'll never know even a significant portion of it. The skill becomes how to operate under such circumstances.

    How to Understand

    • Learn which questions to ask. Diving into new codebases is like sight-reading sheet music -- the more you do it, the more your mind develops mental models of where the key questions and pitfalls are. Seek opportunities to dive into new code, unlike people who avoid going into dark corners because they're scared to. The latter sort of person stays scared for life.
    • Fix lots of bugs. Bug-fixing hones your discovery skills: what code calls what, which tools make it easier, etc. Bug fixing also allows you to interact with far more of a codebase than if you write new code in it. It's the fastest way to cover a lot of ground.
    • Don't look for documentation. Unless you're working on a miraculous team (which btw wastes a ton of time keeping documentation updated instead of writing code that impacts customers), you should assume documentation is outdated or non-existent. Code is like Shakira's hips: it never lies.
    • Meta learn. Learn how to learn. You'll develop strategies... like if you're confused about a component, look through git history to figure out what changes last touched it. Then look at those commits, which'll tell you exactly which other components interact with the current component. Using source code history is just one meta-learning; there are many other ways. The key is to practice (i.e. dive into new code a lot), and to be observant about which behaviors of yours lead to faster/better outcomes.
  • 107
    Profile picture
    Tech Lead @ Robinhood, Meta, Course Hero
    2 years ago

    Great question! There's a lot of great details here that I really appreciate you sharing, so I'll go through them one-by-one.

    Every time, when my manager asked me to do some changes to the repository that is totally new to me. I became scared.

    Flip the mentality - Don't be scared, be excited! Fear is one of the biggest obstacles holding back software engineers. When you are afraid, that fear infects everything you do: In particular, it prevents you from being bold.

    Now here's the thing: Learning a crazy, new codebase is all about being bold. You need to be bold making seemingly stupid changes to break the code in a super obvious way to maximize your learning. You need to be bold asking your teammates for help, maybe even pair programming with them.

    Every time you pick up a new codebase, you are both learning the tactics behind it alongside building up your "meta" learning muscle as Philip described. That should make you excited!

    I prefer to do research by myself first. But I got lost in the new repo by reading file by file, and don't get the clarity.

    Reading the code, especially file-by-file, is one of the easiest ways to throw your time into a black hole. I recommend these other tactics:

    • Read the blames - Blame the overall module and see which files have the most recent changes. From there, you can pick the files that matter the most. We cover this tactic more in this video here: Learning A New Codebase? Here's How To Figure Out What Matters
    • Ask your colleagues for a high-level overview - This is a sort of "leveled up" version of the previous tactic that can even go alongside it. Find a core POC for this other repo (you can use recent blame volume to figure this out) and put a meeting on their calendar to discuss:
      • What the most important classes are in the repo
      • How different components talk to each other and the overall end-to-end flow
      • Your goals within this repo

    So I ask the repository owner to provide documentation, mostly they don't maintain documentation, and even if they do, it is not updated or it involves a lot of detailed feature-wise documentation, which is usually not relevant to my requirement.

    This is another classic trap I've seen engineers fall into, hehe. The documentation will never be good enough. This is going to be especially true in a top-shelf, massive company like Microsoft. It's simply too hard to keep the documentation up-to-date, and engineers generally aren't rewarded enough in their performance review to do so.

    Instead of reading the documentation, you should fall back onto more "active" tactics like the ones I described:

    • Changing the code and breaking stuff
    • Talking to people
    • Tactically going through the blames to find hotspots

    Then, I call the POC of that repo, but I couldn't figure out what is the right question to ask in the first call.

    Honestly, a good first question is something like: "Hey, I'm completely new to your codebase, and I need to make the following changes to do it to accomplish [MY_TASK]. I really want to make sure I uphold the integrity of your system and do it properly, so can you give me a high-level beginner's overview of how it all works and best practices?"

    Just show that you're extremely motivated to be a good citizen within their ecosystem. And after you receive their help, make sure to give them deep thanks.

    Over time, I ping him asking questions whenever I face hurdles while achieving the requirements.

    I could see this getting frustrating for this other person, so I highly recommend batching questions together as much as you can. To help with this, you should break down your task into bite-sized chunks through decomposition. From there, you can proactively think through each piece and come up with a bunch of questions at once. Here's a good discussion around decomposition: "How do I make software less overwhelming?"

    Sometimes, I put a debugger or logs to understand the flow of code.

    Both are great! The debugger feels more "refined", but I actually like logs a lot more. From my personal experience, the debugger can hang in really large codebases like those of Meta and Microsoft. If this is the case for you, there's nothing wrong with adding a bunch of logs. When it comes to logging, make sure to overlog to minimize the amount of builds you need to do.

    Lastly, I highly recommend this other discussion around learning new codebases and becoming more independent within them: "How can I become more independent and better at unblocking myself with tricky technical issues?"

  • 29
    Profile picture
    Senior Software Engineer [5A] at Uber
    a year ago

    Old question but I'm going to give you a structured checklist because a lot of the answers are too vague and too disorderly. They are not necessarily systematic or actionable. The problem you are talking about is more colloquially "decisions under uncertain with imperfect information". You are asked to perform a task given a codebase that you cannot completely possibly completely understand. This is fairly common actually, especially when you get to the senior level.

    I will be drawing on my own experiences navigating the Bazel codebase since its an open source repo. I have a real example of debugging the Bazel codebase, making a change, and upstreaming the change to Google. You can probably steal 70% of the workflow that I'm doing and massage it to what you need to do.

    Here's the YouTube video I made about this: Debugging Under Uncertainty - Case Study: Bazel

    Documentation

    A lot of documentation will be high level descriptions of what the code should functionally do. However, most documentation will be out of date, is poorly written, or will not make sense until you actually work with the code. There is often a lot of context on how each portion of the codebase works that is implicitly assumed by the authors. Other times, the codebase will evolve and the documentation will NOT be updated.

    However, that's not to say the documentation is useless. It should give you a rough idea of what levers you can pull and how information flows through the system. Documentation should be treated as the "spirit" of the codebase unless it is meticulously kept to date.

    Bad information is often times worse than no information.

    And yes, the git history is a form of documentation. Hopefully the committers have left meaningful comments and explanations.

    Create a mental model of roughly how the codebase works around your target area.

    Start with an end-to-end story of how data flows through the system and codebase. What package/classes handles what? What are the expected inputs/outputs? How should it roughly function? This is pretty easy if you can search for strings that and plug a debug point in the codebase.

    Through testing and navigation and stepping through each code flow, especially with a debugger, you begin to understand the discrepancies between documentation and what the code does. You may even need to re-read the documentation and start asking around for why a document says x but you see y. Every discrepancy or misunderstanding is a learning moment.

    Experiment, experiment.

    Now that you have a rough mental model of how the code works, prove that you are right. Plug in some inputs and see if your outputs are what you expect. What will break the code? Why? What are the shortcomings? How would you troubleshoot x errors?

    If you don't have an idea of how to debug issues in the area of the codebase as efficiently as possible beyond "just debug", you don't have a good understanding of what you're working with. Experiments are designed to get you there.

    There is what you know. Then there is what you can prove.

    Prototype

    Now that you have an understanding of the area you need to modify, prove that your idea works. Do a hacky, least effort, high fail fast prototype of your idea. This will prove 2 things. First, that your mental model is indeed correct. Second, how easy it is to make the change you need. When writing code for unknown and complex codebases, your changes should be as simple as possible. And I mean VERY simple (maybe even 3-4 line changes per diff in some cases when its time to productionize it).

    In complex adaptive systems like codebase, there is a very real risk of cascading failures.

    Productionize

    You have tested your idea, you understand how the area of your code works. Now it's time to design + write the code properly. Think about all your use cases and, given your knowledge of how the code works, try to come up with some code structure. Think of how it will scale, the future of the codebase given its past evolution, what you need to defend your code from, etc.


    Hope this was useful. I know this was long and pedantic. But I cannot stress how important it is to make simple, prove, and bulletproof changes to a complex codebase. It will take you a lot longer than you think to go through this but you can always justify it by saying that you are being cautious and guarded. But do be pragmatic and loop in your coworkers throughout the process.

    Failures from unknown and unknowables, including and especially those cased by emergent properties of complex adaptive systems, cannot be overstated. Black swans are far far more common than you think and just because you have test cases does not mean you've necessarily thought through all the points of failure. You don't know what you don't know.