Taro Logo

How Can Things REALLY Break?

The tactics are covered in the system design doc, so this summary will focus on the "meta" context behind this part and additional commentary:

  • You should always think about "How can we cause a SEV?" when you're planning out system design for any substantial project or task.
    • Context: SEV = Severity, which is the term Meta, Robinhood, and many other companies use to describe a really bad product event.
    • This is one of the purest and most critical manifestations of proactive thinking, which is one of the trademark attributes of a competent senior+ engineer and tech lead. SEVs are events that seriously set back your organization or even your entire company if they're something you have to react to as opposed to catching them early.
  • This "anti-SEV" thinking is especially relevant for Big Tech where the stakes are much holder due to regulatory constraints, increased media presence, and a massive scale of users.
    • It's also far harder to do this well at Big Tech due to a greater range of ways things can break.
    • However, if you can do this well, you will add a tremendous amount of value to your team. This is why pretty much all senior and staff engineers at FAANG and FAANG-equivalent companies have really developed this skill.
  • Here are the ways a SEV can happen from the video (these are all vectors Alex dealt with at Meta as well):
    • Users not playing nice - You should never assume that users will put in proper inputs or even not want to harm other users. Most major tech products have some sort of multiplayer component, so it's important to be vigilant here.
    • Bad auth context - This means that users are able to carry out actions on behalf of other users. This can happen either intentionally (i.e. the acting user is malicious) or accidentally (e.g. the user has multiple accounts on your service and switching between them cleanly is broken).
    • Incorrect backporting - "Backporting" is the process of making things work on older builds of your software. This is very relevant for mobile engineers as there will always be a good amount of users on older builds, especially if your product is used by people in regions with lower internet.

You can find the full system design doc here - Feel free to use it as a template for when you're going through the system design process at your job!