2

How to Improve Broken/Flaky End to End Tests for my team

Profile picture
Entry-Level Software Engineer [E3] at Meta8 days ago

Hi everyone, I recently started as an Engineer at Meta, and my team has a lot of frequent broken, flaky tests. Like every single week a test must break. The End to End Test Wiki is not enough for this. How can I improve the testing system? Are there resources that you recommend? What suggestions do you have for me?

Thank you!

31
3

Discussion

(3 comments)
  • 3
    Profile picture
    Engineer @ Robinhood
    8 days ago

    Can you try fixing 1 test? Set a hardstop 1-2 weeks in the future, annoy whatever people can help fix it, and then document what you tried? This will compliment what Rahul is suggesting: what you're looking to define is yield (value/effort). If the yield is low (value is very low or effort is too high), that will explain why no one fixes the tests. In that case, I'd just delete them and measure if that decreases build times.

  • 1
    Profile picture
    Tech Lead/Manager at Meta, Pinterest, Kosei
    8 days ago

    IIRC, Meta had a system to automatically disable flaky tests, no? Why is this not kicking in here? I'm sure this depends on the codebase you're dealing with.

    The first question to ask is: "why does this matter?" When a test breaks, how much time/energy goes into fixing it? Do other people on the team view this as a problem?

    This sounds like a reasonable thing to spend time on, but it could also be a rabbit hole that doesn't actually yield anything fruitful for you. The worst outcome is if you spend a bunch of time on this and no one cares.

    My recommendation is to write a very thorough Workplace post documenting:

    • The problem: what's happening and how long it's been going on for
    • Research you've done about why this is happening
    • The negative impact stemming from the problem
    • Ask for feedback or suggestions on next steps (and propose a few ideas)

    At a minimum, you will learn a lot from making this post. Tag relevant people, and you may get valuable feedback to decide if you want to invest further in fixing it.

    I talk more about my strategy around comms here: [Case Study] Effective Communication: Leading A Multi-Org Re-architecture At Meta

  • 0
    Profile picture
    Tech Lead @ Robinhood, Meta, Course Hero
    7 days ago

    Every Big Tech company is filled to the brim with flaky tests. With flaky tests, you have 3 options:

    1. Fix them
    2. Ignore them and keep suffering
    3. Delete them

    It's easy at a company like Meta where you're always heads down with roadmap work to do #2. I saw this all the time with E3s and E4s. However, this goes against the spirit of Meta and top engineers overall as you aren't taking any action. It feels painful, but it's way better to do #1 and #3 as a "1 step backward, 2 steps forward" type thing.

    In general, I'm a fan of trying to save things, especially with code quality. As Jonathan mentioned, set aside some time to just fix ONE test. This is literally the perfect time of the year to do this Better Engineering work as you are in code freeze right now. From there, you can make an informed call on whether to do #1 (create a playbook and repeat) or #3.

    If you can pull this off (either getting buy-in to do #1 or #3), this will be a shining gem on your E3 packet. This is more like advanced E4 behavior.

    Side note: End-to-end tests suck and are tremendously overrated. We had a similar problem at Instagram. We made the call to effectively delete end-to-end tests and break them down into unit tests and snapshot tests.

    Here's some additional nice reading material for you: "What do mobile testing strategies look like at top tech companies?"