Profile picture

Code Quality Q&A and Videos

About Code Quality

How to convince my lead to take more care of the pipeline?

Senior Software Engineer at Taro Community profile pic
Senior Software Engineer at Taro Community

I am working in a project where we have a pipeline which runs automated tests, lint, and other type checks.

But we are merging PRs even if the pipeline fails 😅

On my case, every time I got a pipeline error, I fix it in my PRs, and some of my coworkers are starting doing the same, but still we are merging some PRs with the pipeline failing.

Our manager is a software engineer too, and has the role of merging the PRs.

I tried to convince him to avoid merging PRs if the pipeline is failing, but while he is open to discuss this topic, he thinks that since other teams also need to merge things. He doesn't want to block them because of the pipeline.

More context:

  • It is a startup and we want to get the job done faster.
  • I have 2 months working there.
  • The pipeline was always failing because of another step that was removed recently. I think they got used to ignore the pipeline because of that.

I believe we are paying 10x of the future time, for short term quick time (10 minutes of the future for 1 minute today).

I thing If we continue with this, all will blow up in our faces.

I am tiring of fix the pipeline almost every day, and checking my team PRs as well.

Not sure if I should just keep pushing, or stop worrying if the pipeline passes or not and just see how the things blow up, and then try to convince the team of keep the pipeline passing as a strict requisite to merge a PR.

What would you do in my case?

Show more
Posted 4 months ago
33 Views
3 Comments

Need advice on writing clean code for ETL pipelines

Mid Level SWE at Taro Community profile pic
Mid Level SWE at Taro Community

a lot of my work depends on writing pandas/ETL pipelines in Python

pain point #1: I find that my functions (each logic block is a function) tends to be hardcoded. How to avoid this?

Example:

def agg(df):

    df['A'] = df['B'].groupby(['C']).apply(function)
    # 10 more lines of specific business logic

    return df

def remove_duplicates(df):

     # gets a list of existing IDs from database and removes them from this ETL pipeline

     existing_rows = get_existing_ids_from_db()

     df = df[~df['id'].isin(existing_rows)]

     return df
def get_existing_ids_from_db():
    return set(db.query(Data.id).all())

As you can see the issue is that a lot of these functions tend to be hardcoded. Thus its hard to

  1. Isolate and verify
  2. Extend
  3. reuse

But when I make it less hardcoded, it ends up being more redundant and thus hard to maintain because if i change the logic in 1 place I need to change elsewhere (eg, remove all duplicates except this super admin) for pipelines A,B,C but not D

This is a simplified example, but illustrates a larger point that my gut instinct is to write a pipeline, then isolate logic into functions, then I realize damn now my logic is all hardcoded when I want to reuse this with slightly different cols/logic

pain point #2: when I pass in data frames into a function it makes the code hard to maintain as now bc this function expects a very specific dataframe with a specific set of columns. How should I be thinking about designing functions here

Additional ask, any suggestions on resources for good code for ETL/data engineering?

Show more
Posted 2 months ago
22 Views
1 Comment