1

Need advice on writing clean code for ETL pipelines

Profile picture
Mid Level SWE at Taro Community3 hours ago

a lot of my work depends on writing pandas/ETL pipelines in Python

pain point #1: I find that my functions (each logic block is a function) tends to be hardcoded. How to avoid this?

Example:

def agg(df):

    df['A'] = df['B'].groupby(['C']).apply(function)
    # 10 more lines of specific business logic

    return df

def remove_duplicates(df):

     # gets a list of existing IDs from database and removes them from this ETL pipeline

     existing_rows = get_existing_ids_from_db()

     df = df[~df['id'].isin(existing_rows)]

     return df
def get_existing_ids_from_db():
    return set(db.query(Data.id).all())

As you can see the issue is that a lot of these functions tend to be hardcoded. Thus its hard to

  1. Isolate and verify
  2. Extend
  3. reuse

But when I make it less hardcoded, it ends up being more redundant and thus hard to maintain because if i change the logic in 1 place I need to change elsewhere (eg, remove all duplicates except this super admin) for pipelines A,B,C but not D

This is a simplified example, but illustrates a larger point that my gut instinct is to write a pipeline, then isolate logic into functions, then I realize damn now my logic is all hardcoded when I want to reuse this with slightly different cols/logic

pain point #2: when I pass in data frames into a function it makes the code hard to maintain as now bc this function expects a very specific dataframe with a specific set of columns. How should I be thinking about designing functions here

Additional ask, any suggestions on resources for good code for ETL/data engineering?

7
0

No responses to this Q&A lesson yet 😭