a lot of my work depends on writing pandas/ETL pipelines in Python
pain point #1: I find that my functions (each logic block is a function) tends to be hardcoded. How to avoid this?
Example:
def agg(df):
df['A'] = df['B'].groupby(['C']).apply(function)
# 10 more lines of specific business logic
return df
def remove_duplicates(df):
# gets a list of existing IDs from database and removes them from this ETL pipeline
existing_rows = get_existing_ids_from_db()
df = df[~df['id'].isin(existing_rows)]
return df
def get_existing_ids_from_db():
return set(db.query(Data.id).all())
As you can see the issue is that a lot of these functions tend to be hardcoded. Thus its hard to
But when I make it less hardcoded, it ends up being more redundant and thus hard to maintain because if i change the logic in 1 place I need to change elsewhere (eg, remove all duplicates except this super admin) for pipelines A,B,C but not D
This is a simplified example, but illustrates a larger point that my gut instinct is to write a pipeline, then isolate logic into functions, then I realize damn now my logic is all hardcoded when I want to reuse this with slightly different cols/logic
pain point #2: when I pass in data frames into a function it makes the code hard to maintain as now bc this function expects a very specific dataframe with a specific set of columns. How should I be thinking about designing functions here
Additional ask, any suggestions on resources for good code for ETL/data engineering?
No responses to this Q&A lesson yet ðŸ˜