I work a lot in Python with ETL pipelines. A lot of my workflow goes like this: Work in jupyter notebook to prototype a pipeline. pipeline looks like this
data = loaddata()
a = transform(data)
b = transform2(a)
c = transform3(b)
savetoDB(c)
Now in notebooks due to the REPL like interface, I can pause between agg3 and DB save, check data looks good, then continue. sometimes I notice theres a bug in agg3 so then I update the method in agg3 and continue. Each aggr is an expensive operation so i want to minimize the amount of steps done
Now here is the issue
I need to:
Now my python workflow looks like this
def load():
pull data
def transform(data):
a=aggr1(data)
b=aggr2(a)
c=aggr3(b)
return c
def save(data)
updatedb(data)
def main():
data = load()
data = transform(data)
save(data)
Note: i am explicitly keeping functions minimum in notebooks. Because if theres a bug in aggr3, I need to run the entire transform function again. This is because if I define a variable b in transform, then once that function executes, I cannot access "b" variable outside that scope. (lmk if i need to clarify more here)
Now I sanity check this before pushing to main
2 things I need to verify:
But then if I want to verify, I need to run the whole script again which takes a lot of time and compute. How do I avoid this? I want to minimize the amount of time spent running code not needed
TL;DR. in python specifically, I am able to protype fast in notebooks due to REPL like interface, but debugging scripts is tough due to inability to go back and forth within a script. I am not talking about just adding breakpoints because you cannot go back
Basically: I am having issues isolating certain parts of code in production pipelines where I need to be careful with how much code I run
Note: things are obv simplified here. In reality transform might have 15/20 different steps
If you're not able to break up aggr1, aggr2, aggr3, etc into separate entities that can be run in clearly separate phases, can you just add a ton of well-labeled logging?
The key to debugging efficiently is breaking down the entire code-flow into clear steps. In an ideal world, you can just carefully step through and be able to "rewind" to certain phases, but if you can't and need to rerun the entire pipeline every time to verify a fix, add a ton of logging to capture the entire process and see what assumptions are broken.
The logging makes it so that when you do try a fix and pay the large cost to validate it, you have a (hopefully) very large chance of the fix being correct due to it being very well-educated via the logging insights.