How to efficiently debug pipelines?

Question

I work a lot in Python with ETL pipelines. A lot of my workflow goes like this: Work in jupyter notebook to prototype a pipeline. pipeline looks like this

data = loaddata()
a = transform(data)
b = transform2(a)
c = transform3(b)
savetoDB(c)

Now in notebooks due to the REPL like interface, I can pause between agg3 and DB save, check data looks good, then continue. sometimes I notice theres a bug in agg3 so then I update the method in agg3 and continue. Each aggr is an expensive operation so i want to minimize the amount of steps done

Now here is the issue

I need to:

convert this to python scripts to integrate into existing workflows/productionize.
Clean up the code

Now my python workflow looks like this

def load():
  pull data
def transform(data):
   a=aggr1(data)
   b=aggr2(a)
   c=aggr3(b)
   return c
def save(data)
   updatedb(data)

def main():
  data = load()
  data = transform(data)
  save(data)

Note: i am explicitly keeping functions minimum in notebooks. Because if theres a bug in aggr3, I need to run the entire transform function again. This is because if I define a variable b in transform, then once that function executes, I cannot access "b" variable outside that scope. (lmk if i need to clarify more here)

Now I sanity check this before pushing to main

2 things I need to verify:

run with large amounts of real world data. As always theres a bug when running with large amounts of data at step `b=aggr2(a)`. Sometimes the bug ONLY happens with large amounts of data.
When going from notebook prototyping to productizing into real production workflows there is some edge case I forgot to account for

But then if I want to verify, I need to run the whole script again which takes a lot of time and compute. How do I avoid this? I want to minimize the amount of time spent running code not needed

TL;DR. in python specifically, I am able to protype fast in notebooks due to REPL like interface, but debugging scripts is tough due to inability to go back and forth within a script. I am not talking about just adding breakpoints because you cannot go back

Basically: I am having issues isolating certain parts of code in production pipelines where I need to be careful with how much code I run

Note: things are obv simplified here. In reality transform might have 15/20 different steps

Alex Chiou · Accepted Answer

If you're not able to break up aggr1, aggr2, aggr3, etc into separate entities that can be run in clearly separate phases, can you just add a ton of well-labeled logging?

The key to debugging efficiently is breaking down the entire code-flow into clear steps. In an ideal world, you can just carefully step through and be able to "rewind" to certain phases, but if you can't and need to rerun the entire pipeline every time to verify a fix, add a ton of logging to capture the entire process and see what assumptions are broken.

The logging makes it so that when you do try a fix and pay the large cost to validate it, you have a (hopefully) very large chance of the fix being correct due to it being very well-educated via the logging insights.

How to efficiently debug pipelines?

Discussion

Other Great Discussions