In my company (early stage so there is not a formal code standard, this is also not yet production code it is R&D code) the ML code for preprocessing the data is a long list of function calls, ETL, pandas stuff to manipulate, clean, and process the raw data. Then there's obv a ton of other function calls for ML -- split data, remove missing vals, run it thru ML models and call test methods
This makes it hard to write clean code bc
Things I'm doing
There is very little code reuse to make stuff modular
Code doesn't have to be reused to become modular. My definition of modular is any code that's split up into reasonably sized, focused components. It doesn't even have to be OOP if your language doesn't support it - Just split things into separate files with good names.
In most SWE cases there are increasing layers of abstraction. But here there is no obvious way to abstract away stuff since most of it is a long list of sequential processing code
Increasing layers of abstraction isn't a good thing IMHO - It just feels like a good thing because it's generally the sign of a growing, successful company (otherwise, why would your codebase be so big?). But at the end of the day, nobody likes jumping through 8 layers of function calls to understand how something works. Don't feel pressured to add more layers of abstraction to your code - In fact, fight for the opposite.
All that being said, you can always add a layer of abstraction, even if the code is sequential. Here's a basic example:
func makeHamburger(bun: Bun) {
val patty = getGrilledPatty()
addPatty(bun, patty)
addCheese(bun)
addTomato(bun)
addLettuce(bun)
}
Making a hamburger is a sequential process, and I introduced 1 layer of abstraction by moving each individual step into its own method.
Anyways, as I mention in my "Code Code Quality Isn't Static" lesson for my code quality course, the level of thoughtfulness you put into your code depends on the stakes. For startups, the stakes are relatively low as you're probably the only engineer working on this and the amount of end users is in the thousands, maybe even just the hundreds. As long as the core functionality works and the code is relatively clean via basic clean code tactics (like the ones you're applying, nice job!), that's good enough.
A lot of code quality is playing things by ear, especially in startups.
One of the important things for ML workflows in particular is how you structure the entire pipeline. It isn't to say that it still won't be a mess at some level, but it'll be better organized than most. I usually just go with my gut on what the structure should look like, check with the team to make sure it's ok, and then just put up the PR. Usually doesn't take too long if handled correctly.