How can an ML engineer learn more about Infra?

Question

I hear a lot about engineers wanting to get into ML, but I kind of have the opposite problem :)

As an ML engineer, I've always been more curious on the infrastructure side of things. How does the infra to train/serve these large models work? What should I know about GPU programming? Is there anything I should know about storage of these data tables? How does distributed computing work?

I tried reading through the book "Designing Data Intensive Applications", but it almost seemed too dense? Is that just me or is that book targeted for a different audience?

In summary, I have two questions:

(1) What infrastructure related stuff would the more senior ML engineers here recommend learning about?

(2) How would one go about actually learning those things? Is DDIA still the best resource or would you recommend something else?

Elliot Kang · Accepted Answer

I'm also curious about this question. Currently, I'm writing some Terraform/Chef to spin up resources to run distributed training jobs on the cloud.

We don't all have a 10000 GPU megacluster to run pretraining on 😿

How can an ML engineer learn more about Infra?

Discussion

Other Great Discussions