Lead Cluster Operations Support Engineer

Thoughtworks

A leading technology consultancy with 30+ years of experience delivering extraordinary impact through technology solutions.

Chicago, IL, USA

$125,330 - $208,880

DevOps

Staff Software Engineer

Hybrid

5,000+ Employees

8+ years of experience

AI · Enterprise SaaS

Description For Lead Cluster Operations Support Engineer

Thoughtworks is seeking a Lead Cluster Operations Support Engineer to join their team in a critical role managing large-scale GPU infrastructure. This position combines deep technical expertise in cloud infrastructure and Kubernetes with a focus on supporting machine learning operations. The role involves providing 24x7 white-glove support for clients utilizing massive GPU clusters (6,000+ GPUs) for Managed Post Training operations.

The ideal candidate will be responsible for ensuring optimal utilization of GPU clusters and coordinating with global teams across four time zones. This position requires both technical excellence in infrastructure management and strong client-facing skills, as you'll be part of a high-value service delivery team.

Key technical aspects include working with Kubernetes at scale, managing GPU clusters, implementing infrastructure as code using tools like Terraform/Pulumi, and working with ML frameworks like NVIDIA NeMo. The role also involves contributing to automation and tooling improvements to enhance operational efficiency.

What makes this role unique is its combination of technical depth and service delivery focus. You'll be working at the intersection of high-performance computing and machine learning operations, while also having the opportunity to shape and improve service delivery processes. The position offers significant growth potential through Thoughtworks' learning and development programs.

Working in a hybrid model, you'll collaborate with diverse teams of Machine Learning Engineers and Infrastructure Engineers, while having the opportunity to influence technical direction and mentor others. This role is perfect for someone who combines strong technical capabilities with excellent communication skills and a desire to work in a client-facing environment.

The position offers competitive compensation ($125,330 - $208,880) and comprehensive benefits, reflecting the senior nature of the role and its importance to Thoughtworks' service delivery capabilities. Join a dynamic organization that values technical excellence, continuous learning, and making a positive impact through technology.

Last updated 6 hours ago

Responsibilities For Lead Cluster Operations Support Engineer

Shape and iterate white glove model training support service on large GPU clusters
Work collaboratively with Machine Learning Engineers and Infrastructure Engineers
Contribute to accelerator development and automation
Assess model training readiness and data preparation
Provide model training support during rotating daytime weekend shifts
Facilitate collaborative problem-solving within the team
Proactively identify and address challenges related to white glove service

Requirements For Lead Cluster Operations Support Engineer

Kubernetes

Python

Linux

Deep expertise in Kubernetes administration and debugging at scale
Extensive experience managing large clusters with thousands of nodes
Knowledge of running training workloads on thousands of GPUs
Experience with NVIDIA NeMo Framework
Proficiency with cloud platforms (GCP, AWS, Azure)
Experience with Terraform/Pulumi, Helm Charts, Linux
Strong stakeholder management skills
Ability to work in ambiguous situations
Coaching and mentoring capabilities

Benefits For Lead Cluster Operations Support Engineer

Medical Insurance

Dental Insurance

Vision Insurance

Learning & Development programs
Equal opportunity employer
Comprehensive benefits package

Thoughtworks

A leading technology consultancy with 30+ years of experience delivering extraordinary impact through technology solutions.

Chicago, IL, USA

$125,330 - $208,880

DevOps

Staff Software Engineer

Hybrid

5,000+ Employees

8+ years of experience

AI · Enterprise SaaS

Interested in this job?

Jobs Related To Thoughtworks Lead Cluster Operations Support Engineer

Senior Infrastructure Support Engineer

Thoughtworks

Senior Infrastructure Support Engineer role at Thoughtworks, focusing on cloud infrastructure maintenance, incident response, and operational efficiency with emphasis on AWS and DevOps practices.

Site Reliability Developer 4

Oracle

Senior Site Reliability Developer position at Oracle, focusing on autonomous database cloud services, requiring expertise in DevOps, cloud infrastructure, and database technologies.

Assistant Vice President/ Vice President,– IPAM/NTP Engineering, Core Technology Infrastructure

Bank of America

Senior infrastructure engineering role focused on IPAM/NTP systems at Bank of America's Core Technology Infrastructure team in Singapore.

Senior Systems Engineer

Sonic

Senior Systems Engineer position at Sonic, requiring 10+ years of experience in systems engineering and infrastructure management, offering $145k salary with hybrid work option in Santa Rosa, CA.

Senior Lead Infrastructure Engineer - Application Owner

JPMorgan Chase

Senior Lead Infrastructure Engineer position at JPMorgan Chase focusing on designing and implementing scalable technology platforms with emphasis on cloud infrastructure and automation.