Lead Cluster Operations Support Engineer

A leading technology consultancy with 30+ years of experience delivering technology solutions and complex business problems.
$125,330 - $208,880
DevOps
Staff Software Engineer
Hybrid
8+ years of experience
AI · Enterprise SaaS

Description For Lead Cluster Operations Support Engineer

Thoughtworks is seeking a Lead Cluster Operations Support Engineer to spearhead their GPU cluster operations for machine learning workloads. This role combines deep technical expertise in cloud infrastructure and Kubernetes with client-facing responsibilities, making it an unique opportunity for experienced DevOps professionals.

The position involves managing massive GPU clusters (6,000+ GPUs) and providing white-glove support for machine learning model training operations. You'll be working with cutting-edge technologies in the AI infrastructure space, including the NVIDIA NeMo Framework, various cloud platforms, and advanced orchestration tools.

The ideal candidate will bring extensive experience in large-scale cluster management, strong problem-solving abilities, and excellent communication skills. You'll be coordinating with teams across four time zones (US, Europe, India, and Australia), requiring both technical prowess and strategic thinking.

Key aspects of the role include developing automation solutions, optimizing infrastructure for ML workloads, and mentoring team members. The position offers significant growth opportunities through Thoughtworks' comprehensive learning and development programs.

This is a hybrid role that combines remote work with occasional travel to client locations. The compensation package is competitive, ranging from $125,330 to $208,880 USD, reflecting the senior nature of the position and its critical importance to the organization's AI infrastructure services.

Working at Thoughtworks means joining a dynamic, inclusive community with a 30+ year track record of delivering extraordinary impact. The company's commitment to continuous learning, technical excellence, and purposeful work makes it an ideal environment for professionals looking to make a significant impact in the technology consulting space.

Last updated 21 hours ago

Responsibilities For Lead Cluster Operations Support Engineer

  • Shape and iterate white glove model training support service on large GPU clusters
  • Collaborate with Machine Learning Engineers and Infrastructure Engineers
  • Contribute to accelerator development and automation
  • Assess model training readiness and data preparation
  • Provide model training support during rotating daytime weekend shifts
  • Facilitate collaborative problem-solving and mentor other engineers
  • Manage and optimize large-scale GPU clusters (6,000+ GPUs)
  • Coordinate support across four time zones (US, Europe, India, and Australia)

Requirements For Lead Cluster Operations Support Engineer

Kubernetes
Python
Linux
  • Deep expertise in Kubernetes administration and debugging at scale
  • Extensive experience managing large clusters with thousands of nodes
  • Knowledge of running training workloads on thousands of GPUs
  • Proficiency with cloud platforms (GCP, AWS, Azure)
  • Experience with Terraform/Pulumi, Helm Charts, and Infrastructure-as-Code tools
  • Strong stakeholder management and client-facing skills
  • Ability to work in ambiguous situations and adapt to challenges
  • Experience with NVIDIA NeMo Framework and NIMs

Benefits For Lead Cluster Operations Support Engineer

  • Learning and Development Programs
  • Career Development Support
  • Equal Opportunity Employment

Interested in this job?

Jobs Related To Thoughtworks Lead Cluster Operations Support Engineer

Senior Build Systems Engineer

Senior Build Systems Engineer role at Adobe leading the development and management of proprietary build systems for Adobe applications.

DevOps Manager

DevOps Manager position at Oracle leading a team of 5+ experts, managing cloud operations and service excellence for Oracle Analytics, requiring 6+ years of experience.

Staff Systems Engineer

Staff Systems Engineer position at Aleph Alpha, leading infrastructure development and optimization for AI systems with focus on Kubernetes and cloud platforms.

Lead Engineer - SDET

Lead Engineer SDET position at HighLevel, focusing on automation testing strategy and quality assurance leadership.

Staff Systems Development Engineering Manager

Staff Systems Development Engineering Manager position at Google Public Sector, focusing on cross-domain solutions and systems automation.