Thoughtworks is seeking a Lead Cluster Operations Support Engineer to spearhead their GPU cluster operations for machine learning workloads. This role combines deep technical expertise in cloud infrastructure and Kubernetes with client-facing responsibilities, making it an unique opportunity for experienced DevOps professionals.
The position involves managing massive GPU clusters (6,000+ GPUs) and providing white-glove support for machine learning model training operations. You'll be working with cutting-edge technologies in the AI infrastructure space, including the NVIDIA NeMo Framework, various cloud platforms, and advanced orchestration tools.
The ideal candidate will bring extensive experience in large-scale cluster management, strong problem-solving abilities, and excellent communication skills. You'll be coordinating with teams across four time zones (US, Europe, India, and Australia), requiring both technical prowess and strategic thinking.
Key aspects of the role include developing automation solutions, optimizing infrastructure for ML workloads, and mentoring team members. The position offers significant growth opportunities through Thoughtworks' comprehensive learning and development programs.
This is a hybrid role that combines remote work with occasional travel to client locations. The compensation package is competitive, ranging from $125,330 to $208,880 USD, reflecting the senior nature of the position and its critical importance to the organization's AI infrastructure services.
Working at Thoughtworks means joining a dynamic, inclusive community with a 30+ year track record of delivering extraordinary impact. The company's commitment to continuous learning, technical excellence, and purposeful work makes it an ideal environment for professionals looking to make a significant impact in the technology consulting space.