Taro Logo

DevOps Engineer - Supercomputing

xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge.
$180,000 - $370,000
DevOps
Senior Software Engineer
Hybrid
5+ years of experience
This job posting may no longer be active. You may be interested in these related jobs instead:

Description For DevOps Engineer - Supercomputing

xAI is seeking a DevOps Engineer specializing in Supercomputing to join their team in the Bay Area. This role involves operating some of the world's largest GPU supercomputing clusters for AI training and serving production models. The ideal candidate will have experience with Kubernetes, Pulumi, Rust, Go, and Flux/ArgoCD.

The company operates with a flat organizational structure, encouraging all employees to be hands-on and contribute directly to the mission. Strong communication skills are essential, as is the ability to work across multiple areas of the company.

Key responsibilities include implementing Infrastructure as Code best practices, enhancing deployment pipelines, ensuring robust and secure service delivery, working with both on-premise clusters and cloud providers, and helping with security best practices for internal researchers and live external traffic.

Ideal experiences include writing scalable and highly available containerized applications in Rust, and managing compute fleets with tools like Pulumi, Terraform, or Ansible.

The interview process consists of an initial interview followed by four technical interviews, including coding assessment, systems design, hands-on problem-solving, and a project deep-dive presentation.

xAI offers a competitive salary range of $180,000 - $370,000 USD annually. The company values engineering excellence, curiosity, and a strong work ethic. This is an excellent opportunity for a skilled DevOps engineer looking to work on cutting-edge AI systems and contribute to understanding the universe.

Last updated a year ago

Responsibilities For DevOps Engineer - Supercomputing

  • Operating some of the world's largest GPU supercomputing clusters for both AI training and serving production models
  • Implement IaC best practices, enhancing deployment pipelines, and ensuring robust, secure service delivery across our production environments
  • Working with both on-premise clusters and cloud providers
  • Help with security best practices for internal researchers and live external traffic

Requirements For DevOps Engineer - Supercomputing

Kubernetes
Go
Rust
  • Writing scalable and highly available containerized applications in Rust
  • Managing compute fleets with Pulumi, Terraform, Ansible, or other stateful automation libraries

Interested in this job?