Site Reliability Engineer, AI Infrastructure

Tesla is an automotive and technology company leading in electric vehicles and AI development.
$133,440 - $355,920
Site Reliability
Senior Software Engineer
In-Person
5,000+ Employees
3+ years of experience
AI · Automotive · Robotics

Description For Site Reliability Engineer, AI Infrastructure

Tesla's Supercomputing/AI infrastructure team is at the forefront of developing and maintaining critical infrastructure for machine learning operations, supporting crucial projects like Full-Self-Driving (FSD), Tesla Bot, and Dojo supercomputer. As a Site Reliability Engineer, you'll be instrumental in maintaining and enhancing the platform that powers Tesla's AI initiatives. The role combines high-performance computing expertise with infrastructure automation, focusing on GPU and Dojo platforms.

The position offers an exciting opportunity to work with cutting-edge technology in autonomous driving and robotics. You'll be responsible for managing AI infrastructure, optimizing performance, and ensuring the reliability of systems that enable neural network training at scale. The role requires strong technical skills in Python, Golang, and Linux systems, along with experience in modern DevOps practices and tools.

Tesla offers a comprehensive benefits package including competitive salary, equity opportunities, and extensive health coverage. The company's mission to accelerate the world's transition to sustainable energy makes this an impactful role where your work will directly contribute to advancing autonomous driving technology and robotics development.

Working at Tesla means joining a team of innovative professionals pushing the boundaries of technology in automotive and AI fields. The role provides opportunities for growth and learning while working with some of the most advanced computing systems in the industry. If you're passionate about infrastructure automation, system reliability, and want to be part of revolutionizing transportation and robotics, this position offers an ideal opportunity to make a significant impact.

Last updated a month ago

Responsibilities For Site Reliability Engineer, AI Infrastructure

  • Support the AI/ML cluster infrastructure on both GPU and Dojo platforms
  • Improve monitoring & self-healing pipelines and security posture
  • Optimize server, storage and network performance
  • Develop new tools in Python, Golang or Bash/Shell
  • Use Infrastructure as Code best practices
  • Participate in 24x7 on-call rotation

Requirements For Site Reliability Engineer, AI Infrastructure

Python
Go
Linux
Kubernetes
  • Proficiency in Python, Golang and/or Bash
  • Proficiency with Linux fundamentals and performance optimizations
  • Experience with configuration management software (Ansible, etc.)
  • Experience with containerization technologies such as Kubernetes
  • Bachelor's Degree in Computer Science, Computer Engineering, Electrical Engineering, Physics or proof of exceptional skills
  • 3+ years of additional equivalent experience or evidence of exceptional ability

Benefits For Site Reliability Engineer, AI Infrastructure

Medical Insurance
Dental Insurance
Vision Insurance
401k
Mental Health Assistance
Parental Leave
Commuter Benefits
  • Aetna PPO and HSA plans with $0 payroll deduction
  • Family-building, fertility, adoption and surrogacy benefits
  • Dental and vision plans with $0 paycheck contribution
  • Company Paid HSA Contribution
  • Healthcare and Dependent Care Flexible Spending Accounts
  • 401(k) with employer match
  • Employee Stock Purchase Plans
  • Company paid Basic Life, AD&D, short-term and long-term disability insurance
  • Employee Assistance Program
  • Sick and Vacation time
  • Back-up childcare and parenting support resources
  • Commuter benefits
  • Employee discounts and perks program

Interested in this job?

Jobs Related To Tesla Site Reliability Engineer, AI Infrastructure

Sr. Site Reliability Engineer, Simulation Cluster Infrastructure

Senior Site Reliability Engineer position at Tesla, focusing on simulation cluster infrastructure and large-scale software systems for electric vehicle development.

Site Reliability Engineer, Observability, Infrastructure

Senior Site Reliability Engineer position at Tesla focusing on observability and infrastructure management for global applications and manufacturing systems.

Sr. Site Reliability Engineer, VMware, Infrastructure

Senior Site Reliability Engineer position at Tesla, focusing on VMware and Windows infrastructure management with emphasis on automation and system reliability.

Sr. Site Reliability Engineer, Integration Tools

Senior Site Reliability Engineer position at Tesla, focusing on integration tools and platforms for vehicle software systems.

Sr. Site Reliability Engineer, Energy

Senior Site Reliability Engineer position at Tesla, focusing on scaling and maintaining energy IoT infrastructure using Kubernetes, AWS, and modern tech stack.