Site Reliability Engineer, Machine Learning Operations, Infrastructure

Tesla

Electric vehicle and clean energy company pioneering sustainable transportation and energy solutions.

Austin, TX, USA

Site Reliability

Senior Software Engineer

In-Person

5,000+ Employees

5+ years of experience

AI · Automotive

Description For Site Reliability Engineer, Machine Learning Operations, Infrastructure

Tesla is seeking a Senior Site Reliability Engineer to join their Machine Learning Operations and Infrastructure team. This role combines DevOps, MLOps, and cloud infrastructure expertise to support Tesla's engineering initiatives across AWS, Azure, and GCP platforms. The position focuses on maintaining and improving the ML platform, ensuring robust deployment processes, and optimizing infrastructure for AI workloads.

As an SRE, you'll be responsible for developing and automating deployment workflows, implementing monitoring systems, and creating self-healing processes. The role requires expertise in Kubernetes, machine learning operations, and modern DevOps practices. You'll work with cutting-edge technologies and frameworks while collaborating with cross-functional teams of data scientists and engineers.

The ideal candidate brings strong technical expertise in Python, Golang, and React, combined with deep knowledge of ML infrastructure and cloud platforms. You'll be instrumental in building and maintaining scalable solutions for ML model training, deployment, and monitoring. The position offers exposure to innovative projects in the automotive and AI domains, alongside Tesla's comprehensive benefits package.

This role presents an exciting opportunity to work at the intersection of site reliability engineering and machine learning, contributing to Tesla's mission of accelerating the world's transition to sustainable energy. You'll be part of a team that values technical excellence, innovation, and collaborative problem-solving, while enjoying competitive compensation and extensive benefits.

Last updated 2 months ago

Responsibilities For Site Reliability Engineer, Machine Learning Operations, Infrastructure

Mature Machine Learning Operations Platform and implement scalable workflows for ML lifecycle
Maintain Kubernetes-based infrastructure for model training, deployment, and monitoring
Develop solutions for workload orchestration using Flyte and Ray
Implement and optimize CI/CD pipelines for machine learning applications
Set up model monitoring systems
Collaborate with engineers on training and inference workflows
Develop Infrastructure-as-Code solutions
Design self-service portals using React
Participate in 24x7 on-call rotation

Requirements For Site Reliability Engineer, Machine Learning Operations, Infrastructure

Python

React

Kubernetes

Linux

Strong hands-on experience with Kubernetes, Kubeflow, MLflow, Flyte, Ray
Proven experience with React for building interactive web applications
Expertise in MIG, time-slicing, and scaling AI workloads
Proficiency in Python, Golang and bash
Experience with Model Deployment and Serving tools
Proficiency with Linux fundamentals
Experience with configuration management software
Strong analytical and problem-solving abilities
Bachelor's Degree in Computer Science, Computer Engineering, Electrical Engineering, Physics or equivalent experience

Benefits For Site Reliability Engineer, Machine Learning Operations, Infrastructure

Medical Insurance

Dental Insurance

Vision Insurance

401k

Mental Health Assistance

Parental Leave

Commuter Benefits

Aetna PPO and HSA plans with $0 payroll deduction
Family-building, fertility, adoption and surrogacy benefits
Dental and vision plans with $0 paycheck contribution
Company Paid HSA Contribution
Healthcare and Dependent Care Flexible Spending Accounts
401(k) with employer match
Employee Stock Purchase Plans
Company paid Basic Life, AD&D, short-term and long-term disability insurance
Employee Assistance Program
Sick and Vacation time
Back-up childcare and parenting support resources
Weight Loss and Tobacco Cessation Programs
Tesla Babies program
Commuter benefits
Employee discounts and perks program

Tesla

Electric vehicle and clean energy company pioneering sustainable transportation and energy solutions.

Austin, TX, USA

Site Reliability

Senior Software Engineer

In-Person

5,000+ Employees

5+ years of experience

AI · Automotive

Interested in this job?

Jobs Related To Tesla Site Reliability Engineer, Machine Learning Operations, Infrastructure

Sr. Site Reliability Engineer, VMware, Infrastructure

Tesla

Senior Site Reliability Engineer position at Tesla, focusing on VMware and Windows infrastructure management with emphasis on automation and system reliability.

Sr. Site Reliability Engineer, Energy

Tesla

Senior Site Reliability Engineer position at Tesla, focusing on scaling and maintaining energy IoT infrastructure using Kubernetes, AWS, and modern tech stack.

Sr. Site Reliability Engineer, Energy

Tesla

Senior Site Reliability Engineer position at Tesla, focusing on energy IoT infrastructure and systems scaling with competitive compensation and comprehensive benefits.

Site Reliability Engineer, AI Infrastructure

Tesla

Senior Site Reliability Engineer position at Tesla, focusing on AI infrastructure maintenance and optimization for autonomous driving and robotics projects.

Sr. Site Reliability Engineer, Dojo

Tesla

Senior Site Reliability Engineer position at Tesla, focusing on Dojo cluster infrastructure maintenance and optimization with competitive compensation and comprehensive benefits.