Site Reliability Engineer, Machine Learning Operations, Infrastructure

Electric vehicle and clean energy company pioneering sustainable transportation and energy solutions.
Site Reliability
Senior Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI · Automotive

Description For Site Reliability Engineer, Machine Learning Operations, Infrastructure

Tesla is seeking a Senior Site Reliability Engineer to join their Machine Learning Operations and Infrastructure team. This role combines DevOps, MLOps, and cloud infrastructure expertise to support Tesla's engineering initiatives across AWS, Azure, and GCP platforms. The position focuses on maintaining and improving the ML platform, ensuring robust deployment processes, and optimizing infrastructure for AI workloads.

As an SRE, you'll be responsible for developing and automating deployment workflows, implementing monitoring systems, and creating self-healing processes. The role requires expertise in Kubernetes, machine learning operations, and modern DevOps practices. You'll work with cutting-edge technologies and frameworks while collaborating with cross-functional teams of data scientists and engineers.

The ideal candidate brings strong technical expertise in Python, Golang, and React, combined with deep knowledge of ML infrastructure and cloud platforms. You'll be instrumental in building and maintaining scalable solutions for ML model training, deployment, and monitoring. The position offers exposure to innovative projects in the automotive and AI domains, alongside Tesla's comprehensive benefits package.

This role presents an exciting opportunity to work at the intersection of site reliability engineering and machine learning, contributing to Tesla's mission of accelerating the world's transition to sustainable energy. You'll be part of a team that values technical excellence, innovation, and collaborative problem-solving, while enjoying competitive compensation and extensive benefits.

Last updated 2 months ago

Responsibilities For Site Reliability Engineer, Machine Learning Operations, Infrastructure

  • Mature Machine Learning Operations Platform and implement scalable workflows for ML lifecycle
  • Maintain Kubernetes-based infrastructure for model training, deployment, and monitoring
  • Develop solutions for workload orchestration using Flyte and Ray
  • Implement and optimize CI/CD pipelines for machine learning applications
  • Set up model monitoring systems
  • Collaborate with engineers on training and inference workflows
  • Develop Infrastructure-as-Code solutions
  • Design self-service portals using React
  • Participate in 24x7 on-call rotation

Requirements For Site Reliability Engineer, Machine Learning Operations, Infrastructure

Python
React
Kubernetes
Linux
  • Strong hands-on experience with Kubernetes, Kubeflow, MLflow, Flyte, Ray
  • Proven experience with React for building interactive web applications
  • Expertise in MIG, time-slicing, and scaling AI workloads
  • Proficiency in Python, Golang and bash
  • Experience with Model Deployment and Serving tools
  • Proficiency with Linux fundamentals
  • Experience with configuration management software
  • Strong analytical and problem-solving abilities
  • Bachelor's Degree in Computer Science, Computer Engineering, Electrical Engineering, Physics or equivalent experience

Benefits For Site Reliability Engineer, Machine Learning Operations, Infrastructure

Medical Insurance
Dental Insurance
Vision Insurance
401k
Mental Health Assistance
Parental Leave
Commuter Benefits
  • Aetna PPO and HSA plans with $0 payroll deduction
  • Family-building, fertility, adoption and surrogacy benefits
  • Dental and vision plans with $0 paycheck contribution
  • Company Paid HSA Contribution
  • Healthcare and Dependent Care Flexible Spending Accounts
  • 401(k) with employer match
  • Employee Stock Purchase Plans
  • Company paid Basic Life, AD&D, short-term and long-term disability insurance
  • Employee Assistance Program
  • Sick and Vacation time
  • Back-up childcare and parenting support resources
  • Weight Loss and Tobacco Cessation Programs
  • Tesla Babies program
  • Commuter benefits
  • Employee discounts and perks program

Interested in this job?

Jobs Related To Tesla Site Reliability Engineer, Machine Learning Operations, Infrastructure

Sr. Site Reliability Engineer, VMware, Infrastructure

Senior Site Reliability Engineer position at Tesla, focusing on VMware and Windows infrastructure management with emphasis on automation and system reliability.

Sr. Site Reliability Engineer, Energy

Senior Site Reliability Engineer position at Tesla, focusing on scaling and maintaining energy IoT infrastructure using Kubernetes, AWS, and modern tech stack.

Sr. Site Reliability Engineer, Energy

Senior Site Reliability Engineer position at Tesla, focusing on energy IoT infrastructure and systems scaling with competitive compensation and comprehensive benefits.

Site Reliability Engineer, AI Infrastructure

Senior Site Reliability Engineer position at Tesla, focusing on AI infrastructure maintenance and optimization for autonomous driving and robotics projects.

Sr. Site Reliability Engineer, Dojo

Senior Site Reliability Engineer position at Tesla, focusing on Dojo cluster infrastructure maintenance and optimization with competitive compensation and comprehensive benefits.