Site Reliability Engineer (SRE)

xAI is an AI company working on large-scale, highly-reliable distributed systems and AI infrastructure.
Site Reliability
Senior Software Engineer
Hybrid

Description For Site Reliability Engineer (SRE)

xAI is seeking an experienced Site Reliability Engineer (SRE) to join their London team. This role focuses on improving observability, building dashboards and alerts, managing on-call rotations, and enhancing deployment processes. The ideal candidate should be an expert in languages like Rust, C++, or Go, and have deep knowledge of monitoring technologies, deployment tools, and Kubernetes. The position offers a dynamic startup environment, working on large-scale distributed systems, including the Grok production stack. Benefits include competitive compensation, equity, and health insurance. The role requires working from the London office, with occasional late meetings and business trips to California. Join xAI to tackle complex technical challenges and contribute to cutting-edge AI infrastructure.

Last updated 6 months ago

Responsibilities For Site Reliability Engineer (SRE)

  • Improving observability by adding/adjusting metrics
  • Building easily parsable dashboards
  • Building reliable alerts
  • Designing and overseeing on-call rotations
  • Improving deployment process to increase reliability

Requirements For Site Reliability Engineer (SRE)

Rust
Go
Kubernetes
  • Expert in at least one programming language that compiles to machine code such as Rust, C++, or Go (Rust or C++ preferred)
  • Expert knowledge of monitoring technologies such as Prometheus, Grafana, and PagerDuty
  • Expert knowledge of deployment technologies such as Pulumi or Terraform
  • Expert knowledge of Kubernetes

Benefits For Site Reliability Engineer (SRE)

Medical Insurance
Dental Insurance
Equity
  • Competitive cash-based compensation
  • xAI equity
  • Private health and dental insurance
  • Unlimited time off subject to prior approval

Interested in this job?

Jobs Related To xAI Site Reliability Engineer (SRE)

ASE Senior Site Reliability Engineer

Senior Site Reliability Engineer role at Apple Services Engineering team, managing infrastructure for App Store and other Apple services.

Cloud Site Reliability Engineer I (CSRE I)

Cloud Site Reliability Engineer role at Zafin, focusing on Azure cloud infrastructure management and system reliability optimization.

Site Reliability Engineer

Senior Site Reliability Engineer position at Pendo, managing cloud infrastructure and ensuring platform reliability for a system processing 15B+ daily events.

Sr. Cloud Site Reliability Engineer

Senior Cloud Site Reliability Engineer position at Serve Robotics, focusing on building and maintaining critical infrastructure for autonomous delivery robots.

Site Reliability Engineer

Senior Site Reliability Engineer role at Baseten, building and maintaining scalable ML infrastructure with competitive compensation and benefits.