Site Reliability Engineer

Together AI is a research-driven artificial intelligence company focused on open and transparent AI systems, aiming to lower the cost of modern AI systems through co-designing software, hardware, algorithms, and models.
$160,000 - $230,000
Site Reliability
Senior Software Engineer
Hybrid
11 - 50 Employees
7+ years of experience

Description For Site Reliability Engineer

As a Site Reliability Engineer (SRE) at Together AI, you will be responsible for maintaining all user-facing services and production systems. This role combines pragmatic operations with software engineering, applying sound engineering principles, operational discipline, and mature automation to our operating environments and codebase.

You will specialize in systems (operating systems, storage subsystems, networking) while implementing best practices for availability, reliability, and scalability. Your varied interests in algorithms and distributed systems will be valuable in this role.

Key responsibilities include:

  • Participating in an on-call (PagerDuty) rotation to respond to incidents impacting availability
  • Building and running infrastructure using Ansible, Terraform, and Kubernetes to enable scaling for a massive number of concurrent users
  • Developing monitoring systems to ensure the highest quality service for customers
  • Designing and implementing operational processes such as deployments and upgrades
  • Debugging production issues across all services and stack levels
  • Identifying improvements for product architecture from reliability, performance, and availability perspectives
  • Planning the growth of Together AI's infrastructure

Together AI is at the forefront of AI research and development, contributing to open-source research, models, and datasets. The company has been behind technological advancements such as FlashAttention, Hyena, FlexGen, and RedPajama. This role offers an opportunity to join a passionate team of researchers and engineers in building the next generation of AI infrastructure.

The position offers competitive compensation, including a base salary range of $160,000 - $230,000, startup equity, health insurance, and other benefits. Together AI is an Equal Opportunity Employer, providing equal employment opportunities regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

If you're passionate about AI infrastructure and have the skills to keep complex systems running smoothly at scale, this role at Together AI could be an excellent opportunity to make a significant impact in the field of artificial intelligence.

Last updated 8 months ago

Responsibilities For Site Reliability Engineer

  • Be on an on-call (PagerDuty) rotation to respond to incidents that impact availability
  • Build and run infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users
  • Build monitoring systems to ensure the highest quality service for customers
  • Design and implement operational processes (such as deployments and upgrades)
  • Debug production issues across all services and levels of the stack
  • Identify improvements for the product architecture from the reliability, performance and availability perspectives
  • Plan the growth of Together AI's infrastructure

Requirements For Site Reliability Engineer

Kubernetes
Linux
  • 7+ years of professional SRE or related experience
  • Bachelor's degree in Computer Science or related field or equivalent work experience
  • Expert knowledge of Ansible (roles, playbooks), Terraform, and Kubernetes
  • Proficiency in programming/scripting languages
  • Direct experience in monitoring and observability practices
  • Advanced knowledge of cloud services
  • Ability to thrive in a collaborative environment involving different stakeholders and subject matter experts

Benefits For Site Reliability Engineer

Medical Insurance
  • Startup equity
  • Health insurance

Interested in this job?

Jobs Related To Together AI Site Reliability Engineer

Software Engineer (Site Reliability), Retail Engineering

Senior Site Reliability Engineer position at Apple, focusing on maintaining and improving retail systems integration with major US carriers for iPhone activations.

Senior Site Reliability Engineer

Senior Site Reliability Engineer role at Halo Studios focusing on cloud infrastructure, automation, and system reliability.

Site Reliability/DevOps Engineer III

Senior Site Reliability Engineer role at LivePerson, leading enterprise conversation platform, managing cloud infrastructure and ensuring 24/7 system reliability.

Senior Site Reliability Engineer

Senior Site Reliability Engineer role at Apple, focusing on managing and scaling media platform services for App Store, Apple TV, Apple Music, and more.

Site Reliability Engineer

Senior Site Reliability Engineer position at Alarm.com (EBS) in Krakow, focusing on infrastructure, incident response, and system optimization for IoT security platform.