Site Reliability Engineer

Together AI

Together AI is a research-driven artificial intelligence company focused on open and transparent AI systems, aiming to lower the cost of modern AI systems through co-designing software, hardware, algorithms, and models.

San Francisco Bay Area, CA, USA

$160,000 - $230,000

Site Reliability

Senior Software Engineer

Hybrid

11 - 50 Employees

7+ years of experience

Description For Site Reliability Engineer

As a Site Reliability Engineer (SRE) at Together AI, you will be responsible for maintaining all user-facing services and production systems. This role combines pragmatic operations with software engineering, applying sound engineering principles, operational discipline, and mature automation to our operating environments and codebase.

You will specialize in systems (operating systems, storage subsystems, networking) while implementing best practices for availability, reliability, and scalability. Your varied interests in algorithms and distributed systems will be valuable in this role.

Key responsibilities include:

Participating in an on-call (PagerDuty) rotation to respond to incidents impacting availability
Building and running infrastructure using Ansible, Terraform, and Kubernetes to enable scaling for a massive number of concurrent users
Developing monitoring systems to ensure the highest quality service for customers
Designing and implementing operational processes such as deployments and upgrades
Debugging production issues across all services and stack levels
Identifying improvements for product architecture from reliability, performance, and availability perspectives
Planning the growth of Together AI's infrastructure

Together AI is at the forefront of AI research and development, contributing to open-source research, models, and datasets. The company has been behind technological advancements such as FlashAttention, Hyena, FlexGen, and RedPajama. This role offers an opportunity to join a passionate team of researchers and engineers in building the next generation of AI infrastructure.

The position offers competitive compensation, including a base salary range of $160,000 - $230,000, startup equity, health insurance, and other benefits. Together AI is an Equal Opportunity Employer, providing equal employment opportunities regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

If you're passionate about AI infrastructure and have the skills to keep complex systems running smoothly at scale, this role at Together AI could be an excellent opportunity to make a significant impact in the field of artificial intelligence.

Last updated 8 months ago

Responsibilities For Site Reliability Engineer

Be on an on-call (PagerDuty) rotation to respond to incidents that impact availability
Build and run infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users
Build monitoring systems to ensure the highest quality service for customers
Design and implement operational processes (such as deployments and upgrades)
Debug production issues across all services and levels of the stack
Identify improvements for the product architecture from the reliability, performance and availability perspectives
Plan the growth of Together AI's infrastructure

Requirements For Site Reliability Engineer

Kubernetes

Linux

7+ years of professional SRE or related experience
Bachelor's degree in Computer Science or related field or equivalent work experience
Expert knowledge of Ansible (roles, playbooks), Terraform, and Kubernetes
Proficiency in programming/scripting languages
Direct experience in monitoring and observability practices
Advanced knowledge of cloud services
Ability to thrive in a collaborative environment involving different stakeholders and subject matter experts

Benefits For Site Reliability Engineer

Medical Insurance

Startup equity
Health insurance

Together AI

San Francisco Bay Area, CA, USA

$160,000 - $230,000

Site Reliability

Senior Software Engineer

Hybrid

11 - 50 Employees

7+ years of experience

Interested in this job?

Jobs Related To Together AI Site Reliability Engineer

Software Engineer (Site Reliability), Retail Engineering

Apple

Senior Site Reliability Engineer position at Apple, focusing on maintaining and improving retail systems integration with major US carriers for iPhone activations.

Senior Site Reliability Engineer

Halo Studios

Senior Site Reliability Engineer role at Halo Studios focusing on cloud infrastructure, automation, and system reliability.

Site Reliability/DevOps Engineer III

LivePerson

Senior Site Reliability Engineer role at LivePerson, leading enterprise conversation platform, managing cloud infrastructure and ensuring 24/7 system reliability.

Senior Site Reliability Engineer

Apple

Senior Site Reliability Engineer role at Apple, focusing on managing and scaling media platform services for App Store, Apple TV, Apple Music, and more.

Site Reliability Engineer

Alarm.com (EBS)

Senior Site Reliability Engineer position at Alarm.com (EBS) in Krakow, focusing on infrastructure, incident response, and system optimization for IoT security platform.