Baseten is seeking a Site Reliability Engineer to build and maintain robust systems ensuring scalable, reliable, and efficient infrastructure. As an SRE, you'll work on automating deployments, monitoring systems, optimizing performance, and managing incidents. You'll collaborate closely with users, learning from their ML operationalization challenges and improving the Baseten platform.
Key responsibilities include:
- Building and maintaining scalable infrastructure
- Working extensively with Kubernetes
- Applying automation for CI/CD pipelines
- Establishing standards for reliability and performance
- Learning about ML infrastructure (prior experience not required)
The ideal candidate should:
- Own products and projects end-to-end
- Navigate ambiguity comfortably
- Focus on customer problems and create simple, elegant solutions
- Exercise good judgment on technical tradeoffs
- Demonstrate pride, ownership, and accountability
Tech stack:
- Backend: Go, Python, Postgres
- Platform: Kubernetes, Go, Postgres, Redis, Kafka
- Infrastructure: Gitops, Flux, Terraform, AWS/GCP
Baseten offers:
- Competitive compensation package (Unlimited PTO, 401k, covered healthcare premiums)
- Opportunity to grow in a rapidly expanding startup
- Inclusive and supportive work culture
- Exposure to various ML startups
Baseten is committed to fostering a diverse and inclusive workplace, providing equal employment opportunities to all employees and applicants.