Site Reliability Engineer

Enterprise-grade AI-focused GPU-as-a-service provider with a decentralized cloud computing infrastructure.
Kuala Lumpur, Federal Territory of Kuala Lumpur, Malaysia
Site Reliability
Staff Software Engineer
Hybrid
AI · Enterprise SaaS

Description For Site Reliability Engineer

Aethir is the only Enterprise-grade AI-focused GPU-as-a-service provider in the market. Its decentralized cloud computing infrastructure allows GPU providers (containers) to meet Enterprise clients who need powerful GPU chips for professional AI/ML tasks. With a network of over 40,000 top-shelf GPUs, including 3,000 NVIDIA H100s, Aethir provides enterprise-grade GPU computing at scale.

We are seeking a Site Reliability Engineer (SRE) for our new headquarters in Kuala Lumpur, Malaysia. This role is crucial in monitoring, troubleshooting, and optimizing our production system to ensure high performance and stability for our AI and gaming customers worldwide.

Key responsibilities include:

  • Monitoring, reviewing, and responding to system faults
  • Continuously reviewing system architecture and performance
  • Coordinating with the business team to resolve operations issues
  • Promptly responding to and resolving production failures
  • Organizing teams for collaborative problem-solving
  • Conducting case studies and implementing optimizations
  • Maintaining comprehensive system documentation
  • Identifying and implementing process improvements

Requirements:

  • Bachelor's degree in Computer Science, Engineering, or related field
  • Experience in operations and maintenance development
  • Strong understanding of system architecture and troubleshooting
  • Excellent communication and collaboration skills
  • Proficiency in Kubernetes, CI/CD, and Docker
  • Expertise in AWS or Python
  • Experience in building operations infrastructure platforms

We offer benefits such as a hypergrowth startup environment, fantastic career progression opportunities, and a collaborative, innovative work environment. Join us in shaping the future of decentralized computing!

Last updated 3 months ago

Responsibilities For Site Reliability Engineer

  • Monitor, review, and respond to system faults
  • Review system architecture and performance
  • Coordinate with business team on operations issues
  • Respond to and resolve production failures
  • Organize teams for collaborative problem-solving
  • Conduct case studies and implement optimizations
  • Maintain system documentation
  • Identify and implement process improvements

Requirements For Site Reliability Engineer

Kubernetes
Python
  • Bachelor's degree in Computer Science, Engineering, or related field
  • Experience in operations and maintenance development
  • Strong understanding of system architecture and troubleshooting
  • Excellent communication and collaboration skills
  • Proficiency in Kubernetes, CI/CD, and Docker
  • Expertise in AWS or Python
  • Ability to work in a fast-paced startup environment

Benefits For Site Reliability Engineer

  • Hypergrowth Startup Environment
  • Fantastic Career Progression Opportunities
  • Work within a Global and Local Team
  • Collaborative and innovative work environment

Interested in this job?

Jobs Related To Aethir Site Reliability Engineer

Software Engineering Manager II, Site Reliability Engineering

Lead Site Reliability Engineering team at Google, managing distributed systems and infrastructure at global scale.

Software Engineering Manager, Site Reliability Engineering, FM Store

Lead Site Reliability Engineering team at Google, managing distributed systems and ensuring service reliability while driving technical excellence and team growth.

Software Engineering Manager II, Site Reliability Engineering, Google Cloud

Lead Google Cloud's Site Reliability Engineering team, managing distributed systems and infrastructure while ensuring service reliability and performance.

Software Engineering Manager II, Site Reliability Engineering

Lead Site Reliability Engineering team at Google, managing distributed systems and ensuring service reliability while driving technical excellence and team development.

Staff Software Engineer, Site Reliability Engineering, Google Cloud

Staff Software Engineer position at Google Cloud focusing on Site Reliability Engineering, building and maintaining large-scale distributed systems.