Site Reliability Engineer

Enterprise-grade AI-focused GPU-as-a-service provider with a decentralized cloud computing infrastructure.
Kuala Lumpur, Federal Territory of Kuala Lumpur, Malaysia
Site Reliability
Staff Software Engineer
Hybrid
AI · Enterprise SaaS

Description For Site Reliability Engineer

Aethir is the only Enterprise-grade AI-focused GPU-as-a-service provider in the market. Its decentralized cloud computing infrastructure allows GPU providers (containers) to meet Enterprise clients who need powerful GPU chips for professional AI/ML tasks. With a network of over 40,000 top-shelf GPUs, including 3,000 NVIDIA H100s, Aethir provides enterprise-grade GPU computing at scale.

We are seeking a Site Reliability Engineer (SRE) for our new headquarters in Kuala Lumpur, Malaysia. This role is crucial in monitoring, troubleshooting, and optimizing our production system to ensure high performance and stability for our AI and gaming customers worldwide.

Key responsibilities include:

  • Monitoring, reviewing, and responding to system faults
  • Continuously reviewing system architecture and performance
  • Coordinating with the business team to resolve operations issues
  • Promptly responding to and resolving production failures
  • Organizing teams for collaborative problem-solving
  • Conducting case studies and implementing optimizations
  • Maintaining comprehensive system documentation
  • Identifying and implementing process improvements

Requirements:

  • Bachelor's degree in Computer Science, Engineering, or related field
  • Experience in operations and maintenance development
  • Strong understanding of system architecture and troubleshooting
  • Excellent communication and collaboration skills
  • Proficiency in Kubernetes, CI/CD, and Docker
  • Expertise in AWS or Python
  • Experience in building operations infrastructure platforms

We offer benefits such as a hypergrowth startup environment, fantastic career progression opportunities, and a collaborative, innovative work environment. Join us in shaping the future of decentralized computing!

Last updated 6 months ago

Responsibilities For Site Reliability Engineer

  • Monitor, review, and respond to system faults
  • Review system architecture and performance
  • Coordinate with business team on operations issues
  • Respond to and resolve production failures
  • Organize teams for collaborative problem-solving
  • Conduct case studies and implement optimizations
  • Maintain system documentation
  • Identify and implement process improvements

Requirements For Site Reliability Engineer

Kubernetes
Python
  • Bachelor's degree in Computer Science, Engineering, or related field
  • Experience in operations and maintenance development
  • Strong understanding of system architecture and troubleshooting
  • Excellent communication and collaboration skills
  • Proficiency in Kubernetes, CI/CD, and Docker
  • Expertise in AWS or Python
  • Ability to work in a fast-paced startup environment

Benefits For Site Reliability Engineer

  • Hypergrowth Startup Environment
  • Fantastic Career Progression Opportunities
  • Work within a Global and Local Team
  • Collaborative and innovative work environment

Interested in this job?

Jobs Related To Aethir Site Reliability Engineer

Staff Software Engineer, Reliability Engineering

Staff Software Engineer position at Airbnb focusing on Site Reliability Engineering, incident management, and building scalable systems with competitive compensation and remote work options.

Sr Staff Software Engineer, Reliability Engineering

Senior Staff SRE position at Airbnb focusing on building and scaling reliable systems, leading technical strategy, and mentoring teams while working remotely.

Lead Site Reliability Engineer (Observability)

Lead SRE role at Xero focusing on observability, implementing monitoring solutions, and driving reliability standards across a global engineering organization.

Senior Software Engineering Manager, Espresso SRE

Lead LinkedIn's Espresso SRE team managing distributed NoSQL database infrastructure serving 30M QPS, overseeing system reliability and team development in hybrid work environment.

Senior Software Engineering Manager, Espresso SRE

Senior Software Engineering Manager position at LinkedIn leading the Espresso SRE team, focusing on distributed NoSQL database infrastructure and team leadership.