Google is seeking a Software Engineer III to join their XBorg team, which is part of the Borg Control Plane infrastructure. This role focuses on developing and maintaining a novel orchestration layer responsible for scheduling throughput-oriented workloads, particularly Machine Learning training and inference workloads, across Google's vast cluster infrastructure.
The position is part of the ML, Systems, & Cloud AI (MSCA) organization, which is responsible for the hardware, software, machine learning, and systems infrastructure that powers all Google services and Google Cloud. XBorg has introduced innovative concepts such as weighted fair queuing, seamless opportunistic access to unused resources, and spatial and platform flexibility, leading to improved resource efficiency for ML workloads across major Alphabet products.
As a Software Engineer III, you'll be working on critical projects with opportunities for growth and team mobility. The role requires strong software development skills, experience with data structures and algorithms, and preferably background in Machine Learning infrastructure. You'll be contributing to technologies that impact billions of users, working on everything from information retrieval to distributed computing, large-scale system design, and artificial intelligence.
The ideal candidate should have at least 2 years of software development experience and be comfortable with complex technical challenges. You'll be involved in code reviews, design discussions, documentation, and problem-solving at scale. This role offers the opportunity to work on cutting-edge technology while being part of a team that shapes the future of hyperscale computing and AI infrastructure.
Working at Google means joining a company that prioritizes security, efficiency, and reliability across all operations. You'll be part of an organization that powers crucial products like Google Cloud's Vertex AI and contributes to bringing Gemini models to enterprise customers. The role offers exposure to diverse technical challenges and the chance to impact global-scale systems used by billions of people daily.