Salesforce is seeking a Principal Software Engineer in Site/Product Reliability Engineering to lead their AgentForce platform reliability efforts. This pivotal role involves working in US operations, including weekend shifts aligned with India hours. The position focuses on maintaining and scaling the availability and performance of Salesforce's AgentForce platform, with particular emphasis on generative and predictive AI platform production support.
The role demands expertise in multi-system debugging and triage across various Salesforce platforms including Core, Service Cloud, Sales Cloud, Data Cloud, and AI Cloud, as well as integration with LLM providers like OpenAI, Azure OpenAI, and AWS Bedrock. The successful candidate will lead production triage processes, implement automated solutions, and maintain comprehensive documentation of production incidents.
Key responsibilities include establishing reliability processes, collaborating with lead engineers, investigating alerts and customer-reported issues, and ensuring scalable services. The role requires strong infrastructure and scaling management skills, particularly in handling Large Language Models and associated services. The position offers an opportunity to work with cutting-edge AI technology while maintaining high availability and reliability standards.
Candidates should possess 8+ years of experience in production support, strong knowledge of cloud services, and expertise in implementing reliability processes across full-stack, end-to-end ML platforms. The role offers competitive compensation, with salaries ranging from $211,500 to $334,600, depending on location.
This is an excellent opportunity for a seasoned professional looking to make a significant impact in AI platform reliability while working with a company at the forefront of customer relationship management technology. The role combines technical leadership with hands-on problem-solving in a dynamic, innovative environment.