Site Reliability Engineering Manager

Saudi , Riyadh

--

Apply on the Job Website

Job Details

Gender Not Mentioned

Education Bachelor of Science

Category Information Technology

Industry Automotive

Job Description

Roles & Responsibilities

Responsibilities

SRE Leadership & Reliability Ownership
Own the availability, performance, and reliability of cloud services deployed and operated in KSA.
Define, implement, and track SRE best practices, including SLIs, SLOs, SLAs, and error budgets.
Lead the architecture and governance of highly available and disaster-resilient systems, ensuring DR strategies are tested and maintained.
Drive capacity planning, auto-scaling, and performance tuning across Kubernetes-based platforms.
Own monitoring, observability, and alerting using Prometheus, Grafana, and logging platforms.
Lead incident response, impact assessment, and root-cause analysis for complex production issues.
Team Management, Mentorship & Growth
Manage a team of SRE engineers, providing technical direction, career coaching, and performance feedback.
Review and approve infrastructure code, deployment configurations, automation scripts, and SRE tooling.
Foster a culture of ownership, learning, blameless postmortems, and continuous improvement.
Lead hiring, onboarding, and skill development initiatives for the SRE function.
Ensure fair, sustainable, and well-documented on-call rotations.
Cloud Platforms & Automation
Oversee production environments on Oracle Cloud Infrastructure (OCI) and AWS.
Govern Infrastructure-as-Code practices using Terraform and configuration management tools.
Lead CI/CD strategy and implementation using ArgoCD, Jenkins, Maven, Docker, and GitLab.
Ensure secure and reliable deployment of microservices and data pipelines on Kubernetes using Helm.
Platform Services & Data Systems
Collaborate closely with Product Owners, Engineering Managers, Security, and Architecture teams.
Oversee the reliability and scaling of platform services such as Kafka, Spark, Trino, Airflow, MQTT, and microservices ecosystems.
Ensure stable operations of NoSQL and RDBMS systems including ElasticSearch, MongoDB, PostgreSQL, and MySQL.
Support distributed data processing and messaging systems, addressing performance and scalability challenges.

Requirements and Skills

B.S. or M.S. degree in Computer Science, Engineering, or a related field.
8+ years of experience in Site Reliability Engineering, DevOps, or Platform Engineering.
2 4 years of experience managing or leading SRE/DevOps engineers.
Strong hands-on experience with OCI and AWS cloud platforms.
Solid expertise in Kubernetes, Terraform, CI/CD pipelines, and cloud-native architectures.
Proficiency in Python, Go, Bash/Shell, or similar languages.
Strong Experience with incident management, observability, and performance optimization.
Fluent in English, with experience collaborating across regions and time zones.
Experience scaling SRE practices across multiple teams or services.
Familiarity with compliance, security, and regulated cloud environments.

Desired Candidate Profile

Description
The Cloud team at Lucid Motors is currently seeking a Senior Site Reliability Engineering (SRE) Manager for leading the reliability, scalability, and
operational excellence of Lucid Motors cloud infrastructure and production services. This role combines hands-on technical leadership with people
management, ensuring systems are highly available while developing and empowering a team of SRE engineers.

Similar Jobs

Apply on the Job Website