Job Description
Roles & Responsibilities
Responsibilities
- SRE Leadership & Reliability Ownership
- Own the availability, performance, and reliability of cloud services deployed and operated in KSA.
- Define, implement, and track SRE best practices, including SLIs, SLOs, SLAs, and error budgets.
- Lead the architecture and governance of highly available and disaster-resilient systems, ensuring DR strategies are tested and maintained.
- Drive capacity planning, auto-scaling, and performance tuning across Kubernetes-based platforms.
- Own monitoring, observability, and alerting using Prometheus, Grafana, and logging platforms.
- Lead incident response, impact assessment, and root-cause analysis for complex production issues.
- Team Management, Mentorship & Growth
- Manage a team of SRE engineers, providing technical direction, career coaching, and performance feedback.
- Review and approve infrastructure code, deployment configurations, automation scripts, and SRE tooling.
- Foster a culture of ownership, learning, blameless postmortems, and continuous improvement.
- Lead hiring, onboarding, and skill development initiatives for the SRE function.
- Ensure fair, sustainable, and well-documented on-call rotations.
- Cloud Platforms & Automation
- Oversee production environments on Oracle Cloud Infrastructure (OCI) and AWS.
- Govern Infrastructure-as-Code practices using Terraform and configuration management tools.
- Lead CI/CD strategy and implementation using ArgoCD, Jenkins, Maven, Docker, and GitLab.
- Ensure secure and reliable deployment of microservices and data pipelines on Kubernetes using Helm.
- Platform Services & Data Systems
- Collaborate closely with Product Owners, Engineering Managers, Security, and Architecture teams.
- Oversee the reliability and scaling of platform services such as Kafka, Spark, Trino, Airflow, MQTT, and microservices ecosystems.
- Ensure stable operations of NoSQL and RDBMS systems including ElasticSearch, MongoDB, PostgreSQL, and MySQL.
- Support distributed data processing and messaging systems, addressing performance and scalability challenges.
Requirements and Skills
- B.S. or M.S. degree in Computer Science, Engineering, or a related field.
- 8+ years of experience in Site Reliability Engineering, DevOps, or Platform Engineering.
- 2 4 years of experience managing or leading SRE/DevOps engineers.
- Strong hands-on experience with OCI and AWS cloud platforms.
- Solid expertise in Kubernetes, Terraform, CI/CD pipelines, and cloud-native architectures.
- Proficiency in Python, Go, Bash/Shell, or similar languages.
- Strong Experience with incident management, observability, and performance optimization.
- Fluent in English, with experience collaborating across regions and time zones.
- Experience scaling SRE practices across multiple teams or services.
- Familiarity with compliance, security, and regulated cloud environments.
Desired Candidate Profile
Description
The Cloud team at Lucid Motors is currently seeking a Senior Site Reliability Engineering (SRE) Manager for leading the reliability, scalability, and
operational excellence of Lucid Motors cloud infrastructure and production services. This role combines hands-on technical leadership with people
management, ensuring systems are highly available while developing and empowering a team of SRE engineers.