Saudi , Riyadh
--
Company

Job Details

Job Description

Roles & Responsibilities

Responsibilities

  • SRE Leadership & Reliability Ownership
  • Own the availability, performance, and reliability of cloud services deployed and operated in KSA.
  • Define, implement, and track SRE best practices, including SLIs, SLOs, SLAs, and error budgets.
  • Lead the architecture and governance of highly available and disaster-resilient systems, ensuring DR strategies are tested and maintained.
  • Drive capacity planning, auto-scaling, and performance tuning across Kubernetes-based platforms.
  • Own monitoring, observability, and alerting using Prometheus, Grafana, and logging platforms.
  • Lead incident response, impact assessment, and root-cause analysis for complex production issues.
  • Team Management, Mentorship & Growth
  • Manage a team of SRE engineers, providing technical direction, career coaching, and performance feedback.
  • Review and approve infrastructure code, deployment configurations, automation scripts, and SRE tooling.
  • Foster a culture of ownership, learning, blameless postmortems, and continuous improvement.
  • Lead hiring, onboarding, and skill development initiatives for the SRE function.
  • Ensure fair, sustainable, and well-documented on-call rotations.
  • Cloud Platforms & Automation
  • Oversee production environments on Oracle Cloud Infrastructure (OCI) and AWS.
  • Govern Infrastructure-as-Code practices using Terraform and configuration management tools.
  • Lead CI/CD strategy and implementation using ArgoCD, Jenkins, Maven, Docker, and GitLab.
  • Ensure secure and reliable deployment of microservices and data pipelines on Kubernetes using Helm.
  • Platform Services & Data Systems
  • Collaborate closely with Product Owners, Engineering Managers, Security, and Architecture teams.
  • Oversee the reliability and scaling of platform services such as Kafka, Spark, Trino, Airflow, MQTT, and microservices ecosystems.
  • Ensure stable operations of NoSQL and RDBMS systems including ElasticSearch, MongoDB, PostgreSQL, and MySQL.
  • Support distributed data processing and messaging systems, addressing performance and scalability challenges.

Requirements and Skills

  • B.S. or M.S. degree in Computer Science, Engineering, or a related field.
  • 8+ years of experience in Site Reliability Engineering, DevOps, or Platform Engineering.
  • 2 4 years of experience managing or leading SRE/DevOps engineers.
  • Strong hands-on experience with OCI and AWS cloud platforms.
  • Solid expertise in Kubernetes, Terraform, CI/CD pipelines, and cloud-native architectures.
  • Proficiency in Python, Go, Bash/Shell, or similar languages.
  • Strong Experience with incident management, observability, and performance optimization.
  • Fluent in English, with experience collaborating across regions and time zones.
  • Experience scaling SRE practices across multiple teams or services.
  • Familiarity with compliance, security, and regulated cloud environments.

Desired Candidate Profile

Description
The Cloud team at Lucid Motors is currently seeking a Senior Site Reliability Engineering (SRE) Manager for leading the reliability, scalability, and
operational excellence of Lucid Motors cloud infrastructure and production services. This role combines hands-on technical leadership with people
management, ensuring systems are highly available while developing and empowering a team of SRE engineers.

Similar Jobs