Job Details

Job description

Position Summary


The AI/ML Support Automation Analyst will be a key member of the KSL AI Support Team, focusing on MLOps


infrastructure, container orchestration, and workflow automation at a supercomputing scale. Working under the


AI/ML Support Team Lead, this role is responsible for developing and maintaining secure, OCI-compliant container


images, robust CI/CD pipelines, and cloud-native MLOps workflows that enable researchers to efficiently deploy and


manage AI/ML workloads. The Analyst will bridge the gap between cutting-edge Kubernetes-based infrastructure


and the diverse needs of the research community, contributing to governance, technical enablement, and


community development initiatives.


Major Responsibilities


1MLOps and Container Development


• Providing timely and useful user support via telephone, walk-in, email, and ticketing system submissions


for all types of inquiries.


• Maintain high customer service standards in dealing with and responding to user issues and questions.


• Develop and maintain secure, OCI-compliant, and HPC-ready AI/ML and data science software container


images


• Design and implement robust MLOps workflows and pipelines at supercomputing scale


• Develop and maintain CI/CD pipelines for reproducible infrastructure and workflow deployment


• Design and deploy APIs for AI/ML services and inference endpoints


• Implement and manage Kubernetes-based orchestration, including CNI, CSI, and service mesh


configurations and optimization


• Deploy and maintain container registries (Harbor) and model registries (MLFlow, Kubeflow Model


Registry)


2Governance and Compliance Support


• Assist in computational readiness reviews for AI research projects


• Assist in AI model and artifact control reviews to ensure compliance with institutional standards


• Provide consultation to users on efficient resource usage for AI/ML and MLOps workflows


• Ensure container images and workflows comply with security policies and best practices


• Support the implementation of usage monitoring and reporting systems


3Performance and Benchmarking


• Perform performance debugging and tuning of MLOps and cloud-native workflows


• Develop and maintain AI/ML and MLOps workload benchmarks for procuring new systems


• Create and maintain regression testing workloads for existing clusters


• Deploy and maintain observability and resource monitoring stacks using Prometheus, Grafana, NVIDIA


DCGM, and Grafana Loki


• Contribute to technology evaluation and benchmarking exercises for future infrastructure investments


4Training and Documentation


• Create comprehensive training content for users on MLOps platforms, Kubernetes, and containerization


• Develop and maintain high-quality user documentation for automation tools and workflows


• Support the delivery of workshops on CI/CD, container orchestration, and MLOps best practices


• Contribute to knowledge transfer initiatives within the KAUST research community


• Provide one-on-one consultation to researchers on efficient use of automation infrastructure


Personal Requirements


Competencies


• Experience


• Demonstrated experience developing robust and complex MLOps pipelines


• Hands-on experience with API design and deployment


• Experience developing robust and portable CI/CD pipelines for reproducible infrastructure and workflow


deployment


• Experience supporting researchers or working in academic/research computing settings preferred


• Technical Skills - Essential


• Kubernetes: Strong expertise in Kubernetes, Container Network Interface (CNI), Container Storage


Interface (CSI), and Service Mesh


• MLOps: Experience developing and maintaining MLOps pipelines and workflows


• CI/CD: Proficiency in building CI/CD pipelines for infrastructure and application deployment


• Containerization: Experience building secure, OCI-compliant container images


• API Development: Experience in API design, development, and deployment


• Programming: Proficiency in Python; experience with Go, Bash scripting


• Linux: Strong Linux/Unix systems administration skills


• Technical Skills - Desired


• Experience with ArgoCD, Airflow, DASK, Spark for workflow orchestration


• Experience with Kubeflow, KServe, and Seldon for ML serving and pipelines


• Experience deploying and maintaining observability stacks (Prometheus, Grafana, NVIDIA DCGM, Grafana


Loki)


• Knowledge of Model Context Protocol (MCP) and agentic frameworks


• Experience deploying inference services at scale


• Experience deploying and maintaining container registries (Harbor) and model registries (MLFlow,


Kubeflow Model Registry, Artifact Hub)


• Experience with GitOps practices and Infrastructure as Code (Terraform, Ansible)


• Experience with HPC schedulers (SLURM) and HPC-cloud integration


• Soft Skills


• Strong problem-solving and analytical abilities


• Excellent written and verbal communication skills in English


• Customer service mindset with patience for supporting diverse skill levels


• Ability to work independently and as part of a collaborative team


• Strong documentation and knowledge-sharing practices


• Cultural sensitivity for working in an international environment


Preferred Qualifications


• Experience in national laboratories or major research computing facilities


• Experience with GPU scheduling and resource management in Kubernetes


• Background in DevOps or Site Reliability Engineering (SRE)


• Contributions to open-source cloud-native or MLOps projects


• Publications or presentations on MLOps, Kubernetes, or automation topics


• Knowledge of Saudi Arabia's Vision 2030 and national AI initiatives


• Additional certifications: AWS/Azure/GCP, Terraform, NVIDIA DLI


Qualifications


• Bachelor's or master’s degree in computer science, Data Science, Computational Science, Artificial


Intelligence, or a related field


• Certifications such as CKA (Certified Kubernetes Administrator), CKAD (Certified Kubernetes Application


Developer), CKS (Certified Kubernetes Security Specialist), or CNPE (Certified Cloud Native Platform


Engineer) are highly valued


Experience


• Minimum of 2 years of relevant experience


Preferred candidate

Years of experience

No experience required

Degree

Bachelor's degree / higher diploma

Similar Jobs