Eram Talent is looking for a talented AI Infrastructure Engineer to join our innovative team.
The ideal candidate will be responsible for designing, building, and maintaining scalable and robust infrastructure solutions that support AI and machine learning workloads.
This role involves working closely with data scientists, machine learning engineers, and software developers to optimize infrastructure performance and facilitate efficient AI model development and deployment.
Key Responsibilities: Design, implement, and manage high-performance computing environments tailored for AI and machine learning applications.
Deploy and maintain GPU-accelerated clusters, cloud-based AI platforms, and parallel processing systems.
Collaborate with data scientists and ML engineers to understand infrastructure requirements for various AI projects.
Optimize resource allocation and scalability of AI infrastructure to support large datasets and complex models.
Automate infrastructure provisioning and deployment using Infrastructure as Code (IaC) tools.
Ensure security, compliance, and reliability of AI infrastructure.
Monitor system performance and troubleshoot issues to minimize downtime and maximize productivity.
Stay updated on emerging technologies and best practices in AI infrastructure and propose continuous improvements.
Bachelor’s or higher degree in Computer Science, Engineering, or related technical field.
3+ years of experience in infrastructure engineering, preferably with a focus on AI, machine learning, or high-performance computing environments.
Cloud skills - GCP/OpenShift, Kubernetes (k8s), Docker containers/images AI skills – Model training, testing/evaluation, deployment ML/LLMOPs LLMs and GenAI core skills – how do LLMs work under the hood, inference mechanics of LLMs/GenAI Inference scaling, distributed computing, inference benchmarking, inference planning for meeting SLAs/SLOs GPUs and how to work with them, distributed workloads handling, autoscaling NVIDIA NIMs, Huggingface NVIDIA Superpods (HPC, slurm, k8s) Monitoring, dashboards for LLM/ML workloads and applications AI Application Architecture know-how, end to end flows DevOps (CI/CD, argoCD, git, Jenkins etc) Languages: Python, SQL