We are seeking a Senior AI Platform Resident Engineer to lead L2/L3 operations, reliability, and production readiness for enterprise AI platform components deployed across virtual machines and Open Shift environments. This role is highly operational and hands-on, focused on stability, observability, scalability, and security of AI runtime services including model inference, vector databases, messaging, and conversational platforms. You will play a key role in closing operational gaps, defining runbooks, and ensuring reliable service delivery in a restricted, on-premises environment.
Key Responsibilities
AI Platform & Vector Systems Operate and support LLM inference services (e.g., v LLM) across VMs and Open Shift Support Qdrant (vector search), Kafka, and Rasa in production environments Implement performance tuning, scaling strategies, security hardening, and observability Develop L2 operational runbooks and define clear L3/vendor escalation paths Messaging & Caching Manage Kafka and Redis clusters with high availability Perform tuning, capacity planning, backup/restore, and failure recovery Monitor throughput, latency, and resource utilization Platform Operations (VMs & Open Shift) Deploy, manage, and harden services on VM-based platforms and Open Shift clusters Apply RBAC, TLS, audit logging, resource quotas, autoscaling, and health checks Support CI/CD rollouts and standardize deployment and release processes Reliability & Observability Build and maintain metrics, logs, alerts, dashboards, and SLO/SLA monitoring Lead incident response, root cause analysis (RCA), and post-incident reviews Execute disaster recovery (DR) testing and resilience validation Knowledge Transfer & Operational Readiness Identify L2 capability gaps and deliver structured operational training Define SLOs, RPO/RTO, escalation workflows, and production readiness checklists Improve documentation and operational maturity across teams Scope Clarification Postgre SQL and Mongo DB are out of L2 scope and handled by other teams
Qualifications7+ years operating distributed systems in production environments3+ years hands-on experience with Open Shift and/or Kubernetes Strong expertise in Linux, networking, observability, and security hardening Experience supporting Kafka, Redis, Qdrant, Rasa, or LLM inference frameworks Proven experience in L2/L3 support, incident management, and escalation handling
ONLY CANDIDATES WITH THE REQUIRED SKILLS AND EXPERIENCE SHOULD APPLY!