Job Purpose
The SRE Consultant – Observability & APM is responsible for designing, implementing, and optimizing large-scale observability and application performance monitoring platforms to ensure the reliability, performance, scalability, and availability of mission-critical enterprise systems. The role applies Site Reliability Engineering (SRE) principles across logging, monitoring, APM, and observability domains, acting as a subject matter expert for platforms such as Splunk, Instana, and App Dynamics, while driving automation, performance engineering, and operational excellence across hybrid and cloud-native environments.
Key Accountabilities
Architect, deploy, and operate enterprise-grade observability and APM platforms, including Splunk, Instana, and/or App Dynamics, across on-premises, cloud, and hybrid environments. Apply SRE principles by defining and managing SLIs, SLOs, and error budgets to ensure platform reliability and service performance. Lead performance analysis, troubleshooting, and root cause analysis (RCA) for complex application and platform-level issues. Design and maintain dashboards, alerts, health rules, and analytics use cases to provide end-to-end system visibility. Perform capacity planning, performance tuning, and scalability assessments for observability and APM platforms. Drive automation initiatives using scripting and Infrastructure as Code (IaC) to improve reliability, consistency, and operational efficiency. Integrate observability platforms with ITSM, CI/CD pipelines, SIEM, and incident management tools. Provide technical leadership, guidance, and mentorship to SRE, Dev Ops, and operations teams. Advise engineering and leadership teams on observability best practices and platform strategy. Maintain platform documentation, standards, and operational runbooks.
Minimum Qualifications
Bachelor’s degree in computer science, Information Technology, or a related field.
Minimum Experience
6+ years of experience in SRE, IT Operations, Dev Ops, or application performance/observability roles.
Job-Specific Skills
Strong foundation in Site Reliability Engineering (SRE), observability, and modern application architectures. Proven hands-on experience with at least one of the following platforms: Splunk, Instana, or App Dynamics, in large-scale enterprise environments. Deep hands-on expertise in observability, logging, and APM platforms (Splunk, Instana, App Dynamics). Strong understanding of APM, metrics, logs, traces, and performance engineering concepts. Proficiency in SRE practices, including reliability measurement, automation, and incident management. Experience with cloud platforms (AWS, Azure, GCP) and container orchestration technologies (Kubernetes / Open Shift). Strong automation and scripting skills (e.g., Python, Bash, Power Shell). Experience with Infrastructure as Code tools (e.g., Terraform, Ansible, Puppet) is highly desirable. Solid knowledge of Linux/Unix and Windows operating systems, networking, and system performance. Ability to communicate complex technical concepts clearly to both technical and non-technical stakeholders. Strong analytical, troubleshooting, and problem-solving skills. Relevant platform or cloud certifications (e.g., Splunk Architect, Instana, App Dynamics, Cloud/SRE certifications) are a plus.