Sr. SRE

Full time

Apply Now

Responsibilities

Collaborate with U.S.-based counterparts to define and monitor service SLOs, SLAs, and key performance indicators.
Lead root cause analysis, blameless postmortems, and reliability improvements across environments.
Review application code (primarily Java/Spring) to assist in identifying defects and systemic performance issues.
Automate deployment pipelines, recovery workflows, and runbook processes to minimize manual intervention.
Build and manage dashboards, alerts, and health checks using tools like Dynatrace, Azure Monitor, Prometheus, and Grafana.
Contribute to architectural decisions with a lens on performance and operability.
Guide and mentor offshore team members in incident response and production readiness.
Participate in 24x7 support rotations aligned with EST coverage expectations.

Qualification & Experience

8-10 years in SRE, DevOps, or platform engineering experience, ideally supporting U.S. enterprise systems.
Strong hands-on experience with Java/Spring Boot applications, with the ability to assist in code-level troubleshooting.
Must Have Skills:
Cloud & Infrastructure
Kubernetes (AKS ) — container orchestration and management
Docker — containerization
Terraform — Infrastructure as Code
Ansible — configuration management and provisioning
CI/CD & SCM
Jenkins / ArgoCD — pipeline design and maintenance
GitHub / BitBucket / Azure Repos — source code management
Observability & Monitoring
Dynatrace— APM and infrastructure monitoring
Prometheus & Grafana — metrics and dashboards
Splunk / Elasticsearch — log aggregation and analysis
Reliability & Operations
Incident management and on-call support
Root cause analysis (RCA) and postmortem practices
SLI / SLO / SLA definition and tracking
Performance tuning and capacity planning
Scripting
Shell, Python, or PowerShell — automation and tooling
Good to Have Skills:
Service Mesh — Istio / Linkerd for traffic management and observability
GitOps — ArgoCD
Chaos Engineering — tools like Chaos Monkey, LitmusChaos
DevSecOps — security scanning in pipelines ( Snyk, SonarQube)
Distributed Tracing — Jaeger / OpenTelemetry
Cloud Certifications — Azure associate or professional level
ITSM Tools — PagerDuty, OpsGenie, ServiceNow for alert routing