Site Reliability Engineering (SRE)
Achieving quantifiable and reliable outcomes through Site Reliability Engineering
Sify’s Site Reliability Engineering (SRE) creates measurable, incremental, and sustained value by modernization and automation of the enterprise application landscape.
In the past, the focus of quality engineering was on shift-left testing, especially requirements review, functional and non-functional testing and automation. With SRE, the focus is also moving towards production shortcomings to ensure that SRE parameter targets are met.
Sify Digital brings value to operations by leveraging a well-defined set of practices, principles, and culture built on SRE and DevOps foundations with a strong emphasis on reliability engineering capabilities. The offering helps enterprises accelerate business transformation and maximize value by delivering reliable services in tune with fast-changing customer expectations.
Highlights
Define Service Level Objectives (SLOs) and Service Level Indicators (SLIs)
Define monitoring and assess monitoring tools for service observability
Eliminate toil
Faster identification of
production issues
Improvement and efficiency in monitoring alerts
Maximized visibility and control around business processes and KPIs
Convert IT operations from a cost center to value center
Define DevOps tool chain for increased release agility by automating deployments and the rollback process
Facilitate a resilient environment for applications and platforms by maximizing automation and data-driven, quick Root Cause Analysis (RCA)
Define SRE runbook tailored for specific contexts
SRE Key Principles
Sify’s SRE solution is focused on four delivery pillars
Incident Management:
The main goal of this pillar is to reduce mean time to detect (MTTD), mean time to resolve (MTTR), and mean time between failures (MTBF) to desired numbers. Defining the SLOs and SLIs at each service layer and tracking KPIs like availability, latency, and system throughput to enhance customer experience is the key focus in this pillar.
Problem Management
This pillar deals with root cause analysis, prevention and self-healing mechanisms in the digital ecosystem. SRE dashboards and data-driven insights provide information about the overall service health, which helps identify service availability for a given amount of time during production monitoring.
Business Continuity
Business continuity and disaster recovery are the core areas of focus for this pillar. Chaos engineering is a disciplined approach to identify vulnerabilities in systems in the production environment. It is implemented to check the system’s reliability, stability, and ability to survive in unstable and unexpected conditions.
Environment Management
This pillar helps drive end-to-end deliverables to ensure a stable, efficient, observable, and resilient technology environment. SRE will also deep-dive into current production incidents, understand current design and architectural issues, develop innovative and technical tooling to improving production stability, and enable faster recovery. The goal of this pillar is to use all channels to clearly communicate on system health and incident resolution.