Site Reliability Engineering - Computer Measurement Group

Site Reliability Engineering

Site Reliability Engineering (SRE) is a ‘SRE-ious’ movement that breaks down the traditional barriers and ends the age-old battles between Development & Operations teams. SRE essentially creates a hybrid role that tries to maintain an equilibrium between developing new features and running production systems reliably.

This is a 3-part training course that attempts to explore SRE from its origins, an insight into its specific terminology to its current evolution as a standard to maintain or run production systems. Each part either tries to answer a series of questions or explores a set of inter-related topics – in an attempt to go in-depth and provide an overall picture


Part I – Introduction

  • What is SRE? Is SRE different from DevOps? Are they competing standards?
  • Reliability vs Availability? How are they measured?
  • What are the essential tools of SRE?
  • What is toil? How to eliminate toil?
  • How does Automation fit into SRE?
  • Measuring vs Monitoring vs Alerting – How are they different?

Part 2 – Defining critical practices

  • How to keep Error Budget in Check?
  • Explore SRE’s observability of distributed systems:
    • Time Series Metrics
    • Distributed Logging
    • Distributed Tracing
  • Discuss facets of Incident Management:
    • Being On-Call
    • Effective Troubleshooting
    • Blameless Postmortems

Part 3 – Types of Implementations and CRE

  • Types of SRE Implementations
  • Customer Reliability Engineering – A team to implement SRE for GCP Customers

About your Instructor

Sandeep Madamanchi
Head Of Cloud & Infrastructure at Variant

Sandeep Madamanchi is currently working as Head Of Cloud & Infrastructure at Variant. Previously, Sandeep was with Manhattan Associates where he worked in various capacities as a Business Analyst, Design Lead, Development Manager and Performance Engineering Lead. Sandeep is passionate about cloud technologies and he is a certified GCP & AWS Architect. Sandeep is also passionate about sharing knowledge and has published articles on topics related to Docker, Kubernetes, GCP services etc. with popular publications on medium. You can follow Sandeep’s articles on medium at https://medium.com/@sandeep.manchi and also follow on LinkedIn at https://www.linkedin.com/in/sandeep-madamanchi-51aaa810

Course Curriculum

Start Next Lesson What is SRE? Is SRE different from DevOps? Are they competing standards?
Part 3 - Types of SRE Implementations & CRE : SRE – Processes and Best Practices