Roundtable Discussion: Site Reliability Engineering

Anoush Najarian of MathWorks joined us as a moderator for our Site Reliability Roundtable Discussion. According to Wikipedia, “Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create ultra-scalable and highly reliable software systems.” Site Reliability Engineering was a practice started at Google when a team of 8 software engineers were tasked to make Google’s website run smoothly, efficiently, and more reliably. “According to Ben Treynor, founder of Google’s Site Reliability Team, SRE is “what happens when a software engineer is tasked with what used to be called operations.”

During our Roundtable discussion, we got to hear from several Site Reliability Engineers on the evolution of SRE and the tools and strategies they employ to make it work.

What are some of the key elements of site reliability engineering (SRE)?

[Jaishankar Padmanabhan, Wayfair]

Infrastructure Engineering automation. Building tools and frameworks that actually make engineering more efficient and fast. At the same time, keeping up with the reliability goals of our systems and making sure that SLAs are met. The SREs are basically embedded into the agile product engineering teams working very closely with the engineers building the infrastructure, handholding them from the reliability aspect to making sure the automation is taken care of from an infrastructure standpoint. So, infrastructure-as-code, monitoring-as-code, and anything that you can do to automate stuff that can help make the engineering more efficient.

[Pete Lindsey, MathWorks]

There are two things that we see from our function as we see ourselves as the champions of the non-functional quality attributes. I think there’s a behavior that we expect our SRE teams to teach other people on how to build those non-functional qualities such as performance, availability, and security, as well as into their systems right from the start—probably from a technical perspective.

We also make our systems and processes observable. That’s the other key part. We call it our next generation monitoring. Also, we want the health of the system to be evident and obvious to people, but also the health of our processes. Therefore, we measure our incident management process and our problem management process because we think they are key to knowing if we are not improving them, and also hitting our targets so we are not preventing issues. I think there are two other parts of the SRE responsibility that are key.

[Jaishankar Padmanabhan, Wayfair]

An advantage is knowing your Service Level Indicators (SLIs) and Service Level Objectives (SLOs) because it gives you some room. You’re not going to target 100% as your SLO all of the time, but you will be able to track and know how much time you have in case there is something slipping. This enables you to distribute new features to those teams that have good reliability and can handle more risk.

[Khushali Desai, Walmart]

In addition to this, we are monitoring production issues and our mean-time-to-failure (MTTF) and mean-time-to-repair (MTTR). So in addition to tracking production, we are tracking our own resiliency.

Members: Login to View More

Roundtable Discussion: Site Reliability Engineering

Edge Computing: A Paradigm Shift with Ashu Joshi

Blockchain Promotion Act of 2019

Roundtable Discussion: Site Reliability Engineering

Edge Computing: A Paradigm Shift with Ashu Joshi

Blockchain Promotion Act of 2019

Edge Computing: A Paradigm Shift with Ashu Joshi

Blockchain Promotion Act of 2019

Related posts

CMG Be Curious Series | Featuring Art Gutowski, President of SHARE

CMG Honors Code Magus with the 2025 IMPACT Innovation Award

Where Innovation Meets Collaboration – CMG Atlanta 2025