In spring 2021, several factors converged to create an amazing problem at Campspot – more traffic!! – and therefore more load on the system. Pieces of our application and infrastructure that held together at the end of the 2020 camping season began to tremble and we experienced some unpleasant customer-facing incidents. In response, we doubled down on the tools and processes we already had at our disposal – increased alerting, beefed up oncall rotations, more runbooks, more dashboards, more high-urgency Slack channels. We put spotlights on so many areas of the system it became hard to see where issues were. Recognizing the chaos, we unified multiple teams around a single observability tool and implemented Service Level Objectives for our key external services. The result: fewer alerts, faster troubleshooting, and clear indicators of when we need to spend more time on performance versus features. Come hear how we cleared out the alert pollution so we could see the constellations we were actually searching for all along.
Kristin Smith (she/her) serves as a DevOps Services Team Lead for a distributed team of cloud and data engineers at Campspot. She transitioned into the technical industry seven years ago, bringing with her a background in history and archival sciences. Along the way she has worked in technical organizations ranging from three people to over 700, in both the private and public spheres. Her professional interests include infrastructure provisioning, monitoring and traceability in distributed systems, and writing documentation that people actually read.