As performance engineers, we understand the importance of software testing during and after development to identify any and all performance bottlenecks. Due to various constraints, be it a scaled down test environment, data volume or code integration limitations, it is not always possible to catch all bugs in test. It is because of this that anomaly detection in production takes on an even bigger significance. There is always the possibility of customers getting impacted if performance bottlenecks are not identified and resolved in a timely manner, but the scale at which this kind of anomaly detection needs to be done is also noteworthy. Few servers in test versus thousands of servers in production with time being of the utmost importance, anomaly detection at scale is one the biggest challenges for a performance engineer.
One of the most widely used techniques to identify performance bugs is to look at time series data for the various metrics that could then possibly pinpoint a potential problem. This approach does not scale well in production, even if time series data can be consolidated into a few charts. Seeing how time consuming this kind of analysis can be, this presentation illustrates how applying simple statistics and basic linear regression principles can improve productivity of a performance engineer, tenfold or even more. Automated anomaly detection in production by using simple data science techniques along with eliminating reliance on time series data, can be beneficial not only to how long it takes to identify an issue, but also how quickly we can get customers out of an outage.