We hosted an Observability Roundtable where we discussed monitoring vs observability, legacy systems, metrics and tagging, and how AI ties into all of it. Joining in on the conversation were technologists and developers representing the software, FinTech, government, talent, and consulting industries.
Q1. There are a lot of differing opinions about observability and monitoring and their hierarchy out there. How do you define and differentiate the two?
Austin Parker, Principal Developer Advocate, LightStep: To me, the biggest difference is monitoring is about the tools. It’s about the ways that you observe signals from your applications and infrastructure, and all the components that make your software work. Meanwhile, observability is more about a cultural change. It impacts how you think about performance and how you’re able to talk about performance across teams. There’s a lot of things that blend into this like site reliability engineering, service level indicators, and error budgets.
Anoush Najarian, Software Engineering Manager, MATLAB Performance Team, MathWorks: The startup Honeycomb has created the Observability Manifesto. It says that observability is the power to ask new questions of your system without having to ship new codes together or get new data in order to ask those questions. Whereas monitoring is about “known unknowns” and actionable alerts and observability is about “unknown unknowns” and empowering you to ask arbitrary new questions and explore where the cookie crumbs will take you.
Bishak Adukkath, Principal Performance Engineering Consultant, Fidelity Investments: At Fidelity, we’re mostly working with “white box monitoring” as they call it. We go in detail to what is happening in the tech stack which includes monitoring different transactions and getting alerts based on certain thresholds. It’s not just from an application performance or transaction perspective. It’s an overall visualization and it gives you information about the infrastructure and server logs. It’s a holistic, bird’s eye view of our tech stack.
Q2. If what you are trying to achieve is not only being able to see where things fail but also to know why they are failing, that observability needs to be baked into your systems. For a legacy system that was not built on this premise, how can you get to a point of observability on those systems?
Austin Parker:I agree that it’s a huge part of it is this holistic approach, but it needs to be an accessible approach as well. It can’t be something only for the Oracles of your tech stack. That’s one of the trade-offs with observability – you are getting a lot more data out of your application and infrastructure. It’s going to be a big change and undertaking to have a way to process all that data.
And with vintage applications, how do you measure the performance of a Mainframe? There’s a lot of technical ways but, by and large, those systems weren’t designed with these sorts of concepts in mind. Sure you could add a proxy to your traffic around the Legacy system and then emit Telemetry data from that proxy. But that’s still adding complexity and adding more moving parts to a system.
Anoush Najarian: I agree with you. It’s like communicating. It differs by the audience that we’re communicating with. So if we’re going to design a tool we want to ask who is going to be using it and how successful they are going to be at using it. In some of these cases, perhaps we need to start simply before we start building something so robust.
Luiz Eduardo Gazola de Souza, Technical Consultant,OCATETO: Observability is more about all the information that you have to complete an investigation for any problem that you have. For monitoring, you want data to go to a dashboard or send alerts to you. But, for observability, it is more data – everything else- that you could dig into to learn more about the problem.
Q3. What data or metrics is observability allowing us to capture? What metrics are most important now and in the future?
Austin Parker: I think one key is that all telemetry is effectively convertible into all other forms of telemetry. I just finished writing a book on distributed tracing and a lot of people see tracing and think that it is this highly specialized thing. But distributed traces are just like a logging statement with some context that is being consumed and processed in a different way. You can take distributed traces and turn them into metrics. In Lightstep, we look at trace data and sample trace data and turn it into golden signals like latency and throughput. Before Lightstep was a word on the wind, people had been doing this forever. They would run a process, processing a log file, and then turning that into a time series request count or request count bucketed by latency or by error rate or whatever.
Time is the most important resource you have. The conversation should be less about the kinds of data that you are able to get with your observability practice and it should be more about what it enables you to do more quickly. Some important measures, for instance, are “is it taking you less time to detect an incident because of your observability practice? Are you able to improve baseline performance? Over time, can we reduce the average latency for a service or a given request?
Bishak Adukkath: I’ll just take an example of what is happening Fidelity. We started tagging all of our logs to be Splunk friendly. We have custom tags on every object and every request that goes into our application. It makes it easy for you to plot any type of business transaction or application transactions in one of the many monitoring tools we use.
Bruce Speyer, Technologist, Texas Department of Information Resources: We basically measure everything across all the technology towers and business operations across the state and we’re starting to see some real benefit to that. We standardized on ServiceNow two years ago and we have a lot of Big Data sources and a lot of API service buses. We also have event coordination centers that work with that data.
The system has financial data, digital governance data, and strategic and technology planning data. Then, it’s multi-tenant and every customer has a view of their own segmented data. There’s some automation depending on how much you want, or not. Depending on who you are you see different dashboards. This storage and insight into operational and financial data allowed us to expand our capacity quickly recently. And, while it’s not as good as we want now, it keeps getting better and better and we have thousands of users.
One big key for it is using it for business intelligence too. For instance, can we measure how well we are servicing unemployment benefits? We’re working hard to push this system up to the business intelligence level and that’s what the management wants – observability into the business decisions. So we put everything in there and it’s all connected. ServiceNow has turned out to be a really good tool for that because it covers almost any operation of the business.
Austin Parker: Ultimately, you have to be measuring things from the perspective of how they impact you. Not only the people who are clicking on a button on a website to submit unemployment forms but also the analysts and decision-makers – people on other teams. I think a big part of observability is bringing the focus on performance and really tying it to user experience.
Bishak Adukkath: Related to mapping for the business perspective, when you have these custom tags applied, your log index can result in a dashboard that is meaningful from a business perspective, When I worked for a large software company, they wanted to know the performance and how it impacts the business from a management perspective. Those log events can be easily mapped in Splunk or any other log indexing tool which makes business sense.
Austin Parker: I think this is interesting. It represents a challenge that I’ve thought a lot about. When you have multiple teams in a large enterprise, your challenge is to standardize all that data across the organization, have it cross-referenced, and then also have it accessible to people that aren’t accustomed to working in these systems. I think that’s a really interesting challenge and it sounds like Bruce has done some stuff around this so I’d be interested to see
Bruce Speyer: We have spent a lot of time establishing common terminology and common definitions of terms. The catalyst for much of this is our expansion to Google Cloud. We have had to translate what we do in AWS, Azure, and our own data centers into the language of the Google Cloud Platform. It’s also forced us to map roles and responsibilities as well as document the expected performance for each service. We’ve gotten a lot of value within the organization by having this shared terminology and there is more visibility into IT. It didn’t happen overnight, we’ve been working on this for 7-8 years.
Q4. Both the engineer’s perspective and the business perspective have come together on this in this conversation. Related to observability, has there been a process or tool that has helped you do something differently in your business.
Bishak Adukkath: So, we are a low-latency trading platform and most of our systems have to respond within milliseconds. So most of our trading compliance checks are based on the performance of the system. So our custom dashboards are used by our internal application team and we have created dashboards that make sense for the business people. Our dashboards are being monitored by C-level people who are using it to make business decisions. It depends on the application, but many can be charted this way, and, for us, the performance of our system makes sense for business people to make certain decisions on the fly. I’ve seen perform in a way that we can place more orders for more compliance checks. Also, our lower latency allows for partners to better forecast trades for each week.
Austin Parker: The nice thing about having an observability product is that you get to use it against yourself. First, we have a bunch of Engineers working on deploying code, building new releases, and trying to rapidly iterate on a product. That gives us some good feedback and we can build the thing that we want to use. I think that means we’ve optimized pretty well for a lot of developer workflows. We have a very safe-to-plummet culture and people can release whenever they want and then we get almost immediate feedback. At the same time, we designed this from the ground up. If something breaks in production, it emits all the Telemetry we would need, and we’re easily able to find these weird, Black Swan events. We’re able to project, for instance, when a customer’s api limit might be up. And that is obviously helpful for kind of driving business conversations.
Q5. Are advances in AI and machine learning creating self-healing systems? Will we get there? Are you game for it? Or is it too scary to think about just setting it and forgetting it going home for the weekend not having anyone on call? What are your thoughts?
Anoush Najarian: There is an anomaly detection model that was actually pretty powerful at detecting anomalies and some AI that can make sense of things, but it is not integrated into the business enough to just let it loose. I don’t think we’re there on a technical level and on a “faith” level. And it has to come in that order. Maybe in five years, we’ll be there.
Austin Parker: I’m a skeptic. Today, you can achieve what looks like AI simply by doing a simple regression analysis against the last 30 minutes of data and then comparing it to a known out. Of course, this is being packaged and marketed already. A true computer learning system that is able to do things that a human couldn’t do isn’t in the cards for at least 15 years.
Bishak Adukkath: There are places where we already employ programming that looks like AI/ML but they are not truly self-healing systems. To be frank, I don’t even understand exactly what a “complete self-healing system” means. System performance is dependent on system behavior which depends on how you built it. Today, I don’t think a system is going to be able to fix all the problems you created.
Bruce Speyer: The closest we are getting, and one thing that is getting a lot of attention at the federal and state government level is RPA. They are not willing to go “all the way” but it’s a low-bar incremental approach for organizations to move towards ML/AI.