December 14, 2017, Per Bauer from TeamQuest led a webinar on How to Do Capacity Management in the Cloud. Incase you missed the webinar, click here to see it now! Our Managing Director, Amanda Hendley, then held a Q&A with Per to further dive into the topic;
The major reason for trying to maintain cloud portability is to avoid dependencies and a lock in scenario. But being provider agnostic also means that you will have to implement many primitive services yourself, rather than using the foundation that the cloud service provider offers as part of their package. Many times that turns out to be more expensive than the effects of the lock in. Always evaluatw the business case for maintaining portability before you make a decision.
2. One of the guiding principles for Capacity Management has been Proactivity – so should we stop those efforts now?
No, not stop. If you can be effective with proactive measurements, you should use them. But since cloud services are closely associated with self-service provisioning models and those are “sacred”, you will lose most of the ability to proactively verify or approve provisioning. To make up for the lack of proper upfront sizing, you will have to focus more on reactive right-sizing and cleanup efforts. You will also benefit from doing such activities more frequently than before – as the potential reward is higher.
3. Should workloads that hasn’t been refactored for cloud always stay on-prem?
Many of the business cases and ROI calculations that are published by cloud providers assume that the workloads have been refactored to be “cloud native”. If they haven’t, the business case will look different and the ROI will likely be less attractive.
With that realization in mind, you need to evaluate the individual workloads. Not all your legacy applications, designed to be hosted in your own datacenter, are identical. Some may offer a certain level of horizontal scaling and tiering and can probably be migrated as they are to start with. Some other are monoliths that will be difficult to move over without a complete redesign.
4. Traditional capacity has been looking at the capacity on a component level. How is Cloud different to this?
Public cloud doesn’t offer visibility to the components that implements it. You get visibility to the relative utilization of the virtual resources assigned to an instance at a given time, but that’s the closest you get to component data. The smallest building block for capacity management becomes the resource consumption of micro services. Micro services can then be aggregated to describe the resource usage and charge for applications and business services.
5. Many of our client think Capacity management roles are dead due to the onset of cloud environments. How do we convince such clients?
A discussion around the primary drivers for doing capacity management in the first place (safeguarding business critical services, managing seasonal demand, planning for high growth, cost optimization etc.) should be a good start. Ask the client to articulate how moving to the cloud specifically will address any of those challenges.
6. What are the key metrics to measure to determine cloud efficiency?
Relative utilization metrics of resources like CPU and memory are generally good indicators whether the used instance is right-sized. Some public cloud services offer a concept of “credits” where units of unused resources are available for subsequent bursts. Monitoring the rate of credits would obviously also be a good indicator of efficiency.
Over time, the cost associated with hosting an application or shared service component in the public cloud is the ultimate measure of efficiency. Adopting a standard of properly tagging each resource allocation will allow you to determine and track the cost of different services.
7. From your experience, what is the no 1 driver for cloud adoption in companies?
Based on my experience, the biggest driver for cloud adoption is cost. A majority of companies moves to the cloud with the expectation that they will be able to provide the required IT support cheaper than before.
8. What are the main activities for optimizing capacity in the public cloud?
To optimize capacity in the public cloud, you should focus on 3 things:
Work on the deployment model to optimize the amount of capacity that is allocated each time you provision a new instance.
Regularly identify under-utilized or dormant resources to reclaim or decommission.
Build a historical trail of resource allocation and consumption for your services. This allows you to understand seasonality patterns across a business cycle – enabling you to further fine tune your provisioning.
9. What metrics do I need to collect for the public cloud instances?
Utilization metrics for relative comparisons and as efficiency indicators. Metrics related to charges for tracking against budgets.
10. How do you ensure cloud provider is not giving you less capacity than you are paying for? Like using fractional CPUs and charging for full CPUs. What metrics can be used to keep the Cloud provider honest?
I think it will be difficult to prove anything like that and I’ve never heard of real life examples where this has been the case. All resource metrics available for public cloud comes from APIs controlled by the cloud provider. If there are systematic inaccuracies, the APIs will likely not expose them. We will simply have to trust our cloud providers to provide the agreed capacity.
11. How to draw a line between a performance anomaly consuming capacity for wrong reasons versus genuine capacity need due to business growth within cloud? Is there monitoring available inherent to cloud to distinguish that better to understand for better capacity management. Thanks.
The best way to spot performance problems related to software releases is to do time based comparisons of the workloads to spot any anomalies or differences. A simple overlay chart is many times enough. You need to combine this with an analysis of customer behaviors, where you focus on under understanding the seasonal shifts and how business cycles impact the workload profiles. This allows you to normalize the data and make sure that you are making relevant comparisons.
12. User-perceived performance and cost (in tension) are usually the most important business concerns. Can I really predict the relationship between performance (response time) and usage of a specific type of image?
It’s hard to say a categorical “yes” to that without knowing the specifics of the application in focus. But most of the time you can predict the scale out effect of application growth fairly accurately, assuming you have empirical data about the behavior of the application under stress to draw from. Cost will most of the time be a straight product of the scale out.