Getting started with AWS Cost Optimisation - The Inventory
As a Solution Architect at an AWS consulting company, it's hard to go for a few weeks without being asked by a client “How do we reduce costs in AWS?”. Well, we first need an Inventory.
The familiar question
As a Solution Architect at an AWS consulting company, it's hard to go for a few weeks without being asked by a client “How do we reduce costs in AWS?”. Now the client often isn’t asking this because they think AWS is inherently just expensive (because for the most part it’s not), it’s generally because they don’t understand what's making up their AWS running costs or the necessary information to make informed decisions.
The biggest trap I see people fall into is instantly going straight into action, Let's just buy some Reserved Instances, let’s get some saving plans, or worse, Let’s re-architect it all!
While these are all valid answers as future steps, we need to take a step back first. To be able to start, we need information, we need to understand our environment and architecture deeply so in the future we can answer questions like this:
- Can we delete those snapshots?
- Do we need those instances anymore?
- Can we turn these instances off at night?
- Why are we using this service?
- What are these files in this bucket for?
As someone that is usually an external party when helping companies with cost optimisation, I get a unique perspective of these questions. Afterall, I can tell you that deleting unrequired EBS snapshots will save you some money but if you don’t know if you don’t need that data anymore, we aren’t going to get very far.
This is why when starting off with cost optimisation (and it's important to remember that cost optimisation never ends, but that’s for a different blog post) we first focus on having all the information we need to make decisions. I like to call this making sure you have an Inventory.
So, what do we need in our inventory, how do we get that information and what tools can help us.
What do we need in our inventory?
Try and keep it simple. For every type of resource we have, I like to try and have an answer to the following:
- What do we call it (e.g., Server Name)
- What does it do or what is its purpose in life (e.g., Web Server for the public website)
- What environment is it for (e.g., Production or Development)
- Who owns it (e.g., This can be an individual or Team)
- Do we have any monitoring or logging in place to gauge its usage?
Feel free to add any additional context you think is relevant to your environment as you are compiling this data. You probably already have a lot of this information but it's always surprising to find a system that everyone thought was deleted a year ago.
Well, that is going to take me forever!
Now to split off on a tangent for a for minutes, I know I said to try and have these answers for everything. Now this is important in the long run, but if you need to move quickly and get some quick wins, focus on what could actually move the needle.
Have a look at your last AWS invoice. What was the most expensive service? You can start there first. This allows us to focus on trying to make the biggest impact possible with the cost optimisation. After all, if we manage to get a 10% saving on $5,000, it's a lot better than 10% saving on $50.
How do we get that information?
This can be can be simple or hard depending on your environment. If everything is well documented, you have tagging in place, or you just have that one guy that knows everything, then this process should be fairly easy.
If you don’t have documentation, tagging or that smart guy however this is where unfortunately for the most part you just need to put in some leg work.
What tools can help us?
There are a few tools that can help with this information discovery:
- AWS Application Discovery Service - You install an agent onto your servers and it monitors network connections and processes to map out what the server is doing.
- Monitoring data in AWS CloudWatch to gauge what the resource is doing (Is it a load balancer with no connections? Or why does this server have 0% CPU usage but everyone thinks its super important)
- AWS Config records configuration data and changes of resources in AWS, this can help give you a bit of a history to the life of a resource
- Infrastructure as Code such as CloudFormation and Terraform can be a good source of information if your workloads have been built using it, you can usually look at the comments in the code or see who made that change (or created things) based on your source control history
- Steampipe - this is an interesting third-party tool that allows you to query AWS resources using SQL queries
- Even your AWS CloudTrail logs can be helpful to see who last touched a resource
- AWS Systems Manager – This can provide an inventory of software installed on instances as well as other configuration data
All of these tools will generally only answer the technical investigation questions (what a server or resource is doing) but actually knowing the “why” something exists will come down to business knowledge and figuring out who you need to ask.
What's next?
Hopefully after going over things and finding that classic hiding in the closet server, we now have the information to support us making decisions.
We can now move onto the next stages of cost optimisation (and I generally like to attack things in this order too). I’ll look to go into further detail of these in future blog posts by here is a list I like to follow:
- AWS Cost Management Tools – Get these up and running early so you have even more information like the AWS Cost and Usage Report
- Tagging – start assigning all of this information we have collected against the resource itself. These tags can be used in the future for reporting, cost allocation and even automation decisions.
- Hygiene – Time to break out the garbage bags, get rid of everything you don’t need anymore. Release those unused elastic IPs, terminate those unused load balancers, delete those left-over development instances. Now that you have the information, you can make informed decisions on if you need that data, system or service anymore.
- Right Size – Focus on only provisioning what you need. In the cloud, like your electricity bill at home, you pay for every little bit you use and don’t pay for what you don’t use. So, it is important to ensure we are only using what we need. Some examples of this can include: Oversized EC2 instances, incorrect scaling policies, over provisioned storage.
- Automation – Can you turn off/on or destroy/recreate infrastructure and services as required. This can be really good once we have identified what is a production workload or a development workload. Just turning your development servers off at night can make a huge impact to running costs
- Contracted Savings – These are things like Savings Plans and Reserved Instances, you really only want to start to look at these once you have a really good idea of the infrastructure you require. No one wants to be contracted to a spend for a thing they realised they didn’t need anymore.
- Re-architecture – This is generally the highest effort thing to tackle, but can also have the highest reward. Taking advantage of things like serverless architectures can lead to very low running costs.