Introduction:
Datastore space monitoring is usually neglected!!
Admins usually tend to keep an eye on Storage IOPs as it’s more dynamic and it has direct performance implications to VMs.
Another reason for neglecting Datastore space utilization is the operational Silos, I’ve seen it in many places where System Admins usually depend on storage admins to monitor space utilization on their storage boxes while storage admins depend on NOC team as a 24/7 monitoring.
Eventually, monitoring space utilization goes forgotten.
This is actually the gap that I’m trying to cover with this “Datastore Monitoring Dashboard for NOC”.
The question is what would really go wrong in 12 hours while Storage & System Admins are not in office and NOC teams are taking up the responsibility to monitor Datastore space utilization?!
What would go wrong to to fill up a couple of TB datastore very quickly to cause an outage?! Is that even possible ?!
The answer is there are many reasons, it’s rare I have to admit (that’s why it’s usually neglected/forgotten by ops teams) however, the impact of getting a datastore full is high and it will for sure cause production outages.
True Story:
Would a datastore running out of space cause a Network Outage?!
Yes, My customer had this incident on NSX Edges Datastore, edges got suspended due to a Snapshot from their backup solution which caused a company wide Network Outage.
You can imagine now how critical is to keep an eye on Datastore Space utilization.
Another reason that might cause a Datastore to get full suddenly is VMDK thin provisioning, but I’m not going to get into details about that in this blog.
The Dashboard:
After discussing the incident with my customer, the first thing I thought about is looking for an out of the box Dashboard for Datastore Monitoring, I used multiple key words to search for Dashboards on Aria Ops like, Datastore, Capacity, Network Operation, etc.
I found couple of dashboards for capacity and utilization but nothing for specifically for Monitoring, even under the “Network Operations Center” folder you can find three Dashboards for ESXi but nothing for Datastore Capacity.
Hence, I agreed with my customer to create a dashboard with the following criteria:
1- Simple, straight forward with color coding for NOC team
2- Dynamic, NOC team can select RED Data stores and then they can get details about them.
3- Quick Remedies, We’ve also thought to include some clues to NOC team for quick actions that may help to avoid the outage.
How to import the Dashboard?
First of all you will have to download the zip files for the Dashboard and the Custom View that is used by this Dashboard from VMware by Broadcom Sample Exchange link here.
Follow the steps in the link to import the custom view and dashboard in your Aria Ops instance.
How to use the Dashboard ?
The first widget in the dashboard is a “Space Utilization Heat Map”, the color index is straight forward and the heat map refresh every 300 seconds, grouped by vCenter Server. All thresholds and refresh rates can be adjusted based on your enterprise policies.
You can click on any of the green, yellow or red boxes which each represents a datastore, by selecting one of the boxes in the heat map you will get the details of the Datastore in the next widget.
In the next widget you can find details about Datastore performance and space utilization.
Speaking of quick remedies, in this widget you can find two pieces of information that can be useful for your NOC team to take a quick action until L2 are contacted.
– Snapshots, NOC team with the right permissions can delete any snapshots if their space is big.
– Orphaned Disks, If there is unused disk on this datastore it can be deleted or migrated to save space as well.
Which Snapshots you can delete and which VMs are consuming the most spaces? That’s what you can check in the next widget, which is a list of all VMs, attached disks, disk spaces, thin or thick, etc.
Finally, check the Live! dashboards which can complement the NOC story for IT here.
Solutions Architect, Cloud and Datacenter.