Reply to post: Re: How do you identify I/O hotspots and how do you address them?

How do you identify I/O hotspots and how do you address them?

EvanUnrue

Re: How do you identify I/O hotspots and how do you address them?

Personally a combination of three things come into play here . Firstly proper planning of LUN/RAID group layout, which considers the applications and workload types, not just a pool of IO and capacity given to a hypervisor (we're not there yet, unless you consider some upcoming vendors such as Tintri and Nutanix who are built from the ground for virtualisation). Proactive avoidance of hotspots is always alleviate the degree to which you need to be reactive.

Secondly, an array which leverages a good API interaction with your hypervisor. Being able to translate an a suffering application to a specific LUN ID quickly is crucial to fast resolution for me. Many vendors give you good hypervisor visibility into the storage for mapping a VM back through the abstracted layers to RAID groups and LUN names, but also being able to look at the storage GUI and identify which VM's reside on a given LUN is masively helpful.

Thirdly. A good storage vendor will have the array intelligence to allow you to set threshold alerts for things such as queue depth, service time/disk latency and utilization being able to respond to that alert and quickly identify the application VM causing the contention will allow you to intelligently decide on how you deal with the issue when you don't have the luxury of just adding spindles is a major benefit (enter point number 2).

There are also a number of third party vendors which have both hypervisor and array vendor awareness which can help correlate between the two (such as Solarwinds since their Tek tools aquisition) or even VMware's own Enterprise Operations Manager (providing VMware is your hypervisor of choice and your storage vendor has worked with VMware to allow themselves to be reported on in this way, which many have).

In terms of how you address these hotspots, there are a few methods which vary from simply migrating VM workloads to less stressed LUN's/datastores/RAID groups to storage based QOS (can get a little complicated unless you really know what your IO goals or thresholds are. Do you want to rob Peter to pay Paul?), better cache management (look for LUN's which are just filling cache and forcing it to flush with no benefit to response time and disable them so that LUNs that really benefit from cache can use it), technologies which boost cache with either server side PCI-E based SSD or SSD in the array can level out hotspots depending on the workload (there are also a number of fabric based SSD caching appliances coming on). Then we have the marvel or auto-tiering of data where at sub LUN level, chunks or blocks of data are dynamically moved between more or less performing tiers of disk (in reality, the larger arrays have this down and some of the mid market arrays are catching up. Also the profile of your data and how much of your capacity is driving majority IOPS is a massive factor as to whether it will actually benefit you here align with the application workload profile itself).

Anyway, just a few thoughts from my perspective.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon