How do you identify I/O hotspots and how do you address them?
How can I/O hot spots best be identified and what technology is out there to help IT managers to manage them in a virtualised environment?
Personally a combination of three things come into play here . Firstly proper planning of LUN/RAID group layout, which considers the applications and workload types, not just a pool of IO and capacity given to a hypervisor (we're not there yet, unless you consider some upcoming vendors such as Tintri and Nutanix who are built from the ground for virtualisation). Proactive avoidance of hotspots is always alleviate the degree to which you need to be reactive.
Secondly, an array which leverages a good API interaction with your hypervisor. Being able to translate an a suffering application to a specific LUN ID quickly is crucial to fast resolution for me. Many vendors give you good hypervisor visibility into the storage for mapping a VM back through the abstracted layers to RAID groups and LUN names, but also being able to look at the storage GUI and identify which VM's reside on a given LUN is masively helpful.
Thirdly. A good storage vendor will have the array intelligence to allow you to set threshold alerts for things such as queue depth, service time/disk latency and utilization being able to respond to that alert and quickly identify the application VM causing the contention will allow you to intelligently decide on how you deal with the issue when you don't have the luxury of just adding spindles is a major benefit (enter point number 2).
There are also a number of third party vendors which have both hypervisor and array vendor awareness which can help correlate between the two (such as Solarwinds since their Tek tools aquisition) or even VMware's own Enterprise Operations Manager (providing VMware is your hypervisor of choice and your storage vendor has worked with VMware to allow themselves to be reported on in this way, which many have).
In terms of how you address these hotspots, there are a few methods which vary from simply migrating VM workloads to less stressed LUN's/datastores/RAID groups to storage based QOS (can get a little complicated unless you really know what your IO goals or thresholds are. Do you want to rob Peter to pay Paul?), better cache management (look for LUN's which are just filling cache and forcing it to flush with no benefit to response time and disable them so that LUNs that really benefit from cache can use it), technologies which boost cache with either server side PCI-E based SSD or SSD in the array can level out hotspots depending on the workload (there are also a number of fabric based SSD caching appliances coming on). Then we have the marvel or auto-tiering of data where at sub LUN level, chunks or blocks of data are dynamically moved between more or less performing tiers of disk (in reality, the larger arrays have this down and some of the mid market arrays are catching up. Also the profile of your data and how much of your capacity is driving majority IOPS is a massive factor as to whether it will actually benefit you here align with the application workload profile itself).
Anyway, just a few thoughts from my perspective.
X-IO has its Continuous Adaptive Data Placement technology (CADP) to identify sub-LUN hotspots on disk and move them into the SSD storage tier in its Hyper ISE and Hyper ISE 7 products. Hybrid flash array vendors can do tiering as well. IMHO the smaller the chunk of data the quicker it is to move it across storage tiers and/or into cache but the more data activity tracking there has to be.
Compellent with its move to a 64-bit O/S from its 32-bit forebear seemed to be positioned to track access levels on small chunks of data but hasn't announced any increase in tiering speed or granularity.
I haven't come across an auto-tiering solution that boasts real-time data movement. Most are based on a batch process which makes no sense at all as yesterdays 'hot' data is being moved to the most expensive storage tier. As Chris points out, small block data movement provides the highest efficiency but the movement needs to occur in as close to real-time as possible with minimal overhead to I/O performance. [Disclosure: I work for Dot Hill Systems]
"movement needs to occur in as close to real-time as possible"
While in an ideal world this may be true, stressing your storage more at the very moment it's stressed is a poor solution, and is why VMware have not made much progress with storage DRS.
Hot blocks don't need to be moved in real time because a good SAN will copy hot reads immediately to the cache. Hot writes get written directly to the fast tier, so no data need move at all.
Although they are not great at tiering, NetApp have nice explanations of this when discussing their new flash drive hybrid aggregates. It's not tiering, just they explain the data movement nicely.
In response to the OP, it depends on your SAN, your workload and your monitoring software. If you can give details of these then we may be able to help more. With a solution like Compellent, the answer is that you don't need to. With more traditional SAN the answer varies massively, some have in built reporting which tells you where problems exist, some have great monitoring showing disk level usage (Equallogic and NetApp are good examples of this, both showing all stats for all drives). Some you may need a command line to find the answers, but in these cases analysing the workload from each connected system is sometimes easier and more efficient.
Since you mentioned virtual, all of the hypervisors allow you to monitor at the VM, LUN, and host level for disk IO both in throughput (usually for bandwidth constraint) and IO (possibly disk constraint).
Biting the hand that feeds IT © 1998–2021