Some topics are both endless and endlessly confusing. Over the years I’ve done the “IOPs talk” under the banner of a number of different employers while representing quite differing solutions. One constant though, is that the topic is complex. Another is that the devil is in the details and it is an area where deliberately opaque marketing has lots of room to fudge things. One aspect of this topic that is particularly confusing is the relationship between IOPs and latency and why, somewhat counterintuitively, the two do not necessarily scale linearly. Since this specific area isn’t one where a ton has been written, I thought it was worth an entry. Before diving into it, it’s probably a good idea to level set on some terms:
1 – IOPS – Input Outputs per second. Simply put, computers deal with moving data around. In and out of memory, in and out of storage, in and out of peripherals, etc. IOPS is a way to put a measurement around this movement of data. Notice that there is absolutely nothing inherent in the term that indicates volume, nor is there any distinction made about the ratio of inputs to outputs; that’s important. It’s a very simple metric really. The only thing implicit in IOPS is that there is some number of both “inputs” and “outputs” happening in one second.
2 – Latency – latency is the amount of time that passes between the request being made of the storage system for some data and that data actually being delivered. Inside that period sits the entire voodoo of a storage subsystem. When the question is asked “what contributes to storage subsystem latency?”, the real answer is “everything”. With magnetic disks, some things to keep in mind are rotational latency, seek time and lookup logic overhead. With solid state disks some things to keep in mind are memory chip response time, controller efficiency and overall storage health (extremely important when dealing with write latency on an SSD that has been in service for a while). These measurements are just for the individual disk elements though. The latency of an entire storage subsystem is also dependent on a myriad of other factors: cache quantity and efficiency, controller speed and configuration, physical and logical protocols, disk protection schemes, file system type and efficiency and on and on.
3 – Throughput – disk throughput is the actual data transfer rate expressed in megabytes per second (MB/s) or even in gigabytes per second (GB/s). Contributing factors here are pretty much everything covered above under latency. All of it is related since we are ultimately talking about how much data can be written to, or read from the storage system in one second.
In addition to the above definitions, some “rule of thumb” metric are useful to keep in mind in terms of the typical performance that one can expect from common disk types. I’ll also try to cover just a bit of how these devices actually work as a refresher:
1 – 7200 RPM SATA or NL-SAS Drives – 70-80 IOPS – traditional magnetic media hard disks are comprised of a stack of metal platters spinning at a high rate of speed. Disk heads capable of reading changes in magnetic polarity, and converting them to an electrical signal, fly over the surface of the platters as they spin, moving in towards the center or out towards the edge as needed based on the location of the data being requested. The surface of the disk is logically organized into tracks and sectors. Tracks can be thought of as concentric circles from innermost to outermost edge like the grooves on a record, and sectors are blocks of data within the tracks. A sector is 512 bytes of contiguous data. Magnetic media hard disks are ranked and categorized by a few key metrics. The most important one is their rotational speed (as relates to performance anyhow – capacity is the most obvious metric but only has a tangential relation to performance typically). In our case here the rotational speed is 7200 revolutions per minute or 120 revolutions per second. Another important characteristic of note is the seek time which is how long it takes the disk heads to move in and out across the surface of the disk. Seek time is actually a multi-dimensional metric. The average seek time is a general measurement that typically expresses the time it takes for the head to seek across roughly one third of the disk surface. The detailed calculation would be the time it takes to seek to each track, cumulatively, divided by the number of tracks n the surface. The maximum seek time is the time it takes the head to move from the outermost track on the disk surface to the innermost track. The last characteristic worth noting is the disk protocol. The protocol defines the physical method by which the disk connects (cabling, signaling, encoding, etc) and the logical structure of the data it transfers. In our case we are talking about “serial AT attached (SATA)” and “near line serial attached SCSI (NL-SAS)” disks. In other words, cheap commodity disks.
2 – 15,000 RPM FC or SAS Drives – 150-180 IOPS – everything said above applies here except these disks rotate faster (15,000 revolutions per minute) and typically connect using either the fiber channel (FC) or serial attached SCSI (SAS) protocols.
3 – SSD – 3500+ IOPS – solid state disks are a whole different ball game. None of the detail above applies. Instead of a set of spinning platters and flying heads, solid state disks use an array of memory chips and a memory controller. These memory chips are NAND flash chips which basically means that their electrical charge state can be changed by passing electrical current through a chemical layer and this state is held even when no power is being applied. So these memory chips hold their data after being powered off. The memory chips in solid state disks are accessed a block at a time with blocks being typically 4KB of data. One interesting and important complexity is that the chips cannot be overwritten. They can only be read, written and erased. Once a block has been written to, in order to be used again for new data, it must be erased. Because the gate that keeps the state of the chips constant is chemical, SSD blocks must be erased in larger groups called erase blocks. Keeping it simple, the erase block size is a result of the physical packaging of the SSD. For manufacturing efficiency, a certain number of memory cells will share a common substrate and, as a result of the way the chemical process for erasing works, all of the cells sharing the substrate must be erased together. Keep in mind that this is all semi conductor internals that we are discussing. If you look at an SSD you will see a number of memory chips surface mounted to a board. These chips are very dense. So inside the NAND flash module chip are multiple individual layers. These layers are made up of the individual memory cells which represent either 1 (for Single Layer Cell – SLC) or more (for Multi-Layer Cell – MLC) bits. So to recap the cells are arranged into blocks and the blocks are arranged into separate substrate groups (physically) which form the erase blocks. What all of this means in the real world is that write performance can vary dramatically when discussing SSD to the point where the difference can be orders of magnitude. Write degradation is the term applied to the disk “filling up” over time (and subsequent writes requiring a read/erase/write cycle) and this leads to “write amplification” where a write operation actually takes quite a few more operations than expected as data is moved around in order to accommodate the new write request. Manufacturers have tried to mitigate this phenomenon by getting better and smarter with garbage collection (proactive handling of written, but flagged for deletion) blocks and operating systems have as well (through the implementation of things like TRIM which also seek to optimize the disk during idle time)
The above is probably a bit more of a primer than I wanted and yet in reality barely scratches the surface; such is the nature of storage. There is a ton of complexity in both the physical and logical layers as well as in individual design, architecure and implementation of a storage subsystem. But tying back to the original point of this entry, what is the best way to think about the relationship between latency and IOPS? Consider the following:
- 1 second is 1000ms
- A hypothetical storage subsystem is quoted as delivering 10000 4KB read IOPS at 50ms latency (this is a rare fully transparent vendor!)
What does the above really tell us? Lets break it down. When presented with 4KB I/O read requests, the storage system was able to deliver 10000 IOPS, but with a 50ms latency. How does that latency figure shape what the real world performance of this array would look like?
1000 / 50 = 20
With a 50ms latency factor, this storage system basically is able to respond to requests 20 times in 1 second
10000 * 4000 = 40000000
It was determined that the storage system was able to deliver 10000 4KB read IOPS sustained. 4KB is a 4000 byte IO and there are 10000 of them in a second or 40MB of data in one second. If we take this hypothetical scenario one step farther, we can interpolate:
40000000 / 20 = 2000000
2000000 / 4000 = 500
We first determined that at 50ms latency the storage system is delivering data 20 times in a second. We then determined that based on being able to deliver 10000 4KB IOPS it delivered 40MB of data across those 20 transfer cycles. Doing some quick calculations we see that in the 50ms it performed 500 IOs in order to deliver the 2MB of data that would be required to sustain 40MB in 1 second.
It’s worth noting that this is a very synthetic scenario. We are ignoring lots of real factors like caching and buffering and are also ignoring any possibility of variability. We are also simplifying to read only. For these purposes that’s fine in order to enforce the concept, however. Introducing these elements wouldn’t alter the basic concept.
The calculations above are pretty basic and largely common sense. What is interesting to consider, however, is how altering the storage systems characteristics, optimizing for either throughput or latency, would impact the result.
Lets keep the scenario the same: an application issuing stead 4KB reads only. The storage architect this time has been instructed to increase throughput. There are lots of ways this could be achieved including adding spindles. No requirement was made for improved latency. The new storage configuration was able to deliver double the IOPS for a total sustained of 20000. Because latency is still 50ms, there are still 20 transfers in a second, each transfer now delivers 4MB, for a total of 80MB / second.
This is a great improvement, but lets say the application really needs smaller amounts of data more quickly. If the application needs data every 10ms, there is going to be serious real world impact if latency is 50ms. Consider a storage system reconfigured for latency. Possible options here might include optimizing cache, using higher performing spindles, a lower overhead protection scheme, etc. After reconfiguration it is found that latency is now 10ms, but IOPS actually reduced to
1000 / 10 = 100
We now have 100 transfer cycles in the 1 second sample period at 10ms latency. With the new spindle configuration, however, only 90 4KB read IOs can be executed. This translates to 9000 IOPS, or 36MB per second, lower throughput and IO actually than even the original scenario, but delivered at much lower latency (10ms vs 50ms) and thus a much better match to the application profile. Of course in some cases it might be required to optimize for both: very high IO and very low latency. These configurations would require a fast controller, multiple high speed interfaces, large intelligent cache, lots of fast spindles (or SSD) and a low overhead protection scheme (RAID 10 rather than RAID 5 or 6).
In conclusion the main take away is that while latency and IOPS are related, they are not related in a directly linearly way. There are lots of levers a storage architect can pull and some will result in a configuration that responds very quickly, but doesn’t move a ton of data with each response, whereas other configurations can be built that are slower to respond but respond with a much larger data set.