Measuring the Performance of High Speed Storage
In developing the PacketRAID recorder, The Packet Company invested a considerable amount of time and effort into evaluating different storage configurations. Below, we outline our approach, one of the tools we developed, and the results we achieved.
Sequential vs Random
Most work on high-performance storage focuses on maximising the number of IOPS - random 4k reads and writes with a typical 70:30 split. This is a rough approximation to the load produced by either a large workgroup of people, or a 'typical' database server; lots of strategies have been developed to optimise this workload, typically involving caching and/or tiered storage architectures.
Recording and replaying network traffic, however, does not fit this mould - their access patterns are pure sequential. Further, they are very intolerant to delay or jitter, requiring sustained even throughput across the entire capacity of the array. In fact, these access patterns differ in almost every way from the 'typical', and this impacts every level of the design.
Read Testing
Sequential read performance can be tested trivially with multiple runs of dd copying a large file into /dev/null, varying the block size and other parameters to get the best performance. dd itself puts a minimal load on the system, as does the code behind /dev/null, ensuring that throughput figures accurately reflect the capabilities of the storage and associated system configuration.
dd if=/raid/hugefile of=/dev/null bs=64M iflag=direct
Note the use of iflag=direct, this bypasses kernel caching and substantially increases performance.
Write Testing
When testing write performance, it is most common to copy from /dev/zero to the storage array, since this device provides a convenient source of data e.g.:
dd if=/dev/zero of=/raid/hugefile bs=64M oflag=direct
This is fine for arrays of hard drives, where /dev/zero is much faster than the disks themselves, but can become an issue as overall storage performance increases. With larger arrays of SSDs, as in our PacketRAID recorder, the time taken to zero each memory block before it is written is significant, and reduces dd's throughput markedly. This makes it difficult to be confident that the storage is optimally tuned, or to accurately measure its speed.
Custom Device
To address this issue, we decided that the best approach would be to develop a new Linux device - /dev/phony - that worked in a similar way to /dev/null, but for reads. It turned out to be surprisingly simple, and we are grateful to Valerie Henson for permission to base our work on her example device driver code. Now, whenever a read request is received, it is immediately satisfied without any actual data being copied.
As expected, testing /dev/phony with dd and /dev/null yields spectacular results, since no data is actually being copied:
# dd if=/dev/phony of=/dev/null bs=2047M count=1000000
1000000+0 records in
1000000+0 records out
2146435072000000 bytes (2.1 PB) copied, 0.291501 s, 7.4 PB/s
This confirms our earlier assumption that dd itself is lightweight and does not impact performance when used for testing.
Performance Results
With a large RAID array of SSDs, we have consistently achieved 10GB/s or better for reading and writing, depending on the precise configuration, and achieved almost perfect linear scaling from 1 disk to many. The benefits of using /dev/phony start to show from 3-4GB/s upwards, where /dev/zero's impact starts to become quite evident.
We ran a comparative test on one of our Ivy Bridge development platforms to compare the effect of using /dev/zero vs /dev/phony:
# numactl --cpunodebind=1 dd if=/dev/zero of=/data/hugefile bs=1G oflag=direct
dd: writing `/data/hugefile': No space left on device
11422+0 records in
11421+0 records out
12263205371904 bytes (12 TB) copied, 2561.83 s, 4.8 GB/s
That's pretty respectable, but as we are about to see, gives us no real indication of the true speed of the storage:
# numactl --cpunodebind=1 dd if=/dev/phony of=/data/hugefile bs=1G oflag=direct
dd: writing `/data/hugefile': No space left on device
11422+0 records in
11421+0 records out
12263205371904 bytes (12 TB) copied, 1105.25 s, 11.1 GB/s
When the latency imposed by /dev/zero is removed, the throughput achieved more than doubles!
Download /dev/phony
The latest version of the code for /dev/phony is available on GitHub under GPL. Please download and enjoy! We welcome comments and suggestions.