SwiftStack
Blog

Havana Design Summit: Benchmarking Swift

At the OpenStack Summit last week, we had conversations with many users about how to best benchmark Swift. In the developer summit sessions, benchmarking and performance were recurring topics with lots of great input from both developers and users. We also heard from HP, Intel, and Seagate about how they conduct benchmarking of Swift and what they learned in the process. This post provides an overview of why benchmarking is important for any Swift cluster, how to approach it, and some of the key takeaways from the summit in this area. It also provides an overview of SwiftStack Bench (ssbench), the Swift benchmarking tool we recently open sourced.

Benchmarking OpenStack Swift

Depending on your goal, you may want a Realistic Benchmark or a Targeted Benchmark. Both approaches require benchmarking tools that scale to avoid any bottlenecks in the benchmarking code during load generation. Because of Swift’s fantastic horizontal scalability, avoiding bottlenecks in benchmarking code can be very challenging. Benchmarking Swift means generating tens of thousands of concurrent requests and utilizing many benchmarking servers to allow hundreds of gigabits per second of available client throughput. Both approaches to benchmarking also benefit from fine-grained collection of total request latency, time-to-first-byte latency, and Swift transaction IDs for every request. But they do have different goals, and that should inform load generation and results analysis.

Realistic Benchmarking, asks, “What happens when the cluster sees a particular client load?” or “How many clients, ops-per-second, or throughput can my cluster really support?” You are more interested in simulating a production workload than you are in isolating a particular action. This kind of benchmarking can benefit from simulating parametric mixed client workloads (proportion of object sizes, operation types, etc.) or replaying a workload based on some kind of capture or “trace” from another cluster.

With Targeted Benchmarking, you want to generate a very specific, controlled load on the cluster to identify problems and test potential improvements. Data collected during a synthetic workload will be less noisy than a more realistic, mixed workload. This is useful for testing the effectiveness of tweaks to networking, node hardware, tuning/configuration, and Swift code.

SwiftStack Bench (ssbench)

At SwiftStack, our first customer benchmarking requirements were realistic, so we wrote a scalable benchmarking tool we named SwiftStack Bench (ssbench). At its heart, ssbench either manages the run of a mixed-workload benchmark “scenario” or it generates a report from the results. The data collected for every request is quite rich and includes the start time, total duration, time-to-first-byte if it was a GET, and the Swift transaction ID for the request. Because there are many different ways to slice and dice the results, reporting has a lot of room for improvement. But the rich, raw results are saved so that previously-run benchmarks may benefit from future reporting improvements.

You can perform some targeted benchmarking with ssbench as well by simply having a very simple scenario. For example, you could target small-object PUTs by having a scenario with only small files and only PUT operations. Similarly, you could have only large files and only GET operations. Sam was able to demonstrate the benefit of per-disk I/O thread-pooling in the object-server with a GET workload using ssbench. We will soon extend the available operation types in ssbench to cover metadata POST operations as well. For folks with a metadata-intensive workload, these operations will enable investigations of the way Swift handles metadata when adjusting the size of XFS inodes, in addition to other metadata optimization.

The ssbench project is open-source and we look forward to developing it cooperatively with the Swift community. To that end, I led a discussion session at this month’s Design Summit to gather requirements and suggestions for benchmarking Swift. We had a lot of great feedback captured in the Etherpad from some current users of ssbench, other tool authors, and various Swift users.

The session generated many new feature requests for ssbench as well as other points/questions:

  • Perhaps use Tsung for load-generation?
  • Enable replaying a past load based on “trace” data from a live cluster.
  • Generate a parametric benchmark scenario from live cluster “trace” data to develop more accurate loads.
  • Peter Portante from RedHat mentioned successfully using Performance Co-Pilot to monitor a cluster during benchmarking.
  • In his presentations, Mark Seger from HP demonstrated using collectl to monitor a cluster during benchmarking.

Here’s an example ssbench report. Note that I used small objects since I only had a single 12-core server and that the cluster in question had a down node during the benchmark. I also cut out the “Worst latency TX ID” column so it would look better in this blog post.

Medium test scenario
Worker count:  10   Concurrency: 800  Ran 2013-04-21 18:06:16 UTC to 2013-04-21 18:06:39 UTC (22s)

% Ops    C   R   U   D       Size Range       Size Name
 77%   % 26  60   7   7        1 kB -  16 kB  tiny
 23%   % 26  60   7   7      100 kB - 200 kB  small
---------------------------------------------------------------------
         26  60   7   7      CRUD weighted average

TOTAL
       Count: 99997  Average requests per second: 4475.8
                            min       max      avg      std_dev  95%-ile                 
       First-byte latency:  0.007 -   1.509    0.053  (  0.036)    0.082  (all obj sizes)
       Last-byte  latency:  0.009 -   1.941    0.160  (  0.127)    0.400  (all obj sizes)
       First-byte latency:  0.007 -   1.509    0.051  (  0.034)    0.070  (    tiny objs)
       Last-byte  latency:  0.009 -   1.559    0.133  (  0.109)    0.312  (    tiny objs)
       First-byte latency:  0.010 -   1.494    0.061  (  0.041)    0.100  (   small objs)
       Last-byte  latency:  0.017 -   1.941    0.248  (  0.140)    0.494  (   small objs)

CREATE
       Count: 25889  Average requests per second: 1158.8
                            min       max      avg      std_dev  95%-ile                 
       First-byte latency:  N/A   -   N/A      N/A    (  N/A  )    N/A    (all obj sizes)
       Last-byte  latency:  0.097 -   1.941    0.286  (  0.128)    0.520  (all obj sizes)
       First-byte latency:  N/A   -   N/A      N/A    (  N/A  )    N/A    (    tiny objs)
       Last-byte  latency:  0.097 -   1.559    0.253  (  0.110)    0.442  (    tiny objs)
       First-byte latency:  N/A   -   N/A      N/A    (  N/A  )    N/A    (   small objs)
       Last-byte  latency:  0.146 -   1.941    0.397  (  0.121)    0.589  (   small objs)

READ
       Count: 60191  Average requests per second: 2722.9
                            min       max      avg      std_dev  95%-ile                 
       First-byte latency:  0.007 -   1.509    0.053  (  0.036)    0.082  (all obj sizes)
       Last-byte  latency:  0.009 -   1.613    0.096  (  0.071)    0.231  (all obj sizes)
       First-byte latency:  0.007 -   1.509    0.051  (  0.034)    0.070  (    tiny objs)
       Last-byte  latency:  0.009 -   1.521    0.070  (  0.039)    0.103  (    tiny objs)
       First-byte latency:  0.010 -   1.494    0.061  (  0.041)    0.100  (   small objs)
       Last-byte  latency:  0.017 -   1.613    0.183  (  0.082)    0.296  (   small objs)

UPDATE
       Count:  6915  Average requests per second: 310.5
                            min       max      avg      std_dev  95%-ile                 
       First-byte latency:  N/A   -   N/A      N/A    (  N/A  )    N/A    (all obj sizes)
       Last-byte  latency:  0.088 -   1.516    0.252  (  0.125)    0.483  (all obj sizes)
       First-byte latency:  N/A   -   N/A      N/A    (  N/A  )    N/A    (    tiny objs)
       Last-byte  latency:  0.088 -   1.516    0.218  (  0.102)    0.394  (    tiny objs)
       First-byte latency:  N/A   -   N/A      N/A    (  N/A  )    N/A    (   small objs)
       Last-byte  latency:  0.121 -   1.409    0.367  (  0.124)    0.568  (   small objs)

DELETE
       Count:  7002  Average requests per second: 316.4
                            min       max      avg      std_dev  95%-ile                 
       First-byte latency:  N/A   -   N/A      N/A    (  N/A  )    N/A    (all obj sizes)
       Last-byte  latency:  0.041 -   1.522    0.144  (  0.094)    0.275  (all obj sizes)
       First-byte latency:  N/A   -   N/A      N/A    (  N/A  )    N/A    (    tiny objs)
       Last-byte  latency:  0.041 -   1.522    0.145  (  0.093)    0.277  (    tiny objs)
       First-byte latency:  N/A   -   N/A      N/A    (  N/A  )    N/A    (   small objs)
       Last-byte  latency:  0.045 -   1.502    0.143  (  0.094)    0.271  (   small objs)

What Does Swift Look Like To a Drive?

Tim Feldman from Seagate shared some interesting results of targeted benchmarking, from the hard-drives’ perspective. For the most part, the drives saw the expected load. Volume of reads was lower than predicted, but operating system buffer cache is the likely culprit, which is to be expected if the volume of benchmark data isn’t large enough to cause buffer cache thrashing.

Mr. Feldman pointed out that “runt writes” will be a problem for drives with native 4k sector sizes. A “runt write” results from writing to fewer than all 8 512-byte sections of a single 4k sector. When fewer than 8 512-byte sections of a 4k sector are written to, the drive must perform a read-merge-write operation instead of just a write. When drives used in Swift clusters move to a 4k sector size (and this will happen soon), we’ll need to make sure the filesystem and OS correctly operate using 4k sectors and not legacy 512-byte sectors.

A relatively small number of disk sectors got accessed many more times than the majority. This would make sense for filesystem and/or container metadata, but warrants further investigation or optimization.

What Can We Learn From Some Benchmarking?

Jiangang Duan from Intel detailed his team’s results from targeted benchmarking with their open-source tool, COSBench. They found that on their storage nodes, when buffer cache pressure caused filesystem metadata to get evicted from memory, read performance suffered by 83%. The average read operation was 34 KB instead of averaging 122 KB, and read requests per second and throughput were both worse. Mr. Duan then showed that setting the Linux kernel tunable, vfs_cache_pressure, to a very low number almost entirely mitigated this performance drop by keeping inode data cached when under memory pressure.

Mr. Duan noted that their servers had four bonded 1-Gb/s NICs and seemed to utilize the bonding when transmitting data, but not when receiving it. He said this could use some further investigation and potential optimization.

Finally, a slow disk which isn’t actually dead can not only impact the average latency of incoming requests to that disk, but worsen the latency of all requests to the node by up to 25 percent. I don’t want to steal his thunder, but Sam will be writing a brief blog post about his talk, which addressed this very problem.

The Power of Fine-Grained Benchmark Metrics

Mark Seger from HP drove home the point that fine-grained tracking of each benchmark client request’s results is critical. Like ssbench, his closed-source benchmarking tool suite, “getput”, tracks response latencies and Swift transaction IDs for each request.

Being able to report on latencies over time allows you to spot odd things that happened briefly during a run. Average numbers for the whole run can’t give you that. Generating a latency histogram can show you the distribution of latencies, allowing you to see a long tail if you have one.

Mr. Seger noted that Swift’s scaling is excellent: with multiple clients, performance grows close to linearly. With small objects, benchmarking scales well, but with larger objects, CPU or bandwidth on the benchmark node becomes a bottleneck. This highlights my point earlier that your benchmarking tool needs to scale out so it doesn’t hit a bottleneck before your cluster does.

When comparing targeted benchmark results for GETs of 1k, 10k, and 100k objects, Mr. Seger found that the requests per second for 10k were noticeably lower. Further investigation revealed that only object sizes between 7,888 and 22,469 bytes were affected. It turned out that Nagel’s algorithm was interfering because the maximum segment size (MSS) over the physical NIC between the client and Pound was much smaller than the MSS of the loopback device between Pound and the Swift proxy-server. This arbitrarily added latency to requests in a certain size range. Disabling Nagel’s algorithm with TCP_NODELAY on internal sockets within Swift may therefore be a good idea.

A particular 6-second PUT had two of three writes return in under one second, but the third object-server held up the response to the client. Mr. Seger suggested optimizing client latency by returning success to the client as soon as the PUT quorum is satisfied.

Conclusion

Benchmarking is an important part of any Swift deployment. With many tools to choose from and best practices just emerging, it can be a daunting project. This post provided an overview of available tools, best practices and some lessons learned from the OpenStack summit. If you have questions or like to discuss benchmarking Swift, feel free to reach out to us here at SwiftStack.

Darrell Bishop

Darrell Bishop

Architect, SwiftStack
OpenStack Swift Core Team

Categories

OpenStack, OpenStack Swift, PlanetOpenstack, SwiftStack

Havana Design Summit: Extending ACLs and Metadata Havana Design Summit: Enhancing Performance with Thread Pools

Comments

© 2014 SwiftStack Inc.        San Francisco, CA         contact@swiftstack.com