[Unbound-users] Unbound multithread performance: an investigation into scaling of cache response qps
wouter at NLnetLabs.nl
Tue Mar 23 08:38:03 UTC 2010
-----BEGIN PGP SIGNED MESSAGE-----
Because we haven't measured multithread performance scaling of unbound
before, I decided to try it myself. Also I was bored waiting late at
night for an audio broadcast from the IETF :-) The study is below.
Unbound multithread performance: an investigation into scaling of cache
Using a Solaris 5.11 quadcore machine[*], with four CPUs at 200 Mhz, I
have tested unbound cache performance in various configurations. In
this test setup the solaris machine is blazing away its four cpus (no
hyperthreading), and two other hosts (BSD and linux) at 3Gz are running
perf and sending queries for cache responses for www.nlnetlabs.nl at a
high rate. We count the number of queries per second that this returns.
The various configurations are with the builtin mini-event (select(2)
based), and with libevent-1.4.12-stable(using evport). Also pthreads,
solaris-threads and no-threaded(fork(2)) operation are used. The
unbound config file contains some minimum statements to make it
accessible from the network - an access control list and interface
statements - and also num-threads, and this is set to 1, 2, 3 and 4.
It was observed that the threads all seem to handle about an even load
in the tests. So real multi-threading is happening. In this test it is
very easy to outperform the machine using the two senders, otherwise
this test becomes a lot trickier.
Table, qps in total for all threads together.
Configuration ------- 1 core --- 2 cores --- 3 cores --- 4 cores
select and pthreads 8450 14100 16100 18600
select and solaristhr 8600 13800 15800 17500
select and no threads 10000 17800 19800 22800
evport and pthreads 8400 13600 15900 18100
evport and solaristhr 8500 14100 16000 18600
evport and no threads 9700 17300 19600 22300
The performance scales up fairly neatly as multi-threading goes. For
every configuration a slower-than-linear speedup is observed, indicating
locks in the underlying operation system network stack. There is only
one network card, after all, and the CPUs have to lock and synchronise
with it. The solaristhreads are a little faster than pthreads, when
combined with evport (a solaris-specific socket management system call).
No threads is even faster (but of course fragments the cache), by about
20%, and its advantage increases slightly as the number of cores
increases (from 15% to 23%). The evport call is a little bit slower
than select, but since it breaks the 1024-limit of select, it will thus
remain useful for high capacity configurations.
To increase performance further, it seems the place to work at is the
network driver or network stack.
[*] This machine has been donated by RIPE NCC and has mostly been used
for System/OS interoperation testing. It turned out to be a good machine
to expose certain race conditions that did not show up on regular
Intel/Linux or BSD systems. If you happen to have somewhat exotic
machinery around we would welcome your donation.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/
-----END PGP SIGNATURE-----
More information about the Unbound-users