[Unbound-users] Unbound multithread performance: an investigation into scaling of cache response qps

Tue Mar 23 08:38:03 UTC 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

Because we haven't measured multithread performance scaling of unbound
before, I decided to try it myself.  Also I was bored waiting late at
night for an audio broadcast from the IETF :-)  The study is below.

Unbound multithread performance: an investigation into scaling of cache
response qps

Using a Solaris 5.11 quadcore machine[*], with four CPUs at 200 Mhz, I
have tested unbound cache performance in various configurations.  In
this test setup the solaris machine is blazing away its four cpus (no
hyperthreading), and two other hosts (BSD and linux) at 3Gz are running
perf and sending queries for cache responses for www.nlnetlabs.nl at a
high rate.  We count the number of queries per second that this returns.

The various configurations are with the builtin mini-event (select(2)
based), and with libevent-1.4.12-stable(using evport).  Also pthreads,
solaris-threads and no-threaded(fork(2)) operation are used.  The
unbound config file contains some minimum statements to make it
accessible from the network - an access control list and interface
statements - and also num-threads, and this is set to 1, 2, 3 and 4.

It was observed that the threads all seem to handle about an even load
in the tests.  So real multi-threading is happening.  In this test it is
very easy to outperform the machine using the two senders, otherwise
this test becomes a lot trickier.

Table, qps in total for all threads together.

Configuration ------- 1 core --- 2 cores --- 3 cores --- 4 cores
select and pthreads     8450      14100       16100       18600
select and solaristhr   8600      13800       15800       17500
select and no threads  10000      17800       19800       22800
evport and pthreads     8400      13600       15900       18100
evport and solaristhr   8500      14100       16000       18600
evport and no threads   9700      17300       19600       22300

The performance scales up fairly neatly as multi-threading goes.  For
every configuration a slower-than-linear speedup is observed, indicating
locks in the underlying operation system network stack.  There is only
one network card, after all, and the CPUs have to lock and synchronise
with it.  The solaristhreads are a little faster than pthreads, when
combined with evport (a solaris-specific socket management system call).
No threads is even faster (but of course fragments the cache), by about
20%, and its advantage increases slightly as the number of cores
increases (from 15% to 23%).  The evport call is a little bit slower
than select, but since it breaks the 1024-limit of select, it will thus
remain useful for high capacity configurations.

To increase performance further, it seems the place to work at is the
network driver or network stack.

Best regards,
   Wouter

[*] This machine has been donated by RIPE NCC and has mostly been used
for System/OS interoperation testing. It turned out to be a good machine
to expose certain race conditions that did not show up on regular
Intel/Linux or BSD systems. If you happen to have somewhat exotic
machinery around we would welcome your donation.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/

iEYEARECAAYFAkuofeoACgkQkDLqNwOhpPiywgCfV9BJDaHYAUtgc/J7ueLCfJF4
d30AoJmpCXcLqc5rnTMWNHeyO3+LdG9w
=m8Qo
-----END PGP SIGNATURE-----