[Unbound-users] ~10% performance increase in worker_handle_request()

Sun Dec 29 19:41:36 UTC 2013

Hi,

I've been testing Unbound 1.4.21's performance with the trafgen tool
from the netsniff-ng toolkit, with a stream of UDP queries for the same
qname/qtype in order to test Unbound's 100% cache hit rate performance,
and using Linux's traffic control facility to vary the query load
offered to the Unbound server.  Then I use Linux's perf utility to
record performance data about the system under different query loads.

I read http://www.unbound.net/documentation/howto_optimise.html and
configured the following settings in my Unbound config file:

    num-threads: 4
    so-rcvbuf: 8m
    so-sndbuf: 8m
    msg-cache-slabs: 8
    rrset-cache-slabs: 8
    infra-cache-slabs: 8
    key-cache-slabs: 8

Under high loads but not under low loads, it appears the local_zone
lookup in worker_handle_request() is quite costly.  Here are the top 10
profiled functions, with the top 2 function's call chains expanded:

-   7.22%  unbound  unbound                [.] dname_lab_cmp
   - dname_lab_cmp
      - 78.04% local_zone_cmp
           rbtree_find_less_equal
           local_zones_lookup
           worker_handle_request.part.19.13072.2480
           comm_point_udp_callback.4353
           event_base_loop
         + comm_base_dispatch
      - 19.46% compress_tree_lookup.15131
           packed_rrset_encode.15189
           reply_info_encode.3179
           reply_info_answer_encode
           worker_handle_request.part.19.13072.2480
           comm_point_udp_callback.4353
           event_base_loop
         + comm_base_dispatch
      + 0.88% rbtree_find_less_equal
      + 0.53% packed_rrset_encode.15189
-   5.60%  unbound  libpthread-2.17.so     [.] pthread_spin_lock
   - pthread_spin_lock
      - 53.14% worker_handle_request.part.19.13072.2480
           comm_point_udp_callback.4353
           event_base_loop
         + comm_base_dispatch
      - 46.86% comm_point_udp_callback.4353
           event_base_loop
         + comm_base_dispatch
+   4.87%  unbound  [kernel.kallsyms]      [k] _raw_spin_lock
+   3.44%  unbound  [kernel.kallsyms]      [k] _raw_spin_lock_bh
+   3.35%  unbound  [kernel.kallsyms]      [k] __ip_select_ident
+   3.26%  unbound  [kernel.kallsyms]      [k] fib_table_lookup
+   3.12%  unbound  libpthread-2.17.so     [.] pthread_rwlock_rdlock
+   2.84%  unbound  [kernel.kallsyms]      [k] fget_light
+   2.80%  unbound  libpthread-2.17.so     [.] pthread_rwlock_unlock
+   2.78%  unbound  unbound                [.] packed_rrset_encode.15189

I believe the calls to dname_lab_cmp() (#1) and pthread_spin_lock() (#2)
can ultimately be accounted to the call to local_zones_lookup() in
worker_handle_request().

The benchmark run that the above data is from resulted in Unbound
utilizing 1.938 CPUs, according to the "perf stat" command.  With the
following patch, disabling local_zone lookups entirely, Unbound only
utilized 1.693 CPUs.

diff -Npru unbound-1.4.21.callgraph/daemon/worker.c unbound-1.4.21.callgraph2/daemon/worker.c

--- unbound-1.4.21.callgraph/daemon/worker.c	2013-09-19 06:51:33.000000000 -0400
+++ unbound-1.4.21.callgraph2/daemon/worker.c	2013-12-27 17:53:29.589149356 -0500
@@ -862,6 +862,7 @@ worker_handle_request(struct comm_point*
 		server_stats_insrcode(&worker->stats, c->buffer);
 		return 1;
 	}
+	/*
 	if(local_zones_answer(worker->daemon->local_zones, &qinfo, &edns, 
 		c->buffer, worker->scratchpad)) {
 		regional_free_all(worker->scratchpad);
@@ -872,6 +873,7 @@ worker_handle_request(struct comm_point*
 		server_stats_insrcode(&worker->stats, c->buffer);
 		return 1;
 	}
+	*/
 	if(!(LDNS_RD_WIRE(ldns_buffer_begin(c->buffer))) &&
 		acl != acl_allow_snoop ) {
 		ldns_buffer_set_limit(c->buffer, LDNS_HEADER_SIZE);

Moving the local_zone lookup to occur after the cache lookup as in the
following patch, Unbound only utilized 1.712 CPUs.

diff -Npru unbound-1.4.21.callgraph/daemon/worker.c unbound-1.4.21.callgraph3/daemon/worker.c
--- unbound-1.4.21.callgraph/daemon/worker.c	2013-09-19 06:51:33.000000000 -0400
+++ unbound-1.4.21.callgraph3/daemon/worker.c	2013-12-27 18:32:01.916016428 -0500
@@ -862,16 +862,6 @@ worker_handle_request(struct comm_point*
 		server_stats_insrcode(&worker->stats, c->buffer);
 		return 1;
 	}
-	if(local_zones_answer(worker->daemon->local_zones, &qinfo, &edns, 
-		c->buffer, worker->scratchpad)) {
-		regional_free_all(worker->scratchpad);
-		if(ldns_buffer_limit(c->buffer) == 0) {
-			comm_point_drop_reply(repinfo);
-			return 0;
-		}
-		server_stats_insrcode(&worker->stats, c->buffer);
-		return 1;
-	}
 	if(!(LDNS_RD_WIRE(ldns_buffer_begin(c->buffer))) &&
 		acl != acl_allow_snoop ) {
 		ldns_buffer_set_limit(c->buffer, LDNS_HEADER_SIZE);
@@ -924,6 +914,17 @@ worker_handle_request(struct comm_point*
 	ldns_buffer_rewind(c->buffer);
 	server_stats_querymiss(&worker->stats, worker);
 
+	if(local_zones_answer(worker->daemon->local_zones, &qinfo, &edns, 
+		c->buffer, worker->scratchpad)) {
+		regional_free_all(worker->scratchpad);
+		if(ldns_buffer_limit(c->buffer) == 0) {
+			comm_point_drop_reply(repinfo);
+			return 0;
+		}
+		server_stats_insrcode(&worker->stats, c->buffer);
+		return 1;
+	}
+
 	if(verbosity >= VERB_CLIENT) {
 		if(c->type == comm_udp)
 			log_addr(VERB_CLIENT, "udp request from",

Here are the top profiled functions with this patch applied:

+   5.44%  unbound  [kernel.kallsyms]      [k] _raw_spin_lock                                     ▒
+   4.02%  unbound  [kernel.kallsyms]      [k] _raw_spin_lock_bh                                  ▒
+   3.82%  unbound  [kernel.kallsyms]      [k] fget_light                                         ▒
+   3.65%  unbound  [kernel.kallsyms]      [k] __ip_select_ident                                  ◆
+   3.55%  unbound  [kernel.kallsyms]      [k] fib_table_lookup                                   ▒
+   3.36%  unbound  unbound                [.] packed_rrset_encode.15189                          ▒
+   3.15%  unbound  [kernel.kallsyms]      [k] sock_alloc_send_pskb                               ▒
+   3.05%  unbound  libpthread-2.17.so     [.] pthread_rwlock_rdlock                              ▒
+   2.76%  unbound  [kernel.kallsyms]      [k] inet_getpeer                                       ▒
+   2.72%  unbound  libpthread-2.17.so     [.] pthread_spin_lock                                  ▒
+   2.66%  unbound  libpthread-2.17.so     [.] pthread_rwlock_unlock

Note the dname_lab_cmp() and pthread_spin_lock() functions are no longer
on top.

I've attached a graph showing the performance of three different
benchmark runs:

 + unbound-callgraph-1: Unbound 1.4.21, unmodified.

 + unbound-callgraph-2: Unbound 1.4.21, with the first patch above.

 + unbound-callgraph-3: Unbound 1.4.21, with the second patch above.

Making the local_zone lookup occur after the cache lookup has little
effect under low load, but improves performance by ~10% under high load.
This has the side effect of making runtime updates to the local_zone
data non-instantaneous if there is a conflicting entry in the cache,
because the cache would be consulted first.  This side effect might be
fixable by having updates to the local_zone database do an implicit
flush, which adds some additional complexity, but it's probably worth
doing for up to a ~10% performance benefit under certain loads.

-- 
Robert Edmonds
edmonds at debian.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: unbound-1.4.21.callgraph2.diff
Type: text/x-diff
Size: 816 bytes
Desc: not available
URL: <http://lists.nlnetlabs.nl/pipermail/unbound-users/attachments/20131229/7a67f55b/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: unbound-1.4.21.callgraph3.diff
Type: text/x-diff
Size: 1415 bytes
Desc: not available
URL: <http://lists.nlnetlabs.nl/pipermail/unbound-users/attachments/20131229/7a67f55b/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: plot-callgraph.png
Type: image/png
Size: 452161 bytes
Desc: not available
URL: <http://lists.nlnetlabs.nl/pipermail/unbound-users/attachments/20131229/7a67f55b/attachment.png>