deadlock in unbound-control commands
Poprocks
robert.paprocki at dreamhost.com
Sun Aug 23 23:52:16 UTC 2015
Hello,
I have encountered an issue in which unbound appears to hang when issuing commands via unbound-control. I am running unbound 1.5.4 built on an Ubuntu 12.04 system. I've configured unbound with the same options available from the unbound package for Ubuntu precise:
configured for x86_64-unknown-linux-gnu on Wed Feb 8 05:48:44 UTC 2012 with options: '--prefix=/usr' '--sysconfdir=/etc' '--disable-rpath' '--with-pidfile=/var/run/unbound.pid' '--with-libevent' '--with-pythonmodule' '--with-pyunbound'
This issue is somewhat difficult to reproduce, as it doesn't manifest itself regularly (perhaps 1 in 100 unbound-controll calls), but I have captured what info I can. I see first that the unbound-control shell call hangs; strace of the unbound-control process shows it is blocked on a read() call, reading from the unbound daemon. Unbound itself is hanging at the following syscall:
root at server:~# strace -p 538
Process 538 attached - interrupt to quit
futex(0x2f8ad3c, FUTEX_WAIT_PRIVATE, 0, NULL
Below is a backtrace, showing what appears to be a deadlock in local_zones_add_RR:
(gdb) bt full
#0 pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:83
No locals.
#1 0x000000000040eccd in local_zones_add_RR (zones=0x2026d70, rr=0x38c7dcec35c "privfoo.p-hex.com IN A 10.10.10.68")
at services/localzone.c:1352
lockret_err = <optimized out>
rr_name = 0x2a497d0 "\aprivfoo\005p-hex\003com"
rr_class = 1
len = 19
labs = 4
z = <optimized out>
r = <optimized out>
#2 0x0000000000429011 in execute_cmd.7956 (rc=0x2021930, ssl=0x2a45430, cmd=<optimized out>, worker=0x1ef3010) at daemon/remote.c:1146
arg = 0x38c7dcec35c "privfoo.p-hex.com IN A 10.10.10.68"
worker = 0x1ef3010
ssl = 0x2a45430
p = 0x38c7dcec351 "local_data privfoo.p-hex.com IN A 10.10.10.68"
#3 0x000000000046ea9b in handle_req.isra.19 (rc=0x2021930, ssl=0x2a45430) at daemon/remote.c:2555
r = <optimized out>
pre = "UBCT1 \000\000", <incomplete sequence \352>
magic = "UBCT1 "
buf = " local_data privfoo.p-hex.com IN A 10.10.10.68\000\000\f\000\000\000\000\000\000\000\240\306\316}\214\003\000\000\005\000\000\000\000\000\000\000@\252\f\260<\003\000\000pi\244\002\000\000\000\000\355s֯<\003\000\000\220\304\316}\214\003\000\000\002E߯<\003\000\000\200\257\244\002\000\000\000\000\250\222U\260<\003\000\000\326\301b\330֡Cˠ\306\316}\214\003\000\000\f\000\000\000\326QS\330\000\001\004\000\000\000\000\000\340\304\316}\214\003\000\000\244W\244\002\000\000\000\000\060\000\000\000\060\000\000\000\364pW\260<\003\000\000 \306\316}\214\003", '\000' <repeats 18 times>, "\017\000\000\000\060\000\000\000\017\000\000\000\000\000\000\000\060", '\000' <repeats 15 times>, "0\000\000\000\005\000\000\000\200"...
#4 0x000000000046ec09 in remote_control_callback (c=<optimized out>, arg=0x2a341e0, err=<optimized out>, rep=<optimized out>)
at daemon/remote.c:2624
s = 0x2a341e0
rc = 0x2021930
r = 1
#5 0x0000033cb02f89cc in event_base_loop () from /usr/lib/libevent-2.0.so.5
No symbol table info available.
#6 0x0000000000430b9c in comm_base_dispatch (b=<optimized out>) at util/netevent.c:305
---Type <return> to continue, or q <return> to quit---
retval = <optimized out>
#7 0x0000000000467ac6 in worker_work (worker=<optimized out>) at daemon/worker.c:1311
No locals.
#8 daemon_fork (daemon=0x1e723f0) at daemon/daemon.c:566
No locals.
#9 0x0000000000411eaa in run_daemon (debug_mode=0, cmdline_verbose=0, cfgfile=0x48bc49 "/etc/unbound/unbound.conf")
at daemon/unbound.c:671
cfg = <optimized out>
daemon = <optimized out>
done_setup = 1
#10 main (argc=<optimized out>, argv=<optimized out>) at daemon/unbound.c:766
c = <optimized out>
cfgfile = 0x48bc49 "/etc/unbound/unbound.conf"
winopt = <optimized out>
cmdline_verbose = 0
debug_mode = 0
The local_zones struct 'zones' passed into local_zones_add_RR appears as such:
(gdb) print *(struct local_zones *)0x2026d70
$2 = {lock = {__data = {__lock = 0, __nr_readers = 1, __readers_wakeup = 0, __writer_wakeup = 0, __nr_readers_queued = 0,
__nr_writers_queued = 1, __writer = 0, __shared = 0, __pad1 = 0, __pad2 = 0, __flags = 0},
__size = "\000\000\000\000\001", '\000' <repeats 15 times>, "\001", '\000' <repeats 34 times>, __align = 4294967296}, ztree = {
root = 0x208e000, count = 103, cmp = 0x4489c0 <local_zone_cmp>}}
In another instance of this I saw this hang in daemon/remote.c:
(gdb) bt full
#0 pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:83
No locals.
#1 0x0000000000428d00 in do_zone_remove (arg=<optimized out>, worker=0x15a6e90, ssl=0x21996e0) at daemon/remote.c:1131
lockret_err = <optimized out>
nm = <optimized out>
nmlabs = <optimized out>
nmlen = <optimized out>
z = 0x3cab0594990
#2 execute_cmd.7956 (rc=0x1551a20, ssl=0x21996e0, cmd=0x3cab0594990 " local_zone_remove p-hex.com", worker=0x15a6e90)
at daemon/remote.c:2462
p = 0x3cab0594991 "local_zone_remove p-hex.com"
#3 0x000000000046ea9b in handle_req.isra.19 (rc=0x1551a20, ssl=0x21996e0) at daemon/remote.c:2555
r = <optimized out>
pre = "UBCT1 \000\000\020p"
magic = "UBCT1 "
buf = " local_zone_remove p-hex.com\000\000\000\000\f\000\000\000\000\000\000\000\240\236\031\002\000\000\000\000\f\000\000\000\000\000\000\000\340LY\260\312\003\000\000\005\000\000\000\000\000\000\000 at j.\252\212\002\000\000@\326\031\002\000\000\000\000\355\063\370\251\212\002\000\000\320JY\260\312\003\000\000\002\005\001\252\212\002\000\000\240\211\030\002\000\000\000\000\250Rw\252\212\002\000\000\326\301b\330֡C\313\340LY\260\312\003\000\000\f\000\000\000\326QS\330\000\001\004\000\000\000\000\000 KY\260\312\003\000\000T\232\031\002\000\000\000\000\060\000\000\000\060\000\000\000\364\060y\252\212\002\000\000`LY\260\312\003", '\000' <repeats 18 times>, "\017\000\000\000\060\000\000\000\017\000\000\000\000\000\000\000\060", '\000' <repeats 15 times>, "0\000\000\000\005"...
#4 0x000000000046ec09 in remote_control_callback (c=<optimized out>, arg=0x21884c0, err=<optimized out>, rep=<optimized out>)
at daemon/remote.c:2624
s = 0x21884c0
rc = 0x1551a20
r = 1
#5 0x0000028aaa5149cc in event_base_loop () from /usr/lib/libevent-2.0.so.5
No symbol table info available.
#6 0x0000000000430b9c in comm_base_dispatch (b=<optimized out>) at util/netevent.c:305
retval = <optimized out>
#7 0x0000000000467ac6 in worker_work (worker=<optimized out>) at daemon/worker.c:1311
No locals.
#8 daemon_fork (daemon=0x1526270) at daemon/daemon.c:566
No locals.
#9 0x0000000000411eaa in run_daemon (debug_mode=0, cmdline_verbose=0, cfgfile=0x48bc49 "/etc/unbound/unbound.conf")
at daemon/unbound.c:671
cfg = <optimized out>
daemon = <optimized out>
done_setup = 1
#10 main (argc=<optimized out>, argv=<optimized out>) at daemon/unbound.c:766
c = <optimized out>
cfgfile = 0x48bc49 "/etc/unbound/unbound.conf"
winopt = <optimized out>
cmdline_verbose = 0
debug_mode = 0
In both of these cases it appears to be hanging on the 'lock_rw_t lock' member of the local_zones struct. Below is the relevant portions of the config (excluding only access control):
root at server:~# sed -e '/#.*$/d' -e '/^$/d' /etc/unbound/unbound.conf
remote-control:
control-enable: yes
server:
verbosity: 4
statistics-interval: 0
statistics-cumulative: no
extended-statistics: no
num-threads: 1
interface: 10.242.128.232
port: 53
do-ip4: yes
do-udp: yes
do-tcp: yes
chroot: ""
directory: "/etc/unbound"
logfile: "/var/log/unbound.log"
use-syslog: "no"
pidfile: "/var/run/unbound.pid"
auto-trust-anchor-file: /var/lib/unbound/root.key
root at server:~# unbound -h
usage: unbound [options]
start unbound daemon DNS resolver.
-h this help
-c file config file to read instead of /etc/unbound/unbound.conf
file format is described in unbound.conf(5).
-d do not fork into the background.
-v verbose (more times to increase verbosity)
Version 1.5.4
linked libs: libevent 2.0.16-stable (it uses epoll), OpenSSL 1.0.1 14 Mar 2012
linked modules: dns64 validator iterator
BSD licensed, see LICENSE in source package for details.
Report bugs to unbound-bugs at nlnetlabs.nl
I have found this problems occurs whether num-threads is set to 1 or 8 (number of cores in my system). There are roughly 15000 local-data entries in this machine, from half a dozen local-zone typetransparent entries. Unbound logging doesn't indicate any failure (the last entry is the one generated from the unbound-control call). This seems to occur most often when adding new local-data entries, but I have also seen this occur when calling 'local_zone_remove' (in this case it was also hanged on pthread_rwlock_wrlock). I have not been able to replicate this with the version of Unbound made available from the Ubuntu precise package (1.4.16). Please let me know what further information I can provide to assist in debugging this problem. Thank you!
--
Sincerely,
Robert Paprocki
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 835 bytes
Desc: Digital signature
URL: <http://lists.nlnetlabs.nl/pipermail/unbound-users/attachments/20150823/4c1117f4/attachment.bin>
More information about the Unbound-users
mailing list