Unbound 1.9.2 deadlock?
Patrik Lundin
patrik at sigterm.se
Mon Sep 23 09:02:12 UTC 2019
Hello,
Today I noticed a machine running Unbound 1.9.2 had gotten itself into
an odd state.
Running "unbound-control status" would just hang, and even after
removing all client queries from the machine one of the CPUs in the
server was still fully busy in 100% user time in top:
===
%Cpu4 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
===
Keep in mind this is on a machine that except for some monitoring queries are
seeing no traffic at this point:
===
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5447 unbound 20 0 14.8g 14.2g 4320 S 100.0 22.7 3742:27 unbound
===
strace was fairly silent, which i guess makes sense since all time is
spent in user space according to top:
===
# strace -f -p 5447
strace: Process 5447 attached with 16 threads
[pid 5489] epoll_wait(130, <unfinished ...>
[pid 5488] epoll_wait(133, <unfinished ...>
[pid 5487] epoll_wait(132, <unfinished ...>
[pid 5486] epoll_wait(143, <unfinished ...>
[pid 5485] epoll_wait(126, <unfinished ...>
[pid 5484] epoll_wait(127, <unfinished ...>
[pid 5483] epoll_wait(134, <unfinished ...>
[pid 5482] epoll_wait(121, <unfinished ...>
[pid 5481] epoll_wait(120, <unfinished ...>
[pid 5480] epoll_wait(115, <unfinished ...>
[pid 5479] epoll_wait(114, <unfinished ...>
[pid 5477] epoll_wait(105, <unfinished ...>
[pid 5476] epoll_wait(102, <unfinished ...>
[pid 5475] epoll_wait(104, <unfinished ...>
[pid 5447] read(70,
===
I do find that read to be a bit suspicious because I am not seeing anything
equivalent on another machine that is not having any problems, though it is not
clear to me what it is communicating with:
===
# lsof -n -P -p 5447
[...]
unbound 5447 unbound 70u unix 0xffff9090b9832400 0t0 20234 socket
[...]
===
It appears being unbound talking to itself:
===
# ss -xp | grep 20234
u_str ESTAB 0 0 * 20235 * 20234 users:(("unbound",pid=5447,fd=71))
u_str ESTAB 0 0 * 20234 * 20235 users:(("unbound",pid=5447,fd=70))
===
Too me this feels like some race condition might have caused something to spin
uncontrollably in the process.
Looking at the process with GDB I can see the following for that read()'ing
thread (thread apply all bt):
===
Thread 1 (Thread 0x7fa25fada840 (LWP 5447)):
#0 0x00007fa25e87a6fd in read () from /lib64/libpthread.so.0
#1 0x0000560e3305f285 in tube_read_msg ()
#2 0x0000560e33014d81 in server_stats_obtain ()
#3 0x0000560e3300f794 in do_stats.isra.12 ()
#4 0x0000560e33011d18 in execute_cmd ()
#5 0x0000560e33013fb8 in handle_req.isra.18 ()
#6 0x0000560e330140d0 in remote_control_callback ()
#7 0x00007fa25f423a14 in event_base_loop () from /lib64/libevent-2.0.so.5
#8 0x0000560e330b140c in comm_base_dispatch ()
#9 0x0000560e3300d8c1 in daemon_fork ()
#10 0x0000560e3300942f in main ()
===
All remaining threads are sitting in epoll_wait except for this one:
===
Thread 13 (Thread 0x7fa25ad46700 (LWP 5478)):
#0 0x00007fa25f4219e2 in event_queue_insert () from /lib64/libevent-2.0.so.5
#1 0x00007fa25f421c24 in event_active_nolock () from /lib64/libevent-2.0.so.5
#2 0x00007fa25f424426 in event_base_loop () from /lib64/libevent-2.0.so.5
#3 0x0000560e330b140c in comm_base_dispatch ()
#4 0x0000560e3300cf6f in thread_start ()
#5 0x00007fa25e873dd5 in start_thread () from /lib64/libpthread.so.0
#6 0x00007fa25e59d02d in clone () from /lib64/libc.so.6
===
Though if this is related I dont know at this point.
Anyone else seeing something like this? Any guesses what might be wrong?
==
Patrik Lundin
More information about the Unbound-users
mailing list