DoT with unbound

Sun Nov 24 23:04:34 UTC 2019

Hm, following up on the saga...

My unbound is listening as follows:

server:
  # Listen on port 853 on both logical and physical interfaces:
  interface: 158.38.0.1 at 853
  interface: 2001:700:0:ff00::1 at 853
  interface: 158.38.0.168 at 853
  interface: 2001:700:0:503::ca53 at 853

and ends up with typically the following FDs open:

# fstat -n -p 24864
USER     CMD          PID   FD  DEV     INUM  MODE  SZ|DV R/W
unbound  unbound    24864 root  18,4   46615746 40755    512 r 
unbound  unbound    24864   wd  18,4   46615746 40755    512 r 
unbound  unbound    24864    0  18,0     85874 20666  2,2  rw
unbound  unbound    24864    1  18,0     85874 20666  2,2  rw
unbound  unbound    24864    2  18,0     85874 20666  2,2  rw
unbound  unbound    24864    3* crypto 0xfffffe841bcbae70
unbound  unbound    24864    4* internet dgram udp 158.38.0.1:853
unbound  unbound    24864    5* internet stream tcp 158.38.0.1:853
unbound  unbound    24864    6* internet6 dgram udp [2001:700:0:ff00::1]:853
unbound  unbound    24864    7* internet6 stream tcp [2001:700:0:ff00::1]:853
unbound  unbound    24864    8* internet dgram udp 158.38.0.168:853
unbound  unbound    24864    9* internet stream tcp 158.38.0.168:853
unbound  unbound    24864   10* internet6 dgram udp [2001:700:0:503::ca53]:853
unbound  unbound    24864   11* internet6 stream tcp [2001:700:0:503::ca53]:853
unbound  unbound    24864   12* unix stream  <-> /var/run/unbound-ctl [creat]
unbound  unbound    24864   13* unix dgram  <-> /var/run/log [using]

The TCP/158.38.0.1:853 port is the "main" service point (receives the
most queries), fd 5.

I added some verbose logging to handle_select(), and it looks like
when unbound takes inordinately long time to return questions on
158.38.0.1:853 (or doesn't do so within even 2 minutes), I get logging
that fd 5 is *not* part of the read-set passed to select(!)

I would have thought that FDs 4-11 would *always* be included in the
read-set in the select loop, but that doesn't appear to be the case.

This also appears to be when there starts to pile up connections which
are established by the TCP stack but which are not being handled by
unbound (not yet picked up with accept()):

# netstat -f inet -n | grep 853
tcp      147      0  158.38.0.1.853         158.38.204.xxx.33336   CLOSE_WAIT
tcp      165      0  158.38.0.1.853         158.38.204.yyy.44656   ESTABLISHED
tcp      147      0  158.38.0.1.853         158.38.216.zzz.41239   ESTABLISHED
tcp      147      0  158.38.0.1.853         158.38.204.xx6.37978   CLOSE_WAIT
tcp      147      0  158.38.0.1.853         158.38.204.zz8.49063   ESTABLISHED
tcp      147      0  158.38.0.1.853         158.38.204.xx1.37366   ESTABLISHED
tcp      147      0  158.38.0.1.853         128.39.150.aa8.37272   ESTABLISHED
tcp      147      0  158.38.0.1.853         158.38.204.xx7.51850   ESTABLISHED
tcp      147      0  158.38.0.1.853         158.38.204.xx9.44854   CLOSE_WAIT
tcp      517      0  158.38.0.1.853         158.38.204.xx1.37362   ESTABLISHED
tcp      517      0  158.38.0.1.853         158.38.204.xx2.46394   ESTABLISHED
...

Looking at the code in util/, there doesn't seem to be any code
outside of mini_event.c which messes with the "reads" set, it's zeroed
by event_init(), used by handle_select(), set by event_add(), and
cleared by event_del().  Except, of course, one may operate with
several "contexts" created by event_init().

So I added some code to event_del() and event_add(), and I see fd 5
being both added and removed, and now last in the log I have 2
minutes+ where it's been deleted, so new connections are not picked up
during that entire duration, which is obviously bad.

What might be causing this?

Regards,

- Håvard