unbound dropping all queries with empty request list when used as a dns-over-tls forwarder

Mon Apr 20 10:16:01 UTC 2020

Hello,

I've tried setting up unbound for the first time (to replace existing
dns setup based around venerable djbdns), and seem to have stumbled
into following problem.

After as little as dozen minutes and as much as several hours, unbound
daemon stops responding to queries without any kind of overt errors
(e.g. warning/error-level logging) or indication that anything is abnormal.

Looking at e.g. "Unbound+request-list+demystified" article, I wasn't able to
find description of the state which would match this case, so don't seem to
understand what exactly is happening here.
( https://docs.menandmice.com/display/MM/Unbound+request-list+demystified )

One suspicion I get (without looking at the code or knowing much about it),
is that maybe unbound marks all forwarders as "bad", but that doesn't seem to
explain "debug: Too many queries" message in logs (see below).

Config file that I'm using:
  https://e.var.nz/unbound.bugreport-2020-04-20_141538.conf

It's (supposed to be) setup to forward DNS queries over TLS to
CloudFlare servers (1.1.1.1 and such), except for couple specific local
and stub zones, with latter forwarded to 127.0.0.1 where tinydns
(djbdns resolver) is listening.

It is setup in a systemd-nspawn container running latest Arch Linux
userspace and Linux Kernel 5.4.x, started from systemd unit, which I
think is shipped with unbound itself (given how it includes a lot of
security options that require good understanding of how daemon works).

"unbound -V" output:
  https://e.var.nz/unbound.bugreport-2020-04-20_141538.version-output.txt

As far as I can tell, bulk of the queries to unbound are coming from postfix
daemon that is checking various IPs and names for being valid for handling email
in various ways (e.g. PTR request to check if IP is dynamic and such).

At some point it stops responding to all DNS queries, as mentioned, but
otherwise works fine, i.e. it's not a crash or hang in the code afaict.

I've tried running special unbound on localhost port 54 and hammering it with
similar PTR queries like postfix does, but couldn't reproduce the issue.

As far as I can tell, for same PTR query where I can see a lot of THROWAWAY
reponses in the log, CloudFlare eventually resolves it (maybe after couple dozen
seconds), and unbound starts serving cached responses to it immediately,
and all other queries seem to work fine at/during that time too.

So basically don't know how to reproduce it outside of my real-world specific
setup, but from restart unbound there ~5-10 times, had it become unresponsive
within 30min-to-hours in all of them.

This does not happen when I switch mandatory TLS off in general request forwarding.

As a workaround, I've setup "stubby" DNS-over-TLS resolver, forced to only use
TLS, with same exact CloudFlare servers, just like unbound, and made unbound
query it over regular TCP/UDP on localhost.

This way (unbound + stubby), it's been running for days now without any issues,
i.e. problem only seem to manifest with mandatory-TLS upstreams.

When problem happens, here are various log/command outputs:

- strace.log from "strace -fp $(pgrep -x unbound)" command when
  querying it over UDP (using "dig ya.ru" in this example iirc):

  epoll_wait(10, [{EPOLLIN, {u32=5, u64=5}}], 32, -1) = 1
  gettimeofday({tv_sec=1587140870, tv_usec=99133}, NULL) = 0
  gettimeofday({tv_sec=1587140870, tv_usec=99357}, NULL) = 0
  recvmsg(5, {msg_name={sa_family=AF_INET, sin_port=htons(37572), sin_addr=inet_addr("10.8.0.1")},
    msg_namelen=128->16,
    msg_iov=[{iov_base="\361G\1 \0\1\0\0\0\0\0\1\2ya\2ru\0\0\1\0\1\0\0)\20\0\0\0\0\0"...,
    iov_len=65552}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_IP,
    cmsg_type=IP_PKTINFO, cmsg_data={ipi_ifindex=if_nametoindex("host0"),
    ipi_spec_dst=inet_addr("10.8.0.2"),
    ipi_addr=inet_addr("10.8.0.2")}}], msg_controllen=32, msg_flags=0}, 0) = 46
  recvmsg(5, {msg_namelen=128}, 0)        = -1 EAGAIN (Resource temporarily unavailable)
  epoll_wait(10,

  Note: there is also no output immediately before/after this for a
  reasonable time, i.e. unbound receives the request and discards it
  without trying to do anything with any sockets or files.

- "unbound-control dump_requestlist" - empty table, i.e. looks like all
  requests are immediately discarded for other reason.

- "unbound-control ratelimit_list" and "unbound-control
  ip_ratelimit_list" are empty.

- "unbound-control stats" output looks like this:
  https://e.var.nz/unbound.bugreport-2020-04-20_141538.unbound-control-stats.txt

  Has total.num.expired=0 total.num.queries_ip_ratelimited=0 total.requestlist.exceeded=34

  Don't think I've checked if new requests in this state bump "exceeded" limit,
  but it looks like they might.

- Enabling "unbound-control verbosity 9", get output like the following in the log:
  http://e.var.nz/unbound.bugreport-2020-04-20_141538.log

  Note that initially it is set to only get "info" level messages, then switched
  to show "debug" (after "info: control cmd: verbosity 9").

  Before unbound goes into this unresponsive state, there are lot of messages
  like this one:

    [1587135862] unbound[520:0] info: resolving 18.96.128.23.in-addr.arpa. PTR IN
    [1587135867] unbound[520:0] info: resolving 18.96.128.23.in-addr.arpa. PTR IN
    [1587135873] unbound[520:0] info: response for 18.96.128.23.in-addr.arpa. PTR IN
    [1587135873] unbound[520:0] info: reply from <.> 2606:4700:4700::1001#853
    [1587135873] unbound[520:0] info: query response was THROWAWAY
    [1587135873] unbound[520:0] info: response for 18.96.128.23.in-addr.arpa. PTR IN
    [1587135873] unbound[520:0] info: reply from <.> 2606:4700:4700::1111#853
    [1587135873] unbound[520:0] info: query response was THROWAWAY

  (i.e. ones with THROWAWAY response, for more-or-less same or similar PTR queries)

  Typical log output after stopping responding to queries seem to be:

    [1587136134] unbound[520:0] info: receive_udp on interface: 3 10.8.0.2 10.8.0.2
    [1587136134] unbound[520:0] debug: udp request from ip4 10.8.0.1 port 50658 (len 16)
    [1587136134] unbound[520:0] debug: Too many queries. dropping incoming query.
    [1587136134] unbound[520:0] debug: cache memory msg=69740 rrset=70720 infra=8996 val=66352 subnet=74504

Apologies if a bit long-winded, tried to include as much info up-front as possible.

Let me know if there's anything I can try or check, it should be very easy to
tweak backend and options in this setup, and failures should be relatively
low-impact, and noticed quickly enough.

Thanks in advance for any information,
as well as reading this far and working on this project.

Cheers!

-- 
Mike Kazantsev // fraggod.net