unbound timeouts to auth-zone records when server lost access to Internet

Thu Sep 23 09:28:48 UTC 2021

Hi Kirill,

The routine that you think is slow is only performing a lookup of some
data, that is stored. That should return immediately and not take
strenuous time.

Perhaps you ran out of memory? And it is using swap space? When this
routine runs, occasionally, when a lookup happens, it attempts to fetch
memory values, and the page fault is slow? The mix of different requests
seems to then mainly influence page fault decisions on what working set
you end up with.

I cannot explain it otherwise. The lookup of a large timeout value
should not be taking any more time than a lookup of a low timeout value.

Best regards, Wouter

On 23/09/2021 11:09, Kirill via Unbound-users wrote:
> Hello! We have a task at our hands to ensure that most of our devices
> would work even after our offices lose access to Internet. We set up
> some auth-zones:
> 
> auth-zone:
> name: "printers.company.org"
> master: "198.51.100.55"
> master: "2001:db8:::ffff"
> fallback-enabled: "yes"
> for-downstream: "no"
> for-upstream: "yes"
> zonefile: "/usr/local/etc/unbound/slave_zones/printers.company.org"
> 
> auth-zone:
> name: "cctv.company.org"
> master: "198.51.100.55"
> master: "2001:db8:::ffff"
> fallback-enabled: "yes"
> for-downstream: "no"
> for-upstream: "yes"
> zonefile: "/usr/local/etc/unbound/slave_zones/cctv.company.org"
> 
> auth-zone:
> name: "company.org"
> master: "198.51.100.55"
> master: "2001:db8:::ffff"
> fallback-enabled: "yes"
> for-downstream: "no"
> for-upstream: "yes"
> zonefile: "/usr/local/etc/unbound/slave_zones/company.org"
> 
> When we have Internet access, unbound work as intended — resolve records
> recursively.
> 
> When our office loses access, we expect to be able to resolve records
> saved in slave zones, but we experience different results:
> 
>   * When number of devices is small — unbound is serving requests for
>     domains in auth-zones, and everything works fine.
>   * When number of devices is large — unbound stop serving all requests.
> 
> 
> We tried to reproduce this problem in our lab with dnsperf. For our
> stand we used the last version of unbound (1.13.2) on FreeBSD 12 (config
> in attach):
> dnsperf -s dns-lab.company.org -f inet6 -Q 100000 -d data -l 300 -q 200
> 
> - When we use data where 50% of domains from auth-zone, and 50% from
> elsewhere — unbound struggling, but continued to serve our records.
> - When data is composed of 10000 third-party domains and 300 of our
> domains — unbound is lost its ability to serve any request, and every
> resolve attempt ended in timeout.
> 
> When we dtrace process, we find out that that unbound work most of the
> time in processQueryTargets -> iter_filter_unsuitable (flamegraph as svg
> in attach).
> 
> Looks like this works accordingly with
> https://www.nlnetlabs.nl/documentation/unbound/info-timeout/ — unbound
> tried to reach root servers to resolve records, but it ultimately can't
> without internet access.
> Also, "Мany threads can have many packets outstanding to an IP address,
> all at the same time. The infra-cache data is shared between threads."
> Since all threads are tried to get to same unreachable servers, and a
> number of requests from clients that lost Internet grow manifold —
> unbound sometimes lock in umtxn state. Even when unbound isn't lock in
> umtxn (we tried forked operation in lab), it cannot serve locally saved
> data from auth-zone.
> 
> Can you confirm that this behavior is expected from unbound? How can we,
> by changing config or other means, provide our devices with working DNS
> when we lose Internet access?