<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
<div>Hello! We have a task at our hands to ensure that most of our
devices would work even after our offices lose access to Internet.
We set up some auth-zones:<br type="_moz">
</div>
<div><br>
</div>
<div>auth-zone:<br>
name: "printers.company.org"<br>
master: "198.51.100.55"<br>
master: "2001:db8:::ffff"<br>
fallback-enabled: "yes"<br>
for-downstream: "no"<br>
for-upstream: "yes"<br>
zonefile:
"/usr/local/etc/unbound/slave_zones/printers.company.org"<br
type="_moz">
</div>
<div><br>
</div>
<div>auth-zone:<br>
name: "cctv.company.org"<br>
master: "198.51.100.55"<br>
master: "2001:db8:::ffff"<br>
fallback-enabled: "yes"<br>
for-downstream: "no"<br>
for-upstream: "yes"<br>
zonefile: "/usr/local/etc/unbound/slave_zones/cctv.company.org"</div>
<div><br>
</div>
<div>auth-zone:<br>
name: "company.org"<br>
master: "198.51.100.55"<br>
master: "2001:db8:::ffff"<br>
fallback-enabled: "yes"<br>
for-downstream: "no"<br>
for-upstream: "yes"<br>
zonefile: "/usr/local/etc/unbound/slave_zones/company.org"</div>
<br>
<div>When we have Internet access, unbound work as intended —
resolve records recursively.<br type="_moz">
</div>
<div><br>
</div>
<div>When our office loses access, we expect to be able to resolve
records saved in slave zones, but we experience different results:</div>
<ul>
<li>When number of devices is small — unbound is serving requests
for domains in auth-zones, and everything works fine.</li>
<li>When number of devices is large — unbound stop serving all
requests.</li>
</ul>
<div><br>
</div>
<div>We tried to reproduce this problem in our lab with dnsperf. For
our stand we used the last version of unbound (1.13.2) on FreeBSD
12 (config in attach):<br>
dnsperf -s dns-lab.company.org -f inet6 -Q 100000 -d data -l 300
-q 200</div>
<div><br>
</div>
<div>- When we use data where 50% of domains from auth-zone, and 50%
from elsewhere — unbound struggling, but continued to serve our
records.<br>
- When data is composed of 10000 third-party domains and 300 of
our domains — unbound is lost its ability to serve any request,
and every resolve attempt ended in timeout.<br type="_moz">
</div>
<div><br>
</div>
<div>When we dtrace process, we find out that that unbound work most
of the time in processQueryTargets -> iter_filter_unsuitable
(flamegraph as svg in attach).<br type="_moz">
</div>
<div><br>
</div>
<div>Looks like this works accordingly with
<a class="moz-txt-link-freetext" href="https://www.nlnetlabs.nl/documentation/unbound/info-timeout/">https://www.nlnetlabs.nl/documentation/unbound/info-timeout/</a> —
unbound tried to reach root servers to resolve records, but it
ultimately can't without internet access.<br>
Also, "Мany threads can have many packets outstanding to an IP
address, all at the same time. The infra-cache data is shared
between threads." Since all threads are tried to get to same
unreachable servers, and a number of requests from clients that
lost Internet grow manifold — unbound sometimes lock in umtxn
state. Even when unbound isn't lock in umtxn (we tried forked
operation in lab), it cannot serve locally saved data from
auth-zone.<br type="_moz">
</div>
<div><br>
</div>
<div>Can you confirm that this behavior is expected from unbound?
How can we, by changing config or other means, provide our devices
with working DNS when we lose Internet access?</div>
</body>
</html>