Jostle logic seems to randomly stop working
Dmitri Kourennyi
dkour at mykolab.com
Sat May 19 01:59:34 UTC 2018
More investigation results:
I think the issue appears when unbound is trying to probe the master
servers for
the auth_zone section. The logs show unbound doing lookups on all the
auth_zone
domain names in my config, and I think unbound is answering them from
its own
cache. After the lookups, I get the following in the logs:
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: mesh_run: end
13 recursion states (462 with reply, 13 detached), 13 waiting replies,
1444 recursion replies sent, 1 replies dropped, 0 states jostled out
[*snip* histogram]
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 0RDd mod0
b.root-servers.net. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 1RDd mod0
c.root-servers.net. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 2RDd mod0
f.root-servers.net. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 3RDd mod0
g.root-servers.net. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 4RDd mod0
k.root-servers.net. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 5RDd mod0
xfr.cjr.dns.icann.org. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 6RDd mod0
xfr.lax.dns.icann.org. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 7RDd mod0
b.root-servers.net. AAAA IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 8RDd mod0
c.root-servers.net. AAAA IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 9RDd mod0
g.root-servers.net. AAAA IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 10RDd mod0
k.root-servers.net. AAAA IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 11RDd mod0
xfr.cjr.dns.icann.org. AAAA IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 12RDd mod0
xfr.lax.dns.icann.org. AAAA IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: mesh_run: end
12 recursion states (462 with reply, 12 detached), 12 waiting replies,
1444 recursion replies sent, 1 replies dropped, 0 states jostled out
⋮
The above pattern repeats, with the last entry in the RDd list dropping
off, and
the `detached` and `waiting reply` counters going down by one each time.
Once
the list is emptied I get:
⋮
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: mesh_run: end 1
recursion states (462 with reply, 1 detached), 1 waiting replies, 1444
recursion replies sent, 1 replies dropped, 0 states jostled out
[*snip* histogram]
0RDd mod0 b.root-servers.net. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: mesh_run: end 1
recursion states (462 with reply, 0 detached), 0 waiting replies, 1444
recursion replies sent, 1 replies dropped, 0 states jostled out
[*snip* histogram]
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: auth zone .:
soa probe serial is 2018051800
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: auth_zone
unchanged, new lease, wait
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: auth zone .
timeout in 1800 seconds
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: close fd 15
From this point onward, queries are rejected. Half an hour later, the
auth-zone
lookup attempt repeats, and it all looks like above. 4 hours later, I
get a
bunch of lines like this:
May 18 16:59:11 homebrew unbound[30620]: [30620:0] debug: comm point
start listening 15
May 18 16:59:11 homebrew unbound[30620]: [30620:0] debug: Reading tcp
query of length 17540
May 18 16:59:11 homebrew unbound[30620]: [30620:0] debug: comm point
stop listening 15
May 18 16:59:11 homebrew unbound[30620]: [30620:0] debug: comm point
start listening 15
⋮
May 18 16:59:12 homebrew unbound[30620]: [30620:0] debug: Reading tcp
query of length 6523
May 18 16:59:12 homebrew unbound[30620]: [30620:0] debug: comm point
stop listening 15
May 18 16:59:12 homebrew unbound[30620]: [30620:0] debug: xfr
b.root-servers.net: last AXFR packet
Followed by a big list:
May 18 16:59:12 homebrew unbound[30620]: [30620:0] debug: apply_axfr
at[0] . SOA
May 18 16:59:12 homebrew unbound[30620]: [30620:0] debug: apply_axfr
at[1] . RRSIG
⋮
May 18 16:59:12 homebrew unbound[30620]: [30620:0] debug: apply_axfr
at[3028] camera. DS
May 18 16:59:57 homebrew systemd-journald[199]: Suppressed 19535
messages from unbound.service
Which looks like it got new records for the auth_zone? I'm not sure. It
keeps
looping the auth_zone lookups every half hour still.
I added some logging to mesh_make_new_space(), which showed that no
jostle is
attempted because mesh->jostle_first is a nullptr. I don't know if
that's
normal.
Since it seems to coincide with the auth-zone lookups, I will try
disabling it
to see if the issue resolves itself.
More information about the Unbound-users
mailing list