[Unbound-users] Unbound stops resolving some hosts sporadically
john
lists at cloned.org.uk
Tue Mar 29 10:46:17 UTC 2011
Hi,
I'm running unbound 1.4.6-1 from debian squeeze on a couple of machines as
a resolver and have come across strange problems where unbound will stop
resolving some hosts. It seems sporadic as to what works and what doesn't,
but nagios generally reports that it timed out after 10s when trying to
resolve www.google.com. I sometimes notice it before nagios does if using
that resolver. I've tried sniffing traffic but it's hard to pinpoint
queries as the servers are quite busy.
What I've noticed though is that sometimes for example i won't be able to
resolve yahoo.com but after a minute or two it works again, but then fails
a few mins after that. I don't believe there is a network problem as it
fails for some domains whilst working for others which are on the same
name servers. others on the same.
If I restart unbound it clears up the problem, but this is obviously less
than ideal. I'm tending to do this once a day now, sometimes more
frequently. I've moved one of the resolvers over to other software now to
try and avoid issues where they are both broken for obvious reasons.
Server config looks like this from the one which is still being used
(minus interface, outgoing-interface and access-control lines):
server:
verbosity: 1
statistics-interval: 86400
num-threads: 2
outgoing-range: 256
msg-cache-size: 128m
num-queries-per-thread: 1024
rrset-cache-size: 256m
do-ip6: no
chroot: ""
root-hints: /etc/unbound/named.cache
hide-identity: yes
hide-version: yes
The other I tweaked slightly with socket receive buffers in case it was
using all sockets but it didn't make any difference:
server:
verbosity: 2
statistics-interval: 86400
num-threads: 2
outgoing-range: 462
so-rcvbuf: 4m
msg-cache-size: 128m
num-queries-per-thread: 1024
rrset-cache-size: 256m
do-ip6: no
chroot: ""
logfile: "/var/log/unbound.log"
root-hints: /etc/unbound/named.cache
hide-identity: yes
hide-version: yes
# cat /proc/sys/net/core/rmem_max
4194304
Logs didn't really show much, and produce too much data to trawl through
easily.
This morning I've changed to running one thread and when I receive a
problem just now dumped stats which were as follows:
# unbound-control status
version: 1.4.6
verbosity: 1
threads: 1
modules: 2 [ validator iterator ]
uptime: 3095 seconds
unbound (pid 24730) is running...
# unbound-control stats_noreset
thread0.num.queries=87145
thread0.num.cachehits=38156
thread0.num.cachemiss=48989
thread0.num.prefetch=0
thread0.num.recursivereplies=41036
thread0.requestlist.avg=327.143
thread0.requestlist.max=1091
thread0.requestlist.overwritten=3188
thread0.requestlist.exceeded=18
thread0.requestlist.current.all=1091
thread0.requestlist.current.user=1024
thread0.recursion.time.avg=27.134193
thread0.recursion.time.median=0.00895803
total.num.queries=87145
total.num.cachehits=38156
total.num.cachemiss=48989
total.num.prefetch=0
total.num.recursivereplies=41036
total.requestlist.avg=327.143
total.requestlist.max=1091
total.requestlist.overwritten=3188
total.requestlist.exceeded=18
total.requestlist.current.all=1091
total.requestlist.current.user=1024
total.recursion.time.avg=27.134193
total.recursion.time.median=0.00895803
time.now=1301394872.957300
time.up=3099.708199
time.elapsed=3099.708199
I've read the changelogs of newer versions but can't see anything that
looks like this problem. I'd prefer to avoid upgrading to the latest
source distribution on the off-chance it will fix it as that just seems
like clutching at straws.
Any ideas?
Cheers,
john
More information about the Unbound-users
mailing list