[Unbound-users] Requestlist filling ? automatic cleanup ?
mailinglists at iinet.com.au
Sun Mar 20 21:53:37 UTC 2011
Excellent explanations and fast reply as usual, many thanks.
> (W)hat version are you using? Recently the timeout code was changed to
> cope with this sort of situation (1.4.7):
Oops sorry. I forgot to tell but I am using the latest : 1.4.8.
It's running on Centos 5.5 (old 2.6.18 kernel sadly). We built our own
packages. And it should have libevent. I created a thread a while ago
about where I wanted an explicit way to be sure that we have and are
using libevent. And you told me that I was using it IIRC :)
modules: 2 [ validator iterator ]
uptime: 2773636 seconds
unbound (pid 3952) is running...
linked libs: libevent 1.4.13-stable (it uses epoll), ldns 1.6.8, OpenSSL
0.9.8e-fips-rhel5 01 Jul 2008
linked modules: validator iterator
configured for i386-redhat-linux-gnu on Wed Feb 16 10:26:27 EST 2011
with options: '--build=i386-koji-linux-gnu' '--host=i386-koji-linux-gnu'
'--target=i386-redhat-linux-gnu' '--program-prefix=' '--prefix=/usr'
'--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin'
'--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include'
'--libdir=/usr/lib' '--libexecdir=/usr/libexec' '--localstatedir=/var'
'--infodir=/usr/share/info' '--with-ldns=' '--with-libevent'
'--with-pthreads' '--with-ssl' '--disable-rpath' '--enable-debug'
BSD licensed, see LICENSE in source package for details.
Report bugs to unbound-bugs at nlnetlabs.nl
>> [..] jostle-timeout is triggered when the server is very busy. What defines
>> 'busy' ?
> The requestlist is full.
Ok. I think this should be clarified in the documentation, I can send
you a patch if you want to save your time.
> Your requestlist is the default, so about 1000 and 300 does not fill it
> up. I would recommend a recompile with libevent because of your
> somewhat high load (then you can increase the requestlist and range to
> several thousand, and in recent versions the default increases by
> itself, http://www.unbound.net/documentation/howto_optimise.html )
I read this document many times since I am using unbound (and I will
read it again;). But what parameter defines the requestlist size or
actually influence on it.
>>[..]. Could that impact unbound reactivity ?
> No, other queries that priority over these older queries.
> The requestlist is divided into two halves: run-to-completion, and
> fast-stuff. The run-to-completion is that. The fast stuff deletes
> older queries to make room for new queries (but not unless the
> jostle-timeout has expired, otherwise you could deleted everything that
> comes in immediately under a DoS).
Thanks for the explanation. Is this written somewhere as well in the docos ?
>> Note: jostle-timeout is still set to the default (see my config below).
> Yes that should be OK. If you lower it, it will be more likely to drop
> the groupinfra stuff.
Ok. I may have some questions about that but I will read the doco first
>> I am asking that because sometimes our unbounds have a random hiccup and
>> I am wondering if it could be due to this or not. The 'hiccup' is very
>> hard to debug because it's random (once a month or so) on servers doing
>> something like 500 to 1500 qps each so increasing the verbosity from 1
>> to 2 is not really possible :)
Ok, so I think I will have to do a script to increase verbosity when it
seems that unbound can't resolve anymore and hopefully I will be able to
catch this nasty issue (could be network related).
> What seems to happen is groupinfra has a lot of servers. And they
> sometimes experience outages. When they experience an outage, unbound
> gets timeouts and tries to fetch the names, but also the other
> nameserver names (and there are a lot of them). Given user demand for
> groupinfra, unbound starts to explore all the nameservers for
> groupinfra, with timeouts and thus the entries fill up your requestlist.
> The dependency structure is like that log excerpt that you show.
> Because the thing has timeouts those entries are necessarily pretty old,
> and thus (the ones in the fast-stuff list) would be dropped to make room
> for new queries (if there was a lack of space, but there is no lack of
> space, so these queries are performed: there is interest and there is
> capacity to undertake actions to find the answers).
Yep ok, I understand but still it is weird to see unbound trying to
resolve something for almost forever. For instance 143000 secs aka 39
hours :) But we have resources so maybe one day it will work (I reckon
this domain just never works ;).
252 AAAA IN uk-dc007.groupinfra.com. 142994.571268 iterator wants AAAA
IN au-dc012.groupinfra.com. AAAA IN br-dc003.groupinfra.com. AAAA IN
de-dc008.groupinfra.com. AAAA IN my-dc003.groupinfra.com. AAAA IN
nl-dc006.groupinfra.com. AAAA IN ph-dc001.groupinfra.com.
More information about the Unbound-users