[Unbound-users] Unbound 1.4.1 cache problems
W.C.A. Wijngaards
wouter at NLnetLabs.nl
Sat Jan 9 09:24:20 UTC 2010
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi Jakub,
I cannot know for sure since the log seems to be shortened by syslog,
but I think you may have the same problem as Haw Loeung reported. I
have since corresponded with him and I think have resolved the issue. A
patch is in svn for unbound.
The issue seems to be caused by the parent-child disagreement fix in
1.4.1, worsened by no IPv6 available (either not in the kernel or
do-ip6:no statement) - just turning ipv6 back on (also if you have no
ipv6 connectivity) may stop this bug.
Sorry for the inconvenience,
Wouter
On 01/08/2010 10:36 PM, Jakub Heichman wrote:
> Hello there,
>
> I have upgraded from Unbound 1.3.0 to 1.4.1 recently on a couple of
> servers. The servers are doing about 3-7k requests per second.
> I have noticed that after some time of operation, servers
> intermittently stop responding to queries to certain domains from
> cache (return SERVFAIL). It particularly doesn't like webex.com,
> sometimes vmware.com, dell.com.
>
> # unbound-host webex.com
> webex.com has address 64.68.120.99
> webex.com mail is handled by 50 mx3.webex.com.
> webex.com mail is handled by 10 mx1.webex2.iphmx.com.
> webex.com mail is handled by 10 mx2.webex2.iphmx.com.
> webex.com mail is handled by 50 mx2.webex.com.
>
> But the A record doesnt seem to get cached:
>
> # dig webex.com @localhost
>
> ; <<>> DiG 9.3.3rc2 <<>> webex.com @localhost
> ; (1 server found)
> ;; global options: printcmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 49004
> ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
>
> ;; QUESTION SECTION:
> ;webex.com. IN A
>
> ;; Query time: 0 msec
> ;; SERVER: 127.0.0.1#53(127.0.0.1)
> ;; WHEN: Fri Jan 8 19:16:01 2010
> ;; MSG SIZE rcvd: 27
>
> # unbound-control dump_cache | grep ^webex.com
> webex.com. 4589 IN NS ns01.webex.com.
> webex.com. 4589 IN NS ns05.webex.com.
> webex.com. 4589 IN NS ns07.webex.com.
> webex.com. 4589 IN NS ns03.webex.com.
> webex.com. 4589 IN NS ns08.webex.com.
> webex.com. 4589 IN NS ns06.webex.com.
>
> It doesnt respond to NS queries either (No A records for
> nsxx.webex.com in cache)
>
> # dig ns webex.com @localhost
>
> ; <<>> DiG 9.3.3rc2 <<>> ns webex.com @localhost
> ; (1 server found)
> ;; global options: printcmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 54946
> ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
>
> ;; QUESTION SECTION:
> ;webex.com. IN NS
>
> ;; Query time: 0 msec
> ;; SERVER: 127.0.0.1#53(127.0.0.1)
> ;; WHEN: Fri Jan 8 20:53:18 2010
> ;; MSG SIZE rcvd: 27
>
>
>
> After flushing webex.com I can query the cache again:
>
> # unbound-control flush webex.com
> ok
> # dig +short webex.com @localhost
> 64.68.120.99
>
> Log with increased verbosity:
>
> an 8 21:05:32 cns01j unbound: [25052:3] info: validator operate:
> query <webex.com. A IN>
> Jan 8 21:05:32 cns01j unbound: [25052:3] info: resolving <webex.com. A IN>
> Jan 8 21:05:32 cns01j unbound: [25052:3] info:
> DelegationPoint<webex.com.>: 6 names (6 missing), 6 addrs (0 result, 6
> avail)
> Jan 8 21:05:32 cns01j unbound: [25052:3] info: ns08.webex.com. A
> Jan 8 21:05:32 cns01j unbound: [25052:3] info: ns05.webex.com. A
> Jan 8 21:05:32 cns01j unbound: [25052:3] info: ns01.webex.com. A
> Jan 8 21:05:32 cns01j unbound: [25052:3] info: ns06.webex.com. A
> Jan 8 21:05:32 cns unbound: [25052:3] info: ns07.webex.com. A
> Jan 8 21:05:32 cns unbound: [25052:3] info: ns03.webex.com. A
> Jan 8 21:05:32 cns unbound: [25052:3] info: resolving (init part 2):
> <webex.com. A IN>
> Jan 8 21:05:32 cns unbound: [25052:3] info: resolving (init part 3):
> <webex.com. A IN>
> Jan 8 21:05:32 cns unbound: [25052:3] info: processQueryTargets:
> <webex.com. A IN>
> Jan 8 21:05:32 cns unbound: [25052:3] info:
> DelegationPoint<webex.com.>: 6 names (6 missing), 6 addrs (0 result, 6
> avail)
> Jan 8 21:05:32 cns unbound: [25052:3] info: ns08.webex.com. A
> Jan 8 21:05:32 cns unbound: [25052:3] info: ns05.webex.com. A
> Jan 8 21:05:32 cns unbound: [25052:3] info: ns01.webex.com. A
> Jan 8 21:05:32 cns unbound: [25052:3] info: ns06.webex.com. A
> Jan 8 21:05:32 cns unbound: [25052:3] info: ns07.webex.com. A
> Jan 8 21:05:32 cns unbound: [25052:3] info: ns03.webex.com. A
> Jan 8 21:05:32 cns unbound: [25052:3] info: validator operate: query
> <webex.com. A IN>
> Jan 8 21:05:32 cns unbound: [25052:3] info: mesh_run: end 8 recursion
> states (8 with reply, 0 detached), 28 waiting replies, 84 recursion
> replies sent, 0 replies dropped, 0 states jostled out
> Jan 8 21:05:32 cns unbound: [25052:3] info: average recursion
> processing time 4.172319 sec
> Jan 8 21:05:32 cns unbound: [25052:3] info: histogram of recursion
> processing times
> Jan 8 21:05:32 cns unbound: [25052:3] info: [25%]=0.00107789
> median[50%]=0.00744727 [75%]=0.0524288
> Jan 8 21:05:32 cns unbound: [25052:3] info: lower(secs) upper(secs) recursions
> Jan 8 21:05:32 cns unbound: [25052:3] info: 0.000000 0.000001 9
> Jan 8 21:05:32 cns unbound: [25052:3] info: 0.000256 0.000512 1
> Jan 8 21:05:32 cns unbound: [25052:3] info: 0.001024 0.002048 1
> Jan 8 21:05:32 cns unbound: [25052:3] info: 0.002048 0.004096 19
> Jan 8 21:05:32 cns unbound: [25052:3] info: 0.004096 0.008192 2
> Jan 8 21:05:32 cns unbound: [25052:3] info: 0.008192 0.016384 11
> Jan 8 21:05:32 cns unbound: [25052:3] info: 0.016384 0.032768 3
> Jan 8 21:05:32 cns unbound: [25052:3] info: 0.032768 0.065536 5
> Jan 8 21:05:32 cns unbound: [25052:3] info: 0.065536 0.131072 15
> Jan 8 21:05:32 cns unbound: [25052:3] info: 0.131072 0.262144 10
> Jan 8 21:05:32 cns unbound: [25052:3] info: 0.262144 0.524288 5
> Jan 8 21:05:32 cns unbound: [25052:3] info: 0.524288 1.000000 2
> Jan 8 21:05:32 cns unbound: [25052:3] info: 256.000000 512.000000 1
>
>
>
> I have downgraded one of the servers back to 1.3.0 (libev 3.53), where
> I don't see this problem.
>
> Unbound 1.4.1 compiled with libev-3.9, without threads - 4 processes running
>
> num-threads: 4
> outgoing-range: 10000
> msg-cache-size: 500m
> msg-cache-slabs: 1
> num-queries-per-thread: 10000
> jostle-timeout: 4000
> so-rcvbuf: 8m
> rrset-cache-size: 1024m
> rrset-cache-slabs: 1
> cache-max-ttl: 86400
>
> Any help would be greatly appreciated :-)
>
> Thanks,
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/
iEYEARECAAYFAktIS0QACgkQkDLqNwOhpPjQ2ACfeCESH6t5ezR94AnasM6DP16f
EsIAoJbV90YIEyovcdz2VWTRmEk4/tMY
=Zfvf
-----END PGP SIGNATURE-----
More information about the Unbound-users
mailing list