[Unbound-users] Unbound 1.4.1 cache problems
Jakub Heichman
kuba.heichman at gmail.com
Fri Jan 8 21:36:09 UTC 2010
Hello there,
I have upgraded from Unbound 1.3.0 to 1.4.1 recently on a couple of
servers. The servers are doing about 3-7k requests per second.
I have noticed that after some time of operation, servers
intermittently stop responding to queries to certain domains from
cache (return SERVFAIL). It particularly doesn't like webex.com,
sometimes vmware.com, dell.com.
# unbound-host webex.com
webex.com has address 64.68.120.99
webex.com mail is handled by 50 mx3.webex.com.
webex.com mail is handled by 10 mx1.webex2.iphmx.com.
webex.com mail is handled by 10 mx2.webex2.iphmx.com.
webex.com mail is handled by 50 mx2.webex.com.
But the A record doesnt seem to get cached:
# dig webex.com @localhost
; <<>> DiG 9.3.3rc2 <<>> webex.com @localhost
; (1 server found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 49004
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;webex.com. IN A
;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Jan 8 19:16:01 2010
;; MSG SIZE rcvd: 27
# unbound-control dump_cache | grep ^webex.com
webex.com. 4589 IN NS ns01.webex.com.
webex.com. 4589 IN NS ns05.webex.com.
webex.com. 4589 IN NS ns07.webex.com.
webex.com. 4589 IN NS ns03.webex.com.
webex.com. 4589 IN NS ns08.webex.com.
webex.com. 4589 IN NS ns06.webex.com.
It doesnt respond to NS queries either (No A records for
nsxx.webex.com in cache)
# dig ns webex.com @localhost
; <<>> DiG 9.3.3rc2 <<>> ns webex.com @localhost
; (1 server found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 54946
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;webex.com. IN NS
;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Jan 8 20:53:18 2010
;; MSG SIZE rcvd: 27
After flushing webex.com I can query the cache again:
# unbound-control flush webex.com
ok
# dig +short webex.com @localhost
64.68.120.99
Log with increased verbosity:
an 8 21:05:32 cns01j unbound: [25052:3] info: validator operate:
query <webex.com. A IN>
Jan 8 21:05:32 cns01j unbound: [25052:3] info: resolving <webex.com. A IN>
Jan 8 21:05:32 cns01j unbound: [25052:3] info:
DelegationPoint<webex.com.>: 6 names (6 missing), 6 addrs (0 result, 6
avail)
Jan 8 21:05:32 cns01j unbound: [25052:3] info: ns08.webex.com. A
Jan 8 21:05:32 cns01j unbound: [25052:3] info: ns05.webex.com. A
Jan 8 21:05:32 cns01j unbound: [25052:3] info: ns01.webex.com. A
Jan 8 21:05:32 cns01j unbound: [25052:3] info: ns06.webex.com. A
Jan 8 21:05:32 cns unbound: [25052:3] info: ns07.webex.com. A
Jan 8 21:05:32 cns unbound: [25052:3] info: ns03.webex.com. A
Jan 8 21:05:32 cns unbound: [25052:3] info: resolving (init part 2):
<webex.com. A IN>
Jan 8 21:05:32 cns unbound: [25052:3] info: resolving (init part 3):
<webex.com. A IN>
Jan 8 21:05:32 cns unbound: [25052:3] info: processQueryTargets:
<webex.com. A IN>
Jan 8 21:05:32 cns unbound: [25052:3] info:
DelegationPoint<webex.com.>: 6 names (6 missing), 6 addrs (0 result, 6
avail)
Jan 8 21:05:32 cns unbound: [25052:3] info: ns08.webex.com. A
Jan 8 21:05:32 cns unbound: [25052:3] info: ns05.webex.com. A
Jan 8 21:05:32 cns unbound: [25052:3] info: ns01.webex.com. A
Jan 8 21:05:32 cns unbound: [25052:3] info: ns06.webex.com. A
Jan 8 21:05:32 cns unbound: [25052:3] info: ns07.webex.com. A
Jan 8 21:05:32 cns unbound: [25052:3] info: ns03.webex.com. A
Jan 8 21:05:32 cns unbound: [25052:3] info: validator operate: query
<webex.com. A IN>
Jan 8 21:05:32 cns unbound: [25052:3] info: mesh_run: end 8 recursion
states (8 with reply, 0 detached), 28 waiting replies, 84 recursion
replies sent, 0 replies dropped, 0 states jostled out
Jan 8 21:05:32 cns unbound: [25052:3] info: average recursion
processing time 4.172319 sec
Jan 8 21:05:32 cns unbound: [25052:3] info: histogram of recursion
processing times
Jan 8 21:05:32 cns unbound: [25052:3] info: [25%]=0.00107789
median[50%]=0.00744727 [75%]=0.0524288
Jan 8 21:05:32 cns unbound: [25052:3] info: lower(secs) upper(secs) recursions
Jan 8 21:05:32 cns unbound: [25052:3] info: 0.000000 0.000001 9
Jan 8 21:05:32 cns unbound: [25052:3] info: 0.000256 0.000512 1
Jan 8 21:05:32 cns unbound: [25052:3] info: 0.001024 0.002048 1
Jan 8 21:05:32 cns unbound: [25052:3] info: 0.002048 0.004096 19
Jan 8 21:05:32 cns unbound: [25052:3] info: 0.004096 0.008192 2
Jan 8 21:05:32 cns unbound: [25052:3] info: 0.008192 0.016384 11
Jan 8 21:05:32 cns unbound: [25052:3] info: 0.016384 0.032768 3
Jan 8 21:05:32 cns unbound: [25052:3] info: 0.032768 0.065536 5
Jan 8 21:05:32 cns unbound: [25052:3] info: 0.065536 0.131072 15
Jan 8 21:05:32 cns unbound: [25052:3] info: 0.131072 0.262144 10
Jan 8 21:05:32 cns unbound: [25052:3] info: 0.262144 0.524288 5
Jan 8 21:05:32 cns unbound: [25052:3] info: 0.524288 1.000000 2
Jan 8 21:05:32 cns unbound: [25052:3] info: 256.000000 512.000000 1
I have downgraded one of the servers back to 1.3.0 (libev 3.53), where
I don't see this problem.
Unbound 1.4.1 compiled with libev-3.9, without threads - 4 processes running
num-threads: 4
outgoing-range: 10000
msg-cache-size: 500m
msg-cache-slabs: 1
num-queries-per-thread: 10000
jostle-timeout: 4000
so-rcvbuf: 8m
rrset-cache-size: 1024m
rrset-cache-slabs: 1
cache-max-ttl: 86400
Any help would be greatly appreciated :-)
Thanks,
--
Jakub Heichman
More information about the Unbound-users
mailing list