[Dnssec-trigger] Why Does unbound Fail on So Many Requests?

Sun Apr 20 00:57:30 UTC 2014

On 4-19-14 20:16:58 Paul Wouters wrote:
> On Sat, 19 Apr 2014, Garry T. Williams wrote:
>
> >    unbound[773]: [773:1] info: validation failure t6021.network-dns-unbound-user.dnstalk.us.dlv.isc.org. DLV IN
> >    unbound[773]: [773:0] info: validation failure natenom.name.dlv.isc.org. DLV IN
> >    unbound[773]: [773:0] info: validation failure platform.twitter.com.dlv.isc.org. DLV IN
>
> >    garry at vfr$ dig +dnssec t6021.network-dns-unbound-user.dnstalk.us @127.0.0.1
> >
> >    ; <<>> DiG 9.9.4-P2-RedHat-9.9.4-12.P2.fc20 <<>> +dnssec t6021.network-dns-unbound-user.dnstalk.us @127.0.0.1
> >    ;; global options: +cmd
> >    ;; Got answer:
> >    ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 56300
>
> That should not happen. I've seen at times that there are timing
> failures when it takes long to get to the hotspot. To test that, you
> can try to restart unbound but load it with the same forwarders
> after you have authenticated with the hotspot:

Thanks for the reply.  I should have mentioned that my first trial
with this stuff is on a desktop system at home.  No hotspot here.  I'm
just doing some browsing and my Web browser reports errors
occasionally for various domains.

While trying a broken domain name, I noticed that one of my ISP's
servers was not responding at all and dig timed out waiting for it.
The other two responded with A and RRSIG records.  My local unbound
gives back SERVFAIL after a shorter wait.

    garry at vfr$ dig +dnssec test.dnssec-or-not.com @65.68.49.50

    ; <<>> DiG 9.9.4-P2-RedHat-9.9.4-12.P2.fc20 <<>> +dnssec test.dnssec-or-not.com @65.68.49.50
    ;; global options: +cmd
    ;; connection timed out; no servers could be reached
    garry at vfr$ dig +dnssec test.dnssec-or-not.com @127.0.0.1

    ; <<>> DiG 9.9.4-P2-RedHat-9.9.4-12.P2.fc20 <<>> +dnssec test.dnssec-or-not.com @127.0.0.1
    ;; global options: +cmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 37221
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags: do; udp: 4096
    ;; QUESTION SECTION:
    ;test.dnssec-or-not.com.                IN      A

    ;; Query time: 1075 msec
    ;; SERVER: 127.0.0.1#53(127.0.0.1)
    ;; WHEN: Sat Apr 19 20:28:42 EDT 2014
    ;; MSG SIZE  rcvd: 51

    garry at vfr$ fc -Dl
    ...
    10843  0:15  dig +dnssec test.dnssec-or-not.com @65.68.49.50
    10844  0:06  dig +dnssec test.dnssec-or-not.com @127.0.0.1
    garry at vfr$

Here, BellSouth (now AT&T) doesn't respond in 15 seconds.  Unbound
calls it an error after six seconds.  Queries on the other two
BellSouth servers are returned normally in under one second.

    garry at vfr$ dig +dnssec test.dnssec-or-not.com @205.152.37.23
    ...
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 43577
    ...
    ;; Query time: 719 msec
    ;; SERVER: 205.152.37.23#53(205.152.37.23)
    ;; WHEN: Sat Apr 19 20:31:44 EDT 2014
    ;; MSG SIZE  rcvd: 910
    garry at vfr$ fc -Dl
    ...
    10846  0:01  dig +dnssec test.dnssec-or-not.com @205.152.37.23

My conclusion is that unbound doesn't manage to go around the
unresponsive server in my ISP's network.

> I think dnssec-trigger/unbound should have a combination to make
> negative-ttl much much shorter on "enduser systems" to avoid these
> kind of timing errors.

Perhaps.

My observation on this /one case/ tells me this stuff needs to avoid
forwarders that have become unresponsive and not cache the non-
response as an answer returned to clients.  But I don't know how to
accomplish that.

I don't remember seeing so many failures when I was running dnsmasq
instead of unbound.  Of course that may be because nothing was ever
logged by dnsmasq -- unbound is very noisy in my system journal.

Anyway, I think you have given me something to go on.  Thank you.

-- 
Garry T. Williams