Timeout semantics of Unbound differ radically from Bind 9

Sun Apr 10 20:28:23 UTC 2016

Thank you for your help Aaron.

Please correct me if I'm wrong, but my understanding is now

1) 'named' responds with a timeout SERVFAIL after 10 seconds while
Unbound does not.  This difference is the trigger of the problem that
occurred with the relay.  I verified the 'named' behavior with

time dig +retry=0 +time=20 @127.0.0.1 <unresolvable_domain>
# SERVFAIL after 10 seconds

but have not gone back to Unbound with this command.

2) The DNS lockup experienced on the Tor relay was in effect a DOS
where a Tor ab/user was requesting large numbers of unresolvable (due
to a null-route) GoDaddy domains.  This request flood was exhausting
the 64 in-flight eventdns slots and also triggering eventdns to mark
the single resolv.conf entry pointing to Unbound "down".

3) The reason that it works with 'named' under the DOS described in
(2) is the 'named' 10-second SERVFAIL timeout response not present
when Unbound is the resolver.  This is enough to tip the balance away
from a DOS state.

4) Proper way to prevent the DOS in the Unbound resolver scenario is
to tune eventdns with a resolv.conf line similar to

   options timeout:5 attempts:1 max-inflight:4096 max-timeouts:100

Where timeout:5 is the usual value appropriate for a Tor daemon (Tor
clients shift to another relay and retry on DNS failures). Attempts:1
assumes that the resolver is a local Unbound instance where Unbound
will handle all timeout retry processing and no UDP loss is possible
between the 'tor' process and the local Unbound, so it's best to give
up directly after five seconds.  Max-inflight:4096 both mitigates the
DOS scenario experienced and maximizes DNS performance of the exit
relay.  Max-timeouts:100 should prevent eventdns from marking the
dedicated local resolver as "down" unless it really is down.  Perhaps
max-timeouts:1000000 is better in order to completely inhibit the
timed-out "down resolver" logic.

5) Seems to me that 'named' does not retry standard resolve requests
at all, assuming that the client application (usually glibc
libresolv.so) will handle this function. Therefore a reasonable
resolv.conf line for running a 'tor' daemon with 'named' would be

   options timeout:5 attempts:2 max-inflight:4096 max-timeouts:100

If you see anything mistaken in any of the above please let me know.

Thanks

On Sun, Apr 10, 2016 at 5:34 AM, Aaron Hopkins <lists at die.net> wrote:
> On Sun, 10 Apr 2016, Dhalgren Tor wrote:
>>
>> Interesting!  This explains why Tor relay DNS completely seizes up
>> when GoDaddy null-routes a relay running Unbound.
>
> It's worse than that, I think.  My read of this code suggests that if
> unbound fails to answer for any single query 3 times in a row, eventdns
> marks that copy of unbound as dead for at least 10 seconds and starts
> exponentially backing off use of it, up to an hour.
>
> This is a desirable characteristic if one of your 3 nameservers is broken;
> you'll stop sending it requests and your users won't keep waiting on
> responses that will never come.  Or if you are querying some recursive
> nameserver who doesn't want traffic from you and blackholes you, you'll stop
> throwing them a large volume of unwanted traffic.
>
> According to https://www.unbound.net/documentation/info_timeout.html,
> unbound should already be returning SERVFAIL immediately if it believes all
> servers are dead.  And SERVFAIL should also be returns after all servers are
> queried (and timed out) 5 times.  I suspect that can take more than 15
> seconds and I don't see a way to put an upper bound on that, though.
>
>> Now I have to look into whether that 64 in-flight limit might be a
>> performance constraint for fast exit relays.  Might want a tunable to
>> increase the limit.
>
> It looks like eventdns will respect an /etc/resolv.conf entry for
> "max-inflight: 1000" or similar. If you are limited by inflight requests,
> this could be an easy workaround.
>