Unbound randomly fails to resolve names

Mon Jul 27 15:48:04 UTC 2020

George,

Thanks for the detailed reply, I now better understand what is going on. 

I hope others also found this thread of benefit.

I have now upgraded unbound to 1.11.0 and so far all is looking good. 

There are still some more things I want things to test but having the explanation below helps a lot.

Thanks

Ray
-----Original Message-----
From: George Thessalonikefs <george at nlnetlabs.nl> 
Sent: 27 July 2020 15:51
To: RayG <rgsub1 at btinternet.com>
Cc: unbound-users at lists.nlnetlabs.nl
Subject: Re: Unbound randomly fails to resolve names

Hi Ray,

On 23/07/2020 15:23, RayG wrote:
> Hi George,
> 
> OK thanks for the confirmation of the other issues I have seen.
> 
> With respect to the use-caps-for-id when you say "as long as the other side supports it" CloudFlare does support it because it works most of the time. 
> 
> You also said:
> 
> "When unbound asks for an.ExaMple.domAin.NeT and the record is not cached in the forwarder, the answer will contain the correct case.
> Afterwards, when the answer is cached, the wrong casing (always lowercase) will be used, and until the TTL expires I assume.
> This results in a mismatch between query and reply if use-caps-for-id is used."
> 
> So am I right in the understanding that the initial query goes to CloudFlare and CloudFlare has to go off and resolve the name itself (its not in the cache) then the correct case is preserved. Now CloudFlare has the name stored in its cache. A subsequent query for the same DNS name is then requested where the casing is randomly different and then a case insensitive match is found in CloudFlare's cache and the cache entry casing is returned rather than the query with the different casing that the client just requested. Then you get the error. This sounds like a bug to me and something CloudFlare should be aware of and fix? Is that what you were referring to when you said "I will try to reach the people involved"?
Indeed.
On second observation though, the cloudflare resolver seems to reply
properly for 0x20 (use-caps-for-id) queries.
The QUESTION section is always consistent (copied) and is the one that
matters for 0x20 support.
The other sections (i.e., ANSWER) always return lowercase from cache.
This is uncommon with other resolvers' behavior (but not wrong) and it
was what tricked me previously.

> 
> For the "tcp error" it would be nice to have a bit more information in the log as to what actually failed e.g. connection refused, timed out etc. It would help to show it was the other end not Unbound.
> 
> I was using just CloudFlare to make sure that when I submitted the logs etc. that I had a repeatable, consistent issue to look at. And yes since setting 'use-caps-for-id' to 'no' there have been no issues.
This is good to know. It verifies that the issue is indeed related to 0x20.

> 
> I have now set 'use-caps-for-id' to 'yes' and changed the list of forwarders to see what happens:
>From your latest email we see that the issue keeps happening even for
other resolvers. What I gather from your log is that due to the tcp
errors (timeouts), unbound sees that there is no answer yet for the 0x20
query and starts the 0x20 fallback. This ultimately results in failure
(SERVFAIL) to the client.

Upon closer inspection I see the following is happening:
1. Unbound tries to resolve a domain and performs qname perturbation
   because of 0x20.
2. Due to tcp errors (timeouts) from the otherside and because unbound
   is configured with 0x20, the fallback logic is triggered.
   "Capsforid: timeouts, starting fallback"
2b. (From now on unbound does not care about the proper case.)
3. Unbound eventually gets the first response that will try to match
   with other replies.
   "Capsforid: starting fallback"
4. Unbound asks other upstream servers and tries to match the reply to
   make sure that what it got on the previous step is the same as with
   other servers/replies.
   "Capsforid: reply is equal. go to next fallback"
5. Unbound will try to get at least x*3 identical replies to make sure
   that the reply is indeed consistent between different upstreams.
   x is the number of addresses for that delegation point (number
   of IPs in the forward section in your case).
6. In your case, and because of the timeouts in between, this number
   cannot be reached before the limit of the glue fetches per delegation
   point(16) is reached, which terminates the query.

This is only happening randomly because in order for the 0x20 fallback
logic to kick in you need either:
- Wrong case in the reply (not your case), or
- No answer (timeouts) at least 3 times.

So if you want to avoid the random SERVFAILS I would suggest to turn off
use-caps-for-id.

0x20 is meant to increase the random bits of query_ids by perturbating
the letter case of the qname. This is to make cache poisoning more
difficult and is more relevant to UDP.

In your case, where you send all queries to specific resolvers over TLS,
you can safely remove that.

I suppose you use DoT for privacy, so I would also use `forward-first:
no` (default) to make sure that unbound does not leak any queries to
other servers except the ones you have specified over DoT.

I hope that explains things,
-- George