Unbound randomly fails to resolve names

Mon Jul 27 14:51:24 UTC 2020

Hi Ray,

On 23/07/2020 15:23, RayG wrote:
> Hi George,
> 
> OK thanks for the confirmation of the other issues I have seen.
> 
> With respect to the use-caps-for-id when you say "as long as the other side supports it" CloudFlare does support it because it works most of the time. 
> 
> You also said:
> 
> "When unbound asks for an.ExaMple.domAin.NeT and the record is not cached in the forwarder, the answer will contain the correct case.
> Afterwards, when the answer is cached, the wrong casing (always lowercase) will be used, and until the TTL expires I assume.
> This results in a mismatch between query and reply if use-caps-for-id is used."
> 
> So am I right in the understanding that the initial query goes to CloudFlare and CloudFlare has to go off and resolve the name itself (its not in the cache) then the correct case is preserved. Now CloudFlare has the name stored in its cache. A subsequent query for the same DNS name is then requested where the casing is randomly different and then a case insensitive match is found in CloudFlare's cache and the cache entry casing is returned rather than the query with the different casing that the client just requested. Then you get the error. This sounds like a bug to me and something CloudFlare should be aware of and fix? Is that what you were referring to when you said "I will try to reach the people involved"?
Indeed.
On second observation though, the cloudflare resolver seems to reply
properly for 0x20 (use-caps-for-id) queries.
The QUESTION section is always consistent (copied) and is the one that
matters for 0x20 support.
The other sections (i.e., ANSWER) always return lowercase from cache.
This is uncommon with other resolvers' behavior (but not wrong) and it
was what tricked me previously.

> 
> For the "tcp error" it would be nice to have a bit more information in the log as to what actually failed e.g. connection refused, timed out etc. It would help to show it was the other end not Unbound.
> 
> I was using just CloudFlare to make sure that when I submitted the logs etc. that I had a repeatable, consistent issue to look at. And yes since setting 'use-caps-for-id' to 'no' there have been no issues.
This is good to know. It verifies that the issue is indeed related to 0x20.

> 
> I have now set 'use-caps-for-id' to 'yes' and changed the list of forwarders to see what happens:
>From your latest email we see that the issue keeps happening even for
other resolvers. What I gather from your log is that due to the tcp
errors (timeouts), unbound sees that there is no answer yet for the 0x20
query and starts the 0x20 fallback. This ultimately results in failure
(SERVFAIL) to the client.

Upon closer inspection I see the following is happening:
1. Unbound tries to resolve a domain and performs qname perturbation
   because of 0x20.
2. Due to tcp errors (timeouts) from the otherside and because unbound
   is configured with 0x20, the fallback logic is triggered.
   "Capsforid: timeouts, starting fallback"
2b. (From now on unbound does not care about the proper case.)
3. Unbound eventually gets the first response that will try to match
   with other replies.
   "Capsforid: starting fallback"
4. Unbound asks other upstream servers and tries to match the reply to
   make sure that what it got on the previous step is the same as with
   other servers/replies.
   "Capsforid: reply is equal. go to next fallback"
5. Unbound will try to get at least x*3 identical replies to make sure
   that the reply is indeed consistent between different upstreams.
   x is the number of addresses for that delegation point (number
   of IPs in the forward section in your case).
6. In your case, and because of the timeouts in between, this number
   cannot be reached before the limit of the glue fetches per delegation
   point(16) is reached, which terminates the query.

This is only happening randomly because in order for the 0x20 fallback
logic to kick in you need either:
- Wrong case in the reply (not your case), or
- No answer (timeouts) at least 3 times.

So if you want to avoid the random SERVFAILS I would suggest to turn off
use-caps-for-id.

0x20 is meant to increase the random bits of query_ids by perturbating
the letter case of the qname. This is to make cache poisoning more
difficult and is more relevant to UDP.

In your case, where you send all queries to specific resolvers over TLS,
you can safely remove that.

I suppose you use DoT for privacy, so I would also use `forward-first:
no` (default) to make sure that unbound does not leak any queries to
other servers except the ones you have specified over DoT.

I hope that explains things,
-- George

> 
> forward-zone: # MyForwardZones.conf
>   name: "."
>   forward-tls-upstream: yes
>   forward-first: yes
>   # Cloudflare DNS
>   forward-addr: 1.1.1.1 at 853#cloudflare-dns.com
>   forward-addr: 1.0.0.1 at 853#cloudflare-dns.com
>   forward-addr: 2606:4700:4700::1111 at 853#cloudflare-dns.com
>   forward-addr: 2606:4700:4700::1001 at 853#cloudflare-dns.com
>   # Quad9
>   forward-addr: 2620:fe::fe at 853#dns.quad9.net
>   forward-addr: 9.9.9.9 at 853#dns.quad9.net
>   #forward-addr: 2620:fe::9 at 853#dns.quad9.net
>   #forward-addr: 149.112.112.112 at 853#dns.quad9.net
>   # Google
>   forward-addr: 8.8.8.8 at 853#Dns.google
>   forward-addr: 8.8.4.4 at 853#Dns.google
>   forward-addr: 2001:4860:4860::8888 at 853#Dns.google
>   forward-addr: 2001:4860:4860::8844 at 853#Dns.google
>   # DNS Privacy
>   forward-addr: 94.130.110.185 at 853#ns1.dnsprivacy.at
>   forward-addr: 94.130.110.178 at 853#ns2.dnsprivacy.at
> 
> Thanks again for the information it has been useful to me and I suspect others.
> 
> Ray
> 
> -----Original Message-----
> From: George Thessalonikefs <george at nlnetlabs.nl> 
> Sent: 22 July 2020 17:30
> To: RayG <rgsub1 at btinternet.com>
> Cc: unbound-users at lists.nlnetlabs.nl
> Subject: Re: Unbound randomly fails to resolve names
> 
> Hi Ray,
> 
> On 21/07/2020 13:26, RayG wrote:
>> Hi George, Oliver, Andi,
>>
>> @George: Thanks for your reply.
>>
>> I have made the adjustment we will see how it goes.
>>
>> But as Oliver Psotta says at https://calomel.org/unbound_dns.html there are good reasons for having it enabled.
>>
>> Also on the page: https://www.grc.com/dns/dns.htm there is a "Spoofabity" test which also suggests having mixed case is good.
> Having the option enabled is good as long as the other side supports it.
> This is not the case for you, at least for now.
> 
> If you want to keep it enabled you can enrich your forwarders configuration with other public DoT resolvers.
> You can find more information at
> https://dnsprivacy.org/wiki/display/DP/DNS+Privacy+Public+Resolvers#DNSPrivacyPublicResolvers-DNS-over-TLS(DoT).
> 
>>
>> There are also the TCP Errors e.g.:
>> 21/07/2020 11:15:01 C:\Program Files\Unbound\unbound.exe[16308:0] 
>> debug: tcp error for address 1.0.0.1 port 853
> Nothing wrong here, seems like a tcp error to that IP and port. Unbound couldn't make the connection (maybe network routing problems, unavailability from the other side or the local system) and it should go on to try the next available server.
> 
>> These are unexplained so far as are some of the other entries like:
>> 21/07/2020 10:41:40 C:\Program Files\Unbound\unbound.exe[16308:0] 
>> debug: request E.ROOT-SERVERS.NET. has exceeded the maximum number of 
>> glue fetches 65
>> 21/07/2020 10:41:40 C:\Program Files\Unbound\unbound.exe[16308:0] 
>> debug: return error response SERVFAIL And
>> 21/07/2020 10:41:40 C:\Program Files\Unbound\unbound.exe[16308:0] 
>> debug: request has exceeded the maximum  number of nxdomain nameserver 
>> lookups with 13
>> 21/07/2020 10:41:40 C:\Program Files\Unbound\unbound.exe[16308:0] 
>> debug: return error response SERVFAIL
>>
>> All of which are still occurring, should they be happening?
> Both of the above are because resolution has exceeded a set of limits and the query is considered as hitting a dead end from unbound's point of view (there seems to be no available servers that can provide an answer).
> Unbound stops resolution and returns SERVFAIL to the client(s).
> 
> As you are forwarding to a limited set of resolvers (in contrast with reaching the different authoritative nameservers during normal resolution), those kind of limits could be reached easier/faster if there are communication issues as the upstream is the same and sole responsible server for all the delegation points.
> 
>>
>> Also I have been able to look back at some of my backup images these were all running the same way as currently and the event log messages like:
>>
>> Level	Date and Time	Source	Event ID	Task Category
>> Warning	16/07/2020 15:48:44	Microsoft-Windows-DNS-Client	1014 (1014)	Name resolution for the name enews.synology.com timed out after none of the configured DNS servers responded.
>>
>> Are present.
>>
>> These started occurring after the release of V1.9.4. The event log on that backup image (which I am able to run as a virtual machine) did not contain any of the above errors.
>> So V1.9.4 was OK
>>
>> Unfortunately between the above VM and the next one I can run there 
>> was V1.9.5 and the download file for that was dated at 19/11/2019 V1.9.6. download date was 12/12/2019 I can say that the above type of errors started appearing just after V1.9.5 was downloaded. I normally install the new version on the same day that I download it. So something happened somewhere between V1.9.4 and V1.9.5 and has been the same ever since.
> I believe the difference in behavior is only a coincidence for unbound.
> 1.9.5 was a CVE release that was solving a security vulnerability in the ipsecmod module. It had nothing to do with upstream connections, tcp connections, or DoT and if unbound is not compiled with the ipsecmod, the code should be identical to the 1.9.4 version.
> 
> Best regards,
> -- George
>>
>> I hope that helps.
>>
>> Thanks for any further information/comments
>>
>> Ray
>> -----Original Message-----
>> From: George Thessalonikefs <george at nlnetlabs.nl>
>> Sent: 20 July 2020 15:09
>> To: unbound-users at lists.nlnetlabs.nl
>> Subject: Re: Unbound randomly fails to resolve names
>>
>> Hi Ray, Andi,
>>
>> I see from Ray's log that use-caps-for-id: is enabled.
>> I also see that the forwarding resolvers used seem to have an issue 
>> with
>> 0x20 replies (use-caps-for-id related).
>>
>> For example:
>> When unbound asks for an.ExaMple.domAin.NeT and the record is not cached in the forwarder, the answer will contain the correct case.
>> Afterwards, when the answer is cached, the wrong casing (always
>> lowercase) will be used, and until the TTL expires I assume. This results in a mismatch between query and reply if use-caps-for-id is used.
>>
>> Unbound's fallback may or may not help at that time. From your log I 
>> see that the fallback does not help (returns SERVFAIL after some 
>> further
>> tries) and consecutive queries try without 0x20.
>>
>> I will try to reach the people involved but for now turning off use-caps-for-id should help.
>>
>> Let us know how it goes.
>>
>> Best regards,
>> -- George
>>
>>
>