Unbound strange stub_zone behavior?

Andrew Forgue andrew at forgue.io
Wed Aug 26 15:43:31 UTC 2020


Ok, I think we found the issue.

First, when a auth_zone is `for-downstream: no` cache is used first.  stub_zones seems to behave the same as `for-downstream: no`.

Somehow, in our environment, someone is triggering a cache of the root zone NS records for ".", which causes unbound to do a referral to the root zone instead of answering from auth data.

I was able to reproduce this by triggering it with `dig -t NS "." @server`

More details:
	https://github.com/NLnetLabs/unbound/issues/292 <https://github.com/NLnetLabs/unbound/issues/292>

-Andrew




> On Aug 20, 2020, at 9:43 AM, Andrew Forgue <andrew at forgue.io> wrote:
> 
> 
> 
>> On Jul 14, 2020, at 11:48 AM, Andrew Forgue via Unbound-users <unbound-users at lists.nlnetlabs.nl <mailto:unbound-users at lists.nlnetlabs.nl>> wrote:
>> 
>> 
>> 
>>> On Jul 13, 2020, at 11:55 AM, Jan Komissar (jkomissa) <jkomissa at cisco.com <mailto:jkomissa at cisco.com>> wrote:
>>> 
>>> Hi Andrew,
>>> 
>>> I believe that stub-zones will not work correctly for +norecurse (RD (recursion desired) flag unset) queries. Also, if your blah.example.com <http://blah.example.com/> has delegations to subzones (even on the same server) and you use a non-standard port, you would need a stub-zone for each sub-zone.
>> 
>> After restarting unbound, non-recursive queries work fine for several days, until they don't (not sure why).  My understanding is that stub_zone presents as if it's local data, and the behavior you're describing would be more like the behavior of a forward zone.
>> 
>>> I would follow Eric's advice to use an auth-zone, either as primary or secondary server (depending on your authoritative requirements).
>> 
>> Yeah, Thanks Eric & Jan I'll take a look at that, but I'm not sure the "proxied" dns server can do notifies, but seems to be a good lead.
> 
> Just to bump this again -- here's the progress so far.  We've been able to reproduce this with auth_zones too.
> 
> With my limited knowledge of unbound code and gdb it *appears* that in answer_norec_from_cache:
> 
> daemon/worker.c:492 (or so):
> 
> answer_norec_from_cache(...) {
> ...
> 	dp = dns_cache_find_delegation(&worker->env, qinfo->qname,
> 	    qinfo->qname_len, qinfo->qtype, qinfo->qclass,
> 	    worker->scratchpad, &msg, timenow);
> 	if(!dp) { /* no delegation, need to reprime */
> 	    return 0;
> 	}
> 
> in the happy case, `dp` is NULL meaning there's no delegation (so it hits return 0), and the correct answer is returned.
> 
> In the failure case: `dp` is a delegation point to what looks like the root zones:
> 
> --
> (gdb) p *dp
> $27 = {name = 0x7f66cc516d58 "", namelen = 1, namelabs = 1, nslist = 0x0, target_list = 0x0, usable_list = 0x0, result_list = 0x0, bogus = 0, has_parent_side_NS = 0 '\000', dp_type_mlc = 0 '\000', ssl_upstream = 0 '\000', auth_dp = 0 '\000', no_cache = 0}
> 
> -- 
> broken node:
> 
> dns1 :: ~ » sudo unbound-control lookup blah.example.com <http://blah.example.com/>
> The following name servers are used for lookup of blah.example.com <http://blah.example.com/>
> ;rrset 17491 13 1 5 2
> .       17491   IN      NS      c.root-servers.net <http://c.root-servers.net/>.
> .       17491   IN      NS      d.root-servers.net <http://d.root-servers.net/>.
> .       17491   IN      NS      j.root-servers.net <http://j.root-servers.net/>.
> .       17491   IN      NS      e.root-servers.net <http://e.root-servers.net/>.
> .       17491   IN      NS      l.root-servers.net <http://l.root-servers.net/>.
> .       17491   IN      NS      h.root-servers.net <http://h.root-servers.net/>.
> .       17491   IN      NS      f.root-servers.net <http://f.root-servers.net/>.
> .       17491   IN      NS      k.root-servers.net <http://k.root-servers.net/>.
> .       17491   IN      NS      m.root-servers.net <http://m.root-servers.net/>.
> .       17491   IN      NS      i.root-servers.net <http://i.root-servers.net/>.
> .       17491   IN      NS      a.root-servers.net <http://a.root-servers.net/>.
> .       17491   IN      NS      g.root-servers.net <http://g.root-servers.net/>.
> .       17491   IN      NS      b.root-servers.net <http://b.root-servers.net/>.
> .       17491   IN      RRSIG   NS 8 0 518400 20200827170000 20200814160000 46594 . ZxJeYw7vVyjxZg8y7mtt5N3YtejDrho11npxtnjt7MMZm/MlbSErowznceyvXYhTkgF4dJOFGcrUkwFekcN86Zw0tN+cHYYb4lpV2o/pYtXIzo2w2OtA0WJURMB1pWcclhma9y648OiGUsEwImRXpCQS7Mgk+XKU05KFCg5yrFW+UC4faaQ1ZiisVnK9GF8CwsHCC82xT7HU/pAMFgF2vEovsomysMuDhBKE1QTP9MN/DqD6bitdqGmhQSC9GxxcRrNCCU8fSnW4UVIiOJ95kaEMDk0kdpTGowBcKx2WCbXN8oKGSYRpJjE+y77mc2mv3cBUBwK9jnqB86jXwZ7enA== ;{id = 46594}
> Delegation with 13 names, of which 13 can be examined to query further addresses.
> It provides 0 IP addresses.
> cache delegation was useless (no IP addresses)
> 
> Any other help finding out how dns_cache_find_delegation returns the root delegations instead of the auth_zone (in this case example.com <http://example.com/> is the auth zone, with a proper zone file on disk)?
> 
> -Andrew
> 
> 
>> 
>> -Andrew
>> 
>>> Regards,
>>> 
>>> Jan.
>>> 
>>> On 7/12/20, 12:00 PM, "Unbound-users on behalf of Eric Luehrsen via Unbound-users" <unbound-users-bounces at lists.nlnetlabs.nl <mailto:unbound-users-bounces at lists.nlnetlabs.nl> on behalf of unbound-users at lists.nlnetlabs.nl <mailto:unbound-users at lists.nlnetlabs.nl>> wrote:
>>> 
>>>   On 7/11/20 11:49 AM, Andrew Forgue via Unbound-users wrote:
>>>> I have an unbound server that acts as a recursive resolver for clients and also acts as a target for fully delegated DNS (i.e. unbound is the NS record). For the fully-delegated domain it is a simple stub zone with an upstream of localhost on a different port.  Let's call it "blah.example.com <http://blah.example.com/>".
>>>> 
>>>> Occasionally, unbound (has happened on versions 1.10.1 and 1.7.3) will start responding to non-recursive queries with the list of root zones instead of a response from the stub-zone.  It seems that clients that use the `rd` flag are fine and continue to be able to resolve records in the stub-zone.  Only recursive desired clients will receive correct records from unbound (using the stub server).  All records in seemingly all stub zones have this behavior simultaneously.
>>>> 
>>>> I don't know what triggers it, but a full restart of unbound is the only thing that fixes it.  I've tried flushing cache, flushing infra, and everything, nothing seems to matter. I've seen only 2 things that may point to the issue.
>>>> 
>>>> - With verbosity turned up to 10, there's an entry produced in strace (but not in the actual log - maybe a misconfig): "unbound[2213085:5] debug: answer from the cache failed"
>>>> 
>>>> - stracing the "broken" unbound process is a very tight recvmsg() (of the request) and sendmsg() (with the root servers) with no syscalls in between.
>>>> 
>>>> Again, Using dig with +recurse works all the time, even when unbound gets in this state.  So seems like an unbound bug / cache corruption or something?
>>> 
>>>   If it is a bug, you may want to try a work around while waiting for a 
>>>   fix. You could try "auth-zone:" instead of "stub-zone:" or as a 
>>>   companion to "stub-zone:" You may need to give the authoritative server 
>>>   permission for a wholesale zone transfer to the Unbound instance. This 
>>>   may help avoid some undiscovered bug in piecemeal zone recursion.
>>>   - Eric

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlnetlabs.nl/pipermail/unbound-users/attachments/20200826/6da2b173/attachment-0001.htm>


More information about the Unbound-users mailing list