Safeguarding forward zones from requestlist buildup

Fri Nov 22 19:43:30 UTC 2024

Hi Paul,

If you are "forwarding" to authoritative nameservers indeed using 
stub-zone is the correct configuration as it expects to send queries to 
authoritative nameservers and not resolvers. That won't help with this 
issue though.

In a situation where Unbound is overwhelmed with client queries, using 
'forward-no-cache: yes' (or 'stub-no-cache: yes' if you use the 
stub-zone) does not help since all those queries need to be resolved.

When under client pressure, Unbound would start dropping slow queries.
Slow queries are ones that take longer than 'jostle-timeout'
(https://unbound.docs.nlnetlabs.nl/en/latest/manpages/unbound.conf.html#unbound-conf-jostle-timeout) 
to resolve.

This way Unbound tries to combat Dos from slow queries or high query 
rates by trying to slowly fill up the cache from fast queries that 
eventually will drop the outgoing query rate and increase cache 
responses. (Glocal cache responses do not contribute to the increase of 
the request list).

In your case where you don't cache the upstream information, Unbound 
cannot protect itself with cached answers because all the internal 
upstream queries need to be resolved.

I am guessing the queries to the configured upstreams are not slower 
than jostle-timeout, so not candidates to be dropped initially, but it 
doesn't help that each one of them needs to always be resolved.

I would first try to use 'stub-no-cache: no' and see if the situation 
gets better.

It would be possible to introduce a new configuration option per 
forward/stub zone to give some kind of priority but unsure if it would 
generally help or in this case in particular.

Best regards,
-- Yorgos

On 19/11/2024 03:01, Paul S. via Unbound-users wrote:
> Hey team,
> 
> We run 8 node unbound clusters as recursive resolvers. The setup 
> forwards (using forward-zone) internal queries to a separate PowerDNS 
> authoritative cluster.
> 
> Recently, we've had some connectivity issues to Cloudflare (who provides 
> a lot of external DNS services in our environment). When this has 
> happened, we've seen the requestlist balloon to around 1.5-2k entries as 
> queries repeatedly time out.
> 
> However, the problem is that this affects forward-zones as well. We lose 
> resolution for internal queries when these backup events happen.
> 
> We're looking for suggestions on how to safeguard these internal 
> forwards. We notice stub-zone may be the more appropriate stanza for our 
> use case, but are unsure if that'd bypass this requestlist queuing (?)
> 
> Any thoughts greatly welcome, thank you!
> 
> Our config is fairly simple:
> 
> server:
>      num-threads: 4
>      # Best performance is a "power of 2 close to the num-threads value"
>      msg-cache-slabs: 4
>      rrset-cache-slabs: 4
>      infra-cache-slabs: 4
>      key-cache-slabs: 4
> 
>      # Use 1.125GB of a 4GB node to start, but real usage may be 2.5x this so
>      # closer to 2.8G/4GB (~70%)
>      #
>      msg-cache-size: 384m
>      # Should be 2x the msg cache
>      rrset-cache-size: 768m
> 
>      # We have libevent! Use lots of ports.
>      outgoing-range: 8192
>      num-queries-per-thread: 4096
> 
>      # Use larger socket buffers for busy servers.
>      so-rcvbuf: 8m
>      so-sndbuf: 8m
> 
>      # Turn on port reuse
>      so-reuseport: yes
> 
>      # This is needed to forward queries for private PTR records to upstream DNS servers
>      unblock-lan-zones: yes
> 
>      forward-zone:
>          name: "int.domain.tld"
>          forward-addr: "10.10.5.5"
>          # No caching in unbound
>          forward-no-cache: "yes"
>