RFC 8767 recheck timer

Wed Jul 22 16:38:56 UTC 2020

Hi Andreas,

On 21/07/2020 18:49, Andreas Schwarz wrote:
> Hi George,
> 
> thank you for your quick reply. Could you please elaborate on these failure recheck measures in unbound?
> 
> I dug a bit through the unbound code, the iterator specifically, but my C is not as good as to pretend that I understood very much.
The most common measures against failing queries have to do with
communication with the upstream nameservers. When timeouts occur unbound
records that in the infra cache that it also uses for the server
selection logic.
You can read more about that at
https://www.nlnetlabs.nl/documentation/unbound/info-timeout/

Unbound monitors the upstream servers responsiveness but it doesn't
monitor the "health" of individual queries. It caches no-answers like
NODATA and NXDOMAIN but not other errors like REFUSED.

> 
> I played a bit around with a zone of mine, added it to the authoritative servers, removed it and observed unbound's behavior. I could not see anything, that would indicate failure recheck measures (at least not for REFUSED codes) in a way that I would interpret the RFC.
> 
> The amount of requests I performed against unbound was pretty much identical to the amount of outgoing requests (times 6 for 3 authoritative Servers with both, IPv4 and IPv6).
> 
>>From the description in the RFC I would have expected unbound to stop querying the authoritative servers for some time and only serve the stale data. At least with serve-expired-client-timeout set to 0. With a non-zero value for this option, the behavior to always query totally makes sense.
REFUSED is a particular case as also mentioned at the last paragraph of
section 6 of the RFC (https://tools.ietf.org/html/rfc8767#section-6).
For REFUSED, unbound does not do anything special; it is simply a
non-usable reply.
It tries to contact the next available nameserver. If all nameservers
return REFUSED (or no nameservers can give an answer) unbound will
return SERVFAIL to the client.

When the RFC behavior is enabled (serve-expired-client-timeout > 0), the
cache is going to be consulted for possible stale records instead of
returning SERVFAIL.

I believe there may also be some confusion for unbound's serve-expired
behavior (note that I specifically used serve-expired instead of
serve-stale).

Unbound had already the serve-expired functionality before the 1st draft
was written.

Unbound's initial serve-expired behavior was to always try to reply from
cache (even if a record is expired) and then try to fetch an updated
record. Combined with the prefetch behavior that prefetches (updates) a
record when within a percentage of the TTL it keeps the cache "almost"
up-to-date for popular queries.
The result is that it increases cache-hit ratio.
The drawback is that you may serve long stale data for records that have
a lower TTL than the query interval from the clients and are subject to
frequent changes (e.g., short-lived A/AAAA records in a dynamic hosted
environment).
The serve-expired-ttl option could help with that.

That initial serve-expired behavior is still desired by operators and
still available (and the default) in unbound.

The new RFC behavior that treats stale records as a last resort to
failing or slow queries is also available and enabled with
serve-expired-client-timeout > 0.

I hope that helps,
-- George

> 
> Thank you in advance.
> 
> Cheers
> Andreas
>  
> On Tuesday, July 21, 2020 14:56 CEST, George Thessalonikefs via Unbound-users <unbound-users at lists.nlnetlabs.nl> wrote: 
>  
>> Hi Andreas,
>>
>> This timer is not specifically created for the serve-stale functionality
>> because as is mentioned in the following paragraph of the RFC:
>>> Most recursive resolvers already have the query resolution timer and,
>>> effectively, some kind of failure recheck timer.
>>
>> This is also true in unbound, where failure recheck measures were
>> already in place, though not configurable (at least in the sense of a
>> timer).
>>
>> Best regards,
>> -- George
>>
>> On 21/07/2020 14:30, Andreas Schwarz via Unbound-users wrote:
>>> Hi,
>>>
>>> I am currently testing the RFC 8767 related options in unbound to serve stale records.
>>>
>>> RFC 8767 mentions in section 5
>>>
>>>  >  *  A failure recheck timer, which limits the frequency at which a
>>>  >    failed lookup will be attempted again.
>>>
>>> I could not find any option related to this functionality in the unbound manual. All the other portions of the RFC seem to be covered, but not this one. Did I miss something or is this not implemented?
>>>
>>> Cheers
>>> Andreas
>>>
>