serve-expired: "yes" and cache-min-ttl: 30 unsafe?

Tue Nov 13 15:56:33 UTC 2018

On 2018-10-30 1:50 a.m., Nick Urbanik wrote:
> Dear Marc,
> 
> Thank you for your reply.
>
> On 29/10/18 10:14 -0400, Marc Branchaud via Unbound-users wrote:
>> On 2018-10-28 3:20 p.m., Nick Urbanik via Unbound-users wrote:
>>> On 25/10/18 18:10 +1100, Nick Urbanik via Unbound-users wrote:
>>>> I am puzzled by the behaviour of our multi-level DNS system which
>>>> answered many queries for names having shorter TTLs with SERVFAIL.
>>>
>>> I mean that SERVFAILs went up to 50% of replies, and current names
>>> with TTLs of around 300 failed to be fetched by the resolver, the last
>>> DNS servers in the chain.  What I mean is that adding these two
>>> configuration options (serve-expired: "yes" and cache-min-ttl: 30)
>>> caused an outage.  I am trying to understand why.
>>>
>>> Any ideas in understanding the mechanism would be very welcome.
>>
>> We use 1.6.8 with both those settings, and observed prolonged SERVFAIL 
>> periods.
>>
>> In our case, the upstream server became inaccessible for a period of 
>> time, but when contact resumed the SERVFAILs persisted.
> 
> This behaviour was quite catastrophic, and to me, unexpected.
> 
> Do you have any idea of the mechanism behind this failure?
> 
> Is there a way to deal better with zero TTL names?
> 
>> We reduced the infra-host-ttl value to compensate.

(Sorry for my slow response -- this slipped through the cracks.)

> Did that bring your system to a functioning condition?

Yes & no.  We reduced infra-host-ttl to 30 seconds, which means that we 
are only affected by this for (up to) 30 seconds after upstream access 
returns.  That is adequate for our purposes.

So I think the mechanism is pretty clear, and I think it's good for 
unbound to cache the upstream server's status for a period of time.  I'm 
just not convinced that 900 seconds is a reasonable default time.

(BTW, our case has nothing to do with zero TTL names:  The IP address 
configured as the zone's forward-addr became inaccessible.  No names 
involved.  That said, I do not know how unbound deals with 0-TTL names.)

I do not think our case is a bug.  It also has nothing to do with 
serve-expired or cache-min-ttl.  But since we use those settings, I 
wanted to relate our experience with a confusing SERVFAIL situation.

In your multi-level system, are you 100% sure that all the forward-addr 
IPs are *always* accessible?  If they are, then you may be seeing 
SERVFAILs for a different reason.

		M.

>> (Why is infra-host-ttl's default 900 seconds?  That seems like a long 
>> time to wait to retry the upstream server.)
>>
>>         M.
>>>> By multilevel, I mean clients talk to one server, which forwards to
>>>> another, and for some clients, there is a third level of caching.
>>>>
>>>> So it was unwise to add:
>>>> serve-expired: "yes"
>>>> cache-min-ttl: 30
>>>>
>>>> to the server section of these DNS servers running unbound 1.6.8 on
>>>> up to date RHEL 7?  Please could anyone cast some light on why this
>>>> was so?  I will be spending some time examining the cause.
>>>>
>>>> If you need more information, please let me know.