serve-expired: "yes" and cache-min-ttl: 30 catastrophically unsafe?

Thu Nov 15 09:45:53 UTC 2018

Dear Marc and anyone else interested in why severe outages can be
caused by serve-expired: "yes" and cache-min-ttl: 30:

On 13/11/18 10:56 -0500, Marc Branchaud via Unbound-users wrote:
>On 2018-10-30 1:50 a.m., Nick Urbanik wrote:
>> On 29/10/18 10:14 -0400, Marc Branchaud via Unbound-users wrote:
>>> On 2018-10-28 3:20 p.m., Nick Urbanik via Unbound-users wrote:
>>>> On 25/10/18 18:10 +1100, Nick Urbanik via Unbound-users wrote:
>>>>> I am puzzled by the behaviour of our multi-level DNS system which
>>>>> answered many queries for names having shorter TTLs with SERVFAIL.
>>>>
>>>> I mean that SERVFAILs went up to 50% of replies, and current names
>>>> with TTLs of around 300 failed to be fetched by the resolver, the last
>>>> DNS servers in the chain.  What I mean is that adding these two
>>>> configuration options (serve-expired: "yes" and cache-min-ttl: 30)
>>>> caused an outage.  I am trying to understand why.
>>>>
>>>> Any ideas in understanding the mechanism would be very welcome.
>>>
>>> We use 1.6.8 with both those settings, and observed prolonged SERVFAIL 
>>> periods.
>>>
>>> In our case, the upstream server became inaccessible for a period of 
>>> time, but when contact resumed the SERVFAILs persisted.
>> 
>> This behaviour was quite catastrophic, and to me, unexpected.

And career affecting.

>> Do you have any idea of the mechanism behind this failure?
>> 
>> Is there a way to deal better with zero TTL names?
>> 
>>> We reduced the infra-host-ttl value to compensate.
>
>(Sorry for my slow response -- this slipped through the cracks.)
>
>> Did that bring your system to a functioning condition?
>
>Yes & no.  We reduced infra-host-ttl to 30 seconds, which means that we 
>are only affected by this for (up to) 30 seconds after upstream access 
>returns.  That is adequate for our purposes.
>
>So I think the mechanism is pretty clear, and I think it's good for 
>unbound to cache the upstream server's status for a period of time.  I'm 
>just not convinced that 900 seconds is a reasonable default time.
>
>(BTW, our case has nothing to do with zero TTL names:  The IP address 
>configured as the zone's forward-addr became inaccessible.  No names 
>involved.  That said, I do not know how unbound deals with 0-TTL names.)
>
>I do not think our case is a bug.  It also has nothing to do with 
>serve-expired or cache-min-ttl.  But since we use those settings, I 
>wanted to relate our experience with a confusing SERVFAIL situation.

How busy are your systems?

>In your multi-level system, are you 100% sure that all the
>forward-addr IPs are *always* accessible?  If they are, then you may
>be seeing SERVFAILs for a different reason.

Absolutely; they are all in our local network.  And when I removed
those two configuration values, everything came back to normal good
behaviour almost immediately.  Perhaps a distinguishing factor is that
some of these systems deal with in the order of up to 50,000 mixed
queries per second.

The result was so unexpected and surprisingly severe and I categorise
our situation as the result of a very serious bug.  Tomorrow there are
repercussions for me personally.

Defining the bug is all complicated by the fact that before this
happened, I had chosen to change jobs within the same company, so no
longer have access to these systems to test the effects of those
configuration values.  I don't know if it was one, the other, or a
combination of both that caused the problem.  Perhaps no one but me
wants to find out.

>>> (Why is infra-host-ttl's default 900 seconds?  That seems like a long 
>>> time to wait to retry the upstream server.)
>>>
>>>         M.
>>>>> By multilevel, I mean clients talk to one server, which forwards to
>>>>> another, and for some clients, there is a third level of caching.
>>>>>
>>>>> So it was unwise to add:
>>>>> serve-expired: "yes"
>>>>> cache-min-ttl: 30
>>>>>
>>>>> to the server section of these DNS servers running unbound 1.6.8 on
>>>>> up to date RHEL 7? 

Hint: the answer is an unreserved "YES!".

>>>>> Please could anyone cast some light on why this
>>>>> was so?  I will be spending some time examining the cause.
>>>>>
>>>>> If you need more information, please let me know.
-- 
Nick Urbanik             http://nicku.org           nicku at nicku.org
GPG: 7FFA CDC7 5A77 0558 DC7A 790A 16DF EC5B BB9D 2C24 ID: BB9D2C24