serve-expired: "yes" and cache-min-ttl: 30 catastrophically unsafe?

Thu Nov 15 17:36:31 UTC 2018

On 2018-11-15 4:45 a.m., Nick Urbanik wrote:
> Dear Marc and anyone else interested in why severe outages can be
> caused by serve-expired: "yes" and cache-min-ttl: 30:
> 
> On 13/11/18 10:56 -0500, Marc Branchaud via Unbound-users wrote:
>> On 2018-10-30 1:50 a.m., Nick Urbanik wrote:
>>> On 29/10/18 10:14 -0400, Marc Branchaud via Unbound-users wrote:
>>>> On 2018-10-28 3:20 p.m., Nick Urbanik via Unbound-users wrote:
>>>>> On 25/10/18 18:10 +1100, Nick Urbanik via Unbound-users wrote:
>>>>>> I am puzzled by the behaviour of our multi-level DNS system which
>>>>>> answered many queries for names having shorter TTLs with SERVFAIL.
>>>>>
>>>>> I mean that SERVFAILs went up to 50% of replies, and current names
>>>>> with TTLs of around 300 failed to be fetched by the resolver, the last
>>>>> DNS servers in the chain.  What I mean is that adding these two
>>>>> configuration options (serve-expired: "yes" and cache-min-ttl: 30)
>>>>> caused an outage.  I am trying to understand why.
>>>>>
>>>>> Any ideas in understanding the mechanism would be very welcome.
>>>>
>>>> We use 1.6.8 with both those settings, and observed prolonged 
>>>> SERVFAIL periods.
>>>>
>>>> In our case, the upstream server became inaccessible for a period of 
>>>> time, but when contact resumed the SERVFAILs persisted.
>>>
>>> This behaviour was quite catastrophic, and to me, unexpected.
> 
> And career affecting.

Ugh, that's terrible.  I am sorry you're suffering such serious 
consequences.

>>> Do you have any idea of the mechanism behind this failure?
>>>
>>> Is there a way to deal better with zero TTL names?
>>>
>>>> We reduced the infra-host-ttl value to compensate.
>>
>> (Sorry for my slow response -- this slipped through the cracks.)
>>
>>> Did that bring your system to a functioning condition?
>>
>> Yes & no.  We reduced infra-host-ttl to 30 seconds, which means that 
>> we are only affected by this for (up to) 30 seconds after upstream 
>> access returns.  That is adequate for our purposes.
>>
>> So I think the mechanism is pretty clear, and I think it's good for 
>> unbound to cache the upstream server's status for a period of time.  
>> I'm just not convinced that 900 seconds is a reasonable default time.
>>
>> (BTW, our case has nothing to do with zero TTL names:  The IP address 
>> configured as the zone's forward-addr became inaccessible.  No names 
>> involved.  That said, I do not know how unbound deals with 0-TTL names.)
>>
>> I do not think our case is a bug.  It also has nothing to do with 
>> serve-expired or cache-min-ttl.  But since we use those settings, I 
>> wanted to relate our experience with a confusing SERVFAIL situation.
> 
> How busy are your systems?

Moderately busy, but nowhere near your level.  1-2 thousand qps, max.

In our case, the upstream servers (indeed, the whole Internet) is on the 
other side of a slow (~600ms RTT) and occasionally unreliable link.

>> In your multi-level system, are you 100% sure that all the
>> forward-addr IPs are *always* accessible?  If they are, then you may
>> be seeing SERVFAILs for a different reason.
> 
> Absolutely; they are all in our local network.  And when I removed
> those two configuration values, everything came back to normal good
> behaviour almost immediately.  Perhaps a distinguishing factor is that
> some of these systems deal with in the order of up to 50,000 mixed
> queries per second.

It appears to me that your problem is different from what we saw.  Our 
setup is pretty modest.  Maybe you're using some feature that we're not, 
and the problem is coming from there.  Here's some more info about our 
setup:

  * Unbound version 1.6.8
  * IPv4 only
  * No DNSSEC
  * No TCP that I'm aware of (very little, if any)
  * Moderate query rate
  * Modest cache size (16MB RRSET).
  * Forward-everything (zone name is ".")
  * Only 1 or 2 forward-addrs.

> The result was so unexpected and surprisingly severe and I categorise
> our situation as the result of a very serious bug.  Tomorrow there are
> repercussions for me personally.

Certainly severe.  However, we never saw any problem like this when we 
enabled serve-expired and cache-min-ttl.

> Defining the bug is all complicated by the fact that before this
> happened, I had chosen to change jobs within the same company, so no
> longer have access to these systems to test the effects of those
> configuration values.  I don't know if it was one, the other, or a
> combination of both that caused the problem.  Perhaps no one but me
> wants to find out.

I wish I could help you more.  All I can say with any certainty is that 
your problem does not seem to be common.  Not much of a comfort, I know.

I hope your repercussions aren't too severe!

		M.

>>>> (Why is infra-host-ttl's default 900 seconds?  That seems like a 
>>>> long time to wait to retry the upstream server.)
>>>>
>>>>         M.
>>>>>> By multilevel, I mean clients talk to one server, which forwards to
>>>>>> another, and for some clients, there is a third level of caching.
>>>>>>
>>>>>> So it was unwise to add:
>>>>>> serve-expired: "yes"
>>>>>> cache-min-ttl: 30
>>>>>>
>>>>>> to the server section of these DNS servers running unbound 1.6.8 on
>>>>>> up to date RHEL 7?
> 
> Hint: the answer is an unreserved "YES!".
> 
>>>>>> Please could anyone cast some light on why this
>>>>>> was so?  I will be spending some time examining the cause.
>>>>>>
>>>>>> If you need more information, please let me know.