serve-expired: "yes" and cache-min-ttl: 30 catastrophically unsafe?
nicku at nicku.org
Thu Nov 15 09:45:53 UTC 2018
Dear Marc and anyone else interested in why severe outages can be
caused by serve-expired: "yes" and cache-min-ttl: 30:
On 13/11/18 10:56 -0500, Marc Branchaud via Unbound-users wrote:
>On 2018-10-30 1:50 a.m., Nick Urbanik wrote:
>> On 29/10/18 10:14 -0400, Marc Branchaud via Unbound-users wrote:
>>> On 2018-10-28 3:20 p.m., Nick Urbanik via Unbound-users wrote:
>>>> On 25/10/18 18:10 +1100, Nick Urbanik via Unbound-users wrote:
>>>>> I am puzzled by the behaviour of our multi-level DNS system which
>>>>> answered many queries for names having shorter TTLs with SERVFAIL.
>>>> I mean that SERVFAILs went up to 50% of replies, and current names
>>>> with TTLs of around 300 failed to be fetched by the resolver, the last
>>>> DNS servers in the chain. What I mean is that adding these two
>>>> configuration options (serve-expired: "yes" and cache-min-ttl: 30)
>>>> caused an outage. I am trying to understand why.
>>>> Any ideas in understanding the mechanism would be very welcome.
>>> We use 1.6.8 with both those settings, and observed prolonged SERVFAIL
>>> In our case, the upstream server became inaccessible for a period of
>>> time, but when contact resumed the SERVFAILs persisted.
>> This behaviour was quite catastrophic, and to me, unexpected.
And career affecting.
>> Do you have any idea of the mechanism behind this failure?
>> Is there a way to deal better with zero TTL names?
>>> We reduced the infra-host-ttl value to compensate.
>(Sorry for my slow response -- this slipped through the cracks.)
>> Did that bring your system to a functioning condition?
>Yes & no. We reduced infra-host-ttl to 30 seconds, which means that we
>are only affected by this for (up to) 30 seconds after upstream access
>returns. That is adequate for our purposes.
>So I think the mechanism is pretty clear, and I think it's good for
>unbound to cache the upstream server's status for a period of time. I'm
>just not convinced that 900 seconds is a reasonable default time.
>(BTW, our case has nothing to do with zero TTL names: The IP address
>configured as the zone's forward-addr became inaccessible. No names
>involved. That said, I do not know how unbound deals with 0-TTL names.)
>I do not think our case is a bug. It also has nothing to do with
>serve-expired or cache-min-ttl. But since we use those settings, I
>wanted to relate our experience with a confusing SERVFAIL situation.
How busy are your systems?
>In your multi-level system, are you 100% sure that all the
>forward-addr IPs are *always* accessible? If they are, then you may
>be seeing SERVFAILs for a different reason.
Absolutely; they are all in our local network. And when I removed
those two configuration values, everything came back to normal good
behaviour almost immediately. Perhaps a distinguishing factor is that
some of these systems deal with in the order of up to 50,000 mixed
queries per second.
The result was so unexpected and surprisingly severe and I categorise
our situation as the result of a very serious bug. Tomorrow there are
repercussions for me personally.
Defining the bug is all complicated by the fact that before this
happened, I had chosen to change jobs within the same company, so no
longer have access to these systems to test the effects of those
configuration values. I don't know if it was one, the other, or a
combination of both that caused the problem. Perhaps no one but me
wants to find out.
>>> (Why is infra-host-ttl's default 900 seconds? That seems like a long
>>> time to wait to retry the upstream server.)
>>>>> By multilevel, I mean clients talk to one server, which forwards to
>>>>> another, and for some clients, there is a third level of caching.
>>>>> So it was unwise to add:
>>>>> serve-expired: "yes"
>>>>> cache-min-ttl: 30
>>>>> to the server section of these DNS servers running unbound 1.6.8 on
>>>>> up to date RHEL 7?
Hint: the answer is an unreserved "YES!".
>>>>> Please could anyone cast some light on why this
>>>>> was so? I will be spending some time examining the cause.
>>>>> If you need more information, please let me know.
Nick Urbanik http://nicku.org nicku at nicku.org
GPG: 7FFA CDC7 5A77 0558 DC7A 790A 16DF EC5B BB9D 2C24 ID: BB9D2C24
More information about the Unbound-users