serve-expired: "yes" and cache-min-ttl: 30 unsafe?
Marc Branchaud
marcnarc at xiplink.com
Tue Nov 13 15:56:33 UTC 2018
On 2018-10-30 1:50 a.m., Nick Urbanik wrote:
> Dear Marc,
>
> Thank you for your reply.
>
> On 29/10/18 10:14 -0400, Marc Branchaud via Unbound-users wrote:
>> On 2018-10-28 3:20 p.m., Nick Urbanik via Unbound-users wrote:
>>> On 25/10/18 18:10 +1100, Nick Urbanik via Unbound-users wrote:
>>>> I am puzzled by the behaviour of our multi-level DNS system which
>>>> answered many queries for names having shorter TTLs with SERVFAIL.
>>>
>>> I mean that SERVFAILs went up to 50% of replies, and current names
>>> with TTLs of around 300 failed to be fetched by the resolver, the last
>>> DNS servers in the chain. What I mean is that adding these two
>>> configuration options (serve-expired: "yes" and cache-min-ttl: 30)
>>> caused an outage. I am trying to understand why.
>>>
>>> Any ideas in understanding the mechanism would be very welcome.
>>
>> We use 1.6.8 with both those settings, and observed prolonged SERVFAIL
>> periods.
>>
>> In our case, the upstream server became inaccessible for a period of
>> time, but when contact resumed the SERVFAILs persisted.
>
> This behaviour was quite catastrophic, and to me, unexpected.
>
> Do you have any idea of the mechanism behind this failure?
>
> Is there a way to deal better with zero TTL names?
>
>> We reduced the infra-host-ttl value to compensate.
(Sorry for my slow response -- this slipped through the cracks.)
> Did that bring your system to a functioning condition?
Yes & no. We reduced infra-host-ttl to 30 seconds, which means that we
are only affected by this for (up to) 30 seconds after upstream access
returns. That is adequate for our purposes.
So I think the mechanism is pretty clear, and I think it's good for
unbound to cache the upstream server's status for a period of time. I'm
just not convinced that 900 seconds is a reasonable default time.
(BTW, our case has nothing to do with zero TTL names: The IP address
configured as the zone's forward-addr became inaccessible. No names
involved. That said, I do not know how unbound deals with 0-TTL names.)
I do not think our case is a bug. It also has nothing to do with
serve-expired or cache-min-ttl. But since we use those settings, I
wanted to relate our experience with a confusing SERVFAIL situation.
In your multi-level system, are you 100% sure that all the forward-addr
IPs are *always* accessible? If they are, then you may be seeing
SERVFAILs for a different reason.
M.
>> (Why is infra-host-ttl's default 900 seconds? That seems like a long
>> time to wait to retry the upstream server.)
>>
>> M.
>>>> By multilevel, I mean clients talk to one server, which forwards to
>>>> another, and for some clients, there is a third level of caching.
>>>>
>>>> So it was unwise to add:
>>>> serve-expired: "yes"
>>>> cache-min-ttl: 30
>>>>
>>>> to the server section of these DNS servers running unbound 1.6.8 on
>>>> up to date RHEL 7? Please could anyone cast some light on why this
>>>> was so? I will be spending some time examining the cause.
>>>>
>>>> If you need more information, please let me know.
More information about the Unbound-users
mailing list