[nsd-users] Timeout for TCP queries to NSD

Thu May 14 13:33:09 UTC 2020

Hi Anand,

On 14/05/2020 14:43, Anand Buddhdev via nsd-users wrote:
> On 14/05/2020 13:29, Wouter Wijngaards via nsd-users wrote:
> 
> Hi Wouter,
> 
>> Yes this applies to incoming queries and to outgoing queries.  120
>> seconds by default.
> 
> Thanks for the clarification. I think the default of 120s should be
> documented in the man page.

Yep, done that!  Could also adjust the default itself, but I documented
it now in the man page.

> 
> I'm still not clear on what the timeout applies to though. Is it to the
> time between individual DNS messages in a TCP connection? Or does it
> apply to any period of inactivity in the connection?

Between DNS messages.  But for AXFR the timeout is reset for every
individual DNS response message (fragment) of the AXFR response stream.

> 
>> A much smaller value, of 200 msec, is used when the server is nearly
>> full on capacity, for incoming connections that are over the limit.
>> Also when the server has updated the existing connections get a smaller
>> 100 msec timeout to wait for them to complete their tcp query to NSD.
>>
>> That last feature since 4.2.1.  The tcp full shorter timeout is since
>> 4.1.11.
> 
> Now that you've explained it here, I recall that there was something
> about this in the release notes. However, the values of 200ms isn't
> documented. The release notes have:
> 
> "When tcp is more than half full, use short timeout for tcp session." So
> I'm guessing that "short timeout" here is 200ms. Also, it's not clear
> whether the timeout is dynamic. What I mean is: is it applied to all
> sessions (existing and new), or only to new ones. When the number of tcp

Only to the new ones, that are above the limit.

> connections drops to less than half, is the timeout reset to 120s? And
> is it reset for all sessions, or just new ones?

It is then applied to new connections.

> 
> Dropping from the default 120s, to a mere 200ms when the number of TCP
> connections goes up, is quite dramatic. And I happen to think that 200ms
> is too low. A client that's getting an AXFR from such an NSD server is

Not really for a busy server that is responding to TCP queries.  But it
may be for you, but only then in this slow AXFR situation.  But I doubt
you are in that sort of space limitation, in which case it is exactly
meant to drop slow, sluggish responders to make space for other, active,
users.

> quite likely to suffer disconnects. In fact, I have been observing
> exactly this behaviour on the servers we run. We have a use case where a
> user is doing AXFR of some largish zones, and when the client is a bit
> slow, NSD drops the connection. This causes the client to retry. This,
> IMHO, is rather wasteful.

You should increase the amount of tcp buffers available, with tcp-count:
2000 in nsd.conf.  That should give a better amount of TCP queries.

Could also make that 200 msec configurable, but I doubt it would be good
to increase that, too much, the main point of it is to force TCP
sessions to close if they are lagging the server.  For that it has to be
short.  But other values than 200 msec could also exist, though.  I
doubt that that would help your AXFR server because I guess that might
be lagging by a whole lot, even if it is hit by this timer, I think it
is likely the 100 msec timer after a reload, which is much more
arbitrary in choice.

> 
> The other feature of shortening the timeout to 100ms is also not so
> obvious. The release notes have:
> 
> "Fix #14, tcp connections have 1/10 to be active and have to work
> every second, and then they get time to complete during a reload,
> this is a process that lingers with the old version during a version
> update."
> 
> The 1/10 there is not very readable. I think that 100ms would be much
> clearer. And I also don't understad what you mean by "and have to work
> every second". Could you please explain that?

That is a misnomer, I meant work frequently and consistently.  The value
is up for grabs as well, if you think it is too low, I could increase it
to, eg. 30 seconds, like BIND has for other TCP timers.  But it would
start to keep more old TCP connections around after reloads, so I opted
for a more defensive low value that keeps the server from having
resources for old TCP connections, but also allows (fast-responding-)
servers to get service completion.  What sort of timeout are you
thinking would help that sluggish python process?

> 
> In my opinion, such details should not be buried in the release notes
> document. The release notes are useful when comparing one version to
> another. All these features of how the server dynamically adjusts its
> behaviour should be in the operations manual or at least the nsd.conf
> man page.
> 
> Imagine a new user of NSD, who is trying to configure and tune the
> server, and sets "tcp-timeout" to some value, and still observes
> different behaviour when running the server. This leads to confusion.
> And it's not reasonable to expect the user to read the entire set of
> release notes trying to find such undocumented features.

That is true.  It is not configurable today, though.  Not sure if it
should be, perhaps it can have a (new and different) sensible default.

Best regards, Wouter

> 
> Regards,
> Anand Buddhdev
> RIPE NCC
> _______________________________________________
> nsd-users mailing list
> nsd-users at lists.nlnetlabs.nl
> https://lists.nlnetlabs.nl/mailman/listinfo/nsd-users