[nsd-users] Timeout for TCP queries to NSD

Thu May 14 14:49:00 UTC 2020

On 14/05/2020 15:33, Wouter Wijngaards via nsd-users wrote:

Hey Wouter,

>> Thanks for the clarification. I think the default of 120s should be
>> documented in the man page.
> 
> Yep, done that!  Could also adjust the default itself, but I documented
> it now in the man page.

Thanks!

>> I'm still not clear on what the timeout applies to though. Is it to the
>> time between individual DNS messages in a TCP connection? Or does it
>> apply to any period of inactivity in the connection?
> 
> Between DNS messages.  But for AXFR the timeout is reset for every
> individual DNS response message (fragment) of the AXFR response stream.

Okay, got it. So if a TCP client sends 1 byte per TCP packet, every 3 
seconds, and delivers a query of 30 bytes, it will take 90 seconds for 
NSD to receive one query to respond to? And then this same client can 
keep the connection open, wait for 115s, and then resume sending a 
query, 1 byte at a time, to keep this TCP connection occupied for ages?

This is the classic Slow Loris attack, isn't it?

> Only to the new ones, that are above the limit.

Okay, so the default "tcp-count" is 100. Does this mean that the first 
50 TCP clients get a generous 120s timeout, and continue to enjoy that, 
as long as they keep the TCP connection open and keep sending some data 
as described above? Aren't these the "sluggish clients" that should be 
penalised somehow and be ejected?

And then the next 50 clients, because they didn't get in there first, 
are subject to a harsher 200ms timeout?

This seems like whoever gets in first gets the comfortable seats, and 
the late-comers have to sit on hard chairs.

>> Dropping from the default 120s, to a mere 200ms when the number of TCP
>> connections goes up, is quite dramatic. And I happen to think that 200ms
>> is too low. A client that's getting an AXFR from such an NSD server is
> 
> Not really for a busy server that is responding to TCP queries.  But it
> may be for you, but only then in this slow AXFR situation.  But I doubt
> you are in that sort of space limitation, in which case it is exactly
> meant to drop slow, sluggish responders to make space for other, active,
> users.

I don't understand this... if some slow, sluggish TCP clients get in 
first, and occupy the first half of available TCP slots, then how do 
they ever get ejected? And while they do this, newer legitimate TCP 
clients are subject to a low timeout.

> You should increase the amount of tcp buffers available, with tcp-count:
> 2000 in nsd.conf.  That should give a better amount of TCP queries.

Yes, sure, I can adjust the "tcp-count" setting. But I'm not so fond of 
this dynamic adjustment of the timeout to 200ms when half the TCP slots 
are filled. A lower "tcp-timeout" that applies equally to all clients, 
is fairer.

> Could also make that 200 msec configurable, but I doubt it would be good
> to increase that, too much, the main point of it is to force TCP
> sessions to close if they are lagging the server.  For that it has to be

I agree that slow TCP clients should be dropped quickly. This dynamic 
lowering of the timeout should be removed. At this time, it's neither 
configurable nor documented, so removing it would not affect anyone's 
configuration.

If you insist on keeping this dynamic adjustment, then the option should 
be given some kind of descriptive name, such as "tcp-timeout-when-busy", 
and the option should explain exactly when it triggers and when it gets 
reset.

> short.  But other values than 200 msec could also exist, though.  I
> doubt that that would help your AXFR server because I guess that might
> be lagging by a whole lot, even if it is hit by this timer, I think it
> is likely the 100 msec timer after a reload, which is much more
> arbitrary in choice.

That's also possible. Our NSD servers have many slave zones, and they 
see frequent updates, so the server process reloads zones frequently.

I read issue #14, and I understand the reasoning for your fix to have a 
lingering process that services an ongoing AXFR. But why not just use 
the regular timeout for it?

> That is a misnomer, I meant work frequently and consistently.  The value
> is up for grabs as well, if you think it is too low, I could increase it
> to, eg. 30 seconds, like BIND has for other TCP timers.  But it would
> start to keep more old TCP connections around after reloads, so I opted
> for a more defensive low value that keeps the server from having
> resources for old TCP connections, but also allows (fast-responding-)
> servers to get service completion.  What sort of timeout are you
> thinking would help that sluggish python process?

Our Knot DNS servers were also using a rather harsh 500ms timeout on TCP 
I/O, and this was affecting the slower AXFR client. I changed that to 
5s, and the AXFR client is now happy. I don't know if there is a magical 
value that is suitable for all, but again, if this parameter were at 
least configurable, then it would serve 2 purposes:

1. inform the operator that NSD is trying to do clever things; and

2. allow the operator to keep NSD from being too clever when this 
cleverness causes operational problems :)

> That is true.  It is not configurable today, though.  Not sure if it
> should be, perhaps it can have a (new and different) sensible default.

See my explanations above. Hope they help you in thinking about a way 
forward to improve the TCP tuning parameters.

Regards,
Anand