[nsd-users] NSD4 goes unresponsive with lots of TCP connection!

Fri Apr 8 07:24:15 UTC 2016

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Hi Kabindra,

The processes that are running but unresponsive is weird, can you do a
stack trace for them, eg. with 'gcore', straight with gdb
/usr/local/sbin/nsd <pid> and then bt?  And tell me the stack traces?
 You can also take multiple stack traces of the same process by trying
again a short time later (to see what sort of loop they must be in)?

The backlog is the number of tcp connections waiting for accept().
The tcp-count is the number is tcp connections NSD services after it
accepted() them.  You can easily set tcp-count higher than backlog,
some systems (older), have a backlog of only 16 or so, while tcp-count
is 1000 or more.  And perhaps very high backlog is an issue (some sort
of regression failure for the network stack)?  But CentOS 6 and
FreeBSD 10 are very different, it is more likely this is caused by NSD
then.  Those stack traces could prove useful (if it gets very long,
send it to me off-list).

Best regards, Wouter

On 08/04/16 08:08, Kabindra Shrestha wrote:
> Hi Wouter,
> 
> 
>> On Apr 6, 2016, at 2:49 PM, W.C.A. Wijngaards
>> <wouter at nlnetlabs.nl <mailto:wouter at nlnetlabs.nl>> wrote:
>> 
>> Signed PGP part Hi Kabindra,
>> 
>> I have not heard of this before, how is TCP affecting NSD?
> After couple thousand of TCP queries, NSD goes unresponsive for
> both TCP and UDP. [kabindra at 1 ~]$ dig @`hostname` -p 5350 ch txt
> hostname.bind
> 
> ; <<>> DiG 9.8.1 <<>> @<replaced> -p 5350 ch txt hostname.bind ; (2
> servers found) ;; global options: +cmd ;; connection timed out; no
> servers could be reached [kabindra at 1 ~]$ dig @`hostname` -p 5350 ch
> txt hostname.bind +tcp
> 
> ; <<>> DiG 9.8.1 <<>> @ <replaced> -p 5350 ch txt hostname.bind
> +tcp ; (2 servers found) ;; global options: +cmd ;; connection
> timed out; no servers could be reached
> 
> One thing we noticed, we have set the server-count to 4, so it
> should have 4 child process forked, right? when NSD goes
> unresponsive, we see couple of <defunct> process and more than 4
> child processes. also, these NSD processes are using lots of CPU. I
> have left this box out of service for almost 2 days now after going
> unresponsive but you can see the cpu usage on the below image, it's
> not coming down.
> 
> 
> 
> 
>> 
> 
>> NSD has a fixed number of tcp connections, configured in
>> tcp-count: 100 from the nsd.conf file.  That should be what is
>> services.  You should increase that count to increase
>> responsiveness to TCP.
> Yes, that's what we changed earlier to increase responsiveness to
> TCP.
> 
>> 
>> UDP should be unaffected.
> That is not the case we are seeing.
> 
>> 
>> The backlog is for tcp connections waiting to be accepted.  256
>> is reasonably portable, reasonably large.  I don't see how that
>> value is your problem.
> It has been so far and should be true for most of the users but
> recently with the increase in TCP traffic, I doubt that's still the
> case. With the RRL implemented I believe it's going to increase
> some amount of TCP traffic than what it used to be, right? So say
> if I increase the number of tcp-counts to 1024 but my backlog is 
> set to 256, will I still be able to get 1024 connections at a time
> or will I be limited to 256 connections concurrently?
> 
>> Is your kernel and networking subsystem failing?
> 
> I don't think so, if it was the problem I would see problem for
> other services on that server as well, right?
> 
> 
>> 
>> The OS can return EMFILE or ENFILE to accept(), nsd starts to
>> stop accepting TCP connections to relieve buffer stress on the
>> OS.  But again, UDP should not have been impacted?
> Again, that's not the case we are seeing.
> 
>> 
>> Are you using so-reuseport: yes?
> Nope.
> 
> 
>> I have had reports that it disrupts connectivity (depending on
>> OS, particular version of the OS, and more recent versions of NSD
>> do not use reuseport on TCP anymore).
> 
> Sorry, forgot to mention earlier, we are on CentOS 6 and NSD 4.1.8.
> 
> 
> Thanks.
> 
>> 
>> Best regards, Wouter
>> 
>> On 05/04/16 18:28, Kabindra Shrestha wrote:
>>> Hi,
>>> 
>>> We are seeing some large number of TCP connections to our DNS 
>>> servers (in thousands) and NSD goes unresponsive after certain
>>> time and doesn't recover, it stops responding to UDP as well.
>>> We tried increasing the number of tcp-counts but it doesn't
>>> help. I noticed the TCP backlog is hardcoded to 256 in NSD
>>> config, so even with customised TCP backlogs on the system its
>>> still being throttled at around 256. Is there anyway we can
>>> change this value without recompiling the NSD.
>>> 
>>> 
>>> [kabindra at 05 nsd-4.1.8]$ grep BACKLOG * config.h.in:#undef 
>>> TCP_BACKLOG configure:#define TCP_BACKLOG 256 
>>> configure.ac:AC_DEFINE_UNQUOTED([TCP_BACKLOG], [256], [Define
>>> to the backlog to be used with listen.])
>>> 
>>> 
>>> We are using NSD4.1.8.
>>> 
>>> ( From one of the servers that went unresponsive, we have seen
>>> that TCP number closing to 10k. )
>>> 
>>> #ss -s Total: 5591 (kernel 5640) TCP:   5067 (estab 4968,
>>> closed 4, orphaned 0, synrecv 0, timewait 3/0), ports 28
>>> 
>>> Transport Total     IP        IPv6 *  5640      -         -
>>> RAW 0         0         0 UDP  122       63        59 TCP
>>> 5063 5017      46 INET  5185      5080      105 FRAG  0
>>> 0 0
>>> 
>>> 
>>> Thanks.
>>> 
>>> Regards, Kabindra Shrestha
>>> 
>>> 
>>> 
>>> _______________________________________________ nsd-users
>>> mailing list nsd-users at NLnetLabs.nl
>>> <mailto:nsd-users at NLnetLabs.nl> 
>>> https://open.nlnetlabs.nl/mailman/listinfo/nsd-users
>>> 
>> 
>> _______________________________________________ nsd-users mailing
>> list nsd-users at NLnetLabs.nl <mailto:nsd-users at NLnetLabs.nl> 
>> https://open.nlnetlabs.nl/mailman/listinfo/nsd-users
> 
> Regards, Kabindra Shrestha
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBCAAGBQJXB1yfAAoJEJ9vHC1+BF+NIrQP/RANBg13y8pJZPExaprvcle7
c/DAP9Ha+77I9221KgMhb8hz4ewDIChiAJynIKHeOsTfXOFGNGiaUKAWosUwdA5U
/U9cdcfp8P0fPZXuIHyRJx7YF6q+nYBd2vjrerPIcDKma6sHTsZvd1DIO+zb2CqN
y3Sm9d79DOmB1w+Cd/JHurExdMvOLr4BEosLZojKka9L1Lp/9d1aJF4rLEU9c6fA
B+c1I79FnwofCMo3atqz7cNwBPjwA+co4tsW4vdqhhD+1M1cJROEhwl+wIZlu1Jw
IEN9YCiXk5elo1AQzdYUMahSGhaQrWxebYTINkFakJno4JkOxTEUQFL5e5qeEo7d
PoUd9QqKh8cxT2cY0r7C118D6ps3tP9rXRoQzDWtrxlWe+pXVKUdHUAXOVNqESFv
ySpSYHCVDD4eeEL0TiUC67PJ5YNSo4UMo7coH9Emn4w6tDyQANaknw0yQy3ySWEb
IjquPIHVKNRVW8bS4Ptn8pxSJKzh7ZGL63AuiWZ/jpNKWUiFaSYI3CrF2TD5akox
6n6KwxQ9EuXlxR2/ImxgXo1AmM2svTZDqXV2xi15tT55hqbZMm1Rt+wHpsTHSQDf
ACjAh1lueJpGuzlOTlferTsSS74ak6X8TLpxBjeFJSiiA4XUc2F03eOq0thOv0pZ
B7VNvU1vL5UJVmGk/dPj
=QFRy
-----END PGP SIGNATURE-----