[nsd-users] make nsdc more reliable for restart

Mon Aug 13 07:33:19 UTC 2012

On Sun, Aug 12, 2012 at 01:41:58PM -0400, Paul Wouters wrote:
> On Fri, 10 Aug 2012, Stuart Henderson wrote:
> 
> >>>while running nsd as a secondary nameserver with +1000
> >>>domains we discovered that the default nsdc(8) was
> >>>not able to reliable restart nsd.
> >>>Reason I think is that, by using the PID file, it sends
> >>>it's signal to only 1 of the default 3 processes.
> >>>Afterwards it only checks against this 1 process while
> >>>the other 2 still may be running causing trouble on
> >>>start up.
> >
> >I wondered whether there's a particular reason that only the
> >master is signalled, or is this purely due to lack of a portable
> >pkill-type program?
> >
> >>>The patch below fixes it for us (was tested in a lab
> >>>environment with 10.000 domains).
> >>
> >>The "pkill" command is not available on all systems. Linux distros ship
> >>with it these days, and MacOS X introduced it with Mountain Lion (10.8),
> >>but it may not be available on other systems. Therefore your patch is
> >>not portable.
> >
> >Some OS have "killall" that does the same as pkill, but other
> >OS have a different "killall" that behaves slightly differently ;)
> 
> The patch did not address my issue actually.
> 
> [root at nohats ~]# pidof nsd
> 4697 4696 4677
> [root at nohats ~]# ls /var/run/nsd
> [root at nohats ~]# nsdc stop
> nsd is not running
> 
> somehow nsd gets signaled and deletes its pid, but won't write a new
> one. There are two methods my nsd is getting signalled. One is via
> an hourly cron running (if necc) a nsdc patch and nsdc reload. When
> doing this manually, it works fine and the reload signals nsd and a
> new pidfile is created:
> 
> [root at nohats ~]# pidof nsd
> 1301 1300 1298
> [root at nohats ~]# cat /var/run/nsd/nsd.pid 1298
> [root at nohats ~]# kill -HUP 1298
> [root at nohats ~]# cat /var/run/nsd/nsd.pid 1304
> [root at nohats ~]# pidof nsd
> 1305 1304 1300
> [root at nohats ~]# kill -HUP 1304
> [root at nohats ~]# pidof nsd
> 1313 1312 1300
> [root at nohats ~]# cat /var/run/nsd/nsd.pid 1312
> 
> The second method is by opendnssec, configured to use:
> 
> /etc/opendnssec/conf.xml:		<NotifyCommand>sudo /sbin/service nsd restart</NotifyCommand>
> 
> [root at nohats ~]# su - ods
> -bash-4.1$ cat /var/run/nsd/nsd.pid 1312
> -bash-4.1$ sudo /sbin/service nsd restart
> Stopping nsd:                                              [  OK  ]
> Starting nsd:                                              [  OK  ]
> -bash-4.1$ cat /var/run/nsd/nsd.pid 1494
> -bash-4.1$ pidof nsd
> 1497 1496 1494
> 
> So it all looks fine, but after a while something happens and the
> pidfile is either wrong or gone, and then all of these fail. But
> even with the pkill patch applied to /usr/sbin/nsdc, this still
> happens.
> 
> Paul
> _______________________________________________
> nsd-users mailing list
> nsd-users at NLnetLabs.nl
> http://open.nlnetlabs.nl/mailman/listinfo/nsd-users
> 

The patch is far from ideal (what would happen if you have more then 1 
nsd running?). However we use this in production for roughly a year 
and it survived 300+ restarts.
Since the secondary domains only exist in memory I don't see any harm
in killing all instances of nsd at once, pkill them without even
looking at the pid-file should be fine too.

We run OpenBSD everywhere so I sent this to sthen at openbsd who then suggested
to post it here to get more feedback, that's why it's not portable:P

We run the default nsdc on our primary servers where the problem
does not exist.