[nsd-users] Edge case on nsdc?

Shane Kerr shane at ca.afilias.info
Wed Jul 9 22:51:10 UTC 2008


Matthijs,

On Jul 9, 2008, at 12:22 +0200, Matthijs Mekking wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi Shane,
>
>> [1214740996] nsd[93921]: warning: nsd is already running as 93888,
>> continuing
>> [1214740996] nsd[93922]: error: can't bind the socket: Address  
>> already
>> in use
>> [1214741027] nsd[94418]: error: can't bind the socket: Address  
>> already
>> in use
>> [1214741057] nsd[94932]: error: can't bind the socket: Address  
>> already
>> in use
>
> This occurs when you call nsd manually (eg without nsdc, NSD control
> script). Because NSD is already running, it can't bind the socket, and
> server initialization for this process fails. Because server
> initialization fails, it tries to remove the pidfile. Hence, later you
> will only see the socket bind error, and no longer the 'already  
> running'
> warning. (and therefore, nsdc running will tell you it is not running)
>
> I changed in nsd.c that the pidfile is written only after succeeding
> server initialization.

Cool.

>> I think this is because we have a script monitoring to make sure  
>> NSD is
>> running at all time and attempts to start it... even though NSD is
>> already running.
>
> What script do you use for monitoring NSD? nsdc also can be used for
> this. nsdc running to check if nsd is running, if it returns 1 (not
> running), you can do nsdc start.

We use nsdc for this. The script basically does:

while true; do
     if ! nsdc running; then
         nsdc start
     fi
     sleep 15
done

>> In the nsdc.sh script we see the following:
>>
>>
>> signal() {
>>        if [ -s ${pidfile} ]
>>        then
>>                kill -"$1" `cat ${pidfile}` && return 0
>>        else
>>                echo "nsd is not running"
>>        fi
>>        return 1
>> }
>>
>>
>> But it seems like NSD restarts itself regularly, getting a new  
>> process
>> ID when it does so. In this case, we have the possibility for the
>> following to happen:
>>
>> - nsdc.sh reads the contents of pidfile
>>
>> - NSD restarts, getting a new PID
>>
>> - nsdc.sh sends a signal to test NSD using the old PID, which  
>> fails, so
>> nsdc claims NSD is not running
>>
>> Is this possible?
>
> As far as I know, when NSD restarts (because it received a dedicated
> signal), it takes care of updating the pidfile.

When you use "nsdc patch", you get an implicit "nsdc reload". We run  
this from a cron job.

nsdc reload issues a SIGHUP to NSD.

This eventually ends up in the server_main() function in server.c,  
which calls fork(), and therefore gets a new pid, which it then writes  
into the pidfile.

So, the scenario is:

Time 1: NSD, running as PID A, writes into pidfile
Time 2: nsdc reads PID A from pidfile
Time 3: NSD gets a SIGHUP, forks a new process with PID B, and exits  
the old process
Time 4: nsdc sends a signal to PID A, which no longer exists
Time 5: nsdc returns "server not running" even though the server is  
running.

>> It is possible to work around this with a little more  
>> sophistication, I
>> think:
>>
>> signal() {
>>    while true
>>    do
>>        # if there is no PID file, NSD is not running
>>        if [ ! -s ${pidfile} ]
>>        then
>>            return 1
>>                fi
>>
>>        # if we can send the signal to the PID, then NSD is running
>>                #   (or some other process with that PID, but we hope
>> not...)
>>        PID=`cat ${pidfile}`
>>        if kill -"$1" $PID
>>        then
>>             return 0
>>        fi
>>
>>        # double-check NSD did not restart between the time we read  
>> the PID
>>        # and the time we sent the signal
>>        CHECK_PID=`cat ${pidfile}`
>>        if [ $PID -eq $CHECK_PID ]
>>        then
>>            echo "nsd is not running"
>>            return 1
>>        fi
>>    done
>> }
>
> Could you try the trunk release? I think it already fixes this issue.
> Make sure your control script first checks if nsd is running (nsdc
> running) and if not start it (nsdc start).

The fix you made makes sense, and should be included.

But I am reasonably sure there is nothing that the server can do to  
fix this problem (mind you I am a bit sleep-deprived right now, so no  
promises). ;)

I think the script needs to work like I coded it here, where it checks  
the PID of the server did not change while it was checking.

--
Shane



More information about the nsd-users mailing list