[nsd-users] NSD 4.3.9 stops updating zones

Sun Apr 3 12:29:29 UTC 2022

Hi NSD developers and users,

I've observed a situation with NSD that I think deserves some attention, 
and perhaps some kind of fix.

We have a server with 32GB of RAM. When we start NSD, it loads all the 
zones, and happily serves them. It uses close to 15GB of RAM. After a 
while, it gets a NOTIFY for a zone, and AXFRs the zone. It saves the XFR 
in /var/lib/nsd/nsd-xfr-5231. It then tries to apply the update, and 
this is when it all goes wrong. NSD's method of updating is to fork 
itself, have the child reload the changed zone(s), and take over from 
the parent... except that it can't fork because of memory shortage. 
While forking, NSD temporarily uses double the amount of RAM.

The log shows this:

[2022-03-30 15:16:27.986] nsd[5299]: error: fork failed: Cannot allocate 
memory
[2022-03-30 15:16:28.355] nsd[45999]: error: handle_reload_cmd: reload 
closed cmd channel
[2022-03-30 15:16:28.355] nsd[45999]: warning: Reload process 5299 
failed, continuing with old database
[2022-03-30 15:16:28.355] nsd[5231]: error: process 5299 exited with 
status 256
[2022-03-30 15:16:29.776] nsd[45999]: error: fork failed: Cannot 
allocate memory
[2022-03-30 15:16:30.149] nsd[46012]: error: handle_reload_cmd: reload 
closed cmd channel
[2022-03-30 15:16:30.149] nsd[46012]: warning: Reload process 45999 
failed, continuing with old database
[2022-03-30 15:16:31.748] nsd[46013]: error: handle_reload_cmd: reload 
closed cmd channel
[2022-03-30 15:16:31.748] nsd[46013]: warning: Reload process 46012 
failed, continuing with old database

After this, there are no more log entries about trying to reload the 
database.

PID 5231 is the xfrd process, and 5299 was the master that coordinates 
things. Now, the situation looks like this:

# systemctl status nsd
● nsd.service - NSD DNS Server
    Loaded: loaded (/usr/lib/systemd/system/nsd.service; enabled; vendor 
preset: disabled)
    Active: active (running) since Tue 2022-01-04 12:07:30 UTC; 2 months 
28 days ago
  Main PID: 5231 (nsd: xfrd)
    CGroup: /system.slice/nsd.service
            ├─ 5231 /usr/sbin/nsd -d
            ├─46013 /usr/sbin/nsd -d
            ├─46016 /usr/sbin/nsd -d
            └─46024 /usr/sbin/nsd -d

So we have the state where the xfrd process is running, and keeps doing 
zone transfers, which slowly accumulate in /var/lib/nsd/nsd-xfr-5231. 
Eventually, this will fill up the disk. Additionally, we have child 
processes running and serving queries, but the zones are now outdated. 
But there is no master process to apply the transfers. Log file rotation 
is also broken, because when I run "nsd-control log_reopen", no new log 
file is created. This will also cause the log file to grow unbounded, 
until it fills up the disk. Essentially, NSD is crippled, and only a 
restart will get it out of this broken state.

The easiest way to prevent this is to add RAM to the server. But my 
opinion is that this is a waste of resources. It may also not be trivial 
to do so. It might be easier on a virtual server, but with a physical 
server, one needs to buy RAM, shut down the server and add the memory 
modules. In this area, I find NSD to be deficient. Other name servers 
handle their memory differently, and make incremental use of memory as 
zones are added.

A question for the developers is: is there any way to make NSD handle 
zone reloads more efficiently rather than doing this fork/reload?

Regards,
Anand