[nsd-users] questions how NSD slave handles XFR and how to improve XFR-patching performance
klaus.mailinglists at pernau.at
Thu Apr 4 21:07:40 UTC 2019
I tried to analyze how NSD as slave handles XFRs. Please correct me if I
When NSD is running, I see 3 processes, lets call them P1 P2 and P3.
It seems that P3 is the "worker", P1 is a "supervisor" and P2 is the
"zone handler". Is this correct? What are the correct names you refer to
this individual processes?
The incoming NOTIFY is received by P3 and signaled to P1. P1 performs
the XFR and saves it to disk.
Then P2 calls clone() and P4 is generated. I have no idea what P4 is used.
P2 reads the xfr file (written to disk by PID1) and aplies the
difference to the in-memory zone.
P2 deletes the xfr file.
P2 calls clone and P5 is generated.
P3 and P4 exit() and P5 is the new worker.
How do you call P4 and what is it used for?
Above scenario I observed on an NSD slave with a huge TLD zone A (2,7GB)
and on an NSD with a big TLD zone B (700MB).
Frankly, on another NSD which has both zones loaded (and some more
smaller zones, ~500MB in total) it looks different. Zone B sends NOTIFYs
every 3 seconds. Hence, NSD is more or less permanently applying diffs
and fork. But, it behaves different than in the above scenario. There is
not only P4 and P5, but also P6.
# pstree -p | grep nsd
looks normal, but some seconds later:
and some seconds later, 2 processes exit and a new is forked:
Is this a legal scenario? While analyzing the logs it seems to me that,
as the NOTIFYs are faster received than NSD can apply and fork them,
that a second XFR is initiated while NSD is still waiting for the
previous XFR to be finished (not sure, but I think the clone() is the
problem which takes 4-5 seconds).
So, How is NSD supposed to work when receiving NOTIFYs although the last
XFR is not finished yet? Is it a feature or a bug to start the next XFR
although the previous is not finished yet?
With strace i sometimes see aborted clone(), may this be the sign of
some problem? Ie:
[pid 12457] 19:56:44.043771 <... clone resumed> child_stack=NULL,
child_tidptr=0x7f324b19ba10) = ? ERESTARTNOINTR (To be restarted) <5.683020>
May it be that a clone() is aborted as there is a newer zone version and
hence the XFR is applied and clone() restarted? If this is the case, it
can happen that a clone() never finished as a new XFR is available.
Any ideas to improve the XFR-applying-speed? I think the most time
consuming part is the clone() (P2 has around 2.3million page tables).
Are there and config/build options to improve this? Ie huge pages? Using
some other forking function?
More information about the nsd-users