[nsd-users] nsd: remote.c:2307: daemon_remote_process_stats: Assertion `s->in_stats_list' failed.

Wouter Wijngaards wouter at nlnetlabs.nl
Fri Feb 15 07:22:23 UTC 2019


Hi,

As a reply to myself, if it was double presence in the list. This fix
should solve the assertion failure. It should remove the item if present
multiple times in the list. If it makes the failures go away, then the
bug must be double presence in that list.  I have also committed it to
the code repository, it should be harmless if it is not this and a
workaround if it is.

Index: remote.c
===================================================================
--- remote.c    (revision 4970)
+++ remote.c    (working copy)
@@ -743,12 +743,22 @@
 static void
 stats_list_remove_elem(struct rc_state** list, struct rc_state* todel)
 {
-    while(*list) {
-        if( (*list) == todel) {
-            *list = (*list)->stats_next;
-            return;
+    struct rc_state* prev = NULL;
+    struct rc_state* n = *list;
+    while(n) {
+        /* delete this one? */
+        if(n == todel) {
+            if(prev) prev->next = n->next;
+            else    (*list) = n->next;
+            /* go on and delete further elements */
+            /* prev = prev; */
+            n = n->next;
+            continue;
         }
-        list = &(*list)->stats_next;
+
+        /* go to the next element */
+        prev = n;
+        n = n->next;
     }
 }
 

Best regards, Wouter

On 15/02/2019 08:12, Wouter Wijngaards wrote:
> Hi Ask,
>
> On 15/02/2019 06:39, Ask Bjørn Hansen wrote:
>> Hi everyone,
>>
>> Recently some of my nsd installations started crashing every so many days (or weeks?) with:
>
> This failure was reported to me before, and then I could not find it. 
> Could not reproduce and code inspection shows it should not really be
> possible.  With that I mean, that the code-paths that go to that
> assertion all have the boolean turned on.  When that item is removed
> from the list the boolean is turned off.  The only explanations I have
> left are random heap corruption or somehow the list management does not
> work right. This could happen if the item was in the list twice.  But
> that is also not possible, both by this boolean and because only one
> command per nsd-control stream.  So I am stymied at the issue; looks
> like the code paths involved in having that linked list mismanaged are
> all not possible.
>
> Some runs in a memory checker like valgrind also show no issues. And
> that is what I would ask for; some way to reproduce, or runs in a memory
> checker like valgrind or gcc's libasan (that might catch the issue
> before it becomes an assertion failure, eg. deleted element still in use
> or something).  Is that is not possible; I guess I could try to make
> debug-versions of the code for people that have it; some sort of
> lighter-weight debug code to get more information on this issue, lighter
> than valgrinds speed slowdown. But I am not sure what it would need to
> check for.
>
> Best regards, Wouter
>
>
>> nsd: remote.c:2307: daemon_remote_process_stats: Assertion `s->in_stats_list' failed.
>>
>> I am using the 4.1.24-2.el7 build on CentOS 7.
>>
>> I haven’t been able to reproduce it. The servers are doing pretty light duty (<1000 qps) and rarely reloading configs.  “nsd-control stats_noreset” runs once or twice a minute (from a prometheus exporter).
>>
>> Is this a known problem in 4.1.24 or am I doing something goofy?
>>
>>
>> Ask
>>
>> _______________________________________________
>> nsd-users mailing list
>> nsd-users at NLnetLabs.nl
>> https://open.nlnetlabs.nl/mailman/listinfo/nsd-users
> _______________________________________________
> nsd-users mailing list
> nsd-users at NLnetLabs.nl
> https://open.nlnetlabs.nl/mailman/listinfo/nsd-users



More information about the nsd-users mailing list