Unbound memory resource consumption?

Thu Mar 13 16:50:57 UTC 2025

Hi,

and thanks for the feedback, the general advice, and the pointer to
jemalloc.  I may look into that a bit later.


However, in the mean time I have come to the conclusion that there
may be a correlation between me enabling DoH and DoT and using RFC
9462 to direct clients which probe for _dns.resolver.arpa to use the
DoH and/or DoT endpoints on the one hand, and on the other hand what
really does look like a massive memory leak in unbound.  If that is
true, which malloc() you use should not make much of a difference.


To test this hypothesis, I turned off DoH and DoT (diff to config
attached below, it was only turned on about last month), and also
stopped serving resolver.arpa, and then restarted unbound.  Here are
a few "top" displays taken over the span of a few hours.  First
after this config change:

load averages:  0.26,  0.20,  0.25;               up 6+00:57:31        14:24:00
79 processes: 76 sleeping, 1 stopped, 2 on CPU
CPU states:  4.5% user,  0.0% nice,  2.2% system,  1.0% interrupt, 92.2% idle
Memory: 2702M Act, 7948K Inact, 17M Wired, 27M Exec, 2367M File, 17G Free
Swap: 14G Total, 32M Used, 14G Free / Pools: 3149M Used / Network: 1574K In, 16

  PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
14982 unbound   43    0   398M  268M CPU/2       6:55 30.22% 30.22% unbound


load averages:  0.13,  0.17,  0.21;               up 6+01:49:28        15:15:57
79 processes: 77 sleeping, 1 stopped, 1 on CPU
CPU states:  2.8% user,  0.0% nice,  2.0% system,  0.6% interrupt, 94.5% idle
Memory: 2847M Act, 11M Inact, 17M Wired, 27M Exec, 2367M File, 17G Free
Swap: 14G Total, 32M Used, 14G Free / Pools: 3149M Used / Network: 1234K In, 13

  PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
14982 unbound   85    0   544M  417M kqueue/2   18:13 38.23% 38.23% unbound


load averages:  0.22,  0.11,  0.10;               up 6+03:55:58        17:22:27
90 processes: 87 sleeping, 1 stopped, 2 on CPU
CPU states:  1.2% user,  0.0% nice,  1.1% system,  0.2% interrupt, 97.3% idle
Memory: 3040M Act, 18M Inact, 17M Wired, 27M Exec, 2367M File, 17G Free
Swap: 14G Total, 32M Used, 14G Free / Pools: 3149M Used / Network: 648K In, 700

  PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
14982 unbound   43    0   738M  604M CPU/2      38:45  3.61%  3.61% unbound


If we compare this to what I experienced with these options turned
on and a number of DoH / DoT clients using those endpoints, quoting
from yesterday's e-mail:

load averages:  0.86,  0.94,  0.92;               up 5+00:58:04        14:24:33
86 processes: 83 sleeping, 1 stopped, 2 on CPU
CPU states: 14.8% user,  0.0% nice,  1.3% system,  0.8% interrupt, 83.0% idle
Memory: 3035M Act, 68M Inact, 17M Wired, 21M Exec, 14M File, 17G Free
Swap: 14G Total, 38M Used, 14G Free / Pools: 2885M Used / Network: 1322K In, 1906K Out

  PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
14678 unbound   40    0  5408M 3033M CPU/2     183:17 78.47% 78.47% unbound


load averages:  0.52,  0.53,  0.52;               up 5+02:22:23        15:48:52
85 processes: 82 sleeping, 1 stopped, 2 on CPU
CPU states: 11.4% user,  0.0% nice,  1.8% system,  1.0% interrupt, 85.7% idle
Memory: 3815M Act, 81M Inact, 17M Wired, 21M Exec, 14M File, 16G Free
Swap: 14G Total, 38M Used, 14G Free / Pools: 2885M Used / Network: 1509K In, 19

  PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
14678 unbound   84    0  6863M 3825M kqueue/0  236:12 39.55% 39.55% unbound


load averages:  0.19,  0.35,  0.41;               up 5+04:50:24        18:16:53
85 processes: 1 runnable, 82 sleeping, 1 stopped, 1 on CPU
CPU states: 11.3% user,  0.0% nice,  1.2% system,  0.0% interrupt, 87.4% idle
Memory: 5085M Act, 99M Inact, 17M Wired, 21M Exec, 14M File, 15G Free
Swap: 14G Total, 38M Used, 14G Free / Pools: 2886M Used / Network: 79G In, 107G

  PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
14678 unbound   85    0  9358M 5118M RUN/1     319:53 29.30% 29.30% unbound


You'll notice the difference is quite stark.

Not only is the CPU time much lower (OK, crypto costs, I guess), but
also the trajectory of the virtual size is vastly different:

5408M -> 6863M (1:24h later) -> 9358M (3:52h after 0th measurement)

compared to

398M -> 544M (51m later) -> 738M (2:58h after 0th measurement)


And according to "unbound-control stats" the query rate is
comparable to what it was yesterday.


So I suspect there is a serious memory leak, possibly in unbound,
related to the code which does DoH and/or DoT handling.


Diff to our unbound.conf compared to yesterday attached.


Regards,

- Håvard
-------------- next part --------------
rcsdiff -u unbound.conf
===================================================================
RCS file: RCS/unbound.conf,v
retrieving revision 1.9
diff -u -r1.9 unbound.conf

--- unbound.conf        2025/03/03 16:25:44     1.9
+++ unbound.conf        2025/03/13 12:53:24
@@ -12,27 +12,27 @@
   # 853 = DNS-over-TLS
   # 443 = DNS-over-HTTPS
   interface: 158.38.0.2
-  interface: 158.38.0.2 at 443
-  interface: 158.38.0.2 at 853
+#  interface: 158.38.0.2 at 443
+#  interface: 158.38.0.2 at 853
   interface: 2001:700:0:ff00::2
-  interface: 2001:700:0:ff00::2 at 443
-  interface: 2001:700:0:ff00::2 at 853
+#  interface: 2001:700:0:ff00::2 at 443
+#  interface: 2001:700:0:ff00::2 at 853
   interface: 158.38.0.169
-  interface: 158.38.0.169 at 443
-  interface: 158.38.0.169 at 853
+#  interface: 158.38.0.169 at 443
+#  interface: 158.38.0.169 at 853
   interface: 2001:700:0:503::c253
-  interface: 2001:700:0:503::c253 at 443
-  interface: 2001:700:0:503::c253 at 853
+#  interface: 2001:700:0:503::c253 at 443
+#  interface: 2001:700:0:503::c253 at 853
   interface: 127.0.0.1
   interface: ::1
 
   # TLS key and certificate
-  tls-service-key: "/usr/pkg/etc/unbound/dns-resolver2-key.pem"
-  tls-service-pem: "/usr/pkg/etc/unbound/dns-resolver2-cert.pem"
-  tls-cert-bundle: "/etc/openssl/certs/ca-certificates.crt"
+#  tls-service-key: "/usr/pkg/etc/unbound/dns-resolver2-key.pem"
+#  tls-service-pem: "/usr/pkg/etc/unbound/dns-resolver2-cert.pem"
+#  tls-cert-bundle: "/etc/openssl/certs/ca-certificates.crt"
 
   # Enable DNS-over-HTTPS (doh):
-  https-port: 443
+#  https-port: 443
 
   # These need tuning away from defaults;
   # the defaults not suitable for TCP-heavy workloads:
@@ -988,9 +988,9 @@
 #      for-upstream: yes
 #      zonefile: "example.org.zone"
 
-  auth-zone:
-    name: resolver.arpa
-    zonefile: "pz/resolver.arpa"
+#  auth-zone:
+#    name: resolver.arpa
+#    zonefile: "pz/resolver.arpa"
 
 # Views
 # Create named views. Name must be unique. Map views to requests using