Unbound memory resource consumption?
Havard Eidnes
he at uninett.no
Thu Mar 13 16:50:57 UTC 2025
Hi,
and thanks for the feedback, the general advice, and the pointer to
jemalloc. I may look into that a bit later.
However, in the mean time I have come to the conclusion that there
may be a correlation between me enabling DoH and DoT and using RFC
9462 to direct clients which probe for _dns.resolver.arpa to use the
DoH and/or DoT endpoints on the one hand, and on the other hand what
really does look like a massive memory leak in unbound. If that is
true, which malloc() you use should not make much of a difference.
To test this hypothesis, I turned off DoH and DoT (diff to config
attached below, it was only turned on about last month), and also
stopped serving resolver.arpa, and then restarted unbound. Here are
a few "top" displays taken over the span of a few hours. First
after this config change:
load averages: 0.26, 0.20, 0.25; up 6+00:57:31 14:24:00
79 processes: 76 sleeping, 1 stopped, 2 on CPU
CPU states: 4.5% user, 0.0% nice, 2.2% system, 1.0% interrupt, 92.2% idle
Memory: 2702M Act, 7948K Inact, 17M Wired, 27M Exec, 2367M File, 17G Free
Swap: 14G Total, 32M Used, 14G Free / Pools: 3149M Used / Network: 1574K In, 16
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
14982 unbound 43 0 398M 268M CPU/2 6:55 30.22% 30.22% unbound
load averages: 0.13, 0.17, 0.21; up 6+01:49:28 15:15:57
79 processes: 77 sleeping, 1 stopped, 1 on CPU
CPU states: 2.8% user, 0.0% nice, 2.0% system, 0.6% interrupt, 94.5% idle
Memory: 2847M Act, 11M Inact, 17M Wired, 27M Exec, 2367M File, 17G Free
Swap: 14G Total, 32M Used, 14G Free / Pools: 3149M Used / Network: 1234K In, 13
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
14982 unbound 85 0 544M 417M kqueue/2 18:13 38.23% 38.23% unbound
load averages: 0.22, 0.11, 0.10; up 6+03:55:58 17:22:27
90 processes: 87 sleeping, 1 stopped, 2 on CPU
CPU states: 1.2% user, 0.0% nice, 1.1% system, 0.2% interrupt, 97.3% idle
Memory: 3040M Act, 18M Inact, 17M Wired, 27M Exec, 2367M File, 17G Free
Swap: 14G Total, 32M Used, 14G Free / Pools: 3149M Used / Network: 648K In, 700
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
14982 unbound 43 0 738M 604M CPU/2 38:45 3.61% 3.61% unbound
If we compare this to what I experienced with these options turned
on and a number of DoH / DoT clients using those endpoints, quoting
from yesterday's e-mail:
load averages: 0.86, 0.94, 0.92; up 5+00:58:04 14:24:33
86 processes: 83 sleeping, 1 stopped, 2 on CPU
CPU states: 14.8% user, 0.0% nice, 1.3% system, 0.8% interrupt, 83.0% idle
Memory: 3035M Act, 68M Inact, 17M Wired, 21M Exec, 14M File, 17G Free
Swap: 14G Total, 38M Used, 14G Free / Pools: 2885M Used / Network: 1322K In, 1906K Out
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
14678 unbound 40 0 5408M 3033M CPU/2 183:17 78.47% 78.47% unbound
load averages: 0.52, 0.53, 0.52; up 5+02:22:23 15:48:52
85 processes: 82 sleeping, 1 stopped, 2 on CPU
CPU states: 11.4% user, 0.0% nice, 1.8% system, 1.0% interrupt, 85.7% idle
Memory: 3815M Act, 81M Inact, 17M Wired, 21M Exec, 14M File, 16G Free
Swap: 14G Total, 38M Used, 14G Free / Pools: 2885M Used / Network: 1509K In, 19
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
14678 unbound 84 0 6863M 3825M kqueue/0 236:12 39.55% 39.55% unbound
load averages: 0.19, 0.35, 0.41; up 5+04:50:24 18:16:53
85 processes: 1 runnable, 82 sleeping, 1 stopped, 1 on CPU
CPU states: 11.3% user, 0.0% nice, 1.2% system, 0.0% interrupt, 87.4% idle
Memory: 5085M Act, 99M Inact, 17M Wired, 21M Exec, 14M File, 15G Free
Swap: 14G Total, 38M Used, 14G Free / Pools: 2886M Used / Network: 79G In, 107G
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
14678 unbound 85 0 9358M 5118M RUN/1 319:53 29.30% 29.30% unbound
You'll notice the difference is quite stark.
Not only is the CPU time much lower (OK, crypto costs, I guess), but
also the trajectory of the virtual size is vastly different:
5408M -> 6863M (1:24h later) -> 9358M (3:52h after 0th measurement)
compared to
398M -> 544M (51m later) -> 738M (2:58h after 0th measurement)
And according to "unbound-control stats" the query rate is
comparable to what it was yesterday.
So I suspect there is a serious memory leak, possibly in unbound,
related to the code which does DoH and/or DoT handling.
Diff to our unbound.conf compared to yesterday attached.
Regards,
- Håvard
-------------- next part --------------
rcsdiff -u unbound.conf
===================================================================
RCS file: RCS/unbound.conf,v
retrieving revision 1.9
diff -u -r1.9 unbound.conf
--- unbound.conf 2025/03/03 16:25:44 1.9
+++ unbound.conf 2025/03/13 12:53:24
@@ -12,27 +12,27 @@
# 853 = DNS-over-TLS
# 443 = DNS-over-HTTPS
interface: 158.38.0.2
- interface: 158.38.0.2 at 443
- interface: 158.38.0.2 at 853
+# interface: 158.38.0.2 at 443
+# interface: 158.38.0.2 at 853
interface: 2001:700:0:ff00::2
- interface: 2001:700:0:ff00::2 at 443
- interface: 2001:700:0:ff00::2 at 853
+# interface: 2001:700:0:ff00::2 at 443
+# interface: 2001:700:0:ff00::2 at 853
interface: 158.38.0.169
- interface: 158.38.0.169 at 443
- interface: 158.38.0.169 at 853
+# interface: 158.38.0.169 at 443
+# interface: 158.38.0.169 at 853
interface: 2001:700:0:503::c253
- interface: 2001:700:0:503::c253 at 443
- interface: 2001:700:0:503::c253 at 853
+# interface: 2001:700:0:503::c253 at 443
+# interface: 2001:700:0:503::c253 at 853
interface: 127.0.0.1
interface: ::1
# TLS key and certificate
- tls-service-key: "/usr/pkg/etc/unbound/dns-resolver2-key.pem"
- tls-service-pem: "/usr/pkg/etc/unbound/dns-resolver2-cert.pem"
- tls-cert-bundle: "/etc/openssl/certs/ca-certificates.crt"
+# tls-service-key: "/usr/pkg/etc/unbound/dns-resolver2-key.pem"
+# tls-service-pem: "/usr/pkg/etc/unbound/dns-resolver2-cert.pem"
+# tls-cert-bundle: "/etc/openssl/certs/ca-certificates.crt"
# Enable DNS-over-HTTPS (doh):
- https-port: 443
+# https-port: 443
# These need tuning away from defaults;
# the defaults not suitable for TCP-heavy workloads:
@@ -988,9 +988,9 @@
# for-upstream: yes
# zonefile: "example.org.zone"
- auth-zone:
- name: resolver.arpa
- zonefile: "pz/resolver.arpa"
+# auth-zone:
+# name: resolver.arpa
+# zonefile: "pz/resolver.arpa"
# Views
# Create named views. Name must be unique. Map views to requests using
More information about the Unbound-users
mailing list