Unbound memory resource consumption?

Thu Mar 13 17:09:50 UTC 2025

I suspect there may be a leak in the DoH library. Not in unbound itself. 
I have had it running in uptime for years without any signs of leaks. 
However, I had a precedent for a leak in one of the setups with a DoH 
library (third-party).

13.03.2025 21:50, Havard Eidnes via Unbound-users пишет:
> Hi,
>
> and thanks for the feedback, the general advice, and the pointer to
> jemalloc.  I may look into that a bit later.
>
>
> However, in the mean time I have come to the conclusion that there
> may be a correlation between me enabling DoH and DoT and using RFC
> 9462 to direct clients which probe for _dns.resolver.arpa to use the
> DoH and/or DoT endpoints on the one hand, and on the other hand what
> really does look like a massive memory leak in unbound.  If that is
> true, which malloc() you use should not make much of a difference.
>
>
> To test this hypothesis, I turned off DoH and DoT (diff to config
> attached below, it was only turned on about last month), and also
> stopped serving resolver.arpa, and then restarted unbound.  Here are
> a few "top" displays taken over the span of a few hours.  First
> after this config change:
>
> load averages:  0.26,  0.20,  0.25;               up 6+00:57:31        14:24:00
> 79 processes: 76 sleeping, 1 stopped, 2 on CPU
> CPU states:  4.5% user,  0.0% nice,  2.2% system,  1.0% interrupt, 92.2% idle
> Memory: 2702M Act, 7948K Inact, 17M Wired, 27M Exec, 2367M File, 17G Free
> Swap: 14G Total, 32M Used, 14G Free / Pools: 3149M Used / Network: 1574K In, 16
>
>    PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
> 14982 unbound   43    0   398M  268M CPU/2       6:55 30.22% 30.22% unbound
>
>
> load averages:  0.13,  0.17,  0.21;               up 6+01:49:28        15:15:57
> 79 processes: 77 sleeping, 1 stopped, 1 on CPU
> CPU states:  2.8% user,  0.0% nice,  2.0% system,  0.6% interrupt, 94.5% idle
> Memory: 2847M Act, 11M Inact, 17M Wired, 27M Exec, 2367M File, 17G Free
> Swap: 14G Total, 32M Used, 14G Free / Pools: 3149M Used / Network: 1234K In, 13
>
>    PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
> 14982 unbound   85    0   544M  417M kqueue/2   18:13 38.23% 38.23% unbound
>
>
> load averages:  0.22,  0.11,  0.10;               up 6+03:55:58        17:22:27
> 90 processes: 87 sleeping, 1 stopped, 2 on CPU
> CPU states:  1.2% user,  0.0% nice,  1.1% system,  0.2% interrupt, 97.3% idle
> Memory: 3040M Act, 18M Inact, 17M Wired, 27M Exec, 2367M File, 17G Free
> Swap: 14G Total, 32M Used, 14G Free / Pools: 3149M Used / Network: 648K In, 700
>
>    PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
> 14982 unbound   43    0   738M  604M CPU/2      38:45  3.61%  3.61% unbound
>
>
> If we compare this to what I experienced with these options turned
> on and a number of DoH / DoT clients using those endpoints, quoting
> from yesterday's e-mail:
>
> load averages:  0.86,  0.94,  0.92;               up 5+00:58:04        14:24:33
> 86 processes: 83 sleeping, 1 stopped, 2 on CPU
> CPU states: 14.8% user,  0.0% nice,  1.3% system,  0.8% interrupt, 83.0% idle
> Memory: 3035M Act, 68M Inact, 17M Wired, 21M Exec, 14M File, 17G Free
> Swap: 14G Total, 38M Used, 14G Free / Pools: 2885M Used / Network: 1322K In, 1906K Out
>
>    PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
> 14678 unbound   40    0  5408M 3033M CPU/2     183:17 78.47% 78.47% unbound
>
>
> load averages:  0.52,  0.53,  0.52;               up 5+02:22:23        15:48:52
> 85 processes: 82 sleeping, 1 stopped, 2 on CPU
> CPU states: 11.4% user,  0.0% nice,  1.8% system,  1.0% interrupt, 85.7% idle
> Memory: 3815M Act, 81M Inact, 17M Wired, 21M Exec, 14M File, 16G Free
> Swap: 14G Total, 38M Used, 14G Free / Pools: 2885M Used / Network: 1509K In, 19
>
>    PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
> 14678 unbound   84    0  6863M 3825M kqueue/0  236:12 39.55% 39.55% unbound
>
>
> load averages:  0.19,  0.35,  0.41;               up 5+04:50:24        18:16:53
> 85 processes: 1 runnable, 82 sleeping, 1 stopped, 1 on CPU
> CPU states: 11.3% user,  0.0% nice,  1.2% system,  0.0% interrupt, 87.4% idle
> Memory: 5085M Act, 99M Inact, 17M Wired, 21M Exec, 14M File, 15G Free
> Swap: 14G Total, 38M Used, 14G Free / Pools: 2886M Used / Network: 79G In, 107G
>
>    PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
> 14678 unbound   85    0  9358M 5118M RUN/1     319:53 29.30% 29.30% unbound
>
>
> You'll notice the difference is quite stark.
>
> Not only is the CPU time much lower (OK, crypto costs, I guess), but
> also the trajectory of the virtual size is vastly different:
>
> 5408M -> 6863M (1:24h later) -> 9358M (3:52h after 0th measurement)
>
> compared to
>
> 398M -> 544M (51m later) -> 738M (2:58h after 0th measurement)
>
>
> And according to "unbound-control stats" the query rate is
> comparable to what it was yesterday.
>
>
> So I suspect there is a serious memory leak, possibly in unbound,
> related to the code which does DoH and/or DoT handling.
>
>
> Diff to our unbound.conf compared to yesterday attached.
>
>
> Regards,
>
> - Håvard
>
> rcsdiff -u unbound.conf
> ===================================================================
> RCS file: RCS/unbound.conf,v
> retrieving revision 1.9
> diff -u -r1.9 unbound.conf
> --- unbound.conf        2025/03/03 16:25:44     1.9
> +++ unbound.conf        2025/03/13 12:53:24
> @@ -12,27 +12,27 @@
>     # 853 = DNS-over-TLS
>     # 443 = DNS-over-HTTPS
>     interface: 158.38.0.2
> -  interface: 158.38.0.2 at 443
> -  interface: 158.38.0.2 at 853
> +#  interface: 158.38.0.2 at 443
> +#  interface: 158.38.0.2 at 853
>     interface: 2001:700:0:ff00::2
> -  interface: 2001:700:0:ff00::2 at 443
> -  interface: 2001:700:0:ff00::2 at 853
> +#  interface: 2001:700:0:ff00::2 at 443
> +#  interface: 2001:700:0:ff00::2 at 853
>     interface: 158.38.0.169
> -  interface: 158.38.0.169 at 443
> -  interface: 158.38.0.169 at 853
> +#  interface: 158.38.0.169 at 443
> +#  interface: 158.38.0.169 at 853
>     interface: 2001:700:0:503::c253
> -  interface: 2001:700:0:503::c253 at 443
> -  interface: 2001:700:0:503::c253 at 853
> +#  interface: 2001:700:0:503::c253 at 443
> +#  interface: 2001:700:0:503::c253 at 853
>     interface: 127.0.0.1
>     interface: ::1
>   
>     # TLS key and certificate
> -  tls-service-key: "/usr/pkg/etc/unbound/dns-resolver2-key.pem"
> -  tls-service-pem: "/usr/pkg/etc/unbound/dns-resolver2-cert.pem"
> -  tls-cert-bundle: "/etc/openssl/certs/ca-certificates.crt"
> +#  tls-service-key: "/usr/pkg/etc/unbound/dns-resolver2-key.pem"
> +#  tls-service-pem: "/usr/pkg/etc/unbound/dns-resolver2-cert.pem"
> +#  tls-cert-bundle: "/etc/openssl/certs/ca-certificates.crt"
>   
>     # Enable DNS-over-HTTPS (doh):
> -  https-port: 443
> +#  https-port: 443
>   
>     # These need tuning away from defaults;
>     # the defaults not suitable for TCP-heavy workloads:
> @@ -988,9 +988,9 @@
>   #      for-upstream: yes
>   #      zonefile: "example.org.zone"
>   
> -  auth-zone:
> -    name: resolver.arpa
> -    zonefile: "pz/resolver.arpa"
> +#  auth-zone:
> +#    name: resolver.arpa
> +#    zonefile: "pz/resolver.arpa"
>   
>   # Views
>   # Create named views. Name must be unique. Map views to requests using
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlnetlabs.nl/pipermail/unbound-users/attachments/20250313/0d89d379/attachment-0001.htm>