<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="viewport" id="viewport" content="width=device-width,minimum-scale=1.0,maximum-scale=10.0,initial-scale=1.0" />
<style>
html { -webkit-text-size-adjust: 100%; -ms-text-size-adjust: 100%; } h1 { font-size: 1.3em; line-height: 1.2; margin: 0; } ul, ol { margin: 0; padding: 0; } ul li, ol li, li li { margin: 0 0 0 36px; } [dir=rtl] li { margin: 0 18px 0 0; } blockquote { border-color: #dfdee1; /* --color--border */ border-style: solid; border-width: 0 0 0 1px; margin: 0; padding: 0 0 0 1em; } [dir=rtl] blockquote, blockquote[dir=rtl] { border-width: 0 1px 0 0; padding: 0 1em 0 0; } pre { font-family: "SFMono-Regular", Consolas, "Liberation Mono", Menlo, Courier, monospace; /* --font-family--mono */ font-size: 0.9em; margin: 0; padding: 1rem; background-color: #f3f1ef; /* --color-bg--surface */ white-space: pre-wrap; word-wrap: break-word; overflow: visible; } .message-content { font-family: -apple-system, BlinkMacSystemFont, Aptos, Roboto, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; /* --font-family */ line-height: 1.4; } .attachment { display: inline-block; margin: 0; padding: 0; } .attachment__caption { padding: 0; text-align: center; } .attachment__caption a[href] { text-decoration: none; color: #333333; } .attachment--preview { width: 100%; text-align: center; margin: 0.625em 0; } .attachment--preview img { border: 1px solid #dfdee1; /* --color--border */ vertical-align: middle; width: auto; max-width: 100%; max-height: 640px; } .attachment--preview .attachment__caption { color: #716d7b; /* --color-txt--subtle */ font-size: 0.85em; margin-top: 0.625em; } .attachment--file { color: #282138; /* --color-txt */ line-height: 1; margin: 0 2px 2px 0; padding: 0.4em 1em; border: 1px solid #dfdee1; /* --color--border */ border-radius: 5px; } .permalink { color: inherit; } .txt--subtle { color: #716d7b; /* --color-txt--subtle */ } .txt--xx-small { font-size: 14px; } .flush { margin: 0; padding: 0; } .push--bottom { margin-bottom: 8px; } .border--top { border-top: 1px solid #ECE9E6; /* --color-border--solid */ } .btn { padding: 0.2em 0.4em; font-weight: 500; text-decoration: none; border-radius: 3rem; white-space: nowrap; background: #5522FA; /* --color-tertiary */ border-color: #5522FA; color: #ffffff; } .btn--email { display: inline-block; text-align: center; font-weight: 500; font-size: 1em; text-decoration: none; border-radius: 2em; white-space: nowrap; background: #5522FA; /* --color-tertiary */ border-color: #5522FA; color: #ffffff; border-top: 0.3em solid #5522FA; border-left: 1em solid #5522FA; border-bottom: 0.3em solid #5522FA; border-right: 1em solid #5522FA; } .shaded { padding: 1em; border-radius: 4px; background-color: #f6f5f3; /* --color-bg--surface */ border: 1px solid #dfdee1; /* --color--border */ } .shaded--blue { background-color: rgba(80, 162, 255, 0.2); /* --rgb-blue 0.2 */ } .shaded--red { background-color: rgba(255, 120, 120, 0.2); /* --rgb-red 0.2 */ } .strikethrough { text-decoration: line-through; }
</style>
</head>
<body>
<div class="message-content">
<div class="trix-content">
<div>Hi Yorgos,<br><br>Apologies for not CCing the list into my last response (that's quoted at the end).<br><br>We have some further details about what's happening. In our environment, the heaviest user of DNS tends to be rspamd (since we run a public mail service at hey.com) and similar automated processes.<br><br>At top of the hour email load (when we ingest the most email), the rspamd workload to lookup RBLs is typically at its highest. On a per node basis, we generally go upto about 250 queries/s or so.<br><br>When any RBL nameserver starts to have issues, we seem to start queueing a lot of queries (upto 4k at a time per instance from my previous response, it looks like) into the requestlist.<br><br>The interesting thing is, Unbound seems quite conservative in serving SERVFAIL in these cases and appears to try for exactly 3 minutes before giving up and expunging these entries. We sort of figured that out by looking at the requestlist count + exceeded metrics via unbound_exporter (prom) and logging servfail responses. <br><br>The delta appears to reliably be 3 minutes from the start of the incident before it gives up. Here are some pool-level graphs showing this in detail for our two datacenters - <a href="https://cln.sh/kfzqNMZq">https://cln.sh/kfzqNMZq</a><br><br>So our first question is whether there are any knobs we can use to bring this time down significantly? We would be happy to give up after 15 seconds (or even less!) to prioritize stability elsewhere.<br><br>Generally, we're looking to safeguard unbound from dropping unrelated queries when external nameservers handling large volumes of queries have issues.<br><br>Thank you all in advance!<br><br></div><blockquote>Thank you for all your suggestions. We ended up migrating to stub zones and enabled the cache, still mostly seeing the same actually.<br><br>Looking further into monitoring, we see the requestlist balloon to around 3900~ entries when these incidents happen. Could that be the num-queries-per-thread limitation?<br><br>There is pretty much no CPU or memory usage ballooning during incident times, and thus no swapping.<br><br>The auth query response time is < 1 ms, they live in adjacent racks. For now, we've segmented our heaviest DNS queriers into a dedicated pool of Unbound nodes so local resolution on the normal cluster isn't affected.<br><br>Thanks again!</blockquote><div><br></div><div>On December 6, 2024, "Paul S. via Unbound-users" <unbound-users@lists.nlnetlabs.nl> wrote:</div><blockquote>Hi Paul,<br><br>Coming back to this I notice that you have configured<br> num-queries-per-thread: 4096<br>and you say the highest you see the request list is 2K.<br>So dropping would not be an issue.<br><br>Maybe your CPU can't handle all those recursion states and lowering that <br>number would also help?<br>Are you reaching memory limits perhaps and the system starts swaping?<br><br>Could you also provide some more information?<br>What is the query response time normally to your auth cluster?<br><br>The serve-expired options could be useful in upstream failure situations <br>but make sure to understand what the options are doing because you will <br>be serving expired answers.<br><br>Best regards,<br>-- Yorgos<br><br>On 22/11/2024 20:43, Yorgos Thessalonikefs via Unbound-users wrote:<br>> Hi Paul,<br>> <br>> If you are "forwarding" to authoritative nameservers indeed using stub- <br>> zone is the correct configuration as it expects to send queries to <br>> authoritative nameservers and not resolvers. That won't help with this <br>> issue though.<br>> <br>> In a situation where Unbound is overwhelmed with client queries, using <br>> 'forward-no-cache: yes' (or 'stub-no-cache: yes' if you use the stub- <br>> zone) does not help since all those queries need to be resolved.<br>> <br>> When under client pressure, Unbound would start dropping slow queries.<br>> Slow queries are ones that take longer than 'jostle-timeout'<br>> (<a href="https://unbound.docs.nlnetlabs.nl/en/latest/manpages/">https://unbound.docs.nlnetlabs.nl/en/latest/manpages/</a> <br>> unbound.conf.html#unbound-conf-jostle-timeout) to resolve.<br>> <br>> This way Unbound tries to combat Dos from slow queries or high query <br>> rates by trying to slowly fill up the cache from fast queries that <br>> eventually will drop the outgoing query rate and increase cache <br>> responses. (Glocal cache responses do not contribute to the increase of <br>> the request list).<br>> <br>> In your case where you don't cache the upstream information, Unbound <br>> cannot protect itself with cached answers because all the internal <br>> upstream queries need to be resolved.<br>> <br>> I am guessing the queries to the configured upstreams are not slower <br>> than jostle-timeout, so not candidates to be dropped initially, but it <br>> doesn't help that each one of them needs to always be resolved.<br>> <br>> I would first try to use 'stub-no-cache: no' and see if the situation <br>> gets better.<br>> <br>> It would be possible to introduce a new configuration option per <br>> forward/stub zone to give some kind of priority but unsure if it would <br>> generally help or in this case in particular.<br>> <br>> Best regards,<br>> -- Yorgos<br>> <br>> <br>> On 19/11/2024 03:01, Paul S. via Unbound-users wrote:<br>>> Hey team,<br>>><br>>> We run 8 node unbound clusters as recursive resolvers. The setup <br>>> forwards (using forward-zone) internal queries to a separate PowerDNS <br>>> authoritative cluster.<br>>><br>>> Recently, we've had some connectivity issues to Cloudflare (who <br>>> provides a lot of external DNS services in our environment). When this <br>>> has happened, we've seen the requestlist balloon to around 1.5-2k <br>>> entries as queries repeatedly time out.<br>>><br>>> However, the problem is that this affects forward-zones as well. We <br>>> lose resolution for internal queries when these backup events happen.<br>>><br>>> We're looking for suggestions on how to safeguard these internal <br>>> forwards. We notice stub-zone may be the more appropriate stanza for <br>>> our use case, but are unsure if that'd bypass this requestlist queuing <br>>> (?)<br>>><br>>> Any thoughts greatly welcome, thank you!<br>>><br>>> Our config is fairly simple:<br>>><br>>> server:<br>>> num-threads: 4<br>>> # Best performance is a "power of 2 close to the num-threads value"<br>>> msg-cache-slabs: 4<br>>> rrset-cache-slabs: 4<br>>> infra-cache-slabs: 4<br>>> key-cache-slabs: 4<br>>><br>>> # Use 1.125GB of a 4GB node to start, but real usage may be 2.5x <br>>> this so<br>>> # closer to 2.8G/4GB (~70%)<br>>> #<br>>> msg-cache-size: 384m<br>>> # Should be 2x the msg cache<br>>> rrset-cache-size: 768m<br>>><br>>> # We have libevent! Use lots of ports.<br>>> outgoing-range: 8192<br>>> num-queries-per-thread: 4096<br>>><br>>> # Use larger socket buffers for busy servers.<br>>> so-rcvbuf: 8m<br>>> so-sndbuf: 8m<br>>><br>>> # Turn on port reuse<br>>> so-reuseport: yes<br>>><br>>> # This is needed to forward queries for private PTR records to <br>>> upstream DNS servers<br>>> unblock-lan-zones: yes<br>>><br>>> forward-zone:<br>>> name: "int.domain.tld"<br>>> forward-addr: "10.10.5.5"<br>>> # No caching in unbound<br>>> forward-no-cache: "yes"<br>>><br>></blockquote>
</div>
</div>
</body>
</html>