Spikes on recursive resolution average time in Unbound

Mon Dec 7 12:07:19 UTC 2020

> I use Unbound version 1.12 at OPNSense in resolve mode. At the
> bottom you can see my unbound.conf
>
> I have a grafana/influxdb collecting and showing Unbound statistics.
> 
> I live in Brazil so I know that the root DNS servers should be
> far from here.

Really?  I beleive there exists RIPE Atlas measurements which
indicate that there are serveral root DNS name server instances
present in Brazil.

Please note that while the number of IP addresses corresponding
to the root name servers are 13 IPv4 addresses (and, I beleive,
also 13 IPv6 addresses), the DNS root name service is actually
heavily anycasted with a multitude of geographically distributed
instances per letter / IP address.

> I can see the recursive average time is high at the start of
> the service and it goes down as days go by. It goes down to
> close to 130ms. Suddenly, there is a high latency and it goes
> to 260ms or more on average.
>
> The latency for 8.8.8.8 that I use as a test ip on the gateway
> setup is pretty much stable at 6ms with 0 loss. Which makes me
> understand that it's not a WAN/Net problem.

I suspect what you are comparing is the difference between a
"cold cache" and a "hot cache".

With a "cold cache", to resolve a given query, your recursor will
in all probability have to do queries to multiple far-away name
servers to resolve the original query.  This will show as a
higher time to resolve the query.

On the other hand, if the answer is (still) present in your
cache, it can be served directly without incurring the cost of
the cache-filling queries to multiple far-away publishing name
servers.

> What can cause this behavior in Unbound? Is it avoidable? How?
> Is there a way to see which query caused the spike in the time
> response? I had times of 9 seconds during last week, without a
> perceived outage in the WAN.

I agree, 9 seconds is excessive, but without any further data it
is difficult to tell what is causing that.  Sometimes domain
owners are careless with keeping all their publishing name
servers in working order (causing reucrsors to have to re-try the
query to another publishing name server), and/or they are sloppy
with keeping their delegation records with the parent domain up
to date, and sometimes this sort of problems can cause prolonged
lookup times for recursors.

Regards,

- Håvard