[Unbound-users] Release 0.7.2

Wed Jan 30 13:56:58 UTC 2008

On Wed, 30 Jan 2008 11:48:19 +0100, "W.C.A. Wijngaards" <wouter at nlnetlabs.nl> said:

> | unbound also correctly dealt with the situation that it had a trust
> | anchor defined for example.net (from the set of trust anchors
> | distributed by RIPE NCC
> | <https://www.ripe.net/projects/disi/keys/ripe-ncc-dnssec-keys-new.txt>),
> | but the corresponding DNSKEY is missing from the zone:
> |
> | [1201660228] unbound[16620:0] info: failed to prime trust anchor --
> could not fetch DNSKEY rrset <example.net. DNSKEY IN>

> Ah, yes, in such a situation you can wait for example.net to fix their
> zone and after 900 (default bogus ttl) unbound will pick that up, or
> instead change your config and kill -HUP unbound (this also clears the
> cache).

> | Operational experience: I was able to integrate unbound into our
> | anycast caching system without problems.  This allows me to run BIND
> | and unbound in parallel on different anycast instances just as I had
> | planned to do.  All of this is looking very good.
> |

> Oh this is really nice. Would be interesting to know of any noticable
> differences between bind and unbound. Apart from version.bind CH TXT, of
> course.

One thing I've noticed is a slight difference in the way BIND and
unbound deal with dead servers at delegation points.  There was one
domain for which all name servers were unreachable.  Both caches
eventually marked these servers as bad and returned "SERFVAIL" quickly
(for as long as they cache this type of information), but I had the
impression that BIND was timing out on the queries for a longer period
of time than unbound before marking the servers as bad.

> Your log file. Thank you for sharing the statistics.
> * There are many TCP connect errors. I assume now, that someone
> configured a zone as example.com NS 10.0.0.100, with a nameserver on
> their local subnet. Unbound cannot contact that nameserver, and tries
> (finally) to use TCP on it; which gives this error.

Possibly.

> I think the log file should not be cluttered with zone administration
> mistakes by others. I can demote this particular error to a higher
> verbosity level (2), or I can print the address that failed and then you
> (the operator) or a script can pickup those and block them
> (do-not-query-address: 10.0.0.0/8 in the config file).

> I think I'll demote the error message, as it does not bother the
> resolver operator. What do you think? Would you like to have the
> addresses printed to the log anyway?

Yes, that would be useful and yes, you can move the message itself to
a higher verbosity level.  Speaking of logging: I think I prefer the
modular approach to logging used by BIND.  If you only have a
verbosity level to tune, it can be hard to get to certain information
without having to collect vast amounts of unwanted information as
well.  Verbosity 2 is already so verbose that I don't want to use it
on a regular basis.

What about classifying all messages and let the user set the verbosity
for each class (which is essentially what BIND does)?  I don't mind
much if there is only a single log file, as long as the messages are
tagged with the class so I can pick out what I need with simple tools.

In general, what I would find exetremely useful (and what I've been
missing so much with BIND) is a catalog of log messages, possibly
including additional information what events can cause a message to be
generated.  If you're going to do this, please add unique tags for
every message, so you can change the message text in later releases
but still identify the message itself.

I have this Perl script that analyzes my BIND logs and I have to
change it with every release of BIND because they often change the
message texts.

> * You have 93% cache hits. With the default 4+4 Mb cache (4 mb for rrset
> data, 4 mb for message data), so unbound caps memory usage at about
> 20-30 Mb total for the process. For 10 million queries. This is
> impressive. You could try to increase cache size to improve cache hits;
> but it doesn't seem worth the effort.

I think a large percentage of the queries is for names in switch.ch,
which can explain the good hit ratio.

> * The requestlist (this is the to-do list of pending recursive queries)
> stays nice and small as well. If the computer were unable to bear the
> load, this number would shoot up as requests come in faster than they
> could be handled.

> * For the histogram, onlookers please note the average reply time
> printed a) does not include the cache responses (which are better
> measured in qps then seconds per query) b) is skewed because of really
> large upper numbers caused by unbound retrying very hard for a couple of
> records (remote server down). In a newer unbound the median is printed
> as well, a nicer way to average recursion speed.

Ah, ok.

> * There is a significant bump on the lower end of your histogram, at 32
> microsec. I assume this is because a lot of recursion requests are due
> to a CNAME. Like where a CNAME is used to load-balance with DNS.
> Consequently, I need to pay attention to CNAME-processing when I do
> optimization, good to know.

Cool.

-- 
Alex