[Unbound-users] Unbound periodically stops responding
Will Roberts
ironwill42 at gmail.com
Wed Apr 6 14:23:13 UTC 2011
On Wed, Apr 6, 2011 at 2:06 AM, W.C.A. Wijngaards <wouter at nlnetlabs.nl>wrote:
>
> > When this issue happens, I can't communicate with unbound via
> > unbound-control and it will never resolve anything. I can cleanly shut
> > it down and start a new instance and it will behave exactly the same.
> > The only solution I've found is to restart the VPS. I have another VPS
> > from the same provider which is setup almost identically and it has
> > never had this issue.
>
> So, it is somehow unique to that machine. Can you see in 'top' what
> unbound is doing? (is it using cpu, 100% in a busy loop?, it is not
> responding to unbound-control, so it must be completely hosed somehow)
>
Sorry I meant to include that in my original email. It does not appear to be
in a busy loop; top shows 0% CPU usage for unbound.
> netstat -su may be interesting (packet counters for UDP).
>
Okay, I'll remember to take a look, see if the packets are sitting unread.
>
> Another thing you can do is use 'gcore' to make a coredump of the
> 'failed' unbound process. (and then kill it and start a new unbound for
> your production). Then you can use 'gdb' and your compiled unbound
> executable to read the core image and produce a stack backtrace what it
> is doing.
>
I'm not familiar with "gcore" can I just configure ulimit to allow core
dumps then send the ABRT signal? I'll make sure I install the debug
libraries so I get something useful there. The weird thing is restarting
unbound won't fix it. I really have to restart the machine (so it's likely
something else is really broken).
Well it should respond to the unbound-control utility. If it does not
> this means it is somehow no longer processing the main loop, or that
> network traffic does not reach it.
>
Interesting, all the requests should be done over localhost. My resolv.conf
only contains the line "nameserver 127.0.0.1" and doing "dig @localhost
foo.com" also fails. I can check the routing table and do the obvious pings
and see if those at least work.
I did run strace last time this happened, but I wasn't really sure what to
look for; I was really just checking that it was doing something and not
just hanging. Next time I'll capture the output and try and take a better
look. If it matters, this is on an amd64 Debian GNU/Linux Squeeze (6.0)
system.
Thanks for the tips,
--Will
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlnetlabs.nl/pipermail/unbound-users/attachments/20110406/94ba5841/attachment.htm>
More information about the Unbound-users
mailing list