Maintained by: NLnet Labs

[Unbound-users] Unbound periodically stops responding

Will Roberts
Wed Apr 6 16:23:13 CEST 2011


On Wed, Apr 6, 2011 at 2:06 AM, W.C.A. Wijngaards <wouter at nlnetlabs.nl>wrote:

>
> > When this issue happens, I can't communicate with unbound via
> > unbound-control and it will never resolve anything. I can cleanly shut
> > it down and start a new instance and it will behave exactly the same.
> > The only solution I've found is to restart the VPS. I have another VPS
> > from the same provider which is setup almost identically and it has
> > never had this issue.
>
> So, it is somehow unique to that machine.  Can you see in 'top' what
> unbound is doing?  (is it using cpu, 100% in a busy loop?, it is not
> responding to unbound-control, so it must be completely hosed somehow)
>

Sorry I meant to include that in my original email. It does not appear to be
in a busy loop; top shows 0% CPU usage for unbound.


> netstat -su may be interesting (packet counters for UDP).
>

Okay, I'll remember to take a look, see if the packets are sitting unread.


>
> Another thing you can do is use 'gcore' to make a coredump of the
> 'failed' unbound process.  (and then kill it and start a new unbound for
> your production).  Then you can use 'gdb' and your compiled unbound
> executable to read the core image and produce a stack backtrace what it
> is doing.
>

I'm not familiar with "gcore" can I just configure ulimit to allow core
dumps then send the ABRT signal? I'll make sure I install the debug
libraries so I get something useful there. The weird thing is restarting
unbound won't fix it. I really have to restart the machine (so it's likely
something else is really broken).

 Well it should respond to the unbound-control utility.  If it does not
> this means it is somehow no longer processing the main loop, or that
> network traffic does not reach it.
>

Interesting, all the requests should be done over localhost. My resolv.conf
only contains the line "nameserver 127.0.0.1" and doing "dig @localhost
foo.com" also fails. I can check the routing table and do the obvious pings
and see if those at least work.

I did run strace last time this happened, but I wasn't really sure what to
look for; I was really just checking that it was doing something and not
just hanging. Next time I'll capture the output and try and take a better
look. If it matters, this is on an amd64 Debian GNU/Linux Squeeze (6.0)
system.

Thanks for the tips,
--Will
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unbound.nlnetlabs.nl/pipermail/unbound-users/attachments/20110406/94ba5841/attachment.html>