[Unbound-users] Unbound stops resolving some hosts sporadically

Tue Mar 29 10:46:17 UTC 2011

Hi,
I'm running unbound 1.4.6-1 from debian squeeze on a couple of machines as 
a resolver and have come across strange problems where unbound will stop 
resolving some hosts. It seems sporadic as to what works and what doesn't, 
but nagios generally reports that it timed out after 10s when trying to 
resolve www.google.com. I sometimes notice it before nagios does if using 
that resolver. I've tried sniffing traffic but it's hard to pinpoint 
queries as the servers are quite busy.

What I've noticed though is that sometimes for example i won't be able to 
resolve yahoo.com but after a minute or two it works again, but then fails 
a few mins after that. I don't believe there is a network problem as it 
fails for some domains whilst working for others which are on the same 
name servers. others on the same.

If I restart unbound it clears up the problem, but this is obviously less 
than ideal. I'm tending to do this once a day now, sometimes more 
frequently. I've moved one of the resolvers over to other software now to 
try and avoid issues where they are both broken for obvious reasons.

Server config looks like this from the one which is still being used 
(minus interface, outgoing-interface and access-control lines):

server:
     verbosity: 1
     statistics-interval: 86400
     num-threads: 2
     outgoing-range: 256
     msg-cache-size: 128m
     num-queries-per-thread: 1024
     rrset-cache-size: 256m
     do-ip6: no
     chroot: ""
     root-hints: /etc/unbound/named.cache
     hide-identity: yes
     hide-version: yes

The other I tweaked slightly with socket receive buffers in case it was 
using all sockets but it didn't make any difference:

server:
     verbosity: 2
     statistics-interval: 86400
     num-threads: 2
     outgoing-range: 462
     so-rcvbuf: 4m
     msg-cache-size: 128m
     num-queries-per-thread: 1024
     rrset-cache-size: 256m
     do-ip6: no
     chroot: ""
     logfile: "/var/log/unbound.log"
     root-hints: /etc/unbound/named.cache
     hide-identity: yes
     hide-version: yes

# cat /proc/sys/net/core/rmem_max
4194304

Logs didn't really show much, and produce too much data to trawl through 
easily.

This morning I've changed to running one thread and when I receive a 
problem just now dumped stats which were as follows:

# unbound-control status
version: 1.4.6
verbosity: 1
threads: 1
modules: 2 [ validator iterator ]
uptime: 3095 seconds
unbound (pid 24730) is running...

# unbound-control stats_noreset
thread0.num.queries=87145
thread0.num.cachehits=38156
thread0.num.cachemiss=48989
thread0.num.prefetch=0
thread0.num.recursivereplies=41036
thread0.requestlist.avg=327.143
thread0.requestlist.max=1091
thread0.requestlist.overwritten=3188
thread0.requestlist.exceeded=18
thread0.requestlist.current.all=1091
thread0.requestlist.current.user=1024
thread0.recursion.time.avg=27.134193
thread0.recursion.time.median=0.00895803
total.num.queries=87145
total.num.cachehits=38156
total.num.cachemiss=48989
total.num.prefetch=0
total.num.recursivereplies=41036
total.requestlist.avg=327.143
total.requestlist.max=1091
total.requestlist.overwritten=3188
total.requestlist.exceeded=18
total.requestlist.current.all=1091
total.requestlist.current.user=1024
total.recursion.time.avg=27.134193
total.recursion.time.median=0.00895803
time.now=1301394872.957300
time.up=3099.708199
time.elapsed=3099.708199

I've read the changelogs of newer versions but can't see anything that 
looks like this problem. I'd prefer to avoid upgrading to the latest 
source distribution on the off-chance it will fix it as that just seems 
like clutching at straws.

Any ideas?

Cheers,

john