Maintained by: NLnet Labs

[Unbound-users] Unbound stops resolving some hosts sporadically

W.C.A. Wijngaards
Tue Mar 29 13:38:23 CEST 2011


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi John,

Your outgoing range = 256 and num queries per thread 1024.  Below in
your stats you show that you have enough traffic to fill the 1024 list
completely.  Having 1024 queries compete for only 256 sockets, makes
this problem.  They have to wait until a socket becomes free and
available before they can do work.

Recompile with libevent (if you have not already).

Then set outgoing-range to 1.5x the num-queries-per-thread (or 2x).

outgoing-range: 8192
num-queries-per-thread: 4096

I also increased the num-queries-per-thread because below stats show
that you need more.

Then there is enough capacity and queries do not have to wait for a socket.


On 03/29/2011 12:46 PM, john wrote:
>     outgoing-range: 256
>     msg-cache-size: 128m
>     num-queries-per-thread: 1024
>     rrset-cache-size: 256m

> The other I tweaked slightly with socket receive buffers in case it was
> using all sockets but it didn't make any difference:
> 
> server:
>     verbosity: 2
>     statistics-interval: 86400
>     num-threads: 2
>     outgoing-range: 462
>     so-rcvbuf: 4m
>     msg-cache-size: 128m
>     num-queries-per-thread: 1024
>     rrset-cache-size: 256m
>     do-ip6: no
>     chroot: ""
>     logfile: "/var/log/unbound.log"
>     root-hints: /etc/unbound/named.cache
>     hide-identity: yes
>     hide-version: yes
> 
> 
> # cat /proc/sys/net/core/rmem_max
> 4194304
> 
> Logs didn't really show much, and produce too much data to trawl through
> easily.
> 
> This morning I've changed to running one thread and when I receive a
> problem just now dumped stats which were as follows:
> 
> 
> # unbound-control status
> version: 1.4.6
> verbosity: 1
> threads: 1
> modules: 2 [ validator iterator ]
> uptime: 3095 seconds
> unbound (pid 24730) is running...
> 
> 
> # unbound-control stats_noreset
> thread0.num.queries=87145
> thread0.num.cachehits=38156
> thread0.num.cachemiss=48989
> thread0.num.prefetch=0
> thread0.num.recursivereplies=41036
> thread0.requestlist.avg=327.143
> thread0.requestlist.max=1091

The max hits your num-queries-per-thread (and exceeds it, because of
internal generated queries (for nameservers usually)).  Time to increase
the num-queries-per-thread.

> thread0.requestlist.overwritten=3188

And this is what happened as a result: queries are getting dropped.  It
tries to give priority to your www.google.com query, but you also
configure so little sockets, that those queries then had to wait as well.

> thread0.requestlist.exceeded=18

Life was bad, but not that bad, while it dropped 3188 queries on the
floor, those were 'old' queries that had waited for a while.  Here 18
were dropped because there was no way to make space (even considered
dropping other queries).  This number would go up if you were under very
high rate queries that come in 1000s per millisecond or so.

> thread0.requestlist.current.all=1091
> thread0.requestlist.current.user=1024

Here you see the requestlist is completely full, 1024 of 1024.  And 1091
including internal entries.  It would be good to have 1091 sockets in
this situation: outgoing-range 1091, but since we need to increase the
requestlist size anyway, that number needs to increase as well.

> thread0.recursion.time.avg=27.134193

Yes you had to wait 27 seconds on average.

> thread0.recursion.time.median=0.00895803

But, the median is very low, meaning the bulk was very fast anyway (once
a socket was available), but some queries took very long causing a high
average.

> total.num.queries=87145
> total.num.cachehits=38156
> total.num.cachemiss=48989

The cachehits are only 38/87 : 44% cache hits.  That is pretty low.
Maybe configure more memory for the cache.  This could also be because
you had just started and the cache has not warmed up yet.

> total.num.prefetch=0

No prefetch enabled.

> total.num.recursivereplies=41036
> total.requestlist.avg=327.143
> total.requestlist.max=1091
> total.requestlist.overwritten=3188
> total.requestlist.exceeded=18
> total.requestlist.current.all=1091
> total.requestlist.current.user=1024
> total.recursion.time.avg=27.134193
> total.recursion.time.median=0.00895803
> time.now=1301394872.957300
> time.up=3099.708199
> time.elapsed=3099.708199
> 
> 
> I've read the changelogs of newer versions but can't see anything that
> looks like this problem. I'd prefer to avoid upgrading to the latest
> source distribution on the off-chance it will fix it as that just seems
> like clutching at straws.
> 
> Any ideas?

Yes, you need more capacity, and the queries are waiting for resources.

Best regards,
   Wouter
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/

iEYEARECAAYFAk2RxK8ACgkQkDLqNwOhpPgMcwCbB+pKocllb+1UT6KwZ297mqen
qTYAnig7+WkuyRQn5mVIY/tXcsGjJ+dt
=Tn3R
-----END PGP SIGNATURE-----