[Unbound-users] Requestlist filling ? automatic cleanup ?

Sun Mar 20 21:53:37 UTC 2011

Hi Wouter,

Excellent explanations and fast reply as usual, many thanks.

> (W)hat version are you using?  Recently the timeout code was changed to
> cope with this sort of situation (1.4.7):
> http://www.unbound.net/documentation/info_timeout.html
Oops sorry. I forgot to tell but I am using the latest : 1.4.8.

It's running on Centos 5.5 (old 2.6.18 kernel sadly). We built our own 
packages. And it should have libevent. I created a thread a while ago 
about where I wanted an explicit way to be sure that we have and are 
using libevent. And you told me that I was using it IIRC :)

unbound-libs-1.4.8-2.el5
unbound-1.4.8-2.el5
ldns-1.6.8-1.el5
libevent-1.4.13-1

unbound-control status
version: 1.4.8
verbosity: 1
threads: 1
modules: 2 [ validator iterator ]
uptime: 2773636 seconds
unbound (pid 3952) is running...

Version 1.4.8
linked libs: libevent 1.4.13-stable (it uses epoll), ldns 1.6.8, OpenSSL 
0.9.8e-fips-rhel5 01 Jul 2008
linked modules: validator iterator
configured for i386-redhat-linux-gnu on Wed Feb 16 10:26:27 EST 2011 
with options: '--build=i386-koji-linux-gnu' '--host=i386-koji-linux-gnu' 
'--target=i386-redhat-linux-gnu' '--program-prefix=' '--prefix=/usr' 
'--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' 
'--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' 
'--libdir=/usr/lib' '--libexecdir=/usr/libexec' '--localstatedir=/var' 
'--sharedstatedir=/usr/com' '--mandir=/usr/share/man' 
'--infodir=/usr/share/info' '--with-ldns=' '--with-libevent' 
'--with-pthreads' '--with-ssl' '--disable-rpath' '--enable-debug' 
'--disable-static' '--with-conf-file=/etc/unbound/unbound.conf' 
'--with-pidfile=/var/run/unbound/unbound.pid' '--disable-gost' 
'--enable-sha2'
BSD licensed, see LICENSE in source package for details.
Report bugs to unbound-bugs at nlnetlabs.nl

>> [..] jostle-timeout is triggered when the server is very busy. What defines
>> 'busy' ?
> The requestlist is full.
Ok. I think this should be clarified in the documentation, I can send 
you a patch if you want to save your time.

> Your requestlist is the default, so about 1000 and 300 does not fill it
> up.  I would recommend a recompile with libevent because of your
> somewhat high load (then you can increase the requestlist and range to
> several thousand, and in recent versions the default increases by
> itself, http://www.unbound.net/documentation/howto_optimise.html )
I read this document many times since I am using unbound (and I will 
read it again;). But what parameter defines the requestlist size or 
actually influence on it.

>>[..]. Could that impact unbound reactivity ?
> No, other queries that priority over these older queries.
Ok.

> The requestlist is divided into two halves: run-to-completion, and
> fast-stuff.  The run-to-completion is that.  The fast stuff deletes
> older queries to make room for new queries (but not unless the
> jostle-timeout has expired, otherwise you could deleted everything that
> comes in immediately under a DoS).
Thanks for the explanation. Is this written somewhere as well in the docos ?

>> Note: jostle-timeout is still set to the default (see my config below).
> Yes that should be OK.  If you lower it, it will be more likely to drop
> the groupinfra stuff.
Ok. I may have some questions about that but I will read the doco first 
about jostle-timeout.

>> I am asking that because sometimes our unbounds have a random hiccup and
>> I am wondering if it could be due to this or not. The 'hiccup' is very
>> hard to debug because it's random (once a month or so) on servers doing
>> something like 500 to 1500 qps each so increasing the verbosity from 1
>> to 2 is not really possible :)
Ok, so I think I will have to do a script to increase verbosity when it 
seems that unbound can't resolve anymore and hopefully I will be able to 
catch this nasty issue (could be network related).

> What seems to happen is groupinfra has a lot of servers.  And they
> sometimes experience outages. When they experience an outage, unbound
> gets timeouts and tries to fetch the names, but also the other
> nameserver names (and there are a lot of them).  Given user demand for
> groupinfra, unbound starts to explore all the nameservers for
> groupinfra, with timeouts and thus the entries fill up your requestlist.
>   The dependency structure is like that log excerpt that you show.
> Because the thing has timeouts those entries are necessarily pretty old,
> and thus (the ones in the fast-stuff list) would be dropped to make room
> for new queries (if there was a lack of space, but there is no lack of
> space, so these queries are performed: there is interest and there is
> capacity to undertake actions to find the answers).
Yep ok, I understand but still it is weird to see unbound trying to 
resolve something for almost forever. For instance 143000 secs aka 39 
hours :) But we have resources so maybe one day it will work (I reckon 
this domain just never works ;).

252 AAAA IN uk-dc007.groupinfra.com. 142994.571268 iterator wants AAAA 
IN au-dc012.groupinfra.com. AAAA IN br-dc003.groupinfra.com. AAAA IN 
de-dc008.groupinfra.com. AAAA IN my-dc003.groupinfra.com. AAAA IN 
nl-dc006.groupinfra.com. AAAA IN ph-dc001.groupinfra.com.

-Thomas