[Unbound-users] Unbound 1.4.1 cache problems

Fri Jan 8 21:36:09 UTC 2010

Hello there,

I have upgraded from Unbound 1.3.0 to 1.4.1 recently on a couple of
servers. The servers are doing about 3-7k requests per second.
I have noticed that after some time of operation, servers
intermittently stop responding to queries to certain domains from
cache (return SERVFAIL). It particularly doesn't like webex.com,
sometimes vmware.com, dell.com.

# unbound-host webex.com
webex.com has address 64.68.120.99
webex.com mail is handled by 50 mx3.webex.com.
webex.com mail is handled by 10 mx1.webex2.iphmx.com.
webex.com mail is handled by 10 mx2.webex2.iphmx.com.
webex.com mail is handled by 50 mx2.webex.com.

But the A record doesnt seem to get cached:

# dig webex.com @localhost

; <<>> DiG 9.3.3rc2 <<>> webex.com @localhost
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 49004
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;webex.com.			IN	A

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Jan  8 19:16:01 2010
;; MSG SIZE  rcvd: 27

# unbound-control dump_cache | grep ^webex.com
webex.com.	4589	IN	NS	ns01.webex.com.
webex.com.	4589	IN	NS	ns05.webex.com.
webex.com.	4589	IN	NS	ns07.webex.com.
webex.com.	4589	IN	NS	ns03.webex.com.
webex.com.	4589	IN	NS	ns08.webex.com.
webex.com.	4589	IN	NS	ns06.webex.com.

It doesnt respond to NS queries either (No A records for
nsxx.webex.com in cache)

# dig ns webex.com @localhost

; <<>> DiG 9.3.3rc2 <<>> ns webex.com @localhost
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 54946
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;webex.com.			IN	NS

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Jan  8 20:53:18 2010
;; MSG SIZE  rcvd: 27

After flushing webex.com I can query the cache again:

# unbound-control flush webex.com
ok
# dig +short webex.com @localhost
64.68.120.99

Log with increased verbosity:

an  8 21:05:32 cns01j unbound: [25052:3] info: validator operate:
query <webex.com. A IN>
Jan  8 21:05:32 cns01j unbound: [25052:3] info: resolving <webex.com. A IN>
Jan  8 21:05:32 cns01j unbound: [25052:3] info:
DelegationPoint<webex.com.>: 6 names (6 missing), 6 addrs (0 result, 6
avail)
Jan  8 21:05:32 cns01j unbound: [25052:3] info:   ns08.webex.com.  A
Jan  8 21:05:32 cns01j unbound: [25052:3] info:   ns05.webex.com.  A
Jan  8 21:05:32 cns01j unbound: [25052:3] info:   ns01.webex.com.  A
Jan  8 21:05:32 cns01j unbound: [25052:3] info:   ns06.webex.com.  A
Jan  8 21:05:32 cns unbound: [25052:3] info:   ns07.webex.com.  A
Jan  8 21:05:32 cns unbound: [25052:3] info:   ns03.webex.com.  A
Jan  8 21:05:32 cns unbound: [25052:3] info: resolving (init part 2):
<webex.com. A IN>
Jan  8 21:05:32 cns unbound: [25052:3] info: resolving (init part 3):
<webex.com. A IN>
Jan  8 21:05:32 cns unbound: [25052:3] info: processQueryTargets:
<webex.com. A IN>
Jan  8 21:05:32 cns unbound: [25052:3] info:
DelegationPoint<webex.com.>: 6 names (6 missing), 6 addrs (0 result, 6
avail)
Jan  8 21:05:32 cns unbound: [25052:3] info:   ns08.webex.com.  A
Jan  8 21:05:32 cns unbound: [25052:3] info:   ns05.webex.com.  A
Jan  8 21:05:32 cns unbound: [25052:3] info:   ns01.webex.com.  A
Jan  8 21:05:32 cns unbound: [25052:3] info:   ns06.webex.com.  A
Jan  8 21:05:32 cns unbound: [25052:3] info:   ns07.webex.com.  A
Jan  8 21:05:32 cns unbound: [25052:3] info:   ns03.webex.com.  A
Jan  8 21:05:32 cns unbound: [25052:3] info: validator operate: query
<webex.com. A IN>
Jan  8 21:05:32 cns unbound: [25052:3] info: mesh_run: end 8 recursion
states (8 with reply, 0 detached), 28 waiting replies, 84 recursion
replies sent, 0 replies dropped, 0 states jostled out
Jan  8 21:05:32 cns unbound: [25052:3] info: average recursion
processing time 4.172319 sec
Jan  8 21:05:32 cns unbound: [25052:3] info: histogram of recursion
processing times
Jan  8 21:05:32 cns unbound: [25052:3] info: [25%]=0.00107789
median[50%]=0.00744727 [75%]=0.0524288
Jan  8 21:05:32 cns unbound: [25052:3] info: lower(secs) upper(secs) recursions
Jan  8 21:05:32 cns unbound: [25052:3] info:    0.000000    0.000001 9
Jan  8 21:05:32 cns unbound: [25052:3] info:    0.000256    0.000512 1
Jan  8 21:05:32 cns unbound: [25052:3] info:    0.001024    0.002048 1
Jan  8 21:05:32 cns unbound: [25052:3] info:    0.002048    0.004096 19
Jan  8 21:05:32 cns unbound: [25052:3] info:    0.004096    0.008192 2
Jan  8 21:05:32 cns unbound: [25052:3] info:    0.008192    0.016384 11
Jan  8 21:05:32 cns unbound: [25052:3] info:    0.016384    0.032768 3
Jan  8 21:05:32 cns unbound: [25052:3] info:    0.032768    0.065536 5
Jan  8 21:05:32 cns unbound: [25052:3] info:    0.065536    0.131072 15
Jan  8 21:05:32 cns unbound: [25052:3] info:    0.131072    0.262144 10
Jan  8 21:05:32 cns unbound: [25052:3] info:    0.262144    0.524288 5
Jan  8 21:05:32 cns unbound: [25052:3] info:    0.524288    1.000000 2
Jan  8 21:05:32 cns unbound: [25052:3] info:  256.000000  512.000000 1

I have downgraded one of the servers back to 1.3.0 (libev 3.53), where
I don't see this problem.

Unbound 1.4.1 compiled with libev-3.9, without threads - 4 processes running

num-threads: 4
outgoing-range: 10000
msg-cache-size: 500m
msg-cache-slabs: 1
num-queries-per-thread: 10000
jostle-timeout: 4000
so-rcvbuf: 8m
rrset-cache-size: 1024m
rrset-cache-slabs: 1
cache-max-ttl: 86400

Any help would be greatly appreciated :-)

Thanks,

-- 
Jakub Heichman