Maintained by: NLnet Labs

[Unbound-users] Unbound 1.4.1 cache problems

Jakub Heichman
Mon Jan 11 10:58:18 CET 2010


Thanks Wouter, the workaround with ip6 is working.

Best regards,
Jakub

2010/1/9 W.C.A. Wijngaards <wouter at nlnetlabs.nl>:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi Jakub,
>
> I cannot know for sure since the log seems to be shortened by syslog,
> but I think you may have the same problem as Haw Loeung reported.  I
> have since corresponded with him and I think have resolved the issue.  A
> patch is in svn for unbound.
>
> The issue seems to be caused by the parent-child disagreement fix in
> 1.4.1, worsened by no IPv6 available (either not in the kernel or
> do-ip6:no statement) - just turning ipv6 back on (also if you have no
> ipv6 connectivity) may stop this bug.
>
> Sorry for the inconvenience,
>   Wouter
>
> On 01/08/2010 10:36 PM, Jakub Heichman wrote:
>> Hello there,
>>
>> I have upgraded from Unbound 1.3.0 to 1.4.1 recently on a couple of
>> servers. The servers are doing about 3-7k requests per second.
>> I have noticed that after some time of operation, servers
>> intermittently stop responding to queries to certain domains from
>> cache (return SERVFAIL). It particularly doesn't like webex.com,
>> sometimes vmware.com, dell.com.
>>
>> # unbound-host webex.com
>> webex.com has address 64.68.120.99
>> webex.com mail is handled by 50 mx3.webex.com.
>> webex.com mail is handled by 10 mx1.webex2.iphmx.com.
>> webex.com mail is handled by 10 mx2.webex2.iphmx.com.
>> webex.com mail is handled by 50 mx2.webex.com.
>>
>> But the A record doesnt seem to get cached:
>>
>> # dig webex.com @localhost
>>
>> ; <<>> DiG 9.3.3rc2 <<>> webex.com @localhost
>> ; (1 server found)
>> ;; global options:  printcmd
>> ;; Got answer:
>> ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 49004
>> ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
>>
>> ;; QUESTION SECTION:
>> ;webex.com.                   IN      A
>>
>> ;; Query time: 0 msec
>> ;; SERVER: 127.0.0.1#53(127.0.0.1)
>> ;; WHEN: Fri Jan  8 19:16:01 2010
>> ;; MSG SIZE  rcvd: 27
>>
>> # unbound-control dump_cache | grep ^webex.com
>> webex.com.    4589    IN      NS      ns01.webex.com.
>> webex.com.    4589    IN      NS      ns05.webex.com.
>> webex.com.    4589    IN      NS      ns07.webex.com.
>> webex.com.    4589    IN      NS      ns03.webex.com.
>> webex.com.    4589    IN      NS      ns08.webex.com.
>> webex.com.    4589    IN      NS      ns06.webex.com.
>>
>> It doesnt respond to NS queries either (No A records for
>> nsxx.webex.com in cache)
>>
>> # dig ns webex.com @localhost
>>
>> ; <<>> DiG 9.3.3rc2 <<>> ns webex.com @localhost
>> ; (1 server found)
>> ;; global options:  printcmd
>> ;; Got answer:
>> ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 54946
>> ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
>>
>> ;; QUESTION SECTION:
>> ;webex.com.                   IN      NS
>>
>> ;; Query time: 0 msec
>> ;; SERVER: 127.0.0.1#53(127.0.0.1)
>> ;; WHEN: Fri Jan  8 20:53:18 2010
>> ;; MSG SIZE  rcvd: 27
>>
>>
>>
>> After flushing webex.com I can query the cache again:
>>
>> # unbound-control flush webex.com
>> ok
>> # dig +short webex.com @localhost
>> 64.68.120.99
>>
>> Log with increased verbosity:
>>
>> an  8 21:05:32 cns01j unbound: [25052:3] info: validator operate:
>> query <webex.com. A IN>
>> Jan  8 21:05:32 cns01j unbound: [25052:3] info: resolving <webex.com. A IN>
>> Jan  8 21:05:32 cns01j unbound: [25052:3] info:
>> DelegationPoint<webex.com.>: 6 names (6 missing), 6 addrs (0 result, 6
>> avail)
>> Jan  8 21:05:32 cns01j unbound: [25052:3] info:   ns08.webex.com.  A
>> Jan  8 21:05:32 cns01j unbound: [25052:3] info:   ns05.webex.com.  A
>> Jan  8 21:05:32 cns01j unbound: [25052:3] info:   ns01.webex.com.  A
>> Jan  8 21:05:32 cns01j unbound: [25052:3] info:   ns06.webex.com.  A
>> Jan  8 21:05:32 cns unbound: [25052:3] info:   ns07.webex.com.  A
>> Jan  8 21:05:32 cns unbound: [25052:3] info:   ns03.webex.com.  A
>> Jan  8 21:05:32 cns unbound: [25052:3] info: resolving (init part 2):
>> <webex.com. A IN>
>> Jan  8 21:05:32 cns unbound: [25052:3] info: resolving (init part 3):
>> <webex.com. A IN>
>> Jan  8 21:05:32 cns unbound: [25052:3] info: processQueryTargets:
>> <webex.com. A IN>
>> Jan  8 21:05:32 cns unbound: [25052:3] info:
>> DelegationPoint<webex.com.>: 6 names (6 missing), 6 addrs (0 result, 6
>> avail)
>> Jan  8 21:05:32 cns unbound: [25052:3] info:   ns08.webex.com.  A
>> Jan  8 21:05:32 cns unbound: [25052:3] info:   ns05.webex.com.  A
>> Jan  8 21:05:32 cns unbound: [25052:3] info:   ns01.webex.com.  A
>> Jan  8 21:05:32 cns unbound: [25052:3] info:   ns06.webex.com.  A
>> Jan  8 21:05:32 cns unbound: [25052:3] info:   ns07.webex.com.  A
>> Jan  8 21:05:32 cns unbound: [25052:3] info:   ns03.webex.com.  A
>> Jan  8 21:05:32 cns unbound: [25052:3] info: validator operate: query
>> <webex.com. A IN>
>> Jan  8 21:05:32 cns unbound: [25052:3] info: mesh_run: end 8 recursion
>> states (8 with reply, 0 detached), 28 waiting replies, 84 recursion
>> replies sent, 0 replies dropped, 0 states jostled out
>> Jan  8 21:05:32 cns unbound: [25052:3] info: average recursion
>> processing time 4.172319 sec
>> Jan  8 21:05:32 cns unbound: [25052:3] info: histogram of recursion
>> processing times
>> Jan  8 21:05:32 cns unbound: [25052:3] info: [25%]=0.00107789
>> median[50%]=0.00744727 [75%]=0.0524288
>> Jan  8 21:05:32 cns unbound: [25052:3] info: lower(secs) upper(secs) recursions
>> Jan  8 21:05:32 cns unbound: [25052:3] info:    0.000000    0.000001 9
>> Jan  8 21:05:32 cns unbound: [25052:3] info:    0.000256    0.000512 1
>> Jan  8 21:05:32 cns unbound: [25052:3] info:    0.001024    0.002048 1
>> Jan  8 21:05:32 cns unbound: [25052:3] info:    0.002048    0.004096 19
>> Jan  8 21:05:32 cns unbound: [25052:3] info:    0.004096    0.008192 2
>> Jan  8 21:05:32 cns unbound: [25052:3] info:    0.008192    0.016384 11
>> Jan  8 21:05:32 cns unbound: [25052:3] info:    0.016384    0.032768 3
>> Jan  8 21:05:32 cns unbound: [25052:3] info:    0.032768    0.065536 5
>> Jan  8 21:05:32 cns unbound: [25052:3] info:    0.065536    0.131072 15
>> Jan  8 21:05:32 cns unbound: [25052:3] info:    0.131072    0.262144 10
>> Jan  8 21:05:32 cns unbound: [25052:3] info:    0.262144    0.524288 5
>> Jan  8 21:05:32 cns unbound: [25052:3] info:    0.524288    1.000000 2
>> Jan  8 21:05:32 cns unbound: [25052:3] info:  256.000000  512.000000 1
>>
>>
>>
>> I have downgraded one of the servers back to 1.3.0 (libev 3.53), where
>> I don't see this problem.
>>
>> Unbound 1.4.1 compiled with libev-3.9, without threads - 4 processes running
>>
>> num-threads: 4
>> outgoing-range: 10000
>> msg-cache-size: 500m
>> msg-cache-slabs: 1
>> num-queries-per-thread: 10000
>> jostle-timeout: 4000
>> so-rcvbuf: 8m
>> rrset-cache-size: 1024m
>> rrset-cache-slabs: 1
>> cache-max-ttl: 86400
>>
>> Any help would be greatly appreciated :-)
>>
>> Thanks,
>>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
> Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAktIS0QACgkQkDLqNwOhpPjQ2ACfeCESH6t5ezR94AnasM6DP16f
> EsIAoJbV90YIEyovcdz2VWTRmEk4/tMY
> =Zfvf
> -----END PGP SIGNATURE-----
>



-- 
Kuba Heichman