Maintained by: NLnet Labs

[Unbound-users] Problem to resolve domains from a certain registrar

Leo Bush
Thu Sep 8 18:18:07 CEST 2011


Dear all,

My problem is not gone yet. I analysed it as far as I could according to 
your key word EDNS and packet loss or fragmentation which looked promising.
I disabled the iptables rule set, restarted the unbound and let it run 
for a couple of days: same result. Sometimes the domains hosted on 
ns1.register.be and ns2.register.be were resolved. Most of the time 
unbound returned a SERVFAIL.
Then I disabled DNSSEC and let it run too. Again same result.

I nevertheless noticed better results for domains like leonidas.be which 
are hosted on ns1.register.be, ns2.register.be, ns3.register.be than for 
domains like estates.lu which is only delegated towards ns1.register.be, 
ns2.register.be.

Then I read something in the unbound documentation about «Unbound 
Timeout Information» and flush_infra and dump_infra commands. I think 
this time I am on a promising way, but I do not understand enough how it 
works and interacts. I did the following checks:

[resolv ~]# unbound-control lookup leonidas.be
The following name servers are used for lookup of leonidas.be.
;rrset 85727 2 0 2 0
leonidas.be.    85727   IN      NS      ns1.register.be.
leonidas.be.    85727   IN      NS      ns2.register.be.
;rrset 37 1 0 8 3
ns2.register.be.        37      IN      A       194.78.23.152
;rrset 37 1 0 8 3
ns1.register.be.        37      IN      A       80.169.63.207
Delegation with 2 names, of which 0 can be examined to query further 
addresses.
It provides 2 IP addresses.
80.169.63.207           rto 120000 msec, ttl 412, ping 0 var 94 rtt 376, 
EDNS 0 assumed.
194.78.23.152           rto 120000 msec, ttl 478, ping 0 var 94 rtt 376, 
EDNS 0 assumed.


[root at resolv ~]# unbound-control flush_infra 80.169.63.207 && 
unbound-control flush_infra 194.78.23.152; while [ 1 ] ; do date; 
unbound-control dump_infra | grep -E 
"80.169.63.207|194.78.23.152|91.121.5.186"; sleep 30; echo; 
done                                    ok
ok
Thu Sep  8 18:02:51 CEST 2011
91.121.5.186 ttl 717 ping 1 var 9 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep  8 18:03:21 CEST 2011
80.169.63.207 ttl 875 ping 2 var 60 rtt 242 rto 242 ednsknown 1 edns 0 
delay 0
194.78.23.152 ttl 889 ping 5 var 16 rtt 69 rto 69 ednsknown 1 edns 0 delay 0
91.121.5.186 ttl 687 ping 1 var 9 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep  8 18:03:52 CEST 2011
80.169.63.207 ttl 844 ping 2 var 60 rtt 242 rto 242 ednsknown 1 edns 0 
delay 0
194.78.23.152 ttl 858 ping 5 var 16 rtt 69 rto 69 ednsknown 1 edns 0 delay 0
91.121.5.186 ttl 656 ping 1 var 9 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep  8 18:04:23 CEST 2011
80.169.63.207 ttl 813 ping 2 var 60 rtt 242 rto 3872 ednsknown 1 edns 0 
delay 0
194.78.23.152 ttl 827 ping 5 var 16 rtt 69 rto 4416 ednsknown 1 edns 0 
delay 0
91.121.5.186 ttl 625 ping 1 var 9 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep  8 18:04:53 CEST 2011
80.169.63.207 ttl 783 ping 2 var 60 rtt 242 rto 15488 ednsknown 1 edns 0 
delay 9
194.78.23.152 ttl 797 ping 5 var 16 rtt 69 rto 17664 ednsknown 1 edns 0 
delay 13
91.121.5.186 ttl 595 ping 1 var 9 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep  8 18:05:24 CEST 2011
80.169.63.207 ttl 752 ping 2 var 60 rtt 242 rto 30976 ednsknown 1 edns 0 
delay 13
194.78.23.152 ttl 766 ping 5 var 16 rtt 69 rto 35328 ednsknown 1 edns 0 
delay 0
91.121.5.186 ttl 564 ping 1 var 9 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep  8 18:05:55 CEST 2011
80.169.63.207 ttl 721 ping 2 var 60 rtt 242 rto 61952 ednsknown 1 edns 0 
delay 54
194.78.23.152 ttl 735 ping 5 var 16 rtt 69 rto 35328 ednsknown 1 edns 0 
delay 8
91.121.5.186 ttl 533 ping 1 var 9 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep  8 18:06:25 CEST 2011
80.169.63.207 ttl 691 ping 2 var 60 rtt 242 rto 61952 ednsknown 1 edns 0 
delay 24
194.78.23.152 ttl 705 ping 5 var 16 rtt 69 rto 70656 ednsknown 1 edns 0 
delay 53
91.121.5.186 ttl 503 ping 1 var 9 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep  8 18:06:56 CEST 2011
80.169.63.207 ttl 660 ping 2 var 60 rtt 242 rto 120000 ednsknown 1 edns 
0 delay 0
194.78.23.152 ttl 674 ping 5 var 16 rtt 69 rto 70656 ednsknown 1 edns 0 
delay 22
91.121.5.186 ttl 472 ping 1 var 9 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep  8 18:07:26 CEST 2011
80.169.63.207 ttl 630 ping 2 var 60 rtt 242 rto 120000 ednsknown 1 edns 
0 delay 0
194.78.23.152 ttl 644 ping 5 var 16 rtt 69 rto 120000 ednsknown 1 edns 0 
delay 0
91.121.5.186 ttl 442 ping 4 var 6 rtt 50 rto 50 ednsknown 1 edns 0 delay 0
^C

I noticed that the rto value increases very quickly towards 120000 for 2 
name servers ns1.register.be, ns2.register.b and it stays there. 
Resolutions work for a certain time. Afterwards unbound returns SERVFAIL.

I did the same test with another twin unbound-nameserver (off traffic), 
and could not notice the same thing.

unbound-control flush_infra 80.169.63.207 && unbound-control flush_infra 
194.78.23.152; while [ 1 ] ; do date; unbound-control dump_infra | grep 
-E "80.169.63.207|194.78.23.152|91.121.5.186"; sleep 30; echo; done
ok
ok
Thu Sep  8 17:55:56 CEST 2011

Thu Sep  8 17:56:26 CEST 2011
80.169.63.207 ttl 877 ping 9 var 16 rtt 73 rto 73 ednsknown 1 edns 0 delay 0
91.121.5.186 ttl 877 ping 0 var 20 rtt 80 rto 80 ednsknown 1 edns 0 delay 0
194.78.23.152 ttl 877 ping 0 var 13 rtt 52 rto 52 ednsknown 1 edns 0 delay 0

Thu Sep  8 17:56:57 CEST 2011
80.169.63.207 ttl 846 ping 9 var 16 rtt 73 rto 73 ednsknown 1 edns 0 delay 0
91.121.5.186 ttl 846 ping 0 var 20 rtt 80 rto 80 ednsknown 1 edns 0 delay 0
194.78.23.152 ttl 846 ping 0 var 13 rtt 52 rto 52 ednsknown 1 edns 0 delay 0

Thu Sep  8 17:57:27 CEST 2011
80.169.63.207 ttl 816 ping 9 var 16 rtt 73 rto 73 ednsknown 1 edns 0 delay 0
91.121.5.186 ttl 816 ping 0 var 20 rtt 80 rto 80 ednsknown 1 edns 0 delay 0
194.78.23.152 ttl 816 ping 0 var 13 rtt 52 rto 52 ednsknown 1 edns 0 delay 0

Thu Sep  8 17:57:57 CEST 2011
80.169.63.207 ttl 786 ping 9 var 16 rtt 73 rto 73 ednsknown 1 edns 0 delay 0
91.121.5.186 ttl 786 ping 0 var 20 rtt 80 rto 80 ednsknown 1 edns 0 delay 0
194.78.23.152 ttl 786 ping 0 var 13 rtt 52 rto 52 ednsknown 1 edns 0 delay 0

Thu Sep  8 17:58:27 CEST 2011
80.169.63.207 ttl 756 ping 9 var 16 rtt 73 rto 73 ednsknown 1 edns 0 delay 0
91.121.5.186 ttl 756 ping 0 var 20 rtt 80 rto 80 ednsknown 1 edns 0 delay 0
194.78.23.152 ttl 756 ping 0 var 13 rtt 52 rto 52 ednsknown 1 edns 0 delay 0

Now I did the following test on the "buggy" unbound server:
[root at resolv ~]# unbound-control flush_infra 80.92.67.140 && 
unbound-control flush_infra 80.92.65.2; while [ 1 ] ; do date; 
unbound-control dump_infra | grep -E "80.92.67.140|80.92.65.2"; sleep 
30; echo; done                                                         ok
ok
Thu Sep  8 17:58:38 CEST 2011

Thu Sep  8 17:59:08 CEST 2011
80.92.67.140 ttl 870 ping 21 var 9 rtt 57 rto 57 ednsknown 1 edns 0 delay 0
80.92.65.2 ttl 872 ping 9 var 8 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep  8 17:59:39 CEST 2011
80.92.67.140 ttl 839 ping 20 var 8 rtt 52 rto 52 ednsknown 1 edns 0 delay 0
80.92.65.2 ttl 841 ping 11 var 10 rtt 51 rto 51 ednsknown 1 edns 0 delay 0

Thu Sep  8 18:00:09 CEST 2011
80.92.67.140 ttl 808 ping 23 var 11 rtt 67 rto 67 ednsknown 1 edns 0 delay 0
80.92.65.2 ttl 810 ping 26 var 29 rtt 142 rto 142 ednsknown 1 edns 0 delay 0

Thu Sep  8 18:00:41 CEST 2011
80.92.67.140 ttl 777 ping 21 var 9 rtt 57 rto 57 ednsknown 1 edns 0 delay 0
80.92.65.2 ttl 779 ping 8 var 8 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep  8 18:01:11 CEST 2011
80.92.67.140 ttl 747 ping 18 var 6 rtt 50 rto 50 ednsknown 1 edns 0 delay 0
80.92.65.2 ttl 749 ping 9 var 7 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep  8 18:01:42 CEST 2011
80.92.67.140 ttl 716 ping 18 var 5 rtt 50 rto 50 ednsknown 1 edns 0 delay 0
80.92.65.2 ttl 718 ping 8 var 6 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep  8 18:02:12 CEST 2011
80.92.67.140 ttl 685 ping 23 var 13 rtt 75 rto 75 ednsknown 1 edns 0 delay 0
80.92.65.2 ttl 687 ping 11 var 7 rtt 50 rto 50 ednsknown 1 edns 0 delay 0
^C

Does anybody have an explanation or a suggestion for this?

regards


Leo Bush


On 24/08/2011 13:47, Lst_hoe02 at kwsoft.de wrote:
 > Zitat von Leo Bush <leo.bush at mylife.lu>:
 >
 >> Dear all,
 >>
 >> Since one month our company uses unbound-1.4.8-1 on two RH6 servers 
as caching and resolving servers with IPv6 and DNSSec enabled. These two 
servers deal with all our DNS traffic, generated by all our customers 
(2x 5Mbps peak traffic). They work as stand alone servers, no 
complicated network components (Load balancer...) around.
 >>
 >> At the beginning we used to activate the option use-caps-for-id, but 
since we got complaints from customers that certain domains were 
available everywhere in the world except at us, we preferred to deactivate.
 >>
 >> Currently we face the following rather strange problem:
 >> Under normal working conditions, in 70-90% of the time our two 
production servers  cannot  resolve domains registered at register.be 
and lying on the three authoritative name servers ns1.register.be, 
ns3.register.be, ns2.register.be (example: leonidas.be, estates.lu). 
They return me a SERVFAIL. register.be itself works all the time. By 
chance it sometimes works correctly for a brief period of time. Even 
though it was not easy due to the thousands of packets passing through 
in a second, I succeeded to trace the packets the server sends to the 
authoritative servers and it gets correct answers back.
 >>
 >> I tried to install unbound 1.4.8 with the same configuration file 
(see attachment) on a desktop machine and there was no issue. All 
resolutions against domains at register.be were immediate and correct.
 >>
 >> As customers continued to complain I was forced to take one server 
out of production and to replace it with bind which works correctly. Now 
I have one server with unbound that has the problem and one server with 
bind, that works fine in production. The formerly faulty unbound server 
that is now offloaded currently responds correctly at all tests (no 
restart done, no reboot done, just IP address switched).
 >>
 >> Does anybody have an idea how I can solve this problem? Shall I 
offer you more technical information? Do you have further tests to suggest?
 >>
 >
 > Looks for me like EDNS problem. At least some part of the .be zone is 
DNSSEC signed an the replies get bigger than 512 Byte like with "dig 
x.dns.be A +dnssec". Bind has a feature to reduce the EDNS size in case 
of trouble, not sure if Unbound does the same. What you should check:
 > - Do the trouble domain/names resolve with unbound if you use 
checking disabled (+cdflag)
 > - Do you have any firewall device in front of your resolvers maybe 
some Cisco inspecting DNS traffic
 > - Do you have disabled Unbound tcp
 >
 > For some hints on the problem have a look here:
 > https://www.dns-oarc.net/oarc/services/replysizetest
 >
 > Regards
 >
 > Andreas
 >
 >
 >
 > _______________________________________________
 > Unbound-users mailing list
 > Unbound-users at unbound.net
 > http://unbound.nlnetlabs.nl/mailman/listinfo/unbound-users
 >
a