Maintained by: NLnet Labs

[Unbound-users] Resolving facebook and RTO of 120000

Leo Bush
Thu Mar 22 09:39:24 CET 2012


Dear all,

We are using unbound 1.4.14 for DNS resolving. It runs at 5Mbp/s of 
traffic in average.
For several days we had a problem with the resolution of m.facebook.com. 
Unbound returned permanently a SERVFAIL error and mobile clients were 
stuck and unhappy.
When I looked at the infra cache I saw the responsible nameservers with 
an RTO value of 120000 and I did not succeed to flush or to clear that 
setting. Is there a possibility that I did not find (flush does not work 
any more)? That is why I did an "unbound-control reload". Now 
m.facebook.com works again (since 3 days).

What astonishes me is, that the unbound algorithm did not succeed to get 
out of that RTO120000 situation by itself. I had a similar problem a few 
weeks earlier with the domain www.voipbuster.com for which the 
nameserver ns1.finarea.ch was listed with 120000 in the dump_cache. The 
only solution to get it work was unbound-control reload.

After two failures in less than a month, I got curious and looked for 
other 120000 entries in the dump cache:

[ ~]# date; unbound-control dump_infra |grep 120000 | sort; echo; sleep 
120; date; unbound-control dump_infra |grep 120000 | sort
Thu Mar 22 09:21:04 CET 2012
128.242.103.18 fr.fm. expired rto 120000
128.242.103.32 fr.fm. expired rto 120000
199.7.59.78 ocsp.verisign.net. expired rto 120000
200.74.240.66 globalcccam.com. expired rto 120000
62.212.66.130 uploadhere.com. expired rto 120000
68.232.43.4 cedexis.net. expired rto 120000
69.171.239.10 star.facebook.com. expired rto 120000
69.171.255.10 star.facebook.com. expired rto 120000

Thu Mar 22 09:23:04 CET 2012
128.242.103.18 fr.fm. expired rto 120000
128.242.103.32 fr.fm. expired rto 120000
199.7.59.78 ocsp.verisign.net. expired rto 120000
200.74.240.66 globalcccam.com. expired rto 120000
62.212.66.130 uploadhere.com. expired rto 120000
68.232.43.4 cedexis.net. expired rto 120000
69.171.239.10 star.facebook.com. expired rto 120000
69.171.255.10 star.facebook.com. expired rto 120000
72.51.41.148 bl.csma.biz. ttl 642 ping 0 var 94 rtt 376 rto 120000 
ednsknown 0 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0


As you can see, I repeated the dump two times in 2 minutes, and there 
are some entries where the RTO did not change (especially 
star.facebook.com, I repeated the dump even more times). When I try to 
manually resolve the RR towards the IP, I get an instant answer. So I do 
not think that I have a network problem because everything else works fine.

[ ~]# dig  star.facebook.com

;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 16121
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;star.facebook.com.             IN      A

;; Query time: 165 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Mar 22 09:25:38 2012

[ ~]# dig @69.171.239.10 star.facebook.com. +norec

;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47559
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;star.facebook.com.             IN      A

;; ANSWER SECTION:
star.facebook.com.      30      IN      A       66.220.158.72

;; Query time: 10 msec
;; SERVER: 69.171.239.10#53(69.171.239.10)
;; WHEN: Thu Mar 22 09:26:02 2012


I also checked the authoritative IP for facebook in the dump. I returns 
plenty of successes besides star.facebook.com.
[ ~]# unbound-control dump_infra | grep 69.171.239.10
69.171.239.10 touch.facebook.com. ttl 854 ping 3 var 46 rtt 187 rto 187 
ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
69.171.239.10 developers.facebook.com. ttl 886 ping 1 var 74 rtt 297 rto 
297 ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
69.171.239.10 orcart.facebook.com. ttl 890 ping 1 var 73 rtt 293 rto 293 
ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
69.171.239.10 check6.facebook.com. ttl 879 ping 1 var 74 rtt 297 rto 297 
ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
69.171.239.10 www.facebook.com. ttl 888 ping 1 var 73 rtt 293 rto 293 
ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
69.171.239.10 staging.channel.facebook.com. ttl 898 ping 1 var 74 rtt 
297 rto 297 ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
69.171.239.10 wild.facebook.com. ttl 890 ping 1 var 74 rtt 297 rto 297 
ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
69.171.239.10 chat.facebook.com. ttl 838 ping 6 var 17 rtt 74 rto 74 
ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
69.171.239.10 api-read.facebook.com. ttl 891 ping 1 var 74 rtt 297 rto 
297 ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
69.171.239.10 check4.facebook.com. ttl 879 ping 3 var 47 rtt 191 rto 191 
ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
69.171.239.10 apps.facebook.com. ttl 891 ping 1 var 74 rtt 297 rto 297 
ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
69.171.239.10 graph.facebook.com. ttl 894 ping 1 var 75 rtt 301 rto 301 
ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
69.171.239.10 star.facebook.com. expired rto 120000

Does anybody see a similar phenomenon (IPs get stuck at expired rto 
120000)? Do you have an idea when and why this arrives and how I should 
deal with it? Unbound reloads are penalizing for the other users.

Thanks for your thoughts.

kind regards

Leo Bush