Maintained by: NLnet Labs

[Unbound-users] Resolve failures when using forwarders that do recursion

Ilya Bakulin
Thu Nov 21 11:21:43 CET 2013


Hi list,
consider the following configuration:
1. Unbound is set up to use 2-3 DNS servers as forwarders
2. These servers are doing recursive resolving
3. Most of resolve requests are for several domain names,
	but there are certainly requests for other domains too.

Now look at this tcpdump log:

15:59:41.599014 IP unbound.host.13453 > forwarder-1.com.domain: 48359+% [1au] MX? some.domain. (41)
15:59:42.246023 IP unbound.host.29253 > forwarder-2.com.domain: 35322+% [1au] MX? some.domain. (41)
15:59:42.277969 IP forwarder-1.com.domain > unbound.host.13453: 48359 2/0/2 MX mx1.some.domain. 5, MX mx2.some.domain. 10 (102)
15:59:42.278009 IP unbound.host > forwarder-1.com: ICMP unbound.host udp port 13453 unreachable, length 36
15:59:43.254411 IP unbound.host.20082 > forwarder-3.com.domain: 18647+% [1au] MX? some.domain. (41)
15:59:43.575827 IP forwarder-2.com.domain > unbound.host.29253: 35322 2/0/2 MX mx2.some.domain. 10, MX mx1.some.domain. 5 (102)
15:59:43.575933 IP unbound.host > forwarder-2.com: ICMP unbound.host udp port 29253 unreachable, length 36
15:59:43.943166 IP forwarder-3.com.domain > unbound.host.20082: 18647 2/0/2 MX mx1.some.domain. 5, MX mx2.some.domain. 10 (102)
15:59:43.943304 IP unbound.host > forwarder-3.com: ICMP unbound.host udp port 20082 unreachable, length 36

We see that Unbound tries to resolve "some.domain" using all three upstream forwarders.
Every upstream server has to do a recursive resolving because "some.domain" is not in cache.

Now let's reorder the dump so that it is grouped by upstream server:

15:59:41.599014 IP unbound.host.13453 > forwarder-1.com.domain: 48359+% [1au] MX? some.domain. (41)
15:59:42.277969 IP forwarder-1.com.domain > unbound.host.13453: 48359 2/0/2 MX mx1.some.domain. 5, MX mx2.some.domain. 10 (102)
15:59:42.278009 IP unbound.host > forwarder-1.com: ICMP unbound.host udp port 13453 unreachable, length 36

So the answer came in >600ms, unbound has closed its socket -> system answers with ICMP unreach

15:59:42.246023 IP unbound.host.29253 > forwarder-2.com.domain: 35322+% [1au] MX? some.domain. (41)
15:59:43.575827 IP forwarder-2.com.domain > unbound.host.29253: 35322 2/0/2 MX mx2.some.domain. 10, MX mx1.some.domain. 5 (102)
15:59:43.575933 IP unbound.host > forwarder-2.com: ICMP unbound.host udp port 29253 unreachable, length 36

Here answer came in >1s, the same reaction with ICMP unreach

15:59:43.254411 IP unbound.host.20082 > forwarder-3.com.domain: 18647+% [1au] MX? some.domain. (41)
15:59:43.943166 IP forwarder-3.com.domain > unbound.host.20082: 18647 2/0/2 MX mx1.some.domain. 5, MX mx2.some.domain. 10 (102)
15:59:43.943304 IP unbound.host > forwarder-3.com: ICMP unbound.host udp port 20082 unreachable, length 36
Upstream answered in >700ms, => ICMP unreach.

So, each of upstream servers has done hard job with recursive resolving,
Unbound hasn't accepted any of the answers, returned SERVFAIL to the mail server,
mail server hasn't sent a mail, the sender is in disaster.

Unbound uses an algorithm described at [1] to set timeouts when
sending queries. This works well when Unbound is used as a recursive resolver
because Internet is a complex wild network full of crappy overloaded DNS servers
and one has to take changing conditions and failing servers in account.

But when Unbound uses forwarders that in turn should deal with that
wild Internet outside, it doesn't forgive its forwarders when they
deliver the answer a bit late. It thinks that those servers should be
FAST just because most answers come from their cache and Unbound uses its
infra-cache to remember this.

If we turn the infra-cache off, Unbound will use its standard 376ms timeout and
the situation may get even worse.

Does it maybe make sense to add a new configuration parameter that allows to set
a custom timeout value when using forwarders? In the case of that poor unbound.host
we would set that timeout to be ~ 1500ms or something like that.
It may be per-server or global value, and should be used only for requests
to "upstream" servers.

[1] http://www.unbound.net/documentation/info_timeout.html

--
Ilya
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part.
URL: <https://unbound.nlnetlabs.nl/pipermail/unbound-users/attachments/20131121/2ab8a9a6/attachment.sig>