Timeout semantics of Unbound differ radically from Bind 9

Sun Apr 10 05:34:23 UTC 2016

yOn Sun, 10 Apr 2016, Dhalgren Tor wrote:

> However please note that Tor project has modified the eventdns code so
> it may not match exactly the behavior of the generic version.

The generic version seems to have 3 retries at 3 seconds each, before
considering the nameserver dead.  So Tor's version is more generous with the
timing.

> Interesting!  This explains why Tor relay DNS completely seizes up
> when GoDaddy null-routes a relay running Unbound.

It's worse than that, I think.  My read of this code suggests that if
unbound fails to answer for any single query 3 times in a row, eventdns
marks that copy of unbound as dead for at least 10 seconds and starts
exponentially backing off use of it, up to an hour.

This is a desirable characteristic if one of your 3 nameservers is broken;
you'll stop sending it requests and your users won't keep waiting on
responses that will never come.  Or if you are querying some recursive
nameserver who doesn't want traffic from you and blackholes you, you'll stop
throwing them a large volume of unwanted traffic.

According to https://www.unbound.net/documentation/info_timeout.html,
unbound should already be returning SERVFAIL immediately if it believes all
servers are dead.  And SERVFAIL should also be returns after all servers are
queried (and timed out) 5 times.  I suspect that can take more than 15
seconds and I don't see a way to put an upper bound on that, though.

> Now I have to look into whether that 64 in-flight limit might be a
> performance constraint for fast exit relays.  Might want a tunable to
> increase the limit.

It looks like eventdns will respect an /etc/resolv.conf entry for
"max-inflight: 1000" or similar. If you are limited by inflight requests,
this could be an easy workaround.

                                     -- Aaron