Maintained by: NLnet Labs

[Unbound-users] No answers from unbound occasionally

Attila Nagy
Sat May 30 14:09:15 CEST 2009


W.C.A. Wijngaards wrote:
> Are those replies from authority servers? That arrive just after unbound
> times out and closes the socket?
I'm not sure I understand this correctly. What we do is the following:
- we have some loaded (some thousand queries per sec) recursive 
nameservers (behind a load balancer) on which the clients report 
occasional loss of answers
- we start a query against the servers, the program sends a configurable 
amount of queries for the same name (so no, the queries should be 
answerable from the cache) and waits 5 seconds for the answers
- we sometimes find timeouts on the client
- investigating this yields a capture (made on the nameservers), which 
has the client's request (so there is no packet loss involved between 
the client and the server), but no answer from the server

The cache is quite big (the machines have 8 GiB of RAM), the TTL is 
high, so answers all should come from it.

> Some sort of selective verbose logging is an idea on my TODO.
Great, it would ease debugging issues like this.
> Unbound will indeed not respond to particular queries.  These queries
> end up getting counted as 'cache hits', but really they were malformed.
>  Some malformed queries unbound does not reply to - such as queries with
> QR=1 flag, or shorter than 12 byte queries.  Since you are sending them
> yourself, it seems unlikely they are this malformed.
The queries should be the same. Or at least, I haven't seen any 
differences between the "good" and the "bad" ones in wireshark.
>> With tcpdump it seems that the machine gets the query, but there is no
>> answer from unbound.
>> Its statistics counters seems OK, there is no full queue, or drops
>> according to that.
> Is it a query to port 53?  Queries to other ports are not answered.
> Are there 'jostled' queries? They also create dropped replies by
> replacing an existing (old) one.
unbound-control currently tells this:
I have graphs from this (munin) and I haven't seen a single overwritten 
or exceeded value on any of our servers.
>>    72146984 dropped due to full socket buffers
> Could this explain the 1 in 100qps-for-3600s that are dropped?  Could
> they be dropped at the query-sender (seems unlikely)?
We monitor the traffic on three points: before, and after the load 
balancer and on the DNS servers. The queries for which we haven't got an 
answer are in all three captures, so no, it gets to the server.
>> I've already tried to raise the related sysctls, without any effects.
> You tried to increase socket buffers already, I presume. Weird.
Yes, I've tried increasing net.inet.udp.recvspace, kern.ipc.maxsockbuf.

According to netstat -m, there is no mbuf shortage:
# netstat -m
5148/5097/10245 mbufs in use (current/cache/total)
4080/2470/6550/25600 mbuf clusters in use (current/cache/total/max)
4080/1424 mbuf+clusters out of packet secondary zone in use (current/cache)
0/0/0/12800 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/6400 9k jumbo clusters in use (current/cache/total/max)
0/0/0/3200 16k jumbo clusters in use (current/cache/total/max)
9447K/6214K/15661K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines

We have pf active, but its states are also monitored, and it doesn't 
reach the maximum, and udp related timeouts are high:
udp.first                    60s
udp.single                   30s
udp.multiple                 60s

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>