Maintained by: NLnet Labs

[Unbound-users] unbound occasionally does not answer

W.C.A. Wijngaards
Fri Sep 18 13:52:16 CEST 2009


Hi Attila,

Could this be because of a feature controlled in util/netevent.c:
#define NUM_UDP_PER_SELECT 100

This causes unbound to approach the UDP socket differently.
By default, it polls every socket for 100 datagrams, but in
your case that can be 8000 'not port 53' sockets.

If every 'not 53' socket has datagrams, this could take a long
time to process them all.  Probably max one (or two)
datagrams on an open port towards the authority servers (due
to port randomisation).  Also, a lot of replies can arrive
at the same time, and thus take a lot of processing.

If you set the value to 1 : this is more portable.  Unbound does
system calls in a more portable way, friendlier to broken
nonblocking IP stacks.

If you set the value to 10000 : this causes unbound to perform
more processing for the port 53 if it has lots of queries
waiting.  Will probably cause unbound to empty the buffer that
is being used, and this may help your dropped packets?

Otherwise, if this value helps at all, perhaps port 53 can be
checked more often than the authority ports are checked...
(But this is a little more tricky to implement).
I'll see if I can send you an experimental patch for that off-list.

Not sure about SO_RCVBUF: if the above works perhaps another
option is better (I do not want to add too many options).
Yeah 256Mb is insane :-)

Threading may help since it uses the IP stack differently.
But I would like it to work without threading.

memcached as cache may work with a python module.  It gets
activated on unbound-cache miss, so the perfect opportunity
to read memcached, let unbound's iterator process the
query otherwise and store the query answer to memcached afterwards.
If you use unbound's DNS query mechanism you can query
memcached nonblocking, but you have to use DNS format to
talk to memcached. (it doesn't check the packet format after
the query, so you can dump arbitrary data there if you want).

Best regards,
    Wouter

On 09/18/2009 01:12 PM, Attila Nagy wrote:
> As I can't turn on query logging (it effectively kills the server), I've

> traced it with ktrace and checked the missing responses there.
> What I can see is the investigated packets (queries) for which there was
> no response, doesn't appear in the trace (but tcpdump sees it), so it
> seems that unbound doesn't even get them.

> I have tried to increase the UDP buffer size of the OS, but because of
> the large number of outgoing ports configured, they took too much space,
> so I've modified unbound, so it sets its SO_RCVBUF only on its listen
> port (BTW, I think such a configuration option would be nice).
> The default (net.inet.udp.recvspace) is 42080, I've gradually increased
> that to 256MB (I would say, it's insanely large) in the hope that it
> helps unbound get the packets, which arrive during it does *something*.
> I could get some improvement with this, there are even some periods,
> where the success rate was 100%, but it' still not perfect:
> Here, "dropped due to no socket" grows constantly (5-20 packets per
> sec), but setting net.inet.udp.log_in_vain to 1 (which logs which
> packets were dropped) tells that these packets are the ones, which come
> back from the authoritative DNS servers, not the one, coming for port 53
> (obviously unbound closed the socket, due to a timeout).
> The other (and more interesting stuff) is "dropped due to full socket
> buffers".
> That counter has grown 0 to 130 drops per second (the exact value of
> course moves with the actual query rate, so at nights, it is low, while
> at daytime, it's at its peak) originally. With increasing SO_RCVBUF I
> could push it down to 0 to 2-3 drops per second.
> The counters are queried about every five minutes, so this is about a
> five minutes average, as I've said, the drops come in batches, like this
> (there is one second sleep between every printouts):
> Should multi threading help this (so the cache management could run in
> its own thread, so maybe it doesn't affect other threads' performance -I
> don't know about the details)?
> BTW, with threading I have two problems:
> - it slows down normal operation (at least on FreeBSD, so it means lower
> qps)
> - it increases the cache memory usage, which of course makes the hit
> ratio slightly worse
>
> So to summarize the above:
> - I think it would be good to have a receive socket size configuration
> option, which would make possible to change SO_RCVBUF size only on the
> listening socket(s). The patch is trivial, but I'm sure you can do it
> better/faster :)
> - what could be done to make unbound not block for about a second and
> drop a lot of queries (is it possible with some cache tuning, or is
> changing code needed, or is is just an impossible thing with an
> unthreaded application?)
>
> ps: using memcached as unbound's cache store is a long time request on
> my part, I would be curious whether it's affected by the same problems,
> or not (it also have an asynchronous mono threaded run-mode)...
>
> Thanks and sorry for the long message.
>
> On 08/14/09 08:57, W.C.A. Wijngaards wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Hi Felix,
>>
>> Can you tell me the names of the queries (off-list ? ).
>>
>> Perhaps they are misconfigured domains of some sort.
>> I believe I solved Attila's problems from June.
>>
>> If you dig for them yourself, do you get an answer?
>> And with dig +cdflag (to see if they are 'bogus').
>>
>> Best regards,
>>     Wouter
>>
>> On 08/13/2009 07:28 PM, Felix Schueren wrote:
>>
>>> Hello,
>>>
>>> and sorry for starting a new thread when the last one is from June - I
>>> subscribed only recently and the mail archive does not show mail ids.
>>> Anyway, I'm referring to the thread from Attila Nagy: "No answers from
>>> unbound occasionally" - we're seeing much the same symptoms (using
>>> 1.3.2), i.e. unbound sometimes not answering queries, and debugging is a
>>> pain - we're running a load-balanced setup with currently 3 unbound
>>> nodes, around 22k q/s, each unbound node doing ~6-8k qps. It usually
>>> works fast&  fine, but some queries appear to get eaten - we're getting
>>> occasional dns resolution errors when using unbound as cache that we
>>> never got with our dnscache (djb) setup. I've debugged as far as seeing
>>> that the queries reach unbound (they get logged), but I don't know
>>> whether unbound answered and the packets got lost, or if unbound simply
>>> did not answer at all.
>>>
>>> Any ideas on debugging this? "Occasionally" means a couple 100 queries
>>> per day (out of roughly 1.1G total queries). My "exceeded" (jostle)
>>> counter is 0, average number of waiting requests is<10 per hosts.
>>>
>>> Kind regards,
>>>
>>> Felix
>>>