Maintained by: NLnet Labs

[Unbound-users] Unbound stops responding

W.C.A. Wijngaards
Mon Aug 26 10:35:01 CEST 2013


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Petter,

This is because of windows, it does not implement line buffering.
The posix: setvbuf(f, NULL, (int)_IOLBF, 0);
does not work, and it does not print whole lines to the file.  The
threads are intermixing the log output.

You should turn down the log level, by the way, also for performance
reasons.  To, perhaps, 1.  At this level, either the logfile is fine
with lots of threads, or you can disable logging to file and log to
'syslog' which is the windows application event log on windows.  And
that is likely to be threadsafe.

Another fix could be to open different files (logfile.0, logfile.1...)
for different threads on windows, but that would need code to support it.

Best regards,
   Wouter

On 08/20/2013 01:43 PM, Petter Lindgren wrote:
> I've now implemented the values you suggested and Unbound seems to
> work well with 10 threads except for the logging. After the config
> change; unbound started to write gibberish in the logfile. It does
> the same on all three servers:
> 
> My optimization options:
> 
> num-threads: 10
> 
> msg-cache-slabs: 8 rrset-cache-slabs: 8 infra-cache-slabs: 8 
> key-cache-slabs: 8
> 
> rrset-cache-size: 512m msg-cache-size: 256m key-cache-size: 64m
> 
> outgoing-range: 50 num-queries-per-thread: 25
> 
> 
> My logfile after the change:
> 
> 2013-08-20 12:31:23 unbound[1272:5] info: server stats for thread
> 5: 244009 queries, 199923 answers from cache, 44086 recursions, 0
> prefetch 2013-08-20 12:31:23 unbound[1272:5] info: server stats for
> thread 5: requestlist max 26 avg 2.80531 exceeded 1 jostled 0 
> 2013-08-20 12:31:23 unbound[1272:5] info: average recursion
> processing time 2.004875 sec 2013-08-20 12:31:23 unbound[1272:5]
> info: histogram of recursion processing times 2013-08-20 12:31:23
> unbound[1272:5] info: [25%]=8.67612e-007 median[50%]=0.0263988
> [75%]=0.101194 2013-08-20 12:31:23 unbound[1272:5] info:
> lower(secs) upper(secs) recursions 2013-08-20 12:31:23
> unbound[1272:5] info:    0.000000    0.000001 12758 2013-08-20
> 12:31:23 unbound[1272:5] info:    0.008192    0.016384 6306 
> 2013-08-20 12:31:23 unbound[1272:5] info:    0.016384    0.032768
> 5029 2013-08-20 12:31:23 unbound[1272:5] info:    0.032768
> 0.065536 7103 2013-08-20 12:31:23 unbound[1272:5] info:    0.065536
> 0.131072 3696 2013-08-20 12:31:23 unbound[1272:5] info:    0.131072
> 0.262144 5122 2013-08-20 12:31:23 unbound[1272:5] info:    0.262144
> 0.524288 2391 2013-08-20 12:31:23 unbound[1272:5] info:    0.524288
> 1.000000 519 2013-08-20 12:31:23 unbound[1272:5] info:    1.000000
> 2.000000 166 2013-08-20 12:31:23 unbound[1272:5] info:    2.000000
> 4.000000 73 2013-08-20 12:31:23 unbound[1272:5] info:    4.000000
> 8.000000 39 2013-08-20 12:31:23 unbound[1272:5] info:    8.000000
> 16.000000 50 2013-08-20 12:31:23 unbound[1272:5] info:   16.000000
> 32.000000 117 2013-08-20 12:31:23 unbound[1272:5] info:   32.000000
> 64.000000 185 2013-08-20 12:31:23 unbound[1272:5] info:   64.000000
> 128.000000 361 2013-08-20 12:31:23 unbound[1272:5] info:
> 128.000000  256.000000 341 2013-08-20 12:31:23 unbound[1272:5]
> info:  256.000000  512.000000 17 2013-08-20 12:31:23
> unbound[1272:5] info:  512.000000 1024.000000 3 2013-08-20 12:31:23
> unbound[1272:8] info: server stats for thread 8: 48 queries, 47
> answers from cache, 1 recursions, 0 prefetch 2013-08-20 12:31:23
> unbound[1272:8] info: server stats for thread 8: requestlist max 0
> avg 0 exceeded 0 jostled 0 2013-08-20 12:31:23 unbound[1272:8]
> info: average recursion processing time 0.249601 sec 2013-08-20
> 12:31:23 unbound[1272:8] info: histogram of recursion processing
> times 2013-08-20 12:31:23 unbound[1272:8] info: [25%]=0
> median[50%]=0 [75%]=0 2013-08-20 12:31:23 unbound[1272:8] info:
> lower(secs) upper(secs) recursions 2013-08-20 12:31:23
> unbound[1272:8] info:    0.131072    0.262144 1 20220113-0830-2-00
> 81-1220: 3311-08:22:-321:0 424 1
> 220u1:3nbouunnd[-12720b3o81:-2:462u ]u i0n ndf[no1b:122
> 7:32o1u::nsde2[49r]  12u7i2n:fnovb0eoru n:s ds[et1]ar2v7ert s2s
> :tfoa7tr]s  i fn ofroi tt: hsnherrfrveeeo:ard  6 :asde rs t9:0
> vqeua t0s e rfqourerr iseit setsa,th,  0s0r e afd oaanrs ns
> tw7weeh:r rsr0es  a ffdrq o0mr cu:a eomcr 0he cquaeirciee,sh es,,,
> 00  00  reaacurrescinonunssswwrese,rrss i fo r0o pfmr ncerfosmea
> tccah, c0ch hpeer,e,  00 f reetccuhr rsi2o020n1133s-,- 008
> -0p82-r200 e1 21:f23e:t311::2c2h44 u
> nuboneucnbduor[un1ds2io7n2[:1s267]2 2i0n:f9]o,:  0in1 3pfo:r
> e-ssefe0e8t-r2rc0vvehe rr 1s 2tsatt ast: sf ofro3 r tt1hh:rr2eea4d
> a6: requestlidu ns9b2:0t  r1oue3mq-anx0d[u1 8e207 s2-:27]0 t
> lai1v2gist  :0 3nmaefxxc 1ee:o0:d e24ad vsg e ur v0neb0  oer
> juoxnstcdasettle[dseed1d 2f o7r2  0t0: 0 hjr]o easdi t7lnefdo :0:
> r seerver sqtuaetstsl isfto rm axt h0r aevga 0d e x0ce:e derd e0q
> ujoesstlteld i0s t max 0 avg 0 exceeded 0 jostled 0 
> 202210310-1330-8--02808--22000   112:31:21242: 3:unbo13:1u2:4n2 4d
> [u1un2b7no2b:ou2un]n didn[f1o2:72[ :1s1e]2 7river 2n:sta3ts ffoo]:
> ri ns erver tfosth:atr esasd e r2v:e rf0 o sqtraut estrhir efeosar,
> t0d  ahn1s:r w0e erqsu efardir o3me s:c ,ach e, 000  a
> nqswreuceruse rfrorms ciaoinescsh,,  00 e,  p0 arreecfuerstiocnhs 
> nwse,r s0  fprroemf ectacchh e2,0 103 -r0e8c-u2r0s i1o2n:s3,1 :02
> 4p ruenfbetoc2hu0 nd1[31-20782-:220]  1i2n:f3o1:: 2s4e ruvnebro
> usntda[ts1 2f72o02r1 3:t1-] info0:h8r-e2 a0ds e r122v:e3:r
> 1rs:e2qt4ua eutsstlni sfort  tbhroemuad 1: axnrequ 0 aedvgstl
> [i0s1t m2ax  0e avg7 02x e:c3e]e dxceeidneedd f0 joo :st ls0 ejdo
> es0tl edr v0e r stats for thread 3: requestlist max 0 avg 0
> exceeded 0 jostled 0 2013-08-20 12:31:24 unbound[1272:4] info:
> server stats for thread 4: 0 queries, 0 answers from cache, 0
> recursions, 0 prefetch 2013-08-20 12:31:24 unbound[1272:4] info:
> server stats for thread 4: requestlist max 0 avg 0 exceeded 0
> jostled 0
> 
> Regards _________________________________________ /Petter Lindgren
> 
> -----Original Message----- From: unbound-users-bounces at unbound.net
> [mailto:unbound-users-bounces at unbound.net] On Behalf Of Petter
> Lindgren Sent: den 16 augusti 2013 16:23 To: W.C.A. Wijngaards;
> unbound-users at unbound.net Subject: Re: [Unbound-users] Unbound
> stops responding
> 
> Thanks for your quick replies!
> 
> I'm not sure why we had such a spike in traffic but I'm quite sure
> it's because the outage triggered all the Windows clients to
> perform multiple queries during the outage.
> 
> We do have stub definitions in our config so it should have worked
> if it didn't stop responding under such heavy load.
> 
> I wasn't aware of the 64 limit on Windows I will change the
> num-thread, outgoing-range, num-queries-per-thread as you
> suggested.
> 
> Do you think this will be enough or should we consider to switch to
> Linux or FreeBSD?
> 
> Regards,
> 
> Petter Lindgren
> 
> 
> Hi Petter,
> 
> Responding to your second question: how to keep up during such an
> outage.
> 
> Local queries should then not depend on the internet provider.  Put
> a stub definition in unbound.conf for your intranet name(s) and
> point it at the IP-addresses of your local authoritative DNS
> server(s).  In this way, local lookups for unbound do not need to
> cross over the internet and can be resolved normally even while the
> ISP is down.
> 
> Unbound can deal with the situation that a part of the queries
> resolves normally, but another part does not resolve (at all) and
> unbound is swamped with queries for the slow part.  It should be
> able to resolve the fast queries and slowly keep working on (a
> fraction of) the slow queries.
> 
> For a capacity increase you would need to use Linux somehow (or
> FreeBSD).  On windows, increase the num-threads (even above number
> of CPUs, because it gives another set of sockets for every
> thread). num-thread: 10, and every thread about 50 sockets
> outgoing-range, and a low num-queries-per-thread too, say 25, for
> 500 sockets instead of 64, and 250 num-queries in total.  Does that
> work?  It defines num-queries smaller than the outgoing-range.
> 
> Best regards, Wouter
> 
> On 08/16/2013 03:24 PM, W.C.A. Wijngaards wrote:
>> Hi Petter,
> 
>> On windows we currently have a 64 limit, not a 1024 limit.  This
>> is because of a windows limit for its waitformultipleevents
>> winsock function.  And we need to wait for 'locks' for
>> inter-thread communication.  Windows does not implement
>> unix-pipes that we use on Unix.
> 
>> So, the machine is massively not able to keep up with the work.
> 
>> When you have the issue, the number of cache hits is roughly the
>> same (this indicates it did respond to cache hits?).  But the
>> non-cache is 600.000 (20x the normal number).  This causes the
>> CPU spike.  And it cannot answer any of these questions because
>> it has no sockets, and the upstream is unresponsive anyway.
> 
>> I think the 20x traffic increase may be why it does not respond
>> to cache responses.  It should have responded to cache responses,
>> but if the network stack is swamped with packets, unbound does
>> not have a chance.
> 
>> Or is the machine easily capable of 20x the network load? (note
>> that UDP load may be different from other load).  Then we should
>> look at the performance under this denialofservice condition. 
>> Especially I am worried because your num-queries-per-thread is a
>> lot bigger than what you have sockets for, which does not happen 
>> (usually) for unices.
> 
>> Best regards, Wouter
> 
>> On 08/16/2013 02:38 PM, Petter Lindgren wrote:
>>> We've used Unbound for a few months for around 15 000 clients I
>>> our company.
> 
> 
> 
>>> We recently had an issue when our Internet provider had a big
>>> network error.
> 
>>> This meant that our three unbound resolvers couldn't perform 
>>> recursion.
> 
> 
> 
>>> But during the outage, they stopped responding to any queries
>>> and the CPU spiked.
> 
>>> This meant that we couldn't resolve local names.
> 
>>> After the outage was resolved, they started responding as usual
>>>  again.
> 
> 
> 
>>> Statistics before outage:
> 
>>> 2013-08-14 11:40:21 unbound[1180:0] info: server stats for
>>> thread 0: 137441 queries, 107082 answers from cache, 30359
>>> recursions, 0 prefetch
> 
>>> 2013-08-14 11:40:21 unbound[1180:0] info: server stats for
>>> thread 0: requestlist max 38 avg 4.54215 exceeded 0 jostled 0
> 
>>> 2013-08-14 11:40:21 unbound[1180:0] info: average recursion 
>>> processing time 0.120068 sec
> 
>>> 2013-08-14 11:40:21 unbound[1180:0] info: histogram of
>>> recursion processing times
> 
>>> 2013-08-14 11:40:21 unbound[1180:0] info: [25%]=0.00974991 
>>> median[50%]=0.0314827 [75%]=0.162277
> 
> 
> 
>>> Statistics during outage:
> 
>>> 2013-08-14 12:40:21 unbound[1180:0] info: server stats for
>>> thread 0: 717834 queries, 79368 answers from cache, 638466
>>> recursions, 0 prefetch
> 
>>> 2013-08-14 12:40:21 unbound[1180:0] info: server stats for
>>> thread 0: requestlist max 893 avg 649.479 exceeded 520726
>>> jostled 90153
> 
>>> 2013-08-14 12:40:21 unbound[1180:0] info: average recursion 
>>> processing time 16.060234 sec
> 
>>> 2013-08-14 12:40:21 unbound[1180:0] info: histogram of
>>> recursion processing times
> 
>>> 2013-08-14 12:40:21 unbound[1180:0] info: [25%]=0.0128234 
>>> median[50%]=0.046963 [75%]=0.241531
> 
> 
> 
> 
> 
>>> What can I do to make sure that Unbound still responds to a
>>> local query when there is an outage?
> 
> 
> 
>>> We run Unbound on Windows Server 2012 x64 (because we have more
>>>  competence using Windows)
> 
> 
> 
>>> Is it possible to overcome the 1024 limit for /outgoing-range/
>>> and /num-queries-per-thread/ when running Windows and would it
>>> have helped?
> 
> 
> 
>>> I would really appreciate comments and suggestions!
> 
> 
> 
>>> Thanks!
> 
>>> / /
> 
>>> /_________________________________________/
> 
>>> /Petter Lindgren/
> 
> 
> 
> 
> 
>>> _______________________________________________ Unbound-users
>>> mailing list Unbound-users at unbound.net 
>>> http://unbound.nlnetlabs.nl/mailman/listinfo/unbound-users
> 
> 
>> _______________________________________________ Unbound-users
>> mailing list Unbound-users at unbound.net 
>> http://unbound.nlnetlabs.nl/mailman/listinfo/unbound-users
> 
> 
> _______________________________________________ Unbound-users
> mailing list Unbound-users at unbound.net 
> http://unbound.nlnetlabs.nl/mailman/listinfo/unbound-users
> 
> _______________________________________________ Unbound-users
> mailing list Unbound-users at unbound.net 
> http://unbound.nlnetlabs.nl/mailman/listinfo/unbound-users
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJSGxMxAAoJEJ9vHC1+BF+Na7EP/j4B87TcweYwYHZq4e+TTbaY
gDWurM65ya00YtC156wOSUYLf64gDkxksaHkP98F89EeBj/AKzZ9TdBmJwp6CANi
kTMT4fcMo3B0hi8AJhqBfSHRX/7Xrp+64B9O+zdnlrgsZunE0/limelU3OV6BVja
jpBFiLeqi9OvogHkoyEibgJRJIhKkQ74ibl0khrWAelovaVk7m1td+J/gCVuQmwJ
806j/guakchIQeBrKRTs1mdiM9YEXHhMQW779L+pUROk0TEcKCCi86M2DMGkG/pm
hJuroN5+/lAmPdMV8NKdSpujrcsXkuxdhRY/LtMO9eWyYJaOj5O0mgPp3RjB8zqr
/RcejA0/3Tjs/e8WZPr0zPL6s94vqjGYr7UD6JwsxaulFtsumgJl6xL+anetMXaD
Yi1V5VUss/E2NxdetwQOtUdQATe4Rf22ntkAHRKFkunL2E1A3J+BRCH0/9XOudmV
5zjgrOV+UWTRGm5PZLbBtuOVMkUqACfL5+v2V3ZRKVtio4j+JPUIyv0LHV99uIb7
VmUJZkT1gGtXPVDTMWDkMyI4X/DFytBCIQ7QxV2JX17NNgpjdASB435By4nOM1UY
iUKNXBtmBqAOHFEjlOrjZjb0d3n/xIio5zKZeOeLwMPeAtO2nUwZOk6XOwVClnuN
5Wx+WcQ5SiBDSDxKrZcx
=Xt6R
-----END PGP SIGNATURE-----