Maintained by: NLnet Labs

[Unbound-users] Unbound multithread performance: an investigation into scaling of cache response qps

W.C.A. Wijngaards
Wed Mar 24 11:08:59 CET 2010

Hash: SHA1

Hi Aaron,

On 03/23/2010 04:05 PM, Aaron Hopkins wrote:
> On Tue, 23 Mar 2010, W.C.A. Wijngaards wrote:
>> The performance scales up fairly neatly as multi-threading goes.  For
>> every configuration a slower-than-linear speedup is observed, indicating
>> locks in the underlying operation system network stack.
> There was no lock contention within unbound?  I don't know how to measure
> this on Solaris, but did you?

Yes it is visible.  The no-threads version of unbound has no lock code
in it (macroed away), and thus has no lock contention.  It has a
slightly better graph than the versions with locks (maybe a 5%
difference at 4 cores).  So there is contention in unbound.  In this
example, with all queries for the same cache element, the contention
should be as high as it gets, I think.

>> There is only one network card, after all, and the CPUs have to lock and
>> synchronise with it.
> This should be true even with multiple processes, however.

Yes, this is what we see in the no-threads results.  Those use
processes.  But they still bind to the same port 53 socket.

> This maybe not be true for Solaris, but you might try having unbound listen
> on multiple ports and spread requests across them and see if it matters.

Yes, I have tried this.  I got 2 more test machines to send queries
from, and modified unbound to open (num_threads)x UDP ports and every
Nth worker listens to UDP port N.

A control check, with four perfs running towards unbound.
evport, forked, 4senders:  9619  15860  19010  21979
evport, forked, 2senders:  9700  17300  19600  22300
Similar, slightly slower.

The special version where every process listens to its own UDP port, and
the perfs all run towards one port.  evport, forked, process0 and perf0
use port 30053, process1 and perf1 use port 30054, process2 and perf2
use port 30055, process3 and perf3 use port 30056.
evport, forked, special:   10000  18783  23461  25797

This is faster.  It is not linear.

In this test unbound has forked processes that do not lock mutexes or
any pthread stuff.  They all have a copy of the same file-descriptor
table.  But the list of fds passed to evport is different (same TCP, but
different UDP) for every process.  There are also some pipes in the
background for interprocess comm but those are silent during the test.

> The last time I looked, recent-ish Linux 2.6 still had per-socket locking
> even in the face of multiple network cards.  This means that multiple
> threads or even multiple processes sharing a UDP socket can't really exceed
> one CPUs worth of raw sendto() performance sourced from the same socket.
> You can get much closer to linear scalability by binding to a different
> port or IP per CPU.

Not sure it is worth it.  Maybe some modifications can be made to the
UDP stack to make it more linear, but I do not know how.

Best regards,
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Fedora -