Maintained by: NLnet Labs

Unbound: Howto Optimise

By W.C.A. Wijngaards, NLnet Labs, October 2008.

This how to contains a guide for optimising unbound. Most users do not have to do this, but it could be useful for large resolver installations. The text below is the result of feedback from unbound users, if you have different experiences or have recommendations, let me know.

Config setup

Set num-threads equal to the number of CPU cores on the system. E.g. for 4 CPUs with 2 cores each, use 8.

Set *-slabs to a power of 2 close to the num-threads value. Do this for msg-cache-slabs, rrset-cache-slabs, infra-cache-slabs and key-cache-slabs. This reduces lock contention.

Increase the memory size of the cache. Use roughly twice as much rrset cache memory as you use msg cache memory. For example, rrset-cache-size: 100m and msg-cache-size: 50m. Due to malloc overhead, the total memory usage is likely to rise to double (or 2.5x) the total cache memory that is entered into the config.

Set the outgoing-range to as large a value as possible, see the sections below on how to overcome the limit of 1024 in total. This services more clients at a time. With 1 core, try 950. With 2 cores, try 450. With 4 cores try 200. If you enter values larger than 1024, increase the num-queries-per-thread per thread value as well.

Here is a short summary of optimisation config

# some optimisation options.
server:
        # use all CPUs
        num-threads: <number of cores>  
	
        # power of 2 close to num-threads  
        msg-cache-slabs: <same>
        rrset-cache-slabs: <same>
        infra-cache-slabs: <same>
        key-cache-slabs: <same>

        # more cache memory, rrset=msg*2
        rrset-cache-size: 100m
        msg-cache-size: 50m

        # more outgoing connections
        # depends on number of cores: 1024/cores - 50 
        outgoing-range: 950

The default setup works fine, but when a large number of users have to be served, the limits of the system are reached. Most pressing is the number of file descriptors, the default has a limit of 1024. To use more than 1024 file descriptors, use libevent or the forked operation method. These are described in sections below.

Using Libevent

Libevent is a BSD licensed cross platform wrapper around platform specific event notification system calls. Unbound can use it to efficiently use more than 1024 file descriptors. Install libevent (and libevent-devel, if it exists) with your favorite package manager. Before compiling unbound run ./configure --with-libevent.

Now you can give any number you like for outgoing-range. Also increase the num-queries-per-thread value.

        # with libevent
        outgoing-range: 4096
        num-queries-per-thread: 4096 

Users report that libevent-1.4.8-stable works well. Users have confirmed it works well on Linux and FreeBSD with 4096 or 8192 as values. Some distributions package older versions (such as libevent-1.1), for which there are crashreports, thus you may need to upgrade your libevent. Unbound can compile from the libevent build directory to make this easy; configure --with-libevent=/home/user/libevent-1.4.8-stable.

Note If you experience crashes anyway, then you can try the following. Update libevent. If the problem persists, libevent can be made to use different system-call back-ends by setting environment variables. Unbound reports the back-end in use when verbosity is at level 4. By setting EVENT_NOKQUEUE, EVENT_NODEVPOLL, EVENT_NOPOLL, EVENT_NOSELECT, EVENT_NOEPOLL or EVENT_NOEVPORT to yes in the shell before you start unbound, some back-ends can be excluded from use. The poll(2) backend is reliable, but slow.

There is also libev which is a libevent compatible library that can be used instead of libevent. It claims to be faster than libevent.

Forked operation

Unbound has a unique mode where it can operate without threading. This can be useful if libevent fails on the platform, for extra performance, or for creating walls between the cores so that one cannot poison another.

To compile for forked operation, before compilation use ./configure --without-pthreads --without-solaris-threads to disable threads and enable forked operation. Because no locking has to be done, the code speeds up (about 10 to 20%).

In the config file, num-threads still specifies the number of cores you want to use (even though it uses processes and not threads). And note that the outgoing-range and cache memory values are all per thread. This means that much more memory is used, as every core uses its own cache. Because every core has its own cache, if one gets cache poisoned, the others are not affected.

# with forked operation
server:
        # use all CPUs
        num-threads: <number of cores>  
	
        msg-cache-slabs: 1
        rrset-cache-slabs: 1
        infra-cache-slabs: 1
        key-cache-slabs: 1

        # more cache memory, rrset=msg*2  
        # total usage is 150m*cores 
        rrset-cache-size: 100m
        msg-cache-size: 50m

        # does not depend on number of cores 
        outgoing-range: 950
        num-queries-per-thread: 950 

Because every process is using at most 1024 file descriptors now, the effective maximum is the number of cores * 1024. The config above uses 950 per process, for 4 processes gives a respectable 3800 simultaneous recursions.

Using forked operation together with libevent is also possible. It may be useful to force the OS to service the filedescriptors for different processes, instead of threads. This may have (radically) different performance if the underlying network stack uses (slow) lookup structures per-process.