Maintained by: NLnet Labs

deadlock in unbound-control commands

Poprocks
Mon Aug 24 01:52:16 CEST 2015


Hello,

I have encountered an issue in which unbound appears to hang when issuing commands via unbound-control. I am running unbound 1.5.4 built on an Ubuntu 12.04 system. I've configured unbound with the same options available from the unbound package for Ubuntu precise:

configured for x86_64-unknown-linux-gnu on Wed Feb  8 05:48:44 UTC 2012 with options: '--prefix=/usr' '--sysconfdir=/etc' '--disable-rpath' '--with-pidfile=/var/run/unbound.pid' '--with-libevent' '--with-pythonmodule' '--with-pyunbound'

This issue is somewhat difficult to reproduce, as it doesn't manifest itself regularly (perhaps 1 in 100 unbound-controll calls), but I have captured what info I can. I see first that the unbound-control shell call hangs; strace of the unbound-control process shows it is blocked on a read() call, reading from the unbound daemon. Unbound itself is hanging at the following syscall:


root at server:~# strace -p 538
Process 538 attached - interrupt to quit
futex(0x2f8ad3c, FUTEX_WAIT_PRIVATE, 0, NULL


Below is a backtrace, showing what appears to be a deadlock in local_zones_add_RR:


(gdb) bt full
#0  pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:83
No locals.
#1  0x000000000040eccd in local_zones_add_RR (zones=0x2026d70, rr=0x38c7dcec35c "privfoo.p-hex.com IN A 10.10.10.68")
    at services/localzone.c:1352
        lockret_err = <optimized out>
        rr_name = 0x2a497d0 "\aprivfoo\005p-hex\003com"
        rr_class = 1
        len = 19
        labs = 4
        z = <optimized out>
        r = <optimized out>
#2  0x0000000000429011 in execute_cmd.7956 (rc=0x2021930, ssl=0x2a45430, cmd=<optimized out>, worker=0x1ef3010) at daemon/remote.c:1146
        arg = 0x38c7dcec35c "privfoo.p-hex.com IN A 10.10.10.68"
        worker = 0x1ef3010
        ssl = 0x2a45430
        p = 0x38c7dcec351 "local_data privfoo.p-hex.com IN A 10.10.10.68"
#3  0x000000000046ea9b in handle_req.isra.19 (rc=0x2021930, ssl=0x2a45430) at daemon/remote.c:2555
        r = <optimized out>
        pre = "UBCT1 \000\000", <incomplete sequence \352>
        magic = "UBCT1 "
        buf = " local_data privfoo.p-hex.com IN A 10.10.10.68\000\000\f\000\000\000\000\000\000\000\240\306\316}\214\003\000\000\005\000\000\000\000\000\000\000@\252\f\260<\003\000\000pi\244\002\000\000\000\000\355s֯<\003\000\000\220\304\316}\214\003\000\000\002E߯<\003\000\000\200\257\244\002\000\000\000\000\250\222U\260<\003\000\000\326\301b\330֡Cˠ\306\316}\214\003\000\000\f\000\000\000\326QS\330\000\001\004\000\000\000\000\000\340\304\316}\214\003\000\000\244W\244\002\000\000\000\000\060\000\000\000\060\000\000\000\364pW\260<\003\000\000 \306\316}\214\003", '\000' <repeats 18 times>, "\017\000\000\000\060\000\000\000\017\000\000\000\000\000\000\000\060", '\000' <repeats 15 times>, "0\000\000\000\005\000\000\000\200"...
#4  0x000000000046ec09 in remote_control_callback (c=<optimized out>, arg=0x2a341e0, err=<optimized out>, rep=<optimized out>)
    at daemon/remote.c:2624
        s = 0x2a341e0
        rc = 0x2021930
        r = 1
#5  0x0000033cb02f89cc in event_base_loop () from /usr/lib/libevent-2.0.so.5
No symbol table info available.
#6  0x0000000000430b9c in comm_base_dispatch (b=<optimized out>) at util/netevent.c:305
---Type <return> to continue, or q <return> to quit---
        retval = <optimized out>
#7  0x0000000000467ac6 in worker_work (worker=<optimized out>) at daemon/worker.c:1311
No locals.
#8  daemon_fork (daemon=0x1e723f0) at daemon/daemon.c:566
No locals.
#9  0x0000000000411eaa in run_daemon (debug_mode=0, cmdline_verbose=0, cfgfile=0x48bc49 "/etc/unbound/unbound.conf")
    at daemon/unbound.c:671
        cfg = <optimized out>
        daemon = <optimized out>
        done_setup = 1
#10 main (argc=<optimized out>, argv=<optimized out>) at daemon/unbound.c:766
        c = <optimized out>
        cfgfile = 0x48bc49 "/etc/unbound/unbound.conf"
        winopt = <optimized out>
        cmdline_verbose = 0
        debug_mode = 0


The local_zones struct 'zones' passed into local_zones_add_RR appears as such:


(gdb) print *(struct local_zones *)0x2026d70
$2 = {lock = {__data = {__lock = 0, __nr_readers = 1, __readers_wakeup = 0, __writer_wakeup = 0, __nr_readers_queued = 0, 
      __nr_writers_queued = 1, __writer = 0, __shared = 0, __pad1 = 0, __pad2 = 0, __flags = 0}, 
    __size = "\000\000\000\000\001", '\000' <repeats 15 times>, "\001", '\000' <repeats 34 times>, __align = 4294967296}, ztree = {
    root = 0x208e000, count = 103, cmp = 0x4489c0 <local_zone_cmp>}}


In another instance of this I saw this hang in daemon/remote.c:


(gdb) bt full
#0  pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:83
No locals.
#1  0x0000000000428d00 in do_zone_remove (arg=<optimized out>, worker=0x15a6e90, ssl=0x21996e0) at daemon/remote.c:1131
        lockret_err = <optimized out>
        nm = <optimized out>
        nmlabs = <optimized out>
        nmlen = <optimized out>
        z = 0x3cab0594990
#2  execute_cmd.7956 (rc=0x1551a20, ssl=0x21996e0, cmd=0x3cab0594990 " local_zone_remove p-hex.com", worker=0x15a6e90)
    at daemon/remote.c:2462
        p = 0x3cab0594991 "local_zone_remove p-hex.com"
#3  0x000000000046ea9b in handle_req.isra.19 (rc=0x1551a20, ssl=0x21996e0) at daemon/remote.c:2555
        r = <optimized out>
        pre = "UBCT1 \000\000\020p"
        magic = "UBCT1 "
        buf = " local_zone_remove p-hex.com\000\000\000\000\f\000\000\000\000\000\000\000\240\236\031\002\000\000\000\000\f\000\000\000\000\000\000\000\340LY\260\312\003\000\000\005\000\000\000\000\000\000\000 at j.\252\212\002\000\000@\326\031\002\000\000\000\000\355\063\370\251\212\002\000\000\320JY\260\312\003\000\000\002\005\001\252\212\002\000\000\240\211\030\002\000\000\000\000\250Rw\252\212\002\000\000\326\301b\330֡C\313\340LY\260\312\003\000\000\f\000\000\000\326QS\330\000\001\004\000\000\000\000\000 KY\260\312\003\000\000T\232\031\002\000\000\000\000\060\000\000\000\060\000\000\000\364\060y\252\212\002\000\000`LY\260\312\003", '\000' <repeats 18 times>, "\017\000\000\000\060\000\000\000\017\000\000\000\000\000\000\000\060", '\000' <repeats 15 times>, "0\000\000\000\005"...
#4  0x000000000046ec09 in remote_control_callback (c=<optimized out>, arg=0x21884c0, err=<optimized out>, rep=<optimized out>)
    at daemon/remote.c:2624
        s = 0x21884c0
        rc = 0x1551a20
        r = 1
#5  0x0000028aaa5149cc in event_base_loop () from /usr/lib/libevent-2.0.so.5
No symbol table info available.
#6  0x0000000000430b9c in comm_base_dispatch (b=<optimized out>) at util/netevent.c:305
        retval = <optimized out>
#7  0x0000000000467ac6 in worker_work (worker=<optimized out>) at daemon/worker.c:1311
No locals.
#8  daemon_fork (daemon=0x1526270) at daemon/daemon.c:566
No locals.
#9  0x0000000000411eaa in run_daemon (debug_mode=0, cmdline_verbose=0, cfgfile=0x48bc49 "/etc/unbound/unbound.conf")
    at daemon/unbound.c:671
        cfg = <optimized out>
        daemon = <optimized out>
        done_setup = 1
#10 main (argc=<optimized out>, argv=<optimized out>) at daemon/unbound.c:766
        c = <optimized out>
        cfgfile = 0x48bc49 "/etc/unbound/unbound.conf"
        winopt = <optimized out>
        cmdline_verbose = 0
        debug_mode = 0



In both of these cases it appears to be hanging on the 'lock_rw_t lock' member of the local_zones struct. Below is the relevant portions of the config (excluding only access control):


root at server:~# sed -e '/#.*$/d' -e '/^$/d' /etc/unbound/unbound.conf 
remote-control:
  control-enable: yes
server:
  verbosity: 4
  statistics-interval: 0
  statistics-cumulative: no
  extended-statistics: no
  num-threads: 1
  interface: 10.242.128.232
  port: 53
  do-ip4: yes
  do-udp: yes
  do-tcp: yes
  chroot: ""
  directory: "/etc/unbound"
  logfile: "/var/log/unbound.log"
  use-syslog: "no"
  pidfile: "/var/run/unbound.pid"
  auto-trust-anchor-file: /var/lib/unbound/root.key

root at server:~# unbound -h
usage:  unbound [options]
	start unbound daemon DNS resolver.
-h	this help
-c file	config file to read instead of /etc/unbound/unbound.conf
	file format is described in unbound.conf(5).
-d	do not fork into the background.
-v	verbose (more times to increase verbosity)
Version 1.5.4
linked libs: libevent 2.0.16-stable (it uses epoll), OpenSSL 1.0.1 14 Mar 2012
linked modules: dns64 validator iterator
BSD licensed, see LICENSE in source package for details.
Report bugs to unbound-bugs at nlnetlabs.nl


I have found this problems occurs whether num-threads is set to 1 or 8 (number of cores in my system). There are roughly 15000 local-data entries in this machine, from half a dozen local-zone typetransparent entries. Unbound logging doesn't indicate any failure (the last entry is the one generated from the unbound-control call). This seems to occur most often when adding new local-data entries, but I have also seen this occur when calling 'local_zone_remove' (in this case it was also hanged on pthread_rwlock_wrlock). I have not been able to replicate this with the version of Unbound made available from the Ubuntu precise package (1.4.16). Please let me know what further information I can provide to assist in debugging this problem. Thank you!

-- 
Sincerely,
Robert Paprocki
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 835 bytes
Desc: Digital signature
URL: <https://unbound.nlnetlabs.nl/pipermail/unbound-users/attachments/20150823/4c1117f4/attachment.sig>