Maintained by: NLnet Labs

deadlock in unbound-control commands

W.C.A. Wijngaards
Mon Aug 24 12:54:17 CEST 2015


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Hi Poprocks,

On 24/08/15 01:52, Poprocks via Unbound-users wrote:
> Hello,
> 
> I have encountered an issue in which unbound appears to hang when
> issuing commands via unbound-control. I am running unbound 1.5.4
> built on an Ubuntu 12.04 system. I've configured unbound with the
> same options available from the unbound package for Ubuntu
> precise:

Thank you for the report with all the details.  I could not reproduce
it, but code inspection reveals a deadlock that is caused when you
interrupt the printout of unbound-control list_local_data.

You have so much local data (10000s) that the interrupt carries back
to the daemon, and then there is a bug in the code path that deals
with the interrupted TLS stream.  That does not properly unlock the
zones.lock and thus the next lock attempt hangs.

It is fixed in the code repository for the next release.  Or, as a
workaround, do not ^C the printout of unbound-control list_local_data.

Best regards,
   Wouter

> 
> configured for x86_64-unknown-linux-gnu on Wed Feb  8 05:48:44 UTC
> 2012 with options: '--prefix=/usr' '--sysconfdir=/etc'
> '--disable-rpath' '--with-pidfile=/var/run/unbound.pid'
> '--with-libevent' '--with-pythonmodule' '--with-pyunbound'
> 
> This issue is somewhat difficult to reproduce, as it doesn't
> manifest itself regularly (perhaps 1 in 100 unbound-controll
> calls), but I have captured what info I can. I see first that the
> unbound-control shell call hangs; strace of the unbound-control
> process shows it is blocked on a read() call, reading from the
> unbound daemon. Unbound itself is hanging at the following
> syscall:
> 
> 
> root at server:~# strace -p 538 Process 538 attached - interrupt to
> quit futex(0x2f8ad3c, FUTEX_WAIT_PRIVATE, 0, NULL
> 
> 
> Below is a backtrace, showing what appears to be a deadlock in
> local_zones_add_RR:
> 
> 
> (gdb) bt full #0  pthread_rwlock_wrlock () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:83 
> No locals. #1  0x000000000040eccd in local_zones_add_RR
> (zones=0x2026d70, rr=0x38c7dcec35c "privfoo.p-hex.com IN A
> 10.10.10.68") at services/localzone.c:1352 lockret_err = <optimized
> out> rr_name = 0x2a497d0 "\aprivfoo\005p-hex\003com" rr_class = 1 
> len = 19 labs = 4 z = <optimized out> r = <optimized out> #2
> 0x0000000000429011 in execute_cmd.7956 (rc=0x2021930,
> ssl=0x2a45430, cmd=<optimized out>, worker=0x1ef3010) at
> daemon/remote.c:1146 arg = 0x38c7dcec35c "privfoo.p-hex.com IN A
> 10.10.10.68" worker = 0x1ef3010 ssl = 0x2a45430 p = 0x38c7dcec351
> "local_data privfoo.p-hex.com IN A 10.10.10.68" #3
> 0x000000000046ea9b in handle_req.isra.19 (rc=0x2021930,
> ssl=0x2a45430) at daemon/remote.c:2555 r = <optimized out> pre =
> "UBCT1 \000\000", <incomplete sequence \352> magic = "UBCT1 " buf =
> " local_data privfoo.p-hex.com IN A
> 10.10.10.68\000\000\f\000\000\000\000\000\000\000\240\306\316}\214\003
\000\000\005\000\000\000\000\000\000\000@\252\f\260<\003\000\000pi\244\0
02\000\000\000\000\355s?<\003\000\000\220\304\316}\214\003\000\000\002E?
<\003\000\000\200\257\244\002\000\000\000\000\250\222U\260<\003\000\000\
326\301b\330?C?\306\316}\214\003\000\000\f\000\000\000\326QS\330\000\001
\004\000\000\000\000\000\340\304\316}\214\003\000\000\244W\244\002\000\0
00\000\000\060\000\000\000\060\000\000\000\364pW\260<\003\000\000
> \306\316}\214\003", '\000' <repeats 18 times>,
> "\017\000\000\000\060\000\000\000\017\000\000\000\000\000\000\000\060"
,
> '\000' <repeats 15 times>, "0\000\000\000\005\000\000\000\200"... 
> #4  0x000000000046ec09 in remote_control_callback (c=<optimized
> out>, arg=0x2a341e0, err=<optimized out>, rep=<optimized out>) at
> daemon/remote.c:2624 s = 0x2a341e0 rc = 0x2021930 r = 1 #5
> 0x0000033cb02f89cc in event_base_loop () from
> /usr/lib/libevent-2.0.so.5 No symbol table info available. #6
> 0x0000000000430b9c in comm_base_dispatch (b=<optimized out>) at
> util/netevent.c:305 ---Type <return> to continue, or q <return> to
> quit--- retval = <optimized out> #7  0x0000000000467ac6 in
> worker_work (worker=<optimized out>) at daemon/worker.c:1311 No
> locals. #8  daemon_fork (daemon=0x1e723f0) at daemon/daemon.c:566 
> No locals. #9  0x0000000000411eaa in run_daemon (debug_mode=0,
> cmdline_verbose=0, cfgfile=0x48bc49 "/etc/unbound/unbound.conf") at
> daemon/unbound.c:671 cfg = <optimized out> daemon = <optimized
> out> done_setup = 1 #10 main (argc=<optimized out>, argv=<optimized
> out>) at daemon/unbound.c:766 c = <optimized out> cfgfile =
> 0x48bc49 "/etc/unbound/unbound.conf" winopt = <optimized out> 
> cmdline_verbose = 0 debug_mode = 0
> 
> 
> The local_zones struct 'zones' passed into local_zones_add_RR
> appears as such:
> 
> 
> (gdb) print *(struct local_zones *)0x2026d70 $2 = {lock = {__data =
> {__lock = 0, __nr_readers = 1, __readers_wakeup = 0,
> __writer_wakeup = 0, __nr_readers_queued = 0, __nr_writers_queued =
> 1, __writer = 0, __shared = 0, __pad1 = 0, __pad2 = 0, __flags =
> 0}, __size = "\000\000\000\000\001", '\000' <repeats 15 times>,
> "\001", '\000' <repeats 34 times>, __align = 4294967296}, ztree =
> { root = 0x208e000, count = 103, cmp = 0x4489c0 <local_zone_cmp>}}
> 
> 
> In another instance of this I saw this hang in daemon/remote.c:
> 
> 
> (gdb) bt full #0  pthread_rwlock_wrlock () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:83 
> No locals. #1  0x0000000000428d00 in do_zone_remove (arg=<optimized
> out>, worker=0x15a6e90, ssl=0x21996e0) at daemon/remote.c:1131 
> lockret_err = <optimized out> nm = <optimized out> nmlabs =
> <optimized out> nmlen = <optimized out> z = 0x3cab0594990 #2
> execute_cmd.7956 (rc=0x1551a20, ssl=0x21996e0, cmd=0x3cab0594990 "
> local_zone_remove p-hex.com", worker=0x15a6e90) at
> daemon/remote.c:2462 p = 0x3cab0594991 "local_zone_remove
> p-hex.com" #3  0x000000000046ea9b in handle_req.isra.19
> (rc=0x1551a20, ssl=0x21996e0) at daemon/remote.c:2555 r =
> <optimized out> pre = "UBCT1 \000\000\020p" magic = "UBCT1 " buf =
> " local_zone_remove
> p-hex.com\000\000\000\000\f\000\000\000\000\000\000\000\240\236\031\00
2\000\000\000\000\f\000\000\000\000\000\000\000\340LY\260\312\003\000\00
0\005\000\000\000\000\000\000\000 at j.\252\212\002\000\000@\326\031\002\00
0\000\000\000\355\063\370\251\212\002\000\000\320JY\260\312\003\000\000\
002\005\001\252\212\002\000\000\240\211\030\002\000\000\000\000\250Rw\25
2\212\002\000\000\326\301b\330?C\313\340LY\260\312\003\000\000\f\000\000
\000\326QS\330\000\001\004\000\000\000\000\000
> KY\260\312\003\000\000T\232\031\002\000\000\000\000\060\000\000\000\06
0\000\000\000\364\060y\252\212\002\000\000`LY\260\312\003",
> '\000' <repeats 18 times>,
> "\017\000\000\000\060\000\000\000\017\000\000\000\000\000\000\000\060"
,
> '\000' <repeats 15 times>, "0\000\000\000\005"... #4
> 0x000000000046ec09 in remote_control_callback (c=<optimized out>,
> arg=0x21884c0, err=<optimized out>, rep=<optimized out>) at
> daemon/remote.c:2624 s = 0x21884c0 rc = 0x1551a20 r = 1 #5
> 0x0000028aaa5149cc in event_base_loop () from
> /usr/lib/libevent-2.0.so.5 No symbol table info available. #6
> 0x0000000000430b9c in comm_base_dispatch (b=<optimized out>) at
> util/netevent.c:305 retval = <optimized out> #7  0x0000000000467ac6
> in worker_work (worker=<optimized out>) at daemon/worker.c:1311 No
> locals. #8  daemon_fork (daemon=0x1526270) at daemon/daemon.c:566 
> No locals. #9  0x0000000000411eaa in run_daemon (debug_mode=0,
> cmdline_verbose=0, cfgfile=0x48bc49 "/etc/unbound/unbound.conf") at
> daemon/unbound.c:671 cfg = <optimized out> daemon = <optimized
> out> done_setup = 1 #10 main (argc=<optimized out>, argv=<optimized
> out>) at daemon/unbound.c:766 c = <optimized out> cfgfile =
> 0x48bc49 "/etc/unbound/unbound.conf" winopt = <optimized out> 
> cmdline_verbose = 0 debug_mode = 0
> 
> 
> 
> In both of these cases it appears to be hanging on the 'lock_rw_t
> lock' member of the local_zones struct. Below is the relevant
> portions of the config (excluding only access control):
> 
> 
> root at server:~# sed -e '/#.*$/d' -e '/^$/d'
> /etc/unbound/unbound.conf remote-control: control-enable: yes 
> server: verbosity: 4 statistics-interval: 0 statistics-cumulative:
> no extended-statistics: no num-threads: 1 interface:
> 10.242.128.232 port: 53 do-ip4: yes do-udp: yes do-tcp: yes chroot:
> "" directory: "/etc/unbound" logfile: "/var/log/unbound.log" 
> use-syslog: "no" pidfile: "/var/run/unbound.pid" 
> auto-trust-anchor-file: /var/lib/unbound/root.key
> 
> root at server:~# unbound -h usage:  unbound [options] start unbound
> daemon DNS resolver. -h	this help -c file	config file to read
> instead of /etc/unbound/unbound.conf file format is described in
> unbound.conf(5). -d	do not fork into the background. -v	verbose
> (more times to increase verbosity) Version 1.5.4 linked libs:
> libevent 2.0.16-stable (it uses epoll), OpenSSL 1.0.1 14 Mar 2012 
> linked modules: dns64 validator iterator BSD licensed, see LICENSE
> in source package for details. Report bugs to
> unbound-bugs at nlnetlabs.nl
> 
> 
> I have found this problems occurs whether num-threads is set to 1
> or 8 (number of cores in my system). There are roughly 15000
> local-data entries in this machine, from half a dozen local-zone
> typetransparent entries. Unbound logging doesn't indicate any
> failure (the last entry is the one generated from the
> unbound-control call). This seems to occur most often when adding
> new local-data entries, but I have also seen this occur when
> calling 'local_zone_remove' (in this case it was also hanged on
> pthread_rwlock_wrlock). I have not been able to replicate this with
> the version of Unbound made available from the Ubuntu precise
> package (1.4.16). Please let me know what further information I can
> provide to assist in debugging this problem. Thank you!
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBCAAGBQJV2vfZAAoJEJ9vHC1+BF+Nej8P/1TGt/GGfLv0cmhthDLolzBC
tCw9CJGvIvFV1W2YpswHKQg03wMWzOm/50xG3eLJvk8UO78TfI52WUWfzfnsOMYx
p4Ws7ft742usoPcuMtghic580kKqLbdv1fOqKFZLZRP6zMKwnZmWhU/2LvAZv1q6
ZFG8lvbweyean08WE97PvnECHpK+XxYMb/OSfeH1sWlOBohSVdbGoaOgEWL71Os2
8ptx/opEvQk1QJ37vszRcz88bZd5l0JjjXdxI14EKQe9KhDuJsEgNHU36u8XVeHZ
Ze4Bc7EPqtNIsNtDjnoiRSePCb7tgbXzyP93d0b/9nMR0rpUhzzgK1y3j8YOp3x9
+jyZ5yLadkA/pBoz6T4qabTVQqlJ9P0bEcOuoUUSYmUVSpVUVwrQpqcooL42XWYo
cWbCtLSdKZkQ5+vax3JDh6JtQDihTRPHYdR96nhO5yvCPpo5pblGbFzDL/+vQrdy
jNSRGNBjbP6/lZQ02HC8ZfTslQlJ1Hl6JzKyIq2x540CLp29nconz5Cb6NffUJAJ
o6afMT8xCxw39GjXkG9kpVldxOjQoHMGVwl5Be4fiJXqt6JRKTdXfL7q5BEimDuQ
kTWIUHNcL0SygzaySjVO23HmVl+s9armgvbHrbSjWEF/O60evgCxS0E81O7S5nOr
klyZNTZo4mZnJompdLLY
=SpWX
-----END PGP SIGNATURE-----