While debugging a deadlock in dask-ms I noticed that when a running all calls to python-casacore in a single threaded executor pool the library deadlocks with no exception when putcol is called without setting a writable user lock. It always seems to hang on a ISMColumn operation. The bug likely affects only this storage manager:
#0 0x00007f12c8d3c6d1 in casacore::ISMColumn::getScalarColumnCellsdoubleV (this=0x7f12c006f1c0, rownrs=..., values=0x7f12c4a77670) at /scratch/bhugo/oms-subset/casacore/tables/DataMan/ISMColumn.cc:464
#1 0x00007f12c8d86ad8 in casacore::StManColumn::getScalarColumnCellsV (this=0x7f12c006f1c0, rownrs=..., dataPtr=0x7f12c4a77670) at /scratch/bhugo/oms-subset/casacore/tables/DataMan/StManColumn.cc:227
#2 0x00007f12c8c7bbf2 in casacore::ScalarColumnData<double>::getScalarColumnCells (this=0x7f12c0066960, rownrs=..., val=0x7f12c4a77670)
at /scratch/bhugo/oms-subset/casacore/casacore/tables/Tables/ScaColData.tcc:151
#3 0x00007f12c8c99375 in casacore::ScalarColumn<double>::getColumnCells (this=0x7f12c4a77450, rownrs=..., vec=..., resize=false)
at /scratch/bhugo/oms-subset/casacore/casacore/tables/Tables/ScalarColumn.tcc:205
#4 0x00007f12c8c990d4 in casacore::ScalarColumn<double>::getColumnRange (this=0x7f12c4a77450, rowRange=..., vec=..., resize=false)
at /scratch/bhugo/oms-subset/casacore/casacore/tables/Tables/ScalarColumn.tcc:180
#5 0x00007f12c8c9921b in casacore::ScalarColumn<double>::getColumnRange (this=0x7f12c4a77450, rowRange=...) at /scratch/bhugo/oms-subset/casacore/casacore/tables/Tables/ScalarColumn.tcc:165
#6 0x00007f12c8cd343c in casacore::TableProxy::getValueFromTable (this=0x7f12c0001b30, colName=..., rownr=100, nrow=100, incr=1, isCell=false)
at /scratch/bhugo/oms-subset/casacore/tables/Tables/TableProxy.cc:2245
#7 0x00007f12c8cca084 in casacore::TableProxy::getColumn (this=0x7f12c0001b30, columnName=..., row=100, nrow=100, incr=1) at /scratch/bhugo/oms-subset/casacore/tables/Tables/TableProxy.cc:1149
#8 0x00007f12ca5f1970 in boost::python::detail::invoke<boost::python::to_python_value<casacore::ValueHolder const&>, casacore::ValueHolder (casacore::TableProxy::*)(casacore::String const&, int, int, int), boost::python::arg_from_python<casacore::TableProxy&>, boost::python::arg_from_python<casacore::String const&>, boost::python::arg_from_python<int>, boost::python::arg_from_python<int>, boost::python::arg_from_python<int> > (ac3=..., ac2=..., ac1=..., ac0=..., tc=<synthetic pointer>, f=
@0x21865c8: (casacore::ValueHolder (casacore::TableProxy::*)(casacore::TableProxy * const, const casacore::String &, int, int, int)) 0x7f12c8cc9fdc <casacore::TableProxy::getColumn(casacore::String const&, int, int, int)>, rc=...) at /scratch/bhugo/oms-subset/opt/include/boost/python/detail/invoke.hpp:86
#9 boost::python::detail::caller_arity<5u>::impl<casacore::ValueHolder (casacore::TableProxy::*)(casacore::String const&, int, int, int), boost::python::default_call_policies, boost::mpl::vector6<casacore::ValueHolder, casacore::TableProxy&, casacore::String const&, int, int, int> >::operator() (args_=<optimized out>, this=0x21865c8) at /scratch/bhugo/oms-subset/opt/include/boost/python/detail/caller.hpp:238
#10 boost::python::objects::caller_py_function_impl<boost::python::detail::caller<casacore::ValueHolder (casacore::TableProxy::*)(casacore::String const&, int, int, int), boost::python::default_call_policies, boost::mpl::vector6<casacore::ValueHolder, casacore::TableProxy&, casacore::String const&, int, int, int> > >::operator() (this=0x21865c0, args=<optimized out>, kw=<optimized out>)
at /scratch/bhugo/oms-subset/opt/include/boost/python/object/py_function.hpp:38
#11 0x00007f12c853b00d in boost::python::objects::function::call(_object*, _object*) const () from /usr/lib/x86_64-linux-gnu/libboost_python-py35.so.1.58.0
#12 0x00007f12c853b208 in ?? () from /usr/lib/x86_64-linux-gnu/libboost_python-py35.so.1.58.0
#13 0x00007f12c8543263 in boost::python::detail::exception_handler::operator()(boost::function0<void> const&) const () from /usr/lib/x86_64-linux-gnu/libboost_python-py35.so.1.58.0
#14 0x00007f12c82f3b6a in boost::python::detail::translate_exception<casacore::IterError, void (*)(casacore::IterError const&)>::operator() (this=0x2190e08, handler=..., f=...,
translate=0x7f12c82f2fbd <casacore::python::translate_iterexcp(casacore::IterError const&)>) at /scratch/bhugo/oms-subset/opt/include/boost/python/detail/translate_exception.hpp:46
#15 0x00007f12c82f38b5 in boost::_bi::list3<boost::arg<1>, boost::arg<2>, boost::_bi::value<void (*)(casacore::IterError const&)> >::operator()<bool, boost::python::detail::translate_exception<casacore::IterError, void (*)(casacore::IterError const&)>, boost::_bi::rrlist2<boost::python::detail::exception_handler const&, boost::function0<void> const&> > (this=0x2190e10, f=..., a=...)
at /scratch/bhugo/oms-subset/opt/include/boost/bind/bind.hpp:388
#16 0x00007f12c82f36c1 in boost::_bi::bind_t<bool, boost::python::detail::translate_exception<casacore::IterError, void (*)(casacore::IterError const&)>, boost::_bi::list3<boost::arg<1>, boost::arg<2>, boost::_bi::value<void (*)(casacore::IterError const&)> > >::operator()<boost::python::detail::exception_handler const&, boost::function0<void> const&> (this=0x2190e08, a1=..., a2=...)
at /scratch/bhugo/oms-subset/opt/include/boost/bind/bind.hpp:1318
#17 0x00007f12c82f352d in boost::detail::function::function_obj_invoker2<boost::_bi::bind_t<bool, boost::python::detail::translate_exception<casacore::IterError, void (*)(casacore::IterError const&)>, boost::_bi::list3<boost::arg<1>, boost::arg<2>, boost::_bi::value<void (*)(casacore::IterError const&)> > >, bool, boost::python::detail::exception_handler const&, boost::function0<void> const&>::invoke (
function_obj_ptr=..., a0=..., a1=...) at /scratch/bhugo/oms-subset/opt/include/boost/function/function_template.hpp:137
#18 0x00007f12c8543238 in boost::python::detail::exception_handler::operator()(boost::function0<void> const&) const () from /usr/lib/x86_64-linux-gnu/libboost_python-py35.so.1.58.0
#19 0x00007f12c82f3b6a in boost::python::detail::translate_exception<casacore::IterError, void (*)(casacore::IterError const&)>::operator() (this=0x2164e08, handler=..., f=...,
translate=0x7f12c82f2fbd <casacore::python::translate_iterexcp(casacore::IterError const&)>) at /scratch/bhugo/oms-subset/opt/include/boost/python/detail/translate_exception.hpp:46
#20 0x00007f12c82f38b5 in boost::_bi::list3<boost::arg<1>, boost::arg<2>, boost::_bi::value<void (*)(casacore::IterError const&)> >::operator()<bool, boost::python::detail::translate_exception<casacore::IterError, void (*)(casacore::IterError const&)>, boost::_bi::rrlist2<boost::python::detail::exception_handler const&, boost::function0<void> const&> > (this=0x2164e10, f=..., a=...)
at /scratch/bhugo/oms-subset/opt/include/boost/bind/bind.hpp:388
#21 0x00007f12c82f36c1 in boost::_bi::bind_t<bool, boost::python::detail::translate_exception<casacore::IterError, void (*)(casacore::IterError const&)>, boost::_bi::list3<boost::arg<1>, boost::arg<2>, boost::_bi::value<void (*)(casacore::IterError const&)> > >::operator()<boost::python::detail::exception_handler const&, boost::function0<void> const&> (this=0x2164e08, a1=..., a2=...)
at /scratch/bhugo/oms-subset/opt/include/boost/bind/bind.hpp:1318
#22 0x00007f12c82f352d in boost::detail::function::function_obj_invoker2<boost::_bi::bind_t<bool, boost::python::detail::translate_exception<casacore::IterError, void (*)(casacore::IterError const&)>, boost::_bi::list3<boost::arg<1>, boost::arg<2>, boost::_bi::value<void (*)(casacore::IterError const&)> > >, bool, boost::python::detail::exception_handler const&, boost::function0<void> const&>::invoke (
function_obj_ptr=..., a0=..., a1=...) at /scratch/bhugo/oms-subset/opt/include/boost/function/function_template.hpp:137
#23 0x00007f12c854301d in boost::python::handle_exception_impl(boost::function0<void>) () from /usr/lib/x86_64-linux-gnu/libboost_python-py35.so.1.58.0
#24 0x00007f12c8538409 in ?? () from /usr/lib/x86_64-linux-gnu/libboost_python-py35.so.1.58.0
#25 0x00000000005c3bd7 in PyObject_Call () at ../Objects/abstract.c:2165
#26 0x00000000005354af in do_call (nk=<optimized out>, na=<optimized out>, pp_stack=0x7f12c4a77ed0, func=<optimized out>) at ../Python/ceval.c:4936
The minimum steps to reproduce the issue are the following. Note the deliberate bug induced in line 30. I'm very careful not to submit calls (even open and close calls) in anything other than the single executor thread:
1 from __future__ import print_function
2
3 import argparse
4 import numpy as np
5 import pyrap.tables as pt
6 from concurrent.futures import ThreadPoolExecutor
7 import threading
8 def create_parser():
9 p = argparse.ArgumentParser()
10 p.add_argument("ms")
11 p.add_argument("-rc", "--row-chunks", default=100, type=int)
12 p.add_argument("-nc", "--num-consumers", default=64, type=int)
13 return p
14
15 args = create_parser().parse_args()
16
17 futs = []
18
19 consumer = ThreadPoolExecutor(max_workers=args.num_consumers)
20 producer = ThreadPoolExecutor(max_workers=1) # ensure we don't interact with CASAcore in more than one thread
21
22 # get future to the table on thread 1 of the data producer queue
23 tbl_hnd = producer.submit(pt.table, args.ms, readonly=False, lockoptions="user")
24
25 def _proxy_table_method(handle, method, write, *args, **kwargs):
26 print("Running CC method '{}' on thread {}".format(method, threading.get_ident()))
27
28 table = handle.result()
29
30 table.lock(write=not write)
31
32 try:
33 result = getattr(table, method)(*args, **kwargs)
34
35 if write:
36 table.flush()
37 finally:
38 if method != "close":
39 table.unlock()
40
41 return result
42
43 nrows = producer.submit(_proxy_table_method, tbl_hnd, "nrows", False).result()
44 print("{} rows in this MS".format(nrows))
45
46
47 def _dummy_rw(colname, r, e):
48 """ Dummy method to read and write using user locks """
49 def _rw_batch(colname, r, e):
50 print("Submitting read write batch for rows {} through {}".format(r, e))
51 nddat = _proxy_table_method(tbl_hnd, "getcol", False, colname, startrow=r, nrow=e-r)
52 # whatever ops to be done here
53 _proxy_table_method(tbl_hnd, "putcol", True, colname, nddat, startrow=r, nrow=e-r)
54 return nddat
55 datshp = producer.submit(_rw_batch, colname, r, e).result().shape
56 return r, e-r, datshp
57
58 # Chunk up the data nice and fine so there are a lot of calls to CC
59 for r in range(0, nrows, args.row_chunks):
60 e = min(r + args.row_chunks, nrows)
61 futs.append(consumer.submit(_dummy_rw,"TIME", r, e))
62 futs.append(consumer.submit(_dummy_rw,"FLAG", r, e))
63
64 # ensure all rw ops finish
65 for f in futs:
66 f.result()
67
68 print(">>>>>>>>>>>>FINISHED<<<<<<<<<<<")
69 for f in futs:
70 sr, nr, res = f.result()
71 print(sr, nr, res)
72
73 import atexit
74 atexit.register(lambda: producer.submit(_proxy_table_method, tbl_hnd, "close", False).result())
with output as follows:
> $ python3 test.py msdir/subset-1563189318-20M.ms
Successful read/write open of user-locked table msdir/subset-1563189318-20M.ms: 24 columns, 6368888 rows
Running CC method 'nrows' on thread 139718585456384
6368888 rows in this MS
Submitting read write batch for rows 0 through 100
Running CC method 'getcol' on thread 139718585456384
Running CC method 'putcol' on thread 139718585456384
Submitting read write batch for rows 0 through 100
Running CC method 'getcol' on thread 139718585456384
Running CC method 'putcol' on thread 139718585456384
Submitting read write batch for rows 100 through 200
Running CC method 'getcol' on thread 139718585456384
Note that I get the expected user exception when executing in the parent thread:
In [5]: t = tbl("msdir/subset-1563189318-20M.ms", readonly=False, lockoptions="user")
Successful read/write open of user-locked table msdir/subset-1563189318-20M.ms: 24 columns, 6368888 rows
In [6]: t.lock(write=True)
In [7]: time = t.getcol("TIME")
In [8]: t.unlock()
In [9]: t.lock(write=False)
In [10]: t.putcol("TIME", time)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-10-d8534bd8f3fa> in <module>()
----> 1 t.putcol("TIME", time)
/scratch/bhugo/oms-subset/opt/home/hugo/.local/lib/python3.5/site-packages/casacore/tables/table.py in putcol(self, columnname, value, startrow, nrow, rowincr)
1152
1153 """
-> 1154 self._putcol(columnname, startrow, nrow, rowincr, value)
1155
1156 def putvarcol(self, columnname, value, startrow=0, nrow=-1, rowincr=1):
RuntimeError: ColumnSet::doLock: table /scratch/bhugo/oms-subset/msdir/subset-1563189318-20M.ms should be locked when using UserLocking
I've compiled casacore master (#2f45a1a7fa8474ffc) from source with debug symbols and installed the source distribution of python-casacore==3.1.1. We are running python 3.5.