i guess all these problems would be solved, along with a number of others, if i had a checkpoint/pause/resume/load coredump/reification/mobile computing type of capability. i've thought about trying to use pyro and such things, but it would be a significant effort.
maybe something like dmtcp and/or urdb for checkpointing (maybe even if it's in the __exit__ method of a with context) are worth a look, especially since they specifically claim success with python, along with matlab, perl, and other binaries via gdb. (they even used it on ipython's parallel demo.) unlike other binary checkpointing packages, like blcr, there is no need for any violence against the kernel or binary. wow, they even claim it will work with files, pipes, sockets, etc., memmap, and x windows (minus extensions, gl and video). only on linux, but still... LGPL and pretty cool as long as the performance hit isn't too bad. ok, according to the paper, performance is virtually unaffected between checkpoints. for programs to control their own checkpointing, there is a c api. probably easy to wrap, maybe even ctypes it if there a shared object lib. section 1.1 of the paper, 'use cases', explicitly identifies save/restore, dump/undump, offline debugging, and bug report image as applications (as well as being robust to deadlock and race conditions by stepping back and retrying, though this is less interesting to me).
EDIT: ok, the ubuntu package only has a static lib for the api, but it also has the .c and .h files in /usr/lib/dmtcp/, so i just
gcc -fPIC -c dmtcpaware.c
gcc -shared -W1,-soname,libdmtcpaware.so -o libdmtcpaware.so dmtcpaware.o
and i had a .so that opened with
a = ctypes.cdll.LoadLibrary('/usr/lib/dmtcp/libdmtcpaware.so')
now i run dmtcp_coordinator in another terminal and
dmtcp_checkpoint python -c "import ctypes; a = ctypes.cdll.LoadLibrary('/usr/lib/dmtcp/libdmtcpaware.so'); print a.dmtcpIsEnabled(); a.dmtcpGetCoordinatorStatus()"
works. some symbols are not in the .so, so those things need some more tweaking.
but a.dmtcpCheckpoint() runs and returns DMTCP_AFTER_CHECKPOINT. an ipython session with 2 checkpoints and a small numpy array is about 12.5 MB, and each one took a few seconds to generate. the dmtcp_restart_script.sh script in the dir where dmtcp_coordinator was run starts the process up again, and everything is in there! puts me right back to after the call to a.dmtcpCheckpoint(), except now it has returned DMTCP_AFTER_RESTART. works great for a simple example, except that it segfaults when i finish the thread.
so for offline debugging, i could put a top level in the __main__:
class Wrapper(object):
def __enter__(self):
pass
def __exit__(self,t,v,tb):
if badness:
dump_checkpoint()
tarball_dump_and_send_it_to_me()
import pdb; pdb.set_trace()
with Wrapper() as w:
do_stuff()
No comments:
Post a Comment