Wednesday, July 23, 2008

python testing and debugging

looks like some commercial debuggers are talking about offline debugging like it's a great new thing. maybe i should dig up the latest version of pycrash i can find and make it my own. it would be cool, especially to get better feedback from users and auto test suites. maybe something like this, that would allow the __main__ module to be pickled, would work as well. except that i would need the contexts for the whole stack down to where the exception happened. could i catch it with pdb and then pickle the debugger state instead? i don't see a really simple way to do it, since frames and tracebacks can't be pickled. i would have to walk through these data structures and pickle what is picklable and pprint the rest, i guess. also, i really should integrate cProfile into my testing utility. fuzzing and fault injection would be really nice, too. once i have a good debug dump working, maybe i could use an email process monitoring tool like gmailpm to send it to me automatically through gmail.
i guess all these problems would be solved, along with a number of others, if i had a checkpoint/pause/resume/load coredump/reification/mobile computing type of capability. i've thought about trying to use pyro and such things, but it would be a significant effort.
maybe something like dmtcp and/or urdb for checkpointing (maybe even if it's in the __exit__ method of a with context) are worth a look, especially since they specifically claim success with python, along with matlab, perl, and other binaries via gdb. (they even used it on ipython's parallel demo.) unlike other binary checkpointing packages, like blcr, there is no need for any violence against the kernel or binary. wow, they even claim it will work with files, pipes, sockets, etc., memmap, and x windows (minus extensions, gl and video). only on linux, but still... LGPL and pretty cool as long as the performance hit isn't too bad. ok, according to the paper, performance is virtually unaffected between checkpoints. for programs to control their own checkpointing, there is a c api. probably easy to wrap, maybe even ctypes it if there a shared object lib. section 1.1 of the paper, 'use cases', explicitly identifies save/restore, dump/undump, offline debugging, and bug report image as applications (as well as being robust to deadlock and race conditions by stepping back and retrying, though this is less interesting to me).
EDIT: ok, the ubuntu package only has a static lib for the api, but it also has the .c and .h files in /usr/lib/dmtcp/, so i just
gcc -fPIC -c dmtcpaware.c
gcc -shared -W1,-soname,libdmtcpaware.so -o libdmtcpaware.so dmtcpaware.o
and i had a .so that opened with
a = ctypes.cdll.LoadLibrary('/usr/lib/dmtcp/libdmtcpaware.so')
now i run dmtcp_coordinator in another terminal and
dmtcp_checkpoint python -c "import ctypes; a = ctypes.cdll.LoadLibrary('/usr/lib/dmtcp/libdmtcpaware.so'); print a.dmtcpIsEnabled(); a.dmtcpGetCoordinatorStatus()"
works. some symbols are not in the .so, so those things need some more tweaking.
but a.dmtcpCheckpoint() runs and returns DMTCP_AFTER_CHECKPOINT. an ipython session with 2 checkpoints and a small numpy array is about 12.5 MB, and each one took a few seconds to generate. the dmtcp_restart_script.sh script in the dir where dmtcp_coordinator was run starts the process up again, and everything is in there! puts me right back to after the call to a.dmtcpCheckpoint(), except now it has returned DMTCP_AFTER_RESTART. works great for a simple example, except that it segfaults when i finish the thread.
so for offline debugging, i could put a top level in the __main__:
class Wrapper(object):
def __enter__(self):
pass
def __exit__(self,t,v,tb):
if badness:
dump_checkpoint()
tarball_dump_and_send_it_to_me()
import pdb; pdb.set_trace()
with Wrapper() as w:
do_stuff()

No comments: