Tuesday, July 28, 2009

python data storage

after a little bit of optimization, i'm finding the bottleneck now is reading in the data. i think i've found about all the ways to speed up cPickle (most recent protocol, Pickler().fast = True) and the next step will be to a real database. i'm not sure the python builtins will buy me much, though, and i think if i'm going to have to install something it might as well be pytables. pytables is an interface layer on top of hdf5, so it's probably best for large volumes of numerical data. it only requires hdf5 (which built without problems: configure && make install) and numpy. i had to set HDF5_DIR and add paths for LD_LIBRARY_PATH and PYTHONPATH, not being root. but overall a very painless install. the data structure is bound to be more complex than a simple pickle, but there are some good tutorials out there. also, the nmag project has some good real-life experience with using pytables for unstructured grid data. (see hdf5*py in nmag-0.1/nsim/interface/nfem) uiuc and cei (the people who make ensight) also defined an hdf5 mesh api, but it looks pretty krufty now. another plus with pytables is that you can use vitables to interact with the data. that's even easier than a pickle. EDIT: hers's a site that covers a lot of the issues with scientific data storage and refers to specific examples, including hdf. one problem that might arise with pytables is that numpy arrays can be memory mapped to a file on disk, but pytables can't do that if i'm not mistaken. am i? according to this email exchange, pytables doesn't do mmap but it can be as fast or faster if used properly. sounds like i can still use pytables without losing performance, but i will need to reread that and some of the refs therein to implement. here's an interesting conversation about large file i/o in python, with specific applications in finance.

No comments: