Thursday, May 13, 2010

memmaping numpy arrays

numpy.memmap seems to be the intended user interface for a flexible memory mapped file (with offsets). numpy.load, with the mmap_mode option set, calls numpy.lib.format.open_memmap. but numpy.lib.format's docstring claims that memory mapping will not not work for object arrays (and it's ignored for zipped or pickled files). numpy.lib.format.open_memmap does some checks and setup, and then it calls numpy.memmap. so i think if i want to mmap binary files from a non-python source with variable-length arrays, i need to use memmap with the offset parameter for each varlength array and build the final array out of the memmaps. apparently there is a < 2GB file size limit for python < 2.5. hmm. apparently keeping references open on all these arrays mapped to random locations on the same file makes a separate file ref for each one. unfortunately it doesn't take long to hit the os open file limit. so i need to reuse the same mmap buffer. a quick look at memmap.py from numpy.core indicates that this is pretty simple; the creation of the ndarray from the buffer is just a few lines at the end of the __new__ method, and the buffer itself is stored in self._mmap. so i could either make a subclass of memmap and loop over that ndarray.__new__, or i could just make one big memmap out of the whole file and follow the pattern for making ndarrays out of it. (or just slice it, which will make views.)

No comments: