Thursday, May 27, 2010

time format in bash

wanted to change the output format from 'time' (the default posix three liner is so wasteful of terminal real estate), but the bash time command ignores the -f option and the TIME env var. (contrary to the man page!) but TIMEFORMAT works:
export TIMEFORMAT="%E real, %U user, %S sys"

mlbviewer.py fixed

looks like mlb.com decided to jack up the xml they were sending out a couple of weeks ago, and it took me a while to figure out what to do about it. apparently the sourceforge forums are not really used; all the discussion about mlbviewer is at lq. the critical point was in comment 2917:
$ cp MediaService.xsd ~/.mlb
$ rm -r /tmp/suds/*
that suds cache really had me stumped when nothing happened after i updated the MediaService files. and it's really annoying to debug these kinds of problems when the Sign-on Restriction Error locks me out for >15 minutes at a time.

anacrontab entry for nonroot

found a good way to run an anachron job as a non-root user. here's an example line from /etc/anacrontab
1 23 py_anacron nice su --command="/bin/bash -i -c 'python ~/svn/Python/script.py'" username
this will run the script every day (as long as the computer is on at some point that day, unlike cron which will only work if the computer is on at the right time) after a 23 minute delay, with the user's full login environment. the /var/spool/anacron/py_anacron file will show the date it was last run.

Monday, May 24, 2010

hudson-ci

web interface on a continuous integration system for building, testing, revision tracking, and distributing code. if i ever have any users for my code, and especially if i have other people writing code, i need to check it out. too bad it's all java.
buildbot is apparently more python-oriented, and bitten (for trac) is in python. but hudson seems to get the vote for easy install and customization. here's a useful summary:
We use both Buildbot and Hudson for Jython development. Both are useful, but have different strengths and weaknesses.
Buildbot's configuration is pure Python and quite simple once you get the hang of it (look at the epydoc-generated API docs for the most current info). Buildbot makes it easier to define non-testing tasks and distribute the testers. However, it really has no concept of individual tests, just textual, HTML, and summary output, so if you want to have multi-level browsable test output and so forth you'll have to build it yourself, or just use Hudson.
Hudson has terrific support for drilling down from overall results into test suites and individual tests; it also is great for comparing test output between builds, but the distributed (master/slave) stuff is comparatively more complicated because you need a Java environment on the slaves too; also, Hudson is less tolerant of flaky network links between the master and slaves.
So, to get the benefits of both tools, we run a single instance of Hudson, which catches the common test failures, then we do multi-platform regression with Buildbot.

Thursday, May 20, 2010

kriging with numpy

scipy.interpolate.Rbf does radial basis functions, and has all the interpolation types listed on p. 350 of 'A Taxonomy of Global Optimization Methods Based on Response Surfaces' (except that the scaling is uniform for all dimensions and the kriging exponents are all fixed at 2). of course, any scaling would have to be done on the variables before calling because it only depends on radius. i would like more convenient kriging with that sort of thing taken care of.
okay, after a bit of testing, it looks like the rbf implementation in scipy needs some caution in use. with the default settings, the matrix that it solves to get the weights can very easily become ill-conditioned with real-life data and the results are horrible, without any warning. (for example, two points much closer together than the rest of the points...) there is an optional parameter that will basically set a minimum on the eigenvalues of that matrix, but you'll have to be smart enough to guess the max eigenval if you want to target a condition number. and, of course, the dense matrix solve is O(n**3) which becomes very impractical, very quickly. see below for some thoughts on this. looks like hpgl has plenty of kriging available at c++ speed. recent development, but no easy_install (but does have a debian package) and very few downloads. i'll see how easy it is to install.
here are a couple of other links to check out:
(someday i should look up latent variable models with gaussian process models.)
i had another thought after the scipy rbf problems i ran into: why can't something like the fast multipole method be used with kriging? well, some people smarter than i am have thought about that, and i just read a good (surprisingly concise) paper on it: 'efficient kriging via fast matrix-vector products' by nargess memarsadeghi et al. her phd dissertation also looks interesting, with more on data fusion for remote sensing images and cokriging. the summary paper cites a 1991 article by the inventor of the fmm, l. greengard: 'the fast gauss transform', which applies fmm ideas to fgt for gaussian process models. the computational complexity is reduced from O(m*n) for m evaluations from n sources to O(m+n), but the constant factor is exponential in the number of dimensions. an improved fgt, creatively named ifgt, was published in 2005 by yang, duraiswami, and davis that is asymptotically polynomial in dimensions while still comparable in lower dimensions. further development with rykar made the error bound sufficiently tight to be practically useful, and refs 16 (article) and 24 (book) outline the implementation that memarsadeghi used. the method is derived for a squared-distance (gaussian) rbf, but i think it could be used for an exponent other than 2 if i'm careful with deriving the taylor series. (note: i think that the multi-index notation in the equation just following equation 4 means that the sum over |\alpha| \le p-1 refers to the magnitude of the components of \alpha, including all polynomial terms of a certain order.)
there is also a description of the gtann algorithm (gt with nearest neighbors), which was developed as the FigTree method in raykar's 2007 phd dissertation. it is more effective when the gaussians have small ranges, whereas the ifgt is better for long range gaussians. i'm not quite sure i understand the brief description, but it sounds like they just search for sources within a radius and use ordinary kriging on them. in their tests it's an order of magnitude faster than ifgt when the covariance distances are small (ifgt is generally about an order of magnitude faster than exact gt), but it's about the same time as gt when the covariance distances are long.
they use the symmlq iterative solver from paige and saunders, 'solution of sparse indefinite systems of linear equations'. (can't use cg, because the system with the weight constraint is not positive definite.) there's a python implementation of symmlq at stanford by michael saunders that he relicensed bsd so it could go into scipy. other parts of that package were either wrapped or translated (saunders thought a blas-core python would be just as fast), but symmlq is the only part missing as of 31-03-2010. however, jeffery kline (of cvxopt fame) put a translation from matlab up on his site. it requires cvxopt, even though it is bsd. there's also a pykrylov library (lgpl!) on github with a symmlq function; it already uses numpy instead of wacky cvxopt.matrix objects, so might be easier to get running. couple of fortran versions floating around out there, too.

fft in python

the fft of a numpy array is simple and easy, as long as you don't care about performance. i'm finding that speed-critical is a bit more tricky. when i zero-pad to a power of 2 (typically up to 1024), i get a huge (~50 times) speed up. i would expect that a more well-tuned library would give an additional speedup over the default fftpack, but they were apparently removed from scipy in the 2008-2009 time frame. mkl from intel is reportedly the fastest, with a proprietary license. i think djbfft and fftw (fftw3) are probably comparable, though the public domain djbfft looks like abandonware at this point and fftw is gpl, so it might be problematic to depend on it for anything i'm shipping out. i've seen email list chatter about optional backends and conditional compilation, but i there's certainly nothing like that in my epd scipy.

Friday, May 14, 2010

corepy

corepy allows you to write/wrap assembly code from python. not sure i would ever (_ever_) want to go lower than c, but somehow it's good to know that it's there.

friction stir welding

found a good 60-page review article on issues with friction stir welding.

Thursday, May 13, 2010

numexpr

numexpr looks like a very handy package to optimize performance on numpy array expressions. it basically breaks arrays into cache-friendly chunks, without needing python-level iterations. the 'vector virtual machine' runs in c, so it claims to have comparable speed to the scipy.weave.blitz expression compiler. and no extra c/c++-compiling step!
pytables comes with it, and with tables 2.2beta1 you can use the expressions on earray, etc (disk-based homogeneous arrays), so i think it should work with mmap.

memmaping numpy arrays

numpy.memmap seems to be the intended user interface for a flexible memory mapped file (with offsets). numpy.load, with the mmap_mode option set, calls numpy.lib.format.open_memmap. but numpy.lib.format's docstring claims that memory mapping will not not work for object arrays (and it's ignored for zipped or pickled files). numpy.lib.format.open_memmap does some checks and setup, and then it calls numpy.memmap. so i think if i want to mmap binary files from a non-python source with variable-length arrays, i need to use memmap with the offset parameter for each varlength array and build the final array out of the memmaps. apparently there is a < 2GB file size limit for python < 2.5. hmm. apparently keeping references open on all these arrays mapped to random locations on the same file makes a separate file ref for each one. unfortunately it doesn't take long to hit the os open file limit. so i need to reuse the same mmap buffer. a quick look at memmap.py from numpy.core indicates that this is pretty simple; the creation of the ndarray from the buffer is just a few lines at the end of the __new__ method, and the buffer itself is stored in self._mmap. so i could either make a subclass of memmap and loop over that ndarray.__new__, or i could just make one big memmap out of the whole file and follow the pattern for making ndarrays out of it. (or just slice it, which will make views.)

Friday, May 7, 2010

optimization solvers in python

scikits has both 'optimization' and 'openopt' packages, though a cursory look at the repo for optimization (no docs) does not show anything that is obviously different from the simple stuff in scipy.optimize. openopt looks like a more extensive collection, put together by some ukranian academics.
cvxopt looks like the other big player in the python world. the convex optimization that it advertises is (i think) more geared toward complex constraints that i'm not currently worried about. also, openopt wraps at least some cvxopt solvers.
i need to take a closer look to decide which solvers to try, but it is probably a good idea to start with ralg for domain-contrained problems. that's the only one that doesn't require the initial point to be inside the domain.

Thursday, May 6, 2010

reitveld code review tool

codereview is a nice code review tool i'll try some time if i'm working on a foss project. looks like it's written by guido and uses google app engine/django, allows for patch submissions with comments, etc.