Thursday, May 20, 2010

kriging with numpy

scipy.interpolate.Rbf does radial basis functions, and has all the interpolation types listed on p. 350 of 'A Taxonomy of Global Optimization Methods Based on Response Surfaces' (except that the scaling is uniform for all dimensions and the kriging exponents are all fixed at 2). of course, any scaling would have to be done on the variables before calling because it only depends on radius. i would like more convenient kriging with that sort of thing taken care of.
okay, after a bit of testing, it looks like the rbf implementation in scipy needs some caution in use. with the default settings, the matrix that it solves to get the weights can very easily become ill-conditioned with real-life data and the results are horrible, without any warning. (for example, two points much closer together than the rest of the points...) there is an optional parameter that will basically set a minimum on the eigenvalues of that matrix, but you'll have to be smart enough to guess the max eigenval if you want to target a condition number. and, of course, the dense matrix solve is O(n**3) which becomes very impractical, very quickly. see below for some thoughts on this. looks like hpgl has plenty of kriging available at c++ speed. recent development, but no easy_install (but does have a debian package) and very few downloads. i'll see how easy it is to install.
here are a couple of other links to check out:
(someday i should look up latent variable models with gaussian process models.)
i had another thought after the scipy rbf problems i ran into: why can't something like the fast multipole method be used with kriging? well, some people smarter than i am have thought about that, and i just read a good (surprisingly concise) paper on it: 'efficient kriging via fast matrix-vector products' by nargess memarsadeghi et al. her phd dissertation also looks interesting, with more on data fusion for remote sensing images and cokriging. the summary paper cites a 1991 article by the inventor of the fmm, l. greengard: 'the fast gauss transform', which applies fmm ideas to fgt for gaussian process models. the computational complexity is reduced from O(m*n) for m evaluations from n sources to O(m+n), but the constant factor is exponential in the number of dimensions. an improved fgt, creatively named ifgt, was published in 2005 by yang, duraiswami, and davis that is asymptotically polynomial in dimensions while still comparable in lower dimensions. further development with rykar made the error bound sufficiently tight to be practically useful, and refs 16 (article) and 24 (book) outline the implementation that memarsadeghi used. the method is derived for a squared-distance (gaussian) rbf, but i think it could be used for an exponent other than 2 if i'm careful with deriving the taylor series. (note: i think that the multi-index notation in the equation just following equation 4 means that the sum over |\alpha| \le p-1 refers to the magnitude of the components of \alpha, including all polynomial terms of a certain order.)
there is also a description of the gtann algorithm (gt with nearest neighbors), which was developed as the FigTree method in raykar's 2007 phd dissertation. it is more effective when the gaussians have small ranges, whereas the ifgt is better for long range gaussians. i'm not quite sure i understand the brief description, but it sounds like they just search for sources within a radius and use ordinary kriging on them. in their tests it's an order of magnitude faster than ifgt when the covariance distances are small (ifgt is generally about an order of magnitude faster than exact gt), but it's about the same time as gt when the covariance distances are long.
they use the symmlq iterative solver from paige and saunders, 'solution of sparse indefinite systems of linear equations'. (can't use cg, because the system with the weight constraint is not positive definite.) there's a python implementation of symmlq at stanford by michael saunders that he relicensed bsd so it could go into scipy. other parts of that package were either wrapped or translated (saunders thought a blas-core python would be just as fast), but symmlq is the only part missing as of 31-03-2010. however, jeffery kline (of cvxopt fame) put a translation from matlab up on his site. it requires cvxopt, even though it is bsd. there's also a pykrylov library (lgpl!) on github with a symmlq function; it already uses numpy instead of wacky cvxopt.matrix objects, so might be easier to get running. couple of fortran versions floating around out there, too.

No comments: