Friday, October 30, 2009

outliers

just finished the book, 'outliers' by malcolm gladwell. interesting, but first a disclaimer/critique: he is not a quantitatively oriented person. he makes some claims without backing them up with data, and other assertions are tenuously extrapolated from scant data. it's clear from the definition he gives both in the book and in later interviews that he does not understand the real meaning of the word 'outlier' in the statistical context. it is also clear that he has an agenda: to disprove the 'myth of individualism' in favor of 'the power of community' to explain personal success. it's true that talent and hard work are not enough for success; you also need opportunity. (well, duh.) but i think he tries to replace one silly straw man with another. he strongly emphasizes the opportunities, and how randomly or arbitrarily they occur for people, as the dominant factor for determining success. but most opportunities don't just drop out of the sky into certain people's lap. it's precisely the people who are working hard, paying attention, and looking for opportunities that take advantage of them. maybe the author's mistake is the very common fallacy of confusing correlation with causation and choosing a causal relationship based on preconceived notion. in that spirit, it is interesting to see the role that these factors can play. people who are looking for those opportunities can increase their awareness and use it to their advantage.

estimating mutual information

just read a very interesting article on estimating mutual information from random variable samples. pretty much everything else i've seen on this subject is based on either histograms or kdes. so improving the algorithms comes down to improving histogram or kernel parameters. 'estimating mutual information' by a. kraskov, h. st\:ogbauer, and p. grassberger takes a unique approach, based on nearest neighbors. seems to do pretty well on few data points and nearly independent sets. the paper also shows an interesting application to ica. certainly worth a try, especially if i'm comparing a number of approaches to mi estimation. it points out that the norms need not be the same or even have the same space, so i could use the rank or any other transform (log is a popular one) to spread out some data or otherwise emphasize some parts to reduce estimation error without changing the theoretical result. the main results to implement are equations 8 and 9. be careful that the definitions of n_x and n_y are different in 8 and 9, though they could be counted simultaneously (on the same pass). compare fig 4 to 13. the exact value for I in the caption comes from eq 11. ref 34 goes into the uniqueness and robustness of the components that drop out of ica analysis. the paragraph under fig 17 has computed values from web-accessible data that could be used for testing code. it would be interesting to see if component 1 in fig 19 reflects phase differences in component 2 due to propagation delay. the second term of the second line in eq a4 equals 0, which confirms the fact that reparameterization does not change mutual information. using small values for k increases statistical errors while large k increases systematic errors. probably best to try multiple ks and compare trends to fig 4. the digamma function is implemented in scipy.special as psi.

Thursday, October 8, 2009

perallel python with ipython.kernel

the kernel module that comes with recent versions of ipython provides for interactive parallel python sessions. might have to check this out some time. this part of ipython is built on foolscap, a secure rpc framework. i wonder if it would make a good replacement for pyro. looks like it takes more up-front work, though.

Tuesday, October 6, 2009

python packages

scikits has a number of packages paralleling the scipy effort, including some optimization and machine learning, audio and signal processing, a matlab wrapper, etc. fwrap looks to be a next-gen of f2py. makes interfaces to cython, c, c++. still alpha. mpi4py provides a c++ like interface via cython. if i'm ever forced at gunpoint to use mpi again, this is what i will use. the mayavi people have made a recorder for use with the traits ui (from enthought) to generate a human-readable python script that can reproduce the gui actions. that sounds like a cheap and easy way to automate journal files without any effort on my part. maybe i could relax my principled stand against making guis in general....

bayesian inference books

travis oliphant (of enthought fame) recommends 'the algebra of probable inference' by richard t. cox and 'probability theory: the logic of science' by edwin t. jaynes as good references on using bayesian inference as a formalization of the scientific method (my interpretation). might have to check them out some time.

Monday, October 5, 2009

psuade

another uncertainty/optimization/sensitivity analysis package, this time from llnl. looks like they want to make their own built in environment (why?! why?!) but at least it's gpl.

Friday, October 2, 2009

automatic debugging

saw some interesting work on automatic debugging. the idea is to use tests (passing tests with one failing tests) to evaluate random code changes and find one that works. i don't really expect my computer to debug my code any time soon. but one interesting thought was that the ast nodes to change were weighted by positive and negative test coverage. maybe i could use coverage in this one to localize a bug. (more likely is places covered by multiple negative tests, less likely in places covered by positive tests.) that would help me find the bug, which is almost always the hardest part. refs to the spike black-box fuzzer from immunitysec.com, strata dynamic binary transformation (from virginia).

stable differentiation

found disappointingly little online about good numerical differentiation algorithms. the only thing i can think of is savitzky-golay filtering. scipy has a cookbook recipe, procoders has a page, and there's a krufty package out there i might steal from. i tried filtering a time series with a 3rd order butterworth before doing a simple central difference, and it made some huge errors at the beginning. i hope the s-g filter works better.

3d pdf objects

so i can now make 3d pdf objects, and i'd like to make it more useful. adobe put out some docs on their support for 3d, as well as javascript for 3d. they claim the possibility of animations, though i seem to recall from reading about this before that that is only via matrix transforms for rigid-body motion and not mesh deformation. i wonder if i can use insdljs.sty for the javascript....