Monday, November 30, 2009

printable diff

here are some handy commands for making printable landscape diff outputs with long (120 char) lines. diff -y --left-column --width=240 version1 version2 > /tmp/t a2ps --columns=1 -l 240 /tmp/t -o /tmp/t.ps

factorization of matrices with unknown elements

been thinking about a problem that can be represented as a factorization (like svd or ica) of a matrix when only some of the elements are known. clearly not a simple problem, but one that apparently comes up in image recognition. google turned up an interesting report from oxford from 5 years ago that gives a good review of the problem. good descriptions of a few algorithms, observations on optimization methods (that may or may not carry over to other applications), and some synthetic examples that show the effect of the distribution of known elements. the residual function uses a hadamard (elementwise) product with a mask matrix to represent the partial knowledge, although sec 4.3 points out that other forms might better incorporate prior knowledge.

more python testing tools

pester looks like something i should try. it uses mutation testing to test your tests, by seeing if there are changes to the code that still pass all tests. unfortunately, it hasn't been updated since 2002, even though jester (the java version) is more recent. mutation testing is sorta related to fuzz testing, which feeds your code modified data until it chokes. fusil and peachfuzzer both look very feature-rich but complicated. peachfuzzer has a gag-me xml interface and fusil provides libs for writing python scripts that test cli programs. svnmock might be a good tool to use if i write code that uses the svn python hooks. (been thinking about making my repository mail itself somewhere as a backup with each commit.) also hasn't been updated in a while, though -- 2006 in this case. pythoscope is a good tool to use if i ever start using unittest instead of just doctests. it will examine your code to build the start of a test file that you can fill in, almost as easily as a doctest. and it's been recently updated. some day i should put a source checker like pylint or pychecker into my stream. maybe catch some problems before they start. clonedigger has more powerful tools than pylint for detecting duplication, for refactoring hints. Link pytest is a test runner rumored to be buried in logilab-common. sounds similar to the script i wrote to run my doctests and do a little coverage analysis. complexity analizes cyclomatic complexity of python code, though it's written in perl and it doesn't seem to be actively developed. Link pymetrics does cyclomatic complexity as well as loc and other metrics. written in python and looks pretty active. worth a try, i think. pysizer does some interesting memory profiling, but i seems to be a little too alpha right now (only built-in types supported; usually these are the least of my memory troubles.) still, some nice features are already there to help trace where objects came from, who's using them, and plotting dep graphs. if only it could handle non built-in instances, this would be great. Link guppy-pe is a combination of python dev tools. heapy is a memory profiler, and it seems to do basically the same thing as pysizer but is not limited to built-ins. gsl is a specification language, which i think i can skip as another example of how to javaify python. EDIT: hmm, i can't seem to get heapy to account for numpy arrays, although these people apparently did. sys.getsizeof() definitely does not. grrr.

Saturday, November 28, 2009

snaplogic for data integration

snaplogic is a tool for integrating data from disparate sources with a clean and simple interface. maybe i should look into it for dealing with some of my code that interacts with various formats of large datasets.

mock tesing

been thinking about mock testing and i finally came to the point that i need it (to paper over some urllib calls). some googling around turned up a number of good links (an excellent list of python testing links, including mocking libs, etc., is at the cheesecake). there are also a couple of controversies roiling out there that i don't really care about, such as the difference between stubs and mocks (i'll use what works for me, whatever it's called) and if mocks are inherently evil (use them at i/o boundaries with external resources that are expensive, unpredictable, or unavailable). some of the mock frameworks out there seem to be derived from java libs, which is an auto strike against in my book. (java coders seem always to decide beforehand that a simple task needs 10k loc, even if they're writing python.) minimock looks interesting, and the only complaint i see for it is that it only works on doctests. i'm not sure that's even true anymore, and it's actively developed. since i only write doctests (i just can't see why it needs to be more complicated; another java arifact?), maybe it's exactly what i need. the examples on its website are easy to read and understand, and one of them shows how to use/mock smtp. one issue that i might run into later is wrapping vs. complete mocking. wrappers let you use the real object part of the time or for part of its interface. this can avoid some of the evilness criticism that the mocked code could change and break things without breaking the tests. i can imagine that sometimes it's less complicated, sometimes it's more.

Wednesday, November 25, 2009

free linear algebra

here's a chart comparing free linalg code. handy.

pyret

python package for image deblurring. the demo page link is broken. deconvolution with svd/tikhonov regularization.

pyamg

a while ago i messed around with multigrid methods for solving sparse inverse problems. if i do it again, i'll try this: algebraic multigrid solvers in python.

text-to-speech on the iphone

surprising how little there is out there. i'm probably not looking in the right place. a number of people got flite running: http://artofsystems.blogspot.com/2009/02/speech-synthesis-on-iphone-with-flite.html http://www.voxtrek.com/ http://www.cmang.org/ http://www.embiggened.info/vocalizer.html some of them have itunes links to buy the software, but they don't seem to be active. one is a navigation system, another is just flite with a little gui icing.

festival voices

had to try some different voices with festival, since the default one on the acer aspire sounded terrible. kal is the default on gentoo, and is the one i'm most used to. ked sounds pretty good, too; maybe i'll use it when it want to differentiate. both of these are diphone voices. all the arctic voices i listened to sounded pretty bad. to switch between them globally, just put (set! voice_default 'voice_kal_diphone) into /etc/festival/siteinit.scm

Tuesday, November 24, 2009

execnet

execnet is another python remote execution package. looks kind of like a pythonic mpi with minimal setup overhead, since the sends and receives are manual. can give modules to remote instances, but you have to resolve import dependencies manually.

support vector machines with pyml

there are a number of svm codes out there, but the only non-swiged pure python one for which i've seen a third-party recommendation is pyml. decent general svm intro doc, too.

Information Theory, Inference, and Learning Algorithms

another book available online, including latex and octave, perl, etc, source. might be a good ref on IT basics, neural nets, coding, compression, and monte carlo.

data clustering for screen scraping

i just had an idea for an application of unsupervised data clustering. a quick google popped up python-cluster. looks like it's been abandoned for a couple of years, but it has a hierarchical algorithm and (maybe) k-means. i might try it. also, scipy.cluster has kmeans and vector quantization, with self organized feature maps and other methods promised later. the app is screen scraping web pages and trying to get the main content (an article, for example) without the ads, links, and other junk around the edges. i think it might be possible to look at each line (after tossing everything inside script tags) and separate the lines based on their length and the percent of the line inside html markup. the reasoning is that real content usually has long lines in the source and a small fraction of html taggage. i probably want to throw in line index as a third variable, since the lines i want will probably be close together. another thing i could do is grab multiple pages from the same site: either multiple articles that should have the same format or multiple copies from different days for a frequently updated page. that would allow me to do two things. first, i could combine data points from multiple pages to get higher point density for the cluster detection. (might need to test if an each individual page is sampled from the same distribution as the others to throw out outliers.) second, i could detect identical lines to throw away as nonunique boilerplate.
the ubuntu python-mvpa package looks like it might fit the bill.

Wednesday, November 18, 2009

Software for Hidden Markov Models and Dynamical Systems

all the source and the whole build system for a book by andrew fraser is online. the code is written in python, and the typesetting uses rubber, the python wrapper for all things tex. not only is the subject matter interesting (an application chapter on obstructive sleep apnea), but i think it has some good examples of how to write a book or long report with python and latex. took a _long_ time to put everything together, but it generates the whole book. if i actually use it, i think i'll have to buy the book in gratitude. (had to comment out a \printnomenclature line to get latex to run. guess i have a bad nomencl.)

Tuesday, November 17, 2009

Flexible Algorithms for Image Registration

book by jan modersitzki that has an associated website with matlab (meh) software and other ref docs.

Scientific Computation

book from cambridge u press by gaston gonnet. i'm not sure the methods are extremely advanced, but it gives an interesting spread of applications, including protein structure and stock price prediction.