Tuesday, November 24, 2009

data clustering for screen scraping

i just had an idea for an application of unsupervised data clustering. a quick google popped up python-cluster. looks like it's been abandoned for a couple of years, but it has a hierarchical algorithm and (maybe) k-means. i might try it. also, scipy.cluster has kmeans and vector quantization, with self organized feature maps and other methods promised later. the app is screen scraping web pages and trying to get the main content (an article, for example) without the ads, links, and other junk around the edges. i think it might be possible to look at each line (after tossing everything inside script tags) and separate the lines based on their length and the percent of the line inside html markup. the reasoning is that real content usually has long lines in the source and a small fraction of html taggage. i probably want to throw in line index as a third variable, since the lines i want will probably be close together. another thing i could do is grab multiple pages from the same site: either multiple articles that should have the same format or multiple copies from different days for a frequently updated page. that would allow me to do two things. first, i could combine data points from multiple pages to get higher point density for the cluster detection. (might need to test if an each individual page is sampled from the same distribution as the others to throw out outliers.) second, i could detect identical lines to throw away as nonunique boilerplate.
the ubuntu python-mvpa package looks like it might fit the bill.

No comments: