Thursday, December 23, 2010

dashboard and screen scraping

one thing that's been on my low-priority radar is a way to scrape through the complex flaming hoops that banks, credit cards, and investment brokerages put up so i can have an auto dashboard, showing me account balances and net worth at a glance. mechanize looks like a nice package for performing many browser functions, including form interaction; probably the best of its kind i've found (and nice faq). however, it does mean writing a browsing session from scratch (read: lots of online debugging) and i'm not sure how well it can handle javascript, frames/windows, and all the other eye candy screen junk these sites like to throw at you. someone out there recommended pyxpcom (combined with pydom in pythonext) as a way to do anything mozilla can. i think that must be true, since it seems to be just the pieces that mozilla-esque browers are made of. as powerful and difficult to use as a build-your-own-ferrari kit. i think the most promising option seems to be selenium, which is apparently merging with webdriver for version 2.0. basically drives a real browser, but can record and play back scripts in a variety of languages (including python). the webdriver type of interface seems to be the future of selenium, and it has the advantages of better navigation and less to install. written in java, but i think it can do python (though the docs are behind if so). so i'm not sure if i should just wait for an official release of 2.0, but it does look like selenium is what i'm after. here's the doc on using ide. EDIT: did some more looking around with selenium, and wow! i love the ide/rc combo. i think i need to look at this blog post to get the most out of locators (css vs. xpath). some of the extra plugins for selenium-ide are worth getting, and the selenium.py module can apparently just be copied into the python path to use selenium-rc. 1.0.11 has firefox 4 support in the ide, but it's very recent (2011-04-12). they have put out a number of rcs for v2; apparently the v2 release is coming summer 2011. no remote control javascript server is necessary for version 2 since it's integrated with webdriver. i need to know if the ide and python export will still work. right now i think python will work, but no ide yet (though 2.0 is probably backwards compatible so might run the code generated by the version 1 ide). more selenium links: command locators, xpath/css/dom rosetta, css locators are faster than xpath, good info, stay up to date,good example, managed to get selenium python bindings installed on a windows machine (not surprisingly, a bit more involved than on linux) with my epd python. had to manually download tar ball, python setup.py install, and manually create the test dir structure that it would then complain about. maybe there's an option to make it skip tests, but the kludge was faster than looking that up. now i have selenium 2 with the webdriver interface, much better than rc! and btw, my experiments confirm what others have said about locators: css is much faster than xpath, even on firefox. i've also found that, while the selenium ide is really good for getting started with the locators, it's often possible to find shorter, more informative, and likely more stable tags and ids by poking around in the html just a little rather than using the first thing that pops up in the ide table. so i'm not going to try to keep a drop-in interface to call into the ide-generated scripts; cut-n-paste of one-liners will be good enough for both dev and maintenance. still, there is tremendous value in starting with something that works, and that alone makes the ide worth the install. some other things i've learned: the 'andWait' stuff is only relevant from the java interface. in python, there's no way to keep running asynch while stuff is still loading. click, get, etc. only return to the python script once it's fully loaded, so that can be a latency bottleneck. i did poke around and find a possible place to change that, but i'll see if i really need to.

No comments: