Duhctaep's Lab: python

Showing posts with label python. Show all posts

Friday, May 31, 2013

amazon aws/ec2, picloud

looks like amazon gives you up to a year to try out some free cpu time on their cloud computing nodes.
http://aws.amazon.com/free/terms/

also, their spot instances allow you to bid for time, rather than paying the fixed on-demand rates. looks like the discount is significant, if you can handle the unpredictability. nice example of running a jenkins build slave on spot, too. also refs princeton consultants and their optispotter, which helps smallish ($50mil) hedge funds find hft opportunities.
http://aws.amazon.com/ec2/spot-instances/
http://www.youtube.com/watch?v=-vAAuTs9iu4

picloud still looks like a good way to get started. can do fractional hours, and prices are comparable to ec2 on-demand. they allow you to create an environment on a virtual ubuntu, so you can install whatever you need as if you had a local filesystem.
http://aws.typepad.com/aws/2012/12/picloud-and-princeton-consultants-win-the-first-amazon-ec2-spotathon.html

Wednesday, April 17, 2013

jake vanderplas blog

http://jakevdp.github.io/

blog with lots of interesting python examples and demos. not just code, though; the guy seems plugged in to big-picture developments.

Saturday, April 13, 2013

python presentation videos

http://vimeo.com/pydata/videos/page:1/sort:date
http://pyvideo.org/category

no need to attend pycon or pydata, or no need to take notes if you do.

Friday, April 5, 2013

ipython parallel and acyclic graphs

heard an interesting tidbit from a talk by brian granger. apparently the ipython parallel kernel has the ability to take acyclic graph dependencies and intelligently distribute the computation. i need to look into that.

Saturday, November 17, 2012

sobol sequences and python

tried it, seems to work just fine.

https://github.com/naught101/sobol_seq/blob/master/sobol_test_output.txt
http://people.sc.fsu.edu/~jburkardt/py_src/sobol/sobol.html

modular toolkit for data processing

http://mdp-toolkit.sourceforge.net/

interesting project with a number of capabilities. python code for pca, ica, slow feature analysis, manifold learning methods ([Hessian] local linear embedding), classifiers, factor analysis, rbm, etc.

according to the 'intro to scipy' talk at pydata 2012, it has the fastest pca available in python (even if the interface is more difficult than scipy svd or sklearn.decomposition.PCA).

numba and cython

interesting comparison between numba and cython (and pure python). both projects i want to keep and eye on.

http://jakevdp.github.com/blog/2012/08/24/numba-vs-cython/

Friday, October 26, 2012

nuitka

http://www.nuitka.net/pages/overview.html

interesting project that generates c++ from python. not sure if it's ready for me to grab and use, without becoming a dev. but it's one to keep an eye on and file away with numba, shedskin, cython, py2exe, etc.

Friday, June 22, 2012

continuous integration

recently decided it was time to stop putting off trying continuous integration for software development. (i'm only a decade behind the times; not bad.)

since i mostly use python, i had to look at buildbot. apache gump and cruisecontrol also seemed like possibilities. but in the end i tried hudson since i'd read it was easy to set up and use, and it really was. all i had to do was download the war file and run

java -jar .\hudson.war *>output.txt

(i had to redir output so the blocking to console wouldn't make it wait for me to scroll or press a key.)

here are some motivational/informative quotes on ci:
wikipedia:
continuous integration -- the practice of frequently integrating one's new or changed code with the existing code repository -- should occur frequently enough that no intervening window remains between commit and build, and such that no errors can arise without developers noticing them and correcting them immediately.

martin fowler:
continuous integration doesn't get rid of bugs, but it does make them dramatically easier to find and remove. in this respect it's rather like self-testing code. if you introduce a bug and detect it quickly it's far easier to get rid of. since you've only changed a small bit of the system, you don't have far to look. since that bit of the system is the bit you just worked with, it's fresh in your memory -- again making it easier to find the bug. you can also use diff debugging -- comparing the current version of the system to an earlier one that didn't have the bug.

bugs are also cumulative. the more bugs you have, the harder it is to remove each one. this is partly because you get bug interactions, where failures show as the result of multiple faults -- making each fault harder to find. It's also psychological -- people have less energy to find and get rid of bugs when there are many of them...

if you have continuous integration, it removes one of the biggest barriers to frequent deployment. frequent deployment is valuable because it allows your users to get new features more rapidly, to give more rapid feedback on those features, and generally become more collaborative in the development cycle. this helps break down the barriers between customers and development -- barriers which i believe are the biggest barriers to successful software development.

paul duvall, cto, stelligent incorporated:
6 anti patters
infrequent checkins, which lead to delayed integration
broken builds, whcih prevent teams from moving on to other tasks
minimal feedback, which prevents action from occurring
receiving spam feedback, wich causes people to ignore messages
possessing a slow machine, which delays feedback
relying on a bloated build, which reduces rapid feedback

Sunday, April 1, 2012

my own doctest runner

#!/cygdrive/c/Python27/python -i 'c:\crunch6SVN\python\pyTest.py'
#!/cygdrive/c/Python26_64/python  'c:\crunch6SVN\python\pyTest.py'
#!/usr/bin/env python
#!/usr/local/python/bin/python

# this works from powershell, but not from xterm or within spyder:
# C:\Python27\python ..\..\pyTest.py .\scanCoverage.py -g
# i think spyder puts in some trace hooks into pdb of its own
import sys
import os
#import ipdb
#dbg = ipdb.set_trace
from pdb import set_trace as dbg

def lineProfile(runStr,runContext={},module=None,moduleOnly=False):
    # with the run string set up, i can use cProfile to find the worst offenders
    import cProfile,pstats
    import line_profiler
    import sys
    prof = cProfile.Profile()
    #r = prof.runctx(runStr,{},{'p':p,'sout':sout})
    r = prof.runctx(runStr,{},runContext)
    # maybe use prof.dump_stats() to spit out to a file
    r = pstats.Stats(prof).strip_dirs().sort_stats('time').print_stats(5)

    #get line profiling on top 3 time hog functions
    ss = pstats.Stats(prof).sort_stats('time')
    def b(fn): return fn.rstrip('.py').rstrip('.pyc') #### rstrip takes or
    if moduleOnly:
        # only show functions in this file
        hogs = [f[2] for f in ss.fcn_list if b(f[0])==b(__file__)][:3]
        ts = [ss.stats[f][2] for f in ss.fcn_list if b(f[0])==b(__file__)][:3]
    else:
        #hogs = [f[2] for f in ss.fcn_list][:3]
        hogs = ss.fcn_list[:3]
        ts = [ss.stats[f][2] for f in ss.fcn_list][:3]
    fts = [t/ss.total_tt for t in ts]
    # ignore any functions beyond what accounts for 80% of the time
    for i in range(len(fts)):
        if sum(fts[:i])>.8: break
    hogs,ts,fts = hogs[:i],ts[:i],fts[:i]
    hogs.reverse();ts.reverse();fts.reverse() # i want longest time last
    # can't line prof builtins, so take them out of the list
    fts = [f for f,h in zip(fts,hogs) if not h[0]=='~']
    ts = [t for t,h in zip(ts,hogs) if not h[0]=='~']
    hogs = [h for h in hogs if not h[0]=='~']
    # this probably won't work in pyTest:
    #fs = [[getattr(x,h) for x in locals().values() if hasattr(x,h)][0]
    #fs = [[getattr(x,h) for x in sys.modules.values() if hasattr(x,h)][0]
    # pstats only saves module filename, so match files and search within them
    # rstrip for .pyc, .pyo
    modules = [x.__file__.rstrip('oc') for x in sys.modules.values() if hasattr(x,'__file__')]
    indices = [modules.index(h[0].rstrip('oc')) for h in hogs]
    modules = [x for x in sys.modules.values() if hasattr(x,'__file__')]
    hogMods = [modules[i] for i in indices]
    # find functions/methods within module
    #     only searches down one level instead of a full tree search, so don't
    #      get too crazy with deeply nested defs
    fs = []
    for ln,h,m in zip(*zip(*hogs)[1:3]+[hogMods]):
        #import pdb;pdb.set_trace()
        if hasattr(m,h) and hasattr(getattr(m,h),'__code__') and getattr(m,h).__code__.co_firstlineno == ln: fs.append(getattr(m,h))
        else:
            for a in [getattr(m,x) for x in dir(m)]:
                if hasattr(a,h) and hasattr(getattr(a,h),'__code__') and getattr(a,h).__code__.co_firstlineno == ln:
                    fs.append(getattr(a,h))
                    break
    #fs = [[getattr(x,h) for x in runContext.values() if hasattr(x,h)][0]
    #      for h in hogs]
    lprof = line_profiler.LineProfiler()
    for f in fs: lprof.add_function(f)
    #stats = lprof.runctx(runStr,{},{'p':p,'sout':sout}).get_stats()
    stats = lprof.runctx(runStr,{},runContext).get_stats()
    for ((fn,lineno,name),timings),ft in zip(sorted(stats.timings.items(),reverse=True),fts):
       line_profiler.show_func(fn,lineno,name,stats.timings[fn,lineno,name],stats.unit)
       print 'this function accounted for \033[0;31m%2.2f%%\033[m of total time'%(ft*100)
    #import pdb;pdb.set_trace()


# monkey patches to allow coverage analysis to work
#     just a little disturbing that (as of 2.4) doctest and trace coverage
#      don't work together...
def monkeypatchDoctest():
    # stolen from http://coltrane.bx.psu.edu:8192/svn/bx-python/trunk/setup.py
    #
    # Doctest and coverage don't get along, so we need to create
    # a monkeypatch that will replace the part of doctest that
    # interferes with coverage reports.
    #
    # The monkeypatch is based on this zope patch:
    # http://svn.zope.org/Zope3/trunk/src/zope/testing/doctest.py?rev=28679&r1=28703&r2=28705
    #
    try:
        import doctest
        _orp = doctest._OutputRedirectingPdb
        class NoseOutputRedirectingPdb(_orp):
            def __init__(self, out):
                self.__debugger_used = False
                _orp.__init__(self, out)

            def set_trace(self):
                self.__debugger_used = True
                #_orp.set_trace(self)
                pdb.Pdb.set_trace(self)

            def set_continue(self):
                # Calling set_continue unconditionally would break unit test coverage
                # reporting, as Bdb.set_continue calls sys.settrace(None).
                if self.__debugger_used:
                    #_orp.set_continue(self)
                    pdb.Pdb.set_continue(self)

        doctest._OutputRedirectingPdb = NoseOutputRedirectingPdb
    except:
        raise #pass
    return doctest

def monkeypatchTrace():
    import trace
    try:
        t = trace.Trace
        class NoDoctestCounts(t):
            def results(self):
                cs = self.counts
                newcs = {}
                # throw away 'files' that start with = 2.6 will not allow import by filename
        # i should refactor the whole thing to use imp module
        sys.path.insert(0,os.path.dirname(n))
        n = os.path.splitext(os.path.basename(n))[0]
    if not n.startswith('-'):
        if True:#try:
            if debug:
                # __import__ needs a non-empty fromlist if it's a submodule
                if '.' in n:
                    try: m = __import__(n,None,None,[True,])
                    except ImportError: # just run doctests for an object
                            modName = '.'.join(n.split('.')[:-1])
                            #objName = n.split('.')[-1]
                            m = __import__(modName,None,None,[True,])
                            #doctest.run_docstring_examples(m.__dict__[objName],m.__dict__,name=objName)
                            doctest.debug(m,n,True)
                            import sys
                            sys.exit()
                else: m = __import__(n)
                for i in m.__dict__.values():
                    import abc
                     # if it's a class (from a metaclass or metametaclass) or function
                    if type(i) == type or type(i) == abc.ABCMeta or \
                       (type(type(i)) == type and hasattr(i,'__name__')) \
                       or type(i) == type(lineProfile):
                        try:
                            print 'Testing',i.__name__
                            doctest.debug(m,n+'.'+i.__name__,True)
                        except ValueError:
                            print 'No doctests for', i.__name__
            else:
                import pdb
                if coverage:
                    #### need a better way to get module filenames without
                    #     importing them. (after initial import, the class and
                    #     def lines will not be executed, so will erroneously
                    #     be flagged as not tested.)
                    #d,name = os.path.split(m.__file__)
                    d,name = '.',n
                    #bn = trace.fullmodname(name)
                    bn = name.split('.')[-1]
                    # ignore all modules except the one being tested
                    ignoremods = []
                    mods = [trace.fullmodname(x) for x in os.listdir(d)]
                    for ignore,mod in zip([bn != x for x in mods], mods):
                        if ignore: ignoremods.append(mod)
                    tracer = trace.Trace(
                        ignoredirs=[sys.prefix, sys.exec_prefix],
                        ignoremods=ignoremods,
                        trace=0,
                        count=1)
                    if '.' in n:
                        tracer.run('m = __import__(n,None,None,[True,])')
                    else: tracer.run('m = __import__(n)')
                    tracer.run('doctest.testmod(m)')
                    r = tracer.results()
                    r.write_results(show_missing=True, coverdir='.')
                else:
                    # __import__ needs a non-empty fromlist if it's a submodule
                    if '.' in n:
                        try: m = __import__(n,None,None,[True,])
                        except ImportError: # just run doctests for an object
                            modName = '.'.join(n.split('.')[:-1])
                            objName = n.split('.')[-1]
                            m = __import__(modName,None,None,[True,])
                            doctest.run_docstring_examples(m.__dict__[objName],m.__dict__,name=objName)
                            import sys
                            sys.exit()
                    else:
                        #import pdb; pdb.set_trace()
                        m = __import__(n)
                    # dangerously convenient deletion of any old coverage files
                    try: os.remove(trace.modname(m.__file__)+'.cover')
                    except OSError: pass
                    # need to call profile function from the doctest
                    # so that it can set up the context and identify the run string, because anything not passed back will get garbage collected
                    # and there's no way to pass anything back
                    # but how can i call something within pyTest from the doctest string? some kind of callback?
                    # i want pyTest to decide if it gets called, so i can switch from the command line

                    doctest.testmod(m)
                    if profile:
                        runStr,runContext = m._profile()
                        lineProfile(runStr,runContext,m)
        else:#except Exception,e:
            print 'Could not test '+n
            print e
            raise e

q = quit
from sys import exit as e

Wednesday, January 11, 2012

debugging c++ extensions to python

trying to debug a python extension module written in c++ (wrapped with swig). i think this would be so much easier if i were using gcc, but python is built with msvc... setup.py wants the debug versions of python libs, but i don't have them and don't really want to try to build python from scratch right now. these refs seem relevant: http://www.velocityreviews.com/forums/t677466-please-include-python26_d-lib-in-the-installer.html http://vtk.org/gitweb?p=VTK.git;a=blob;f=Wrapping/Python/vtkPython.h;h=9d01ac21bafae0a24252398f268b6b3563df62cd

Wednesday, September 14, 2011

python/c++ with microsoft visual c++

finally got a 64 bit pyd python extension working, compiled with ms visual c++ and visual studio for epd on windows. figured out that the version string in python (MSC v.1500 64 bit (AMD64)) was for visual c++ 2008 == v9.0, and the express edition only builds 32 bit. so i had to get the sdk (version 7 works with 2008) and make sure the amd64 stuff got installed with it. some notes say to install the service pack before the 64 bit stuff, but i didn't find this necessary. i ran the 'Windows SDK Configuration Tool' from the start menu, since it sounded logical, and ticked the box to link the sdk with VC 2008. not sure if that was necessary or not. one change i had to kludge manually was changing the references in C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\vcvarsall.bat under the amd64 label. originally they were pointing to "%~dp0bin\amd64\vcvarsamd64.bat"; they need to be "%~dp0bin\vcvars64.bat". at that point, running vcvarsall.bat amd64 should work and a simple use of weave passes: import scipy.weave as w c = w.inline(r'printf("hi.");',verbose=2) now that the compiler, etc., are set up i can use swig and distutils to build bigger extensions (like in the swig docs), and it Just Works!

Saturday, May 7, 2011

karnickel: macros in python

not sure when exactly a macro would be useful in python. i remember seeing such a thing in some cython code, to deal with a c++ template, though karnickel deals with the python ast so probably not useful for that. but, there it is, if the need arises.

Thursday, May 5, 2011

python embedded in gdb

version >=7 of gdb has an embedded python interpreter. here's a tutorial on it. very handy if i need to debug c or c++. i'm guessing a million new debugger guis will be built on top of this.

Wednesday, February 16, 2011

pybtex

if i ever need to mess around with beastly bst (bibtex style) files, i won't. i'll switch to pybtex, which looks like it's a lot more fun to use.

Thursday, January 6, 2011

windows python in cygwin

finally solved a problem (or found a workaround, at least) for something that had bothered me for a while: when i tried to use windows python (not cygwin python, which worked fine) in a xterm, it seemed not to be connected to stdout, stderr, and stdin. neither the interpreter nor the debugger prompt would show up, and nothing happened when i used print or sys.stdout.write. the mysterious thing was it would work from a non-x cygwin shell. but i needed mouse action on the desktop and screen (which uses a text-based x windows server) remotely. turns out the problem is how cygwin interfaces a non-cygwin console app from the terminal. it talks to it through pipes rather than with a real pty, and the issues there are deep and woolly. so all these windows programs are buffering in the pipe, not realizing how impatient i'm getting on the other end. fortunately, python has an easy workaround. the -i option makes it assume interactivity, skipping the tty check. i can use it on the cli or #! shebang, and now it's working. only problem is it drops me into an interpreter when the script finishes, so i have to type quit() (c-d, c-z, c-c are all ignored). ref here

parallel, numpy, shared memory,...

trying to figure out how to do parallel processing efficiently with python, and numpy in particular. i want something simple, closely related to the original serial code (sorry, mpi, you're not welcome here). parallelpython holds some promise, dodging the gil by starting separate interpreters and piping pickles back and forth. similar to pyro, and it looks pretty seemless between smp vs. cluster. unfortunately, pp does not provide for any shared mem so big data (even read only) must be copied (and pickled!) on smp. multiprocessing is now built in to 2.6 and backported as far as 2.4 or 2.3. doesn't handle remote processes, though the pp/pyro-type pickle server (manager) interfaces with inet ports. i think it basically forks the process to make the worker processes, so you get less overhead (os service vs. cranking up a new python). and there's no need to feed it modules or any other globals; these get copied on the fork. it has some capability to share memory, though i think these are only kinda raw ctype buffers. (i think all of this is similar to the approach posh used, though more generally for user-defined types -- high quality hackery but unmaintained since 2003.) apparently some people have coaxed numpy into using these ctype arrays to make np arrays sit in shared memory land, with views available to the children. (maybe using this sort of thing.) the approach got an attaboy from the big man himself, travis oliphant, but (in the same dir) sturla has a sharedmem module written later (cleaned up and posted here) that looks like it makes lower level sys calls to create shared memory space manually. does that mean the multiprocessing shm is unsatisfactory? the paper does warn that it's a moving target, and the scipy cookbook indicates the same thing: 'this page was obsolete as multiprocessing's internals have changed.' epd has a webinar coming up promising to demo multiprocessing with large arrays, so maybe i should see what they do. anyway, if i do use this for parallel stuff, this blog post might be useful. here's another page that looks very useful for multiprocessing. fabric looks interesting, too, though more geared toward sysadmin stuff. maybe similar to posh in some ways.

Thursday, December 23, 2010

dashboard and screen scraping

one thing that's been on my low-priority radar is a way to scrape through the complex flaming hoops that banks, credit cards, and investment brokerages put up so i can have an auto dashboard, showing me account balances and net worth at a glance. mechanize looks like a nice package for performing many browser functions, including form interaction; probably the best of its kind i've found (and nice faq). however, it does mean writing a browsing session from scratch (read: lots of online debugging) and i'm not sure how well it can handle javascript, frames/windows, and all the other eye candy screen junk these sites like to throw at you. someone out there recommended pyxpcom (combined with pydom in pythonext) as a way to do anything mozilla can. i think that must be true, since it seems to be just the pieces that mozilla-esque browers are made of. as powerful and difficult to use as a build-your-own-ferrari kit. i think the most promising option seems to be selenium, which is apparently merging with webdriver for version 2.0. basically drives a real browser, but can record and play back scripts in a variety of languages (including python). the webdriver type of interface seems to be the future of selenium, and it has the advantages of better navigation and less to install. written in java, but i think it can do python (though the docs are behind if so). so i'm not sure if i should just wait for an official release of 2.0, but it does look like selenium is what i'm after. here's the doc on using ide. EDIT: did some more looking around with selenium, and wow! i love the ide/rc combo. i think i need to look at this blog post to get the most out of locators (css vs. xpath). some of the extra plugins for selenium-ide are worth getting, and the selenium.py module can apparently just be copied into the python path to use selenium-rc. 1.0.11 has firefox 4 support in the ide, but it's very recent (2011-04-12). they have put out a number of rcs for v2; apparently the v2 release is coming summer 2011. no remote control javascript server is necessary for version 2 since it's integrated with webdriver. i need to know if the ide and python export will still work. right now i think python will work, but no ide yet (though 2.0 is probably backwards compatible so might run the code generated by the version 1 ide). more selenium links: command locators, xpath/css/dom rosetta, css locators are faster than xpath, good info, stay up to date,good example, managed to get selenium python bindings installed on a windows machine (not surprisingly, a bit more involved than on linux) with my epd python. had to manually download tar ball, python setup.py install, and manually create the test dir structure that it would then complain about. maybe there's an option to make it skip tests, but the kludge was faster than looking that up. now i have selenium 2 with the webdriver interface, much better than rc! and btw, my experiments confirm what others have said about locators: css is much faster than xpath, even on firefox. i've also found that, while the selenium ide is really good for getting started with the locators, it's often possible to find shorter, more informative, and likely more stable tags and ids by poking around in the html just a little rather than using the first thing that pops up in the ide table. so i'm not going to try to keep a drop-in interface to call into the ide-generated scripts; cut-n-paste of one-liners will be good enough for both dev and maintenance. still, there is tremendous value in starting with something that works, and that alone makes the ide worth the install. some other things i've learned: the 'andWait' stuff is only relevant from the java interface. in python, there's no way to keep running asynch while stuff is still loading. click, get, etc. only return to the python script once it's fully loaded, so that can be a latency bottleneck. i did poke around and find a possible place to change that, but i'll see if i really need to.

Friday, December 10, 2010

getting spyder and python to work on windows 7

had to struggle to get my python install working with numpy and spyder, probably because i copied them over from another install.

with spyder, i had to move the 2.0.0beta5 egg and give the --prefix option to setup.py in order to install and use 2.0.3.

with numpy, i was getting 'ImportError: DLL load failed: The specified module could not be found.' when i tried to import numpy, _unless_ i was in the Python26_64\Scripts dir. i think it's because of the mkl lapack dlls in there. but it all worked once i added that to my PATH (not PYTHONPATH) with the help of cygwin -- just using export in bash allows me to peek at what it would be in windowsese with os.environ['PATH'] so i could put it into the env editor in spyder. viola!

Monday, November 22, 2010

thinking in c++

both volumes of the 2nd edition are freely available online, as is a draft version of 'thinking in python'. they're all a bit old, though they look like worthwhile reads.

Blog Archive

About Me