Quick parallelism in Python

This post gives a demonstration of how parallelism can be introduced to a Python program to increase performance, and in particular how this can be done in very few lines of code.

Introduction

I’ve been working on a project involves calculating a “sentiment” value for a webpage, I will be blogging the results very soon. To evaluate a few dozen pages was taking over a minute so I looked for ways to speed it up.

Given the nature of the problem, each webpage could be valued independently, it made sense to look for a way to parallelise this operation. This was achieved by making only a very minor re-shuffle to the code. The example here addresses a much simplified problem but the parallelisation is performed in the same way.

Example problem

The illustrative example used here is trying to find the size of landing page for various websites. Some starter code is below. I want to use the get_site_size() function for each website, then send the output to print_results(). The print function expects a total, being the sum of all sizes, and the details for each site as a (url, size) tuple.

import urllib2

sites = ['www.google.com', 'www.bing.com', 'www.yahoo.com', 'www.wikipedia.org',
         'news.bbc.co.uk', 'www.cnn.com', 'www.theguardian.com', 'www.aljazeera.com',
         'www.theonion.com', 'www.bloomberg.com', 'www.reuters.com', 'www.nytimes.com',
         'www.liberation.fr', 'www.lemonde.fr', 'www.tdg.ch', 'www.24heures.ch']

def get_site_size(url):
    return len(urllib2.urlopen('http://%s' % url).read())

def print_results(total, details):
    print 'Total size is %s' % total
    for r in sorted(details, key = lambda x: x[1]):
        print '%s - %s' % r

A serial solution

To achieve this in Python is straightforward enough using the code below.

def serial():
    res = [(url,get_site_size(url)) for url in sites]
    total = sum([x[1] for x in res])
    print_results(total, res)

The use of list comprehensions to iterate across lists and form results is compact and quite usual. However, on my set-up it takes around 10 seconds and it doesn’t lend itself to parallelisation. The output is as below.

Total size is 3616211
www.google.com - 20162
www.wikipedia.org - 54537
www.bing.com - 74890
www.theonion.com - 90314
www.cnn.com - 104371
www.reuters.com - 106289
www.aljazeera.com - 158480
www.24heures.ch - 169361
www.nytimes.com - 176633
www.tdg.ch - 181578
news.bbc.co.uk - 209822
www.liberation.fr - 276102
www.lemonde.fr - 395189
www.yahoo.com - 416647
www.theguardian.com - 559838
www.bloomberg.com - 621998

Parallel processing in Python

Parallel processing in Python is constrained by the Global Interpreter Lock (GIL) which prevents the parallelisation of Python bytecode. However parallelisation is still possible in two respects.

First, although Python bytecode cannot be executed in parallel, blocking operations such as IO execute outside of the GIL.

Secondly, while parallelisation of bytecode isn’t possible across threads in the same python process, parallelisation can be achieved by launching multiple Python instances of Python processes.

The nice thing is how Python can take care the mechanics of the parallelisation for you.

A functional refactor

Before introducing parallel execution we apply a small re-factor to the code. Recall that the initial implementation looked like this.

def serial():
    res = [(url,get_site_size(url)) for url in sites]
    total = sum([x[1] for x in res])

We can change this to move away from list comprehensions and instead use the methods map, reduce and zip which take their inspiration from functional programming languages.

def functional():
    res = zip(sites, map(get_site_size, sites))
    total = reduce(lambda x, y : x + y[1], res, 0)

The important change here is to introduce the map function. This performs a very similar role to the comprehension in that it iterates across a set of inputs and performs an operation on them, returning the results as a list.

The results are now combined with the url into a tuple using the zip method, rather than in-place via the comprehension.

The totals are summed using the reduce method. There’s no real reason for doing this other than it’s a more functional way of doing the same thing as the original sum statement. Note that despite the linguistic similarity, and the parallelisation opportunities, this is not the same as the MapReduce model, although it was an influence.

The performance of these two methods is much the same.

Introducing parallelisation

To introduce parallel processing we can now change the code to be as below. All we’ve done is created a multiprocessing pool, and used the map function from this instead of the basic map method.

from multiprocessing import Pool

def parallel_process():
    pool = Pool()
    res = zip(sites, pool.map(get_site_size, sites))
    total = reduce(lambda x, y : x + y[1], res, 0)

For a small change this does an awful lot under the covers. The Pool represents a set of Python processes that can be used to execute code in parallel. By default it creates one process per CPU on the host machine, 4 in the case of my system. Now when we execute the map function the framework takes care of farming out the work to the sub-processes and collecting the results.

With a very small change the parallel processing can be implemented using threads rather than sub-processes.

from multiprocessing.pool import ThreadPool

def parallel_thread():
    pool = ThreadPool()
    res = zip(sites, pool.map(get_site_size, sites))
    total = reduce(lambda x, y : x + y[1], res, 0)

Again, the framework takes care of all the thread management behind the scenes.

Results

Since this test depends on download times from external websites there is inevitably quite a bit of variation in the timings. However, in general:

  • The serial execution was 2 to 3 times slower than parallel execution.
  • There was not much difference in the parallel process and parallel threads performance.

The similarity in sub-process and threading performance suggests, as expected, that the performance is IO bound so GIL contention is less of an issue. Just to prove the point I substituted the get_site_size() method for sum(xrange(100000000)). Sure enough, now that processing was CPU bound the sub-process implementation was much quicker than the multi-threaded one, by about a factor of three.

Conclusion

It’s no surprise that network IO bound processing performance can be improved through parallelisation. What’s impressive is how easily this can be done in Python, especially if the initial implementation makes use of the map function.

Posted in Python | Leave a comment