This post gives a demonstration of how parallelism can be introduced to a Python program to increase performance, and in particular how this can be done in very few lines of code.
Introduction
I’ve been working on a project involves calculating a “sentiment” value for a webpage, I will be blogging the results very soon. To evaluate a few dozen pages was taking over a minute so I looked for ways to speed it up.
Given the nature of the problem, each webpage could be valued independently, it made sense to look for a way to parallelise this operation. This was achieved by making only a very minor re-shuffle to the code. The example here addresses a much simplified problem but the parallelisation is performed in the same way.
Example problem
The illustrative example used here is trying to find the size of landing page for various websites. Some starter code is below. I want to use the get_site_size()
function for each website, then send the output to print_results()
. The print function expects a total, being the sum of all sizes, and the details for each site as a (url, size)
tuple.
import urllib2 sites = ['www.google.com', 'www.bing.com', 'www.yahoo.com', 'www.wikipedia.org', 'news.bbc.co.uk', 'www.cnn.com', 'www.theguardian.com', 'www.aljazeera.com', 'www.theonion.com', 'www.bloomberg.com', 'www.reuters.com', 'www.nytimes.com', 'www.liberation.fr', 'www.lemonde.fr', 'www.tdg.ch', 'www.24heures.ch'] def get_site_size(url): return len(urllib2.urlopen('http://%s' % url).read()) def print_results(total, details): print 'Total size is %s' % total for r in sorted(details, key = lambda x: x[1]): print '%s - %s' % r
A serial solution
To achieve this in Python is straightforward enough using the code below.
def serial(): res = [(url,get_site_size(url)) for url in sites] total = sum([x[1] for x in res]) print_results(total, res)
The use of list comprehensions to iterate across lists and form results is compact and quite usual. However, on my set-up it takes around 10 seconds and it doesn’t lend itself to parallelisation. The output is as below.
Total size is 3616211 www.google.com - 20162 www.wikipedia.org - 54537 www.bing.com - 74890 www.theonion.com - 90314 www.cnn.com - 104371 www.reuters.com - 106289 www.aljazeera.com - 158480 www.24heures.ch - 169361 www.nytimes.com - 176633 www.tdg.ch - 181578 news.bbc.co.uk - 209822 www.liberation.fr - 276102 www.lemonde.fr - 395189 www.yahoo.com - 416647 www.theguardian.com - 559838 www.bloomberg.com - 621998
Parallel processing in Python
Parallel processing in Python is constrained by the Global Interpreter Lock (GIL) which prevents the parallelisation of Python bytecode. However parallelisation is still possible in two respects.
First, although Python bytecode cannot be executed in parallel, blocking operations such as IO execute outside of the GIL.
Secondly, while parallelisation of bytecode isn’t possible across threads in the same python process, parallelisation can be achieved by launching multiple Python instances of Python processes.
The nice thing is how Python can take care the mechanics of the parallelisation for you.
A functional refactor
Before introducing parallel execution we apply a small re-factor to the code. Recall that the initial implementation looked like this.
def serial(): res = [(url,get_site_size(url)) for url in sites] total = sum([x[1] for x in res])
We can change this to move away from list comprehensions and instead use the methods map
, reduce
and zip
which take their inspiration from functional programming languages.
def functional(): res = zip(sites, map(get_site_size, sites)) total = reduce(lambda x, y : x + y[1], res, 0)
The important change here is to introduce the map function. This performs a very similar role to the comprehension in that it iterates across a set of inputs and performs an operation on them, returning the results as a list.
The results are now combined with the url into a tuple using the zip
method, rather than in-place via the comprehension.
The totals are summed using the reduce
method. There’s no real reason for doing this other than it’s a more functional way of doing the same thing as the original sum
statement. Note that despite the linguistic similarity, and the parallelisation opportunities, this is not the same as the MapReduce model, although it was an influence.
The performance of these two methods is much the same.
Introducing parallelisation
To introduce parallel processing we can now change the code to be as below. All we’ve done is created a multiprocessing pool, and used the map
function from this instead of the basic map
method.
from multiprocessing import Pool def parallel_process(): pool = Pool() res = zip(sites, pool.map(get_site_size, sites)) total = reduce(lambda x, y : x + y[1], res, 0)
For a small change this does an awful lot under the covers. The Pool represents a set of Python processes that can be used to execute code in parallel. By default it creates one process per CPU on the host machine, 4 in the case of my system. Now when we execute the map
function the framework takes care of farming out the work to the sub-processes and collecting the results.
With a very small change the parallel processing can be implemented using threads rather than sub-processes.
from multiprocessing.pool import ThreadPool def parallel_thread(): pool = ThreadPool() res = zip(sites, pool.map(get_site_size, sites)) total = reduce(lambda x, y : x + y[1], res, 0)
Again, the framework takes care of all the thread management behind the scenes.
Results
Since this test depends on download times from external websites there is inevitably quite a bit of variation in the timings. However, in general:
- The serial execution was 2 to 3 times slower than parallel execution.
- There was not much difference in the parallel process and parallel threads performance.
The similarity in sub-process and threading performance suggests, as expected, that the performance is IO bound so GIL contention is less of an issue. Just to prove the point I substituted the get_site_size()
method for sum(xrange(100000000))
. Sure enough, now that processing was CPU bound the sub-process implementation was much quicker than the multi-threaded one, by about a factor of three.
Conclusion
It’s no surprise that network IO bound processing performance can be improved through parallelisation. What’s impressive is how easily this can be done in Python, especially if the initial implementation makes use of the map
function.