I found that cleaning a time-series was taking longer than expected, the data being stored in a Python Pandas dataframe. With a few alterations to the way I was manipulating the data I was able to improve the performance so that it ran 25 times faster.
Introduction
I’ve embarked on re-writing the trend analysis I’ve previously implemented using R in Python. The standard way of manipulating time-series is by making use of a Pandas dataframe with a date index. Indeed this is what you receive if you download a future from the time-series source Quandl.
On loading historical time-series data the first thing you are likely to do is perform a bit of a clean. To replicate what I did in R I needed to perform the following:
- Consolidate the “Open Interest” column, this is sometimes delivered as open interest for the day, sometimes it is the previous day’s.
- Remove the columns I didn’t need, such as “Volume” and “Last” which are often not populated anyway.
- Clean the “High” and “Low” columns. While the settle and open prices seem to be pretty reliable, the high and low could sometimes be empty or zero. In this event I want to set them to the min/max of the open and settle.
- Add “Next Open” and “Previous Settle” columns which will be used later in the trend algorithm.
- Strip away any rows with an invalid settle price, or with no open interest.
First steps
This is the code I wrote to munge the data, it proceeds in steps just as previously described.
import numpy as np import pandas as pd def clean_series_orig(series): clean = series.copy() # Create consistent open interest column if 'Prev. Day Open Interest' in clean.columns: clean['OpenInterest'] = clean['Prev. Day Open Interest'].shift(-1).fillna(0) else: clean.rename(columns={'Open Interest': 'OpenInterest'}, inplace=True) # Limit to columns we want. clean = clean[['Open', 'High', 'Low', 'Settle', 'OpenInterest']] # Deal with missing high and low prices. clean['High'] = clean.apply(lambda row: row['High'] if np.isfinite(row['High']) and row['High'] > 0. else max(row['Open'], row['Settle']), axis=1) clean['Low'] = clean.apply(lambda row: row['Low'] if np.isfinite(row['Low']) and row['Low'] > 0. else min(row['Open'], row['Settle']), axis=1) # Add a couple of columns we'll need later. clean['NextOpen'] = clean['Open'].shift(-1) clean['PreviousSettle'] = series['Settle'].shift(1) # Remove rows we're not interested in. clean = clean[(np.isfinite(clean.OpenInterest)) & (np.isfinite(clean.Settle)) & (clean.OpenInterest > 0) & (clean.Settle > 0)] return clean
I invoked the code by downloading a future series from Quandl and using the IPython Notebook %timeit
magic command to see how long it took.
import Quandl as ql series = ql.get('CME/EDH2010') # Mar 2010 Euro-dollar %timeit clean = clean_series_orig(series)
The result was that it took around 100 ms on my system. This isn’t long enough to notice on a single instrument, but for most futures there 4 or more expiries a year so for 40 years of history it could take around 15 seconds. I’m looking at around 40 series so it would take about 10 minutes to clean it all. Probably not a big practical problem as you would cache it after doing it once, but I thought I’d look into whether it could be done quicker.
Bearing fruit
It didn’t take much investigation to realise that the two apply
lines in the original code were contributing to a lot of the slow-down. I read that performance could be improved by treating rows as raw Numpy arrays, meaning you lose the key look-up and have to use integer indexes. So made the replacement below, note the raw=True
flags.
# Deal with missing high and low prices. clean['High'] = clean.apply(lambda row: row[1] if np.isfinite(row[1]) and row[1] > 0. else max(row[0], row[3]), axis=1, raw=True) clean['Low'] = clean.apply(lambda row: row[2] if np.isfinite(row[2]) and row[2] > 0. else min(row[0], row[3]), axis=1, raw=True)
This did indeed speed things up a lot, it was now down to around 45 ms. However, I was still using apply
. This isn’t a vectorised operation, it loops through the rows, so is best avoided. So I re-phrased the code as below.
It’s probably clear what the “High” operation is doing, but the “Low” needs a bit more explanation. First it replaces any zero values with NaN, as zeros were the only bad values I needed to worry about. It then makes use of fmin()
which takes the minimum of two numbers, favouring the non-NaN value in the event that one isn’t defined. So we take the existing “Low” if it is valid, or the lowest of the “Open” or “Settle” otherwise.
clean['High'] = clean[['High','Open','Settle']].max(axis=1) clean['Low'] = np.fmin(clean[['Open', 'Settle']].min(axis=1), clean['Low'].replace(0., np.nan))
This made another small step forwards, the time was now down to 40 ms.
What a load of garbage
The next thing I did was run another IPython Notebook magic command, %prun
, which profiles the code and shows where the time is being taken.
%prun -l 5 clean = clean_series_orig(series)
ncalls tottime percall cumtime percall filename:lineno(function) 1 0.054 0.054 0.054 0.054 {gc.collect} 1505 0.001 0.000 0.001 0.000 {isinstance} 35 0.001 0.000 0.001 0.000 {method 'reduce' of 'numpy.ufunc' objects} 9 0.000 0.000 0.001 0.000 internals.py:2611(_rebuild_blknos_and_blklocs) 107/106 0.000 0.000 0.000 0.000 {numpy.core.multiarray.array}
Somewhat surprisingly, the results showed that the bulk of the time was being taken in garbage collection. After a bit of experiment it transpired this was caused by those same two lines. I replaced them with the line below.
# Deal with missing high and low prices. clean = pd.DataFrame({'Open': clean['Open'], 'High': clean[['High','Open','Settle']].max(axis=1), 'Low': np.fmin(clean[['Open', 'Settle']].min(axis=1), clean['Low'].replace(0., np.nan)), 'Settle': clean['Settle'], 'OpenInterest': clean['OpenInterest']})
So now I’m replacing the entire dataframe with a new one, rather than updating individual columns one at a time. I’m performing the same operations as before on the High and Low columns, but effectively doing no-ops on the others. This had a very dramatic improvement, it was now down to only 8 ms and the garbage collection call was nowhere to be seen on the profiler output.
Following this logic further, I modified the code so that dealing with the open interest, removing the unwanted columns and adding the new ones was all done in a single line of code, along with the High and Low corrections. Furthermore, this removed the need to perform an initial copy of the dataframe.
clean = pd.DataFrame({'Open': series['Open'], 'High': series[['High','Open','Settle']].max(axis=1), 'Low': np.fmin(series[['Open', 'Settle']].min(axis=1), series['Low'].replace(0., np.nan)), 'Settle': series['Settle'], 'OpenInterest': series['Prev. Day Open Interest'].shift(-1).fillna(0.) if 'Prev. Day Open Interest' in series.columns else series['Open Interest'], 'NextOpen': series['Open'].shift(-1), 'PreviousSettle': series['Settle'].shift(1) })
As hoped this brought a further improvement, now down to 5 ms.
One last push
The only remaining code was the filtering of rows with no settle price or open interest which looked like this.
# Remove columns we're not interested in. clean = clean[(np.isfinite(clean.OpenInterest)) & (np.isfinite(clean.Settle)) & (clean.OpenInterest > 0) & (clean.Settle > 0)]
Using similar logic to before this was replaced with the code below, trimming off another millisecond.
clean = clean[(np.isfinite(clean.OpenInterest.replace(0., np.nan))) & (np.isfinite(clean.Settle.replace(0., np.nan)))]
I stumbled across another slight improvement. Replacing the logical and operator above with the np.logical_and
function brought another advance, but only of about a quarter of a millisecond. So the code was now running in just under 4 ms, rather than the 100 ms of the original.
clean = clean[np.logical_and(np.isfinite(clean.OpenInterest.replace(0., np.nan)), np.isfinite(clean.Settle.replace(0., np.nan)))]
Final code
The final code is not only faster but also quite clear and compact. Furthermore, it also scales better than the original code. It is about 25 times faster on 5 years of data, but nearer 35 times faster over 10 years data.
def clean_series(series): clean = pd.DataFrame({'Open': series['Open'], 'High': series[['High','Open','Settle']].max(axis=1), 'Low': np.fmin(series[['Open', 'Settle']].min(axis=1), series['Low'].replace(0., np.nan)), 'Settle': series['Settle'], 'OpenInterest': series['Prev. Day Open Interest'].shift(-1).fillna(0.) if 'Prev. Day Open Interest' in series.columns else series['Open Interest'], 'NextOpen': series['Open'].shift(-1), 'PreviousSettle': series['Settle'].shift(1) }) clean = clean[np.logical_and(np.isfinite(clean.OpenInterest.replace(0., np.nan)), np.isfinite(clean.Settle.replace(0., np.nan)))] return clean
Conclusion
It’s well advertised that vectorised operations will be faster in Pandas than using apply
so it’s no surprise that making changes in this area helped. More of a surprise was that replacing a dataframe wholesale rather than updating individual columns made a very big improvement by removing an expensive garbage collection operation. I’d need to know more on the internals of Pandas to understand why this is the case.
This short exercise also really underlined the power of using IPython Notebook as an interactive development platform, the magic commands in particular really helped guide the changes to make.