
Upsampling time series data
In upsampling, the frequency of the time series is increased. As a result, we have more sample points than data points. One of the main questions is how to account for the entries in the series where we have no measurement.
Let's start with hourly data for a single day:
>>> rng = pd.date_range('4/29/2015 8:00', periods=10, freq='H') >>> ts = pd.Series(np.random.randint(0, 100, len(rng)), index=rng) >>> ts.head() 2015-04-29 08:00:00 30 2015-04-29 09:00:00 27 2015-04-29 10:00:00 54 2015-04-29 11:00:00 9 2015-04-29 12:00:00 48 Freq: H, dtype: int64
If we upsample to data points taken every 15 minutes, our time series will be extended with NaN
values:
>>> ts.resample('15min') >>> ts.head() 2015-04-29 08:00:00 30 2015-04-29 08:15:00 NaN 2015-04-29 08:30:00 NaN 2015-04-29 08:45:00 NaN 2015-04-29 09:00:00 27
There are various ways to deal with missing values, which can be controlled by the fill_method
keyword argument to resample. Values can be filled either forward or backward:
>>> ts.resample('15min', fill_method='ffill').head() 2015-04-29 08:00:00 30 2015-04-29 08:15:00 30 2015-04-29 08:30:00 30 2015-04-29 08:45:00 30 2015-04-29 09:00:00 27 Freq: 15T, dtype: int64 >>> ts.resample('15min', fill_method='bfill').head() 2015-04-29 08:00:00 30 2015-04-29 08:15:00 27 2015-04-29 08:30:00 27 2015-04-29 08:45:00 27 2015-04-29 09:00:00 27
With the limit
parameter, it is possible to control the number of missing values to be filled:
>>> ts.resample('15min', fill_method='ffill', limit=2).head() 2015-04-29 08:00:00 30 2015-04-29 08:15:00 30 2015-04-29 08:30:00 30 2015-04-29 08:45:00 NaN 2015-04-29 09:00:00 27 Freq: 15T, dtype: float64
If you want to adjust the labels during resampling, you can use the loffset
keyword argument:
>>> ts.resample('15min', fill_method='ffill', limit=2, loffset='5min').head() 2015-04-29 08:05:00 30 2015-04-29 08:20:00 30 2015-04-29 08:35:00 30 2015-04-29 08:50:00 NaN 2015-04-29 09:05:00 27 Freq: 15T, dtype: float64
There is another way to fill in missing values. We could employ an algorithm to construct new data points that would somehow fit the existing points, for some definition of somehow. This process is called interpolation.
We can ask Pandas to interpolate a time series for us:
>>> tsx = ts.resample('15min') >>> tsx.interpolate().head() 2015-04-29 08:00:00 30.00 2015-04-29 08:15:00 29.25 2015-04-29 08:30:00 28.50 2015-04-29 08:45:00 27.75 2015-04-29 09:00:00 27.00 Freq: 15T, dtype: float64
We saw the default interpolate
method – a linear interpolation – in action. Pandas assumes a linear relationship between two existing points.
Pandas supports over a dozen interpolation
functions, some of which require the scipy
library to be installed. We will not cover interpolation
methods in this chapter, but we encourage you to explore the various methods yourself. The right interpolation
method will depend on the requirements of your application.