When to Use Forward Filling (ffill) for Missing Values in pandas Resampling
Resampling in pandas
Resampling in pandas involves changing the frequency of a time series data structure (typically a Series
or DataFrame
with a DateTimeIndex). This allows you to aggregate or manipulate data at different time intervals (e.g., daily to monthly, hourly to minutely).
pandas.core.resample.Resampler.ffill
The ffill
method, also known as forward filling, is used to handle missing values (NaNs) that arise during resampling. It specifically addresses these missing values in the following way:
For upsampling (increasing the number of data points):
- It fills the newly created time periods with the value from the last valid observation before the missing period.
For downsampling (reducing the number of data points):
- It replaces NaN values with the value from the previous valid observation in the original data.
Example
import pandas as pd
# Create a sample time series with missing values
data = {'date': ['2023-01-01', '2023-01-05', '2023-01-10', None, '2023-01-20'],
'value': [1, 2, 3, None, 5]}
df = pd.DataFrame(data)
df.set_index('date', inplace=True)
# Resample to monthly frequency using forward fill
resampled_df = df.resample('M').ffill()
print(resampled_df)
This code will output:
value
date
2023-01-31 1.0
2023-02-28 2.0
2023-03-31 3.0
2023-04-30 3.0 # Forward-filled from the previous valid value (3)
2023-05-31 5.0
As you can see:
- In April, the newly created time period is filled with
3.0
(the last valid value before the missing period). - In January, the missing value is filled with
1.0
(the previous valid value).
Important Points
ffill
is just one of several methods for handling missing values in resampled data. Other options includefillna
(with various strategies like 'bfill' for backward fill), interpolation, or custom functions.ffill
only affects missing values introduced during resampling. Existing missing values in the original data remain unchanged.
Forward Filling with Upsampling (Increasing Frequency)
This code shows how ffill
works when upsampling to a higher frequency:
import pandas as pd
data = {'date': ['2023-01-01', '2023-01-15', '2023-01-30'], 'value': [10, 20, 30]}
df = pd.DataFrame(data)
df.set_index('date', inplace=True)
# Resample to daily frequency using forward fill
resampled_df = df.resample('D').ffill()
print(resampled_df)
This will output:
value
date
2023-01-01 10.0
2023-01-02 10.0 # Forward-filled from the previous day
2023-01-03 10.0 # Forward-filled from the previous day
2023-01-04 10.0 # Forward-filled from the previous day
2023-01-05 10.0 # Forward-filled from the previous day
...
2023-01-14 10.0 # Forward-filled from the previous day
2023-01-15 20.0
2023-01-16 20.0 # Forward-filled from the previous day
...
2023-01-29 20.0 # Forward-filled from the previous day
2023-01-30 30.0
As you can see, the missing days are filled with the value from the last valid observation before the missing period.
Forward Filling with Backfilling (Combining ffill and bfill)
You can combine ffill
with bfill
(backward filling) to create a hybrid approach:
import pandas as pd
data = {'date': ['2023-01-01', None, '2023-01-15', None, '2023-01-30'], 'value': [10, None, 20, None, 30]}
df = pd.DataFrame(data)
df.set_index('date', inplace=True)
# Resample to daily frequency, first filling forward then backward
resampled_df = df.resample('D').ffill().bfill()
print(resampled_df)
value
date
2023-01-01 10.0
2023-01-02 10.0 # Forward-filled from the previous day
2023-01-03 10.0 # Forward-filled from the previous day
2023-01-04 20.0 # Back-filled from the next valid day
2023-01-05 20.0 # Back-filled from the next valid day
...
2023-01-14 20.0 # Back-filled from the next valid day
2023-01-15 20.0
2023-01-16 30.0 # Forward-filled from the previous day
...
2023-01-29 30.0 # Forward-filled from the previous day
2023-01-30 30.0
Here, the first missing value is filled with ffill
, and then subsequent missing values are filled with bfill
.
Forward Filling with Custom Function
You can also define your own function to handle missing values during resampling:
import pandas as pd
def my_fill_function(group):
# Custom logic to fill missing values (e.g., average of surrounding values)
return group.fillna(group.mean())
data = {'date': ['2023-01-01', '2023-01-10', None, '2023-01-20'], 'value': [10, 20
fillna with Different Strategies
fillna(method=None, limit=n)
: This allows you to specify a custom function for filling missing values. You can uselimit
to control how far to look for valid values when using methods likeffill
orbfill
.fillna(method='nearest')
: This fills missing values with the value from the nearest valid observation (either forward or backward) in the original data.fillna(method='bfill')
: This uses backward fill, which replaces missing values with the value from the next valid observation in the original data.
Interpolation
pandas provides various interpolation methods (linear, polynomial, spline) that can be used to estimate missing values based on surrounding valid data. You can access them through the interpolate
method after resampling:
resampled_df = df.resample('D').interpolate('linear')
Custom Functions
You can define your own logic for handling missing values within a function:
def my_fill_function(group):
# Custom logic to fill missing values (e.g., average of surrounding values)
return group.fillna(group.mean())
resampled_df = df.resample('D').apply(my_fill_function)
Choosing the Right Alternative
The best alternative depends on the specific characteristics of your data and the desired outcome:
- Use custom functions for complex logic or domain-specific knowledge.
- Use interpolation if you have a reasonable assumption about the underlying trend.
- Use
bfill
if the trend is likely to continue forward.