pandas join by time range
Usually we need to join two tables based on time range.
One table has timestamp, the second table has time_range_start and time_range_end, so we want to join two tables based on timestamp between range_start and range_end.
However this kind of joining on '<=' or '>=' is not supported in pandas at the moment. A work around is to full outer join the two tables and do filtering based on '<=' and '>=' conditions.
Pandas has now provide a cool merge function: pandas.merge_asof
pandas.merge_asof(left, right, on=None, left_on=None, right_on=None, left_index=False, right_index=False, by=None, left_by=None, right_by=None, suffixes=('_x', '_y'), tolerance=None, allow_exact_matches=True, direction='backward')[source]
Perform an asof merge. This is similar to a left-join except that we match on nearest key rather than equal keys.
Both DataFrames must be sorted by the key.
For each row in the left DataFrame:
A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.
A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.
A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.
The backward join direction simulates joining two time series where one timestamp must fall within the lag and current of other timestamp.
The forward join direction similarly simulates joining two time series where one timestamp must fall within the current and lead of other timestamp.
This also depends on which dataset is left or right.
The nearest is also useful when it assign rows based on the closest distance.
Range join example:
import pandas as pd
df1 = pd.DataFrame([['A', 1], ['B', 5], ['C', 10]], columns=['name', 'time'])
df2 = pd.DataFrame([['range1', 0, 3], ['range2', 6, 11]], columns=['range', 'start', 'end'])
print(df1)
print(df2)
#compare time with range start and range end separately
df = pd.merge_asof(df1, df2, left_on='time', right_on='start', direction='backward')
df = pd.merge_asof(df, df2, left_on='time', right_on='end', direction='forward')
#if time is within the range, then range_x must be the same as range_y
df = df.loc[df['range_x'] == df['range_y']][['name', 'time', 'range_x', 'start_x', 'end_y']]
print(df)
name time
0 A 1
1 B 5
2 C 10
range start end
0 range1 0 3
1 range2 6 11
name time range_x start_x end_y
0 A 1 range1 0 3
2 C 10 range2 6 11