Get rid of your ineffective code for filtering time series data
Every time I work with time series data, I end up writing complex and non-reusable code to filter it. Whether I’m doing simple filtering techniques like removing weekends, or more complex ones like removing specific time windows, I always resort to writing a quick and dirty function that works for the specific thing that I’m filtering in the moment, but never again.
I finally decided to break that horrible cycle by writing a processor that allows me to filter time series no matter how complex the condition using very simple and concise inputs.
Just an example of how it works in practice:
- On weekdays, I want to remove < 6 am and ≥ 8 pm, and on weekends I want to remove < 8 am and ≥ 10 pm
df = pl.DataFrame(
{"date": [
# -- may 24th is a Friday, weekday
'2024-05-24 00:00:00', # < 6 am, should remove
'2024-05-24 06:00:00', # not < 6 am, should keep
'2024-05-24 06:30:00', # not < 6 am, should keep
'2024-05-24 20:00:00', # >= 8 pm, should remove
# -- may 25th is a Saturday, weekend
'2024-05-25 00:00:00', # < 8 am, should remove
'2024-05-25 06:00:00', # < 8 am, should remove
'2024-05-25 06:30:00', # < 8 am, should remove
'2024-05-25 20:00:00', # not >= 10 pm, should keep
]
}
).with_columns(pl.col("date").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S"))
- Without processor: expressive, but verbose and non-reusable
df.filter(
pl.Expr.not_(
(
(pl.col("date").dt.weekday() < 6)
.and_(
(pl.col("date").dt.hour() < 6)
.or_(pl.col("date").dt.hour() >= 20)
)
)
.or_(
(pl.col("date").dt.weekday() >= 6)
.and_(
(pl.col("date").dt.hour() < 8)
.or_(pl.col("date").dt.hour() >= 22)
)
)
)
)
- With processor: equally as expressive, concise and reusable
processor = FilterDataBasedOnTime(
"date", time_patterns=[
"<6wd<6h",
"<6wd>=20h",
">=6wd<8h",
">=6wd>=22h",
]
)
processor.transform(df)
In this article I’ll explain how I came up with this solution, starting with the string format I chose to define filter conditions, followed by a design of the processor itself. Towards the end of the article, I’ll describe how this pipeline can be used alongside other pipelines to enable complex time series processing with only a few lines of code.
If you’re interested in the code only, skip to the end of the article for a link to the repository.
Expressive, Concise and Flexible Time Conditions?
This was by far the hardest part of this task. Filtering time series based on time is conceptually easy, but it’s much harder to do with code. My initial thought was to use a string pattern that is most intuitive to myself:
# -- remove values between 6 am (inclusive) and 2 pm (exclusive)
pattern = '>=06:00,<14:00'
However with this, we immediately run into a problem: we lose flexibility. This is because 06:00 is ambiguous, as it could mean min:sec or hr:min . So we’d almost always have to define the date format a-priori.
This prevents us from allowing complex filtering techniques, such as filtering a specific time ON specific days (e.g. only remove values in [6am, 2pm) on a Saturday).
Extending my pattern to something resembling cron would not help either:
# cronlike pattern
pattern = ‘>=X-X-X 06:00:X, <X-X-X 20:00:X’
The above can help with selecting specific months or years, but doesn’t allow flexibility of things like weekdays. Further, it is not very expressive with all the X’s and it’s really verbose.
I knew that I needed a pattern that allows chaining of individual time series components or units. Effectively something that is just like an if-statement:
- IF day == Saturday
- AND time ≥ 06:00
- AND time < 14:00
So then I thought, why not use a pattern where you can add any conditions to a time-components, with the implicit assumption that they are all AND conditions?
# -- remove values in [6am, 2pm) on Saturday
pattern = 'day==6,time>=06:00,time<14:00'
Now we have a pattern that is expressive, however it can still be ambiguous, since time implicitly assumes a date fomat. So I decided to go further:
# -- remove values in [6am, 2pm) on Saturday
pattern = 'day==6,hour>=6,hour<14'
Now to make it less verbose, I borrowed the Polars duration string format (this is the equivalent of “frequency” if you are more familiar with Pandas), and viola:
# -- remove values in [6am, 2pm) on Saturday
pattern = '==6wd,>=6h,<14h'
What About Time Conditions that Need the OR Operator?
Let’s consider a different condition: to filter anything LESS than 6 am (inclusive) and > 2 pm (exclusive). A pattern like below would fail:
# -- remove values in (-inf, 6am], and (2pm, inf)
pattern = '<=6h,>14h'
Since we’d read it as: ≤ 6 am AND > 2 pm
No such value exists that satisfies these two conditions!
But the solution to this is simple: apply AND conditions within a pattern, and apply OR conditions across different patterns. So:
# -- remove values in (-inf, 6am], and (2pm, inf)
patterns = ['<=6h', '>14h']
Would be read as: ≤ 6 am OR > 2pm
Why not allow OR statements within a pattern?
I did consider adding support for an OR statement within the pattern, e.g. using | or alternatively to let , denote the difference between a “left” and “right” condition. However, I found that these would be adding unnecessary complexity to parsing the pattern, without making the code any more expressive.
I much prefer it simple: within a pattern we apply AND, across patterns we apply OR.
Edge Cases
There is one edge-cases worth discussing here. The “if-statement” like pattern doesn’t always work.
Let’s consider filtering timestamps > 06:00. If we simply defined:
# -- pattern to remove values > 06:00
pattern = '>6h'
Then do we interpret this as:
- Remove all values where hour>6
- Or remove all values where time>06:00 ?
The latter makes more sense, but the current pattern doesn’t allow us to express that. So to explicitly state that we which to include timestamps greater than the 6th hour of the day, we must add what I call the cascade operator:
# -- pattern to remove values > 06:00
pattern = '>6h*'
Which would be read as:
- hour > 6
- OR (hour == 6 AND any(minute, second, millisecond, etc… > 0)
Which would be an accurate condition to capture time>06:00!
The Code
Here I highlight important design bits to create a processor for filtering time series data.
Parsing Logic
Since the pattern is quite simple, parsing it is really easy. All we need to do is loop over each pattern and keep track of the operator characters. What remains is then a list of operators, and a list of durations that they are applied to.
# -- code for parsing a time pattern, e.g. "==7d<7h"
pattern = pattern.replace(" ", "")
operator = ""
operators = []
duration_string = ""
duration_strings = []
for char in pattern:
if char in {">", "<", "=", "!"}:
operator += char
if duration_string:
duration_strings.append(duration_string)
duration_string = ""
else:
duration_string += char
if operator:
operators.append(operator)
operator = ""
duration_strings.append(duration_string)
Now for each operator and duration string, we can extract metadata that helps us make the actual boolean rules later on.
# -- code for extracting metadata from a parsed pattern
# -- mapping to convert each operator to the Polars method
OPERATOR_TO_POLARS_METHOD_MAPPING = {
"==": "eq",
"!=": "ne",
"<=": "le",
"<": "lt",
">": "gt",
">=": "ge",
}
operator_method = (
OPERATOR_TO_POLARS_METHOD_MAPPING[operator]
)
# -- identify cascade operations
if duration_string.endswith("*"):
duration_string = duration_string[:-1]
how = "cascade"
else:
how = "simple"
# -- extract a polars duration, e.g. 7d7h into it's components: [(7, "d"), (7, "h")]
polars_duration = PolarsDuration(duration=duration_string)
decomposed_duration = polars_duration.decomposed_duration
# -- ensure that cascade operator only applied to durations that accept it
if how == "cascade" and any(
unit not in POLARS_DURATIONS_TO_IMMEDIATE_CHILD_MAPPING
for _, unit in decomposed_duration
):
raise ValueError(
(
"You requested a cascade condition on an invalid "
"duration. Durations supporting cascade: "
f"{list(POLARS_DURATIONS_TO_IMMEDIATE_CHILD_MAPPING.keys())}"
)
)
rule_metadata = {
"operator": operator_method,
"decomposed_duration": decomposed_duration,
"how": how,
}
We now have, for each pattern, dictionaries for how to define the rules for each of it’s components. So if we went for a complicated example:
pattern = '==1m>6d6h' # remove if month = Jan, and day > 6 and hour > 6
# parsed pattern
[
[
{
"operator": "eq",
"decomposed_duration": [(1, "m")],
"how": "simple"
},
{
"operator": "gt",
"decomposed_duration": [(6, "d"), (6, "h")],
"how": "simple"
}
]
]
Notice that a single pattern can be split into multiple metadata dicts because it can be composed of multiple durations and operations.
Creating Rules from metadata
Having created metadata for each pattern, now comes the fun part of creating Polars rules!
Remember that within each pattern, we apply an AND condition, but across patterns we apply an OR condition. So in the simplest case, we need a wrapper that can take a list of all the metadata for a specific pattern, then apply the and condition to it. We can store this expression in a list alongside the expressions for all the other patterns, before applying the OR condition.
# -- dictionary to contain each unit along with the polars method to extract it's value
UNIT_TO_POLARS_METHOD_MAPPING = {
"d": "day",
"h": "hour",
"m": "minute",
"s": "second",
"ms": "millisecond",
"us": "microsecond",
"ns": "nanosecond",
"wd": "weekday",
}
patterns = ["==6d<6h6s"]
patterns_metadata = get_rule_metadata_from_patterns(patterns)
# -- create an expression for the rule pattern
pattern_metadata = patterns_metadata[0] # list of length two
# -- let's consider the condition for ==6d
condition = pattern_metadata[0]
decomposed_duration = condition["decomposed_duration"] # [(6, 'd')]
operator = condition["operator"] # eq
conditions = [
getattr( # apply the operator method, e.g. pl.col("date").dt.hour().eq(value)
getattr( # get the value of the unit, e.g. pl.col("date").dt.hour()
pl.col(time_column).dt,
UNIT_TO_POLARS_METHOD_MAPPING[unit],
)(),
operator,
)(value) for value, unit in decomposed_duration # for each unit separately
]
# -- finally, we aggregate the separate conditions using an AND condition
final_expression = conditions.pop()
for expression in conditions:
final_expression = getattr(final_expression, 'and_')(expression)
This looks complex… but we can convert bits of it into functions and the final code looks quite clean and readable:
rules = [] # list to store expressions for each time pattern
for rule_metadata in patterns_metadata:
rule_expressions = []
for condition in rule_metadata:
how = condition["how"]
decomposed_duration = condition["decomposed_duration"]
operator = condition["operator"]
if how == "simple":
expression = generate_polars_condition( # function to do the final combination of expressions
[
self._generate_simple_condition(
unit, value, operator
) # this is the complex "getattr" code
for value, unit in decomposed_duration
],
"and_",
)
rule_expressions.append(expression)
rule_expression = generate_polars_condition(
rule_expressions, "and_"
)
rules.append(rule_expression)
overall_rule_expression = generate_polars_condition(
rules, "or_"
).not_() # we must negate because we're filtering!
Creating Rules for the cascade operator
In the above code, I had an if condition only for the “simple” conditions… how do we do the cascade conditions?
Remember from our discussion above that a pattern of “>6h*” means:
hour > 6 OR (hour == 6 AND any(min, s, ms, etc… > 0)
So what we need, is to know for each unit, what the subsequent smaller units are.
E.g. if I had “>6d*”, I should know to include “hour” in my any condition, thus:
day > 6 OR (day == 6 AND any(hr, min, s, ms, etc… > 0)
This is easily achieved using a dictionary that maps each unit to its “next” smaller unit. E.g.: day → hour, hour → second, etc…
POLARS_DURATIONS_TO_IMMEDIATE_CHILD_MAPPING = {
"y": {"next": "mo", "start": 1},
"mo": {"next": "d", "start": 1},
"d": {"next": "h", "start": 0},
"wd": {"next": "h", "start": 0},
"h": {"next": "m", "start": 0},
"m": {"next": "s", "start": 0},
"s": {"next": "ms", "start": 0},
"ms": {"next": "us", "start": 0},
"us": {"next": "ns", "start": 0},
}
The start value is necessary because the any condition isn’t always > 0. Because if I want to filter any values > February, then 2023–02–02 should be a part of it, but not 2023–02–01.
With this dictionary in mind, we can then easily create the any condition:
# -- pattern example: >6h* cascade
simple_condition = self._generate_simple_condition(
unit, value, operator
) # generate the simple condition, e.g. hour>6
all_conditions = [simple_condition]
if operator == "gt": # cascade only affects > operator
equality_condition = self._generate_simple_condition(
unit, value, "eq"
) # generate hour==6
child_unit_conditions = []
child_unit_metadata = (
POLARS_DURATIONS_TO_IMMEDIATE_CHILD_MAPPING.get(unit, None)
) # get the next smallest unit, e.g. minute
while child_unit_metadata is not None:
start_value = child_unit_metadata["start"]
child_unit = child_unit_metadata["next"]
child_unit_condition = self._generate_simple_condition(
child_unit, start_value, "gt"
) # generate minute > 0
child_unit_conditions.append(child_unit_condition)
child_unit_metadata = (
POLARS_DURATIONS_TO_IMMEDIATE_CHILD_MAPPING.get(
child_unit, None
)
) # now go on to seconds, and so on...
cascase_condition = generate_polars_condition(
[
equality_condition, # and condition for the hour unit
generate_polars_condition(child_unit_conditions, "or_"), # any condition for all the child units
],
"and_",
)
all_conditions.append(cascase_condition)
# -- final condition is hour>6 AND the cascade condition
overall_condition = generate_polars_condition(all_conditions, "or_")
The Bigger Picture
A processor like this isn’t just useful for ad-hoc analysis. It can be a core component your data processing pipelines. One really useful use case for me is to use it along with resampling. An easy filtering step would enable me to easy calculate metrics on time series with regular disruptions, or regular downtimes.
Further, with a few simple modifications I can extend this processor to allow easy labelling of my time series. This allows me to add regressors to bits that I know behave differently, e.g. if I’m modelling a time series that jumps at specific hours, I can add a step regressor to only those parts.
Concluding Remarks
In this article I outlined a processor that enables easy, flexible and concise time series filtration on Polars datasets. The logic discussed can be extended to your favourite data frame processing library, such as Pandas with some minor changes.
Not only is the processor useful for ad-hoc time series analysis, but it can be the backbone of data processing if chained with other operations such as resampling, or if used to create extra features for modelling.
I’ll conclude with some extensions that I have in mind to make the code even better:
- I’m thinking of creating a short cut to define “weekend”, e.g. “==we”. This way I don’t wouldn’t need to explicitly define “>=6wd” which can be less clear
- With proper design, I think it is possible to enable the addition of custom time identifiers. For example “==eve” to denote evening, the time for which can be user defined.
- I’m definitely going to add support for simply labelling the data, as opposed to filtering it
- And I’m going to add support for being able to define the boundaries as “keep”, e.g. instead of defining [“<6h”, “>=20hr”] I can do [“>=6h<20hr”]
Where to find the code
This project is in its infancy, so items may move around. As of 23.05.2024, you can find the FilterDataBasedOnTime under mix_n_match/main.py .
GitHub – namiyousef/mix-n-match: repository for processing dataframes
All code, data and images by author unless specified otherwise
Intuitive Temporal DataFrame Filtration was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Intuitive Temporal DataFrame Filtration
Go Here to Read this Fast! Intuitive Temporal DataFrame Filtration