Home Pandas to Polars: what to know for time series analysis
Post
Cancel

Pandas to Polars: what to know for time series analysis

There are differences between some important time series concepts in Pandas and Polars that you should know. In this post, to help you make the Pandas to Polars switch I talk through some of these key differences.

I’m working with Polars version 0.20.6 here, but most of these changes should be independant of the version of Polars you are using.

Want to get going with Polars? Check out my Polars course here

No more string datetimes

In Pandas we can use date strings when working with dates and times. In Polars, on the other hand, we use Python datetime objects and we never use strings to do datetime operations.

To illustrate this we create a timeseries in Polars and then convert it to Pandas. To create a date column in Polars we use the confusingly-named datetime.datetime class in Python.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from datetime import datetime

import pandas as pd
import polars as pl
df_polars = pl.DataFrame(
    {
        "datetime": [
            datetime(2021,1,1), datetime(2021,1,2), datetime(2021,1,3)
        ], 
        "value": [1, 2, 3]
    }
)
df_pandas = df_polars.to_pandas()
df_polars
shape: (3, 2)
┌─────────────────────┬───────┐
 datetime             value 
 ---                  ---   
 datetime[μs]         i64   
╞═════════════════════╪═══════╡
 2021-01-01 00:00:00  1     
 2021-01-02 00:00:00  2     
 2021-01-03 00:00:00  3     
└─────────────────────┴───────┘

In Pandas we can use datetime strings to filter datetimes like this:

1
df_pandas.loc[df_pandas["datetime"] > "2021-01-02"]

But in Polars we use the datetime.datetime class to filter dates:

1
2
3
4
5
6
7
8
9
df_polars.filter(pl.col("datetime") > datetime(2021,1,2))
shape: (1, 2)
┌─────────────────────┬───────┐
 datetime             value 
 ---                  ---   
 datetime[μs]         i64   
╞═════════════════════╪═══════╡
 2021-01-03 00:00:00  3     
└─────────────────────┴───────┘

The Polars developers chose not to support string datetime representations because they are ambiguous. For example, 2021-01-02 could be the 2nd of January or the 1st of February depending on the locale.

Of course, we can still extract a string representation as a string column using the dt.strftime method:

1
2
3
4
5
6
7
8
9
10
11
df_polars.with_columns(pl.col("date").dt.strftime("%Y-%m-%d").alias("date_str"))
shape: (3, 3)
┌─────────────────────┬───────┬────────────┐
 datetime             value  date_str   
 ---                  ---    ---        
 datetime[μs]         i64    str        
╞═════════════════════╪═══════╪════════════╡
 2021-01-01 00:00:00  1      2021-01-01 
 2021-01-02 00:00:00  2      2021-01-02 
 2021-01-03 00:00:00  3      2021-01-03 
└─────────────────────┴───────┴────────────┘

Want more time series tips? There is a whole time series section in my Polars course

Polars has different interval strings

In Pandas and Polars we can represent intervals using strings. In Pandas, for example, we use 30T for 30 minutes. In Polars we use 30m for 30 minutes. Here are some examples of interval strings in Polars:

  • 1ns (1 nanosecond)
  • 1us (1 microsecond)
  • 1ms (1 millisecond)
  • 1s (1 second)
  • 1m (1 minute)
  • 1h (1 hour)
  • 1d (1 calendar day)
  • 1w (1 calendar week)
  • 1mo (1 calendar month)
  • 1q (1 calendar quarter)
  • 1y (1 calendar year)

We can compose these interval strings to create more complex intervals. For example, we can use 1h30m for 1 hour and 30 minutes.

Polars works with microseconds by default

In both libraries the datetime, date and duration dtypes are all based on an underlying integer representation of time. For example, with the pl.Datetime dtype, the integer represents a count since the start of the Unix epoch.

In Pandas the integer counts occur in nanoseconds by default but in Polars the integer counts occur in microseconds by default. The microseconds are denoted by us in the DataFrame schema below:

1
2
3
4
5
6
7
df_polars.schema
OrderedDict(
    [
        ('datetime', Datetime(time_unit='us', time_zone=None)),
        ('value', Int64)
    ]
)

However, Polars also supports nanosecond precision while Pandas also supports microsecond precision.

If we convert a Pandas DataFrame to a Polars DataFrame then the integer representations remain in nanoseconds. We can’t join two Polars DataFrames on a datetime if one has nanosecond precision and the other has microsecond precision. So when I convert from Pandas to Polars I normally cast datetime columns to microseconds straight away using the dt.cast_time_unit expression:

1
df_polars = pl.from_pandas(df_pandas).with_columns(pl.col("datetime").dt.cast_time_unit("us"))

A missing datetime in Polars is a null rather than a NaT

In Pandas a missing datetime in a datetime column is represented by NaT (not a time). In Polars a missing datetime is represented by the same value it is represented by in every column: null.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
df_polars = pl.DataFrame(
    {
        "datetime": [
            datetime(2021,1,1), None, datetime(2021,1,3)
        ], 
        "value": [1, 2, 3]
    }
)
shape: (3, 2)
┌─────────────────────┬───────┐
 datetime             value 
 ---                  ---   
 datetime[μs]         i64   
╞═════════════════════╪═══════╡
 2021-01-01 00:00:00  1     
 null                 2     
 2021-01-03 00:00:00  3     
└─────────────────────┴───────┘

I find that having the same representation for missing values in every column makes it easier to work with missing values in Polars. This is because I don’t have to remember different approaches for missing values in different dtypes e.g .isna versus isnull in Pandas.

Temporal groupby in Polars has its own method

In Pandas you do temporal groupby by passing the pd.Grouper method:

1
df_pandas.set_index("datetime").groupby(pd.Grouper(freq='D')).mean()

In Polars we have a special method for temporal groupby group_by_dynamic. In this example we get the mean value for each day:

1
df_polars.sort("datetime").group_by_dynamic("datetime", every="1d").agg(pl.col("value").mean())

Note that we sort the DataFrame by the datetime column before we do the groupby. This is because the group_by_dynamic method requires the data to be sorted by the column we are grouping by.

As in Pandas we have lots of flexibility in how the grouping windows are set. For example we if want to offset the start of the windows by 2 hours we can do this:

1
df_polars.sort("datetime").group_by_dynamic("datetime", every="1d", offset="2h").agg(pl.col("value").mean())

Polars has fast-path operations on sorted data

Polars can take advantage of sorted data to speed up operations using fast-path operations. These fast-path operations occur where Polars knows a column is sorted and can therefore use a faster algorithm to perform the operation. As time series data has a natural sort order it is particularly important to be aware of fast-paths for time series analysis.

We can adapt our filter code above for a simple example of a fast-path operation on time series data. This time we are looking for datetimes before the 2nd of January.

1
df_polars.filter(pl.col("datetime") < datetime(2021,1,2))

If Polars knows that the datetime column is sorted then the fast-path operation is to stop scanning the column once it finds the first row that is greater than or equal to the filter value. This can be much faster than scanning the whole column.

Other important time-series methods that support fast-path operations include group_by and join.

Check out these posts for more on fast-path operations in Polars:

Or you can see my many other Polars posts here:https://www.rhosignal.com/tags/polars/

If you would like more detailed support on working with Polars then I provide consulting on optimising your data processing pipelines with Polars. You can also check out my online course to get you up-and-running with Polars by clicking on the bear below

Want to know more about Polars for high performance data science? Then you can:

This post is licensed under CC BY 4.0 by the author.