Home Filling time series gaps in lazy mode
Post
Cancel

Filling time series gaps in lazy mode

Two major advantages of Polars over Pandas is that Polars has a lazy mode with query optimization and that Polars can scale to larger-than-memory datasets with its streaming mode. Taking advantage of these sometimes requires re-thinking how you might write the same operations in Pandas…

Want to get going with Polars? This post is an extract from my Up & Running with Polars course - learn more here or check out the preview of the first chapters

A gappy time series

In this simple example I have a time series that is missing some times. I want to add rows in with these missing times and interpolate over the gaps.

We define a time series that is missing an entry at 2020-01-01 02:00:00.

1
2
3
4
5
6
7
8
9
10
11
12
import polars as pl

df = pl.DataFrame(
    {
        "time": [
            datetime(2020, 1, 1),
            datetime(2020, 1, 1, 1),
            datetime(2020, 1, 1, 3),
        ],
        "values": [0, 1, 3],
    }
)

This DataFrame looks like this:

1
2
3
4
5
6
7
8
9
10
shape: (3, 2)
┌─────────────────────┬────────┐
│ time                ┆ values │
│ ---                 ┆ ---    │
│ datetime[μs]        ┆ i64    │
╞═════════════════════╪════════╡
│ 2020-01-01 00:00:00 ┆ 0      │
│ 2020-01-01 01:00:00 ┆ 1      │
│ 2020-01-01 03:00:00 ┆ 3      │
└─────────────────────┴────────┘

Filling gaps in eager mode with upsample

We can fills gaps in eager mode with upsample - just as in Pandas.

1
df.set_sorted('time').upsample('time',every='1h')

The output with upsample looks like this:

1
2
3
4
5
6
7
8
9
10
11
shape: (4, 2)
┌─────────────────────┬────────┐
│ time                ┆ values │
│ ---                 ┆ ---    │
│ datetime[μs]        ┆ i64    │
╞═════════════════════╪════════╡
│ 2020-01-01 00:00:00 ┆ 0      │
│ 2020-01-01 01:00:00 ┆ 1      │
│ 2020-01-01 02:00:00 ┆ null   │
│ 2020-01-01 03:00:00 ┆ 3      │
└─────────────────────┴────────┘

The main difference between Polars and Pandas here is that Polars requires the column to be sorted before calling upsample. This is because the implementation requires sorted data and Polars wants to avoid doing an expensive sort if we can tell it that the data is already sorted.

The problem with this approach is that upsample is an eager operation. This means we have to load the whole DataFrame into memory before performing the operation. This is fine for small DataFrames but will not scale to larger-than-memory datasets.

Filling gaps in lazy mode

To fill gaps in lazy mode we first define a DataFrame that has the time series with no gaps using the pl.date_range function

1
2
3
4
5
6
7
8
9
10
pl.DataFrame(
    {
        "time": pl.date_range(
            start=datetime(2020, 1, 1),
            end=datetime(2020, 1, 1, 3),
            interval="1h",
            eager=True,
        )
    }
)

We then left join the original DataFrame to this DataFrame using the time column as the join key. The key point is that we call lazy on each of the DataFrames before joining them. This tells Polars to perform the join in lazy mode.

1
2
3
4
5
6
7
8
9
10
pl.DataFrame(
    {
        "time": pl.date_range(
            start=datetime(2020, 1, 1),
            end=datetime(2020, 1, 1, 3),
            interval="1h",
            eager=True,
        )
    }
).lazy().join(df.lazy(), on="time", how="left")

If we evalaute this code with collect we get the following output:

1
2
3
4
5
6
7
8
9
10
11
shape: (4, 2)
┌─────────────────────┬────────┐
│ time                ┆ values │
│ ---                 ┆ ---    │
│ datetime[μs]        ┆ i64    │
╞═════════════════════╪════════╡
│ 2020-01-01 00:00:00 ┆ 0      │
│ 2020-01-01 01:00:00 ┆ 1      │
│ 2020-01-01 02:00:00 ┆ null   │
│ 2020-01-01 03:00:00 ┆ 3      │
└─────────────────────┴────────┘

Interpolating over the gaps

We can now interpolate over the gaps using the interpolate expression.

1
2
3
4
5
6
7
8
9
10
11
12
pl.DataFrame(
    {
        "time": pl.date_range(
            start=datetime(2020, 1, 1),
            end=datetime(2020, 1, 1, 3),
            interval="1h",
            eager=True,
        )
    }
).lazy().join(df.lazy(), on="time", how="left").with_columns(
    pl.col("values").interpolate()
)

If we evalaute this code with collect we get the following output:

1
2
3
4
5
6
7
8
9
10
11
shape: (4, 2)
┌─────────────────────┬────────┐
│ time                ┆ values │
│ ---                 ┆ ---    │
│ datetime[μs]        ┆ i64    │
╞═════════════════════╪════════╡
│ 2020-01-01 00:00:00 ┆ 0      │
│ 2020-01-01 01:00:00 ┆ 1      │
│ 2020-01-01 02:00:00 ┆ 2      │
│ 2020-01-01 03:00:00 ┆ 3      │
└─────────────────────┴────────┘

Streaming mode

As I covered in a previous post we can check if a lazy query will use the streaming engine if we call explain(streaming=True) and there is a code block bounded by PIPELINE.

1
2
3
4
5
6
7
8
9
10
11
12
pl.DataFrame(
    {
        "time": pl.date_range(
            start=datetime(2020, 1, 1),
            end=datetime(2020, 1, 1, 3),
            interval="1h",
            eager=True,
        )
    }
).lazy().join(df.lazy(), on="time", how="left").with_columns(
    pl.col("values").interpolate()
).explain(streaming=True)

In this case we get the following output:

1
2
3
4
5
6
7
8
9
10
11
 WITH_COLUMNS:
 [col("values").interpolate()]
  --- PIPELINE
LEFT JOIN:
LEFT PLAN ON: [col("time")]
  DF ["time"]; PROJECT */1 COLUMNS; SELECTION: "None"
RIGHT PLAN ON: [col("time")]
  DF ["time", "values"]; PROJECT */2 COLUMNS; SELECTION: "None"
END LEFT JOIN  --- END PIPELINE

    DF []; PROJECT */0 COLUMNS; SELECTION: "None"

Here we see that we can do the join in streaming mode but we can’t do the interpolation in streaming mode. Interpolation is a challenging operation for streaming mode as it can require data from different batches to perform the interpolation.

Generalising this approach for multuple time series

The example here is relatively simple but can be generalised. In the ML time series forecasting pipelines that I build, for example, I typically have multiple time series in a DataFrame with an id column to distinguise them. In this case I do an extra cross join of the time steps and the IDs before doing the left join with the gappy data.

Want to get going with Polars? This post is an extract from my Up & Running with Polars course - learn more here or check out the preview of the first chapters )

Next steps

Want to know more about Polars for high performance data science? Then you can:

This post is licensed under CC BY 4.0 by the author.