Home Embrace streaming mode in Polars
Post
Cancel

Embrace streaming mode in Polars

Polars can handle larger-than-memory datasets with its streaming mode. In this mode Polars processes your data in batches rather than all at once. However, the streaming mode is not some emergency switch that you should only hit when you run out of memory. For many queries streaming mode is as quick or quicker than non-streaming mode.

What this means is that it is worth keeping streaming switched on if you are working with larger datasets - particularly if you are building pipelines that you want to be ready to larger datasets in the future.

Want to get going with Polars? This post is an extract from my Up & Running with Polars course - learn more here or check out the preview of the first chapters

Simple example of streaming and non-streaming

To work in streaming mode we simply pass the streaming=True argument to collect when we evaluate a query.

First we create a DataFrame with 1 million rows, 100 floating point columns and an integer ID column.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import polars as pl
import numpy as np

N = 1_000_000
K = 100

df = (
    pl.DataFrame(
        np.random.standard_normal((N,K))
    )
    # Add an ID column
    .hstack(
        pl.DataFrame(
            np.random.randint(0,9,(N,1)
            )
        )
        .rename(
            {'column_0':'id'}
            )
        )
)

We then do a groupby on the id column and take the mean of the remaingin columns. We execute the query in streaming mode with the streaming=True argument

1
2
3
4
5
6
7
8
9
10
11
(
    df
    .lazy()
    .groupby('id')
    .agg(
        pl.all().mean()
    )
    .collect(
        streaming=True
    )
)

If we compare this query with streaming=True and streaming=False (the default) I get an average of 75 ms for streaming and 120 ms for non-streaming. For Pandas this takes about 330 ms for comparison.

Takeaway

For many queries running in streaming mode may be a great default - rather than an emergency button that should only be hit when you are struggling with memory.

I’m not going to guarantee that streaming will always be at least as fast as non-streaming though, this is still a developing technology within Polars and there are surely use cases where streaming will be significantly slower. If you find such a case you are very welcome to discuss it on the Polars discord.

Also note that streaming is not supported for all operations in lazy mode at this point, but it does work for core operations such as groupby and join.

For more on streaming check out these other posts:

or this video where I process a 30 Gb dataset on a not-very-impressive laptop.

Want to get going with Polars? This post is an extract from my Up & Running with Polars course - learn more here or check out the preview of the first chapters )

Next steps

Want to know more about Polars for high performance data science? Then you can:

This post is licensed under CC BY 4.0 by the author.