Home Lazy mode's hidden timesaver in Polars
Post
Cancel

Lazy mode's hidden timesaver in Polars

Lazy mode in Polars does not only provide query optimisation and allow you to work with larger than memory datasets. It also provides some type security that can find errors in your pipeline before you start crunching through lots of data.

Want to get going with Polars? This post is an extract from my Up & Running with Polars course - learn more here or check out the preview of the first chapters

Basic setup

We illustrate the idea with a simple pipeline below where we create a DataFrame from some data and do a transformation on it in eager mode.

1
2
3
4
5
6
7
8
9
10
11
12
13
import polars as pl

df = (
    pl.DataFrame(
        {
            "groups":['a','a','b','b','c'],
            "values":[0,1,2,3,4]
        }
    )
    .with_columns(
        pl.col('values').round(0)
    )
)

The problem is that our data transformation isn’t valid: the values column has an integer dtype but the round expression can only be called on a column with a floating point dtype.

If we run the code above we get the following exception:

1
SchemaError: Int64 is not a floating point datatype

In this eager mode example Polars found the error after it had created the DataFrame and tried to do the with_columns transformation.

What happens in lazy mode?

We can try this again in lazy mode where we convert the DataFrame to a LazyFrame once it has been created but before we do the with_columns transformation.

1
2
3
4
5
6
7
8
9
10
11
12
df = (
    pl.DataFrame(
        {
            "groups":['a','a','b','b','c'],
            "values":[0,1,2,3,4]
        }
    )
    .lazy()
    .with_columns(
        pl.col('values').round(0)
    )
)

If we run this we see that…nothing happens. Polars has just created a LazyFrame containing the erroneous expression. This is because Polars doesn’t test for schema errors until we execute the pipeline.

If we run try to execute the pipeline with collect (to process all the data) or fetch (to process a subset) then we see our SchemaError.

1
2
df.collect()
SchemaError: Int64 is not a floating point datatype

The key point here is that Polars find this error when we call collect but before the actual time-consuming part of processing the data.

In this way lazy mode in Polars can help you find errors in your pipeline at the start of a pipeline run rather than a long way into them.

Want to get going with Polars? This post is an extract from my Up & Running with Polars course - learn more here or check out the preview of the first chapters )

Next steps

Want to know more about Polars for high performance data science? Then you can:

This post is licensed under CC BY 4.0 by the author.