Home ML pre-processing with Polars
Post
Cancel

ML pre-processing with Polars

This post was created while writing my Up & Running with Polars course. Check it out here with a free preview of the first chapters

I think we’ll see a nice ML pre-processing library develop around Polars in the next year. A recent addition to the library makes an important step on that journey easier…

Fill some nulls

A common step in ML pre-processing is sharing data from the training set with the test set. For example we may want to fill nulls in the test set with values from the train set.

The new with_context method in Polars does just that. It lets you use expressions from one dataframe inside another dataframe!

In the example below we have nulls in the Age column.

1
2
3
4
5
6
7
8
9
10
11
(
    test_df
    .with_context(
        # Rename train columns to avoid a column name collision
        train_df.select(pl.all().name.suffix("_train"))
    )
    # Fill nulls in test with median from train
    .with_column(
        pl.col("Age").fill_null(pl.col("Age_train").median())
    )    
)

We want to replace nulls in the test set with the median from the training set.

We do this by calling with_context on the test dataframe to bring the train dataframe into the context. Then we can fill some nulls!

Keeping it lazy

The advantage of with_context is that we stay in the powerful lazy mode in Polars, so we still take advantage of things like query optimisation.

In fact we always do with_context in lazy mode as this is how Polars brings the different parts of the query together.

Learn more

Want to know more about Polars for high performance data science and ML? Then you can:

or let me know if you would like a Polars workshop for your organisation.

This post is licensed under CC BY 4.0 by the author.