This post was created while writing my Up & Running with Polars course. Check it out here with a free preview of the first chapters

I think we’ll see a nice ML pre-processing library develop around Polars in the next year. A recent addition to the library makes an important step on that journey easier…

Fill some nulls

A common step in ML pre-processing is sharing data from the training set with the test set. For example we may want to fill nulls in the test set with values from the train set.

The new with_context method in Polars does just that. It lets you use expressions from one dataframe inside another dataframe!

In the example below we have nulls in the Age column.

        
      
(
    test_df
    .with_context(
        # Rename train columns to avoid a column name collision
        train_df.select(pl.all().name.suffix("_train"))
    )
    # Fill nulls in test with median from train
    .with_column(
        pl.col("Age").fill_null(pl.col("Age_train").median())
    )    
)

We want to replace nulls in the test set with the median from the training set.

We do this by calling with_context on the test dataframe to bring the train dataframe into the context. Then we can fill some nulls!

Keeping it lazy

The advantage of with_context is that we stay in the powerful lazy mode in Polars, so we still take advantage of things like query optimisation.

In fact we always do with_context in lazy mode as this is how Polars brings the different parts of the query together.

Learn more

Want to know more about Polars for high performance data science and ML? Then you can:

or let me know if you would like a Polars workshop for your organisation.

ML pre-processing with Polars

Fill some nulls

Keeping it lazy

Learn more

Further Reading

What does ChatGPT's Advanced Data Analysis have installed?

Doing ML pre-processing in Polars

Fit Scikit-learn and XGBoost models directly from Polars