This post was created while writing my Up & Running with Polars course. Check it out here with a free preview of the first chapters
I think we’ll see a nice ML pre-processing library develop around Polars in the next year. A recent addition to the library makes an important step on that journey easier…
A common step in ML pre-processing is sharing data from the training set with the test set. For example we may want to fill nulls in the test set with values from the train set.
with_context method in Polars does just that. It lets you use expressions from one dataframe inside another dataframe!
In the example below we have nulls in the Age column.
1 2 3 4 5 6 7 8 9 10 11 ( test_df .with_context( # Rename train columns to avoid a column name collision train_df.select(pl.all().name.suffix("_train")) ) # Fill nulls in test with median from train .with_column( pl.col("Age").fill_null(pl.col("Age_train").median()) ) )
We want to replace nulls in the test set with the median from the training set.
We do this by calling
with_context on the test dataframe to bring the train dataframe into the context. Then we can fill some nulls!
The advantage of with_context is that we stay in the powerful lazy mode in Polars, so we still take advantage of things like query optimisation.
In fact we always do with_context in lazy mode as this is how Polars brings the different parts of the query together.
Want to know more about Polars for high performance data science and ML? Then you can:
- join my Polars course on Udemy
- follow me on twitter
- connect with me at linkedin
- check out my youtube videos
or let me know if you would like a Polars workshop for your organisation.