Home Fit XGBoost models directly from Polars and Arrow
Post
Cancel

Fit XGBoost models directly from Polars and Arrow

Published on: 11th October 2022

Can you use Polars and Apache Arrow to fit ML models?

This post was created while writing my Data Analysis with Polars course. Check it out on Udemy

Update: The XGBoost developers may withdraw support for fitting models with Arrow - see my discussion with them in this issue. I recommend following their advice to call to_pandas on your Polars DataFrame. I wouldn’t lose too much sleep over this: in my current ML pipeline that runs for about 5 minutes this adds about 2 seconds to the total timing.

Here’s the original blog post:

Polars is backed by Apache Arrow rather than Numpy. One argument you hear against working in Polars is that you’ll have to convert back to Numpy to fit ML models.

Does this argument against using Polars and Apache Arrow libraries hold water?

Nope - it’s not true now and will be more invalid over time.

Let’s take a Polars dataframe of the Titanic data for an example.

  • Do some simple feature engineering
  • Pass it to XGBoost in its Arrow form
  • Fit the model.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import polars as pl
import xgboost as xgb

df = pl.read_csv(csvPath)
X = (
    df
    .select(["Pclass"])
    .to_dummies()            
    .to_arrow()
)
y = df["Survived"]

model = xgb.XGBClassifier(objective='binary:logistic')
model.fit(X, y)

df = pl.concat([
        df,
        pl.DataFrame(model.predict_proba(X)[:,1],columns=["pos"])
],
    how="horizontal"
)

No Numpy or Pandas required.

We can do this because XGBoost introduced support for Arrow in recent months. Other ML and feature engineering libraries are working on Arrow support as well.

In addition, if your library does need a Numpy array then it’s often quicker to load and pre-process your data in Polars and then convert to a Numpy array at the last minute rather than using Pandas.

Learn more

Want to know more about Polars for high performance data science and ML? Then you can:

or let me know if you would like a Polars workshop for your organisation.

This post is licensed under CC BY 4.0 by the author.