Home Fit Scikit-learn and XGBoost models directly from Polars
Post
Cancel

Fit Scikit-learn and XGBoost models directly from Polars

Published on: 11th October 2022 Update: 1st November 2023

Can you use Polars to fit ML models without Numpy?

The data in a Polars DataFrame is backed stored in an Apache Arrow table rather than a Numpy array. In the early days this meant that we had to convert the data to a Numpy array manually before we could use it in machine learning libraries.

However, this is no longer the case. In this post we see how that we can fit XGboost and some scikit-learn models directly from a Polars DataFrame. The journey isn’t fully over though - there is likely to be internal copying of the data to the libraries preferred format internally.

This post was created while writing my Up & Running with Polars course. Check it out here with a free preview of the first chapters

Let’s create a Polars DataFrame with some random data and see if we can fit an XGBoost model directly from it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import polars as pl
import xgboost as xgb

# Set the number of rows in the DF
N = 100
# Create the DF with 2 features and a label
df = (
    pl.DataFrame(
        {
            # Use pl.arange to create a sequence of integers
            "feat1":pl.arange(0,N,eager=True),
            # Shuffle the sequence for the second feature
            "feat2":pl.arange(0,N,eager=True).shuffle(),
            # Create a label with 0s and 1s
            "label":[0]*(N//2) +  [1]*(N//2)
        }
    )
)

model = xgb.XGBClassifier(objective='binary:logistic')
# Fit the model
# X  = df.select("feat1","feat2")
# y = df.select("label")
model.fit(
    X = df.select("feat1","feat2"),
    y= df.select("label")
)
# Add the prediction probabilities to the DF
df = pl.concat([
        df,
        pl.DataFrame(model.predict_proba(X)[:,1],schema=["pos"])
],
    how="horizontal"
)

This all works and we can let XGBoost handle any data conversions.

Now let’s try with a logistic regression model from scikit-learn.

1
2
3
4
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(df.select("feat1","feat2"),df.select("label"))
model.predict(df.select("feat1","feat2"))

Again this just works. Note that scikit-learn currently does an internal copy from Polars to Numpy but with this support we’re a step towards full native support for Arrow data. Not all scikit learn models and processes currently support Polars but this is still a great step forward.

Learn more

Want to know more about Polars for high performance data science and ML? Then you can:

or let me know if you would like a Polars workshop for your organisation.

This post is licensed under CC BY 4.0 by the author.