Fitting linear models within Polars
Post
Cancel

# Fitting linear models within Polars

Linear models are a staple of data science and machine learning. They are simple to understand and quick to fit. With this great new Polars plugin you can now fit linear model - including Lasso and Ridge regression - directly within Polars.

This is based on work done by Amzy Rajab in this github repo.

The first step if you want to follow along is to install the plugin. You can do this with the following command

```1 pip install polars-ols patsy ```

where the `patsy` package is used to parse formulae (though we don’t use this below).

## Registering the namespace

We fit the models with the polars-ols package. Polars-ols is a Polars plugin. When we import a plugin the plugin registers its namespace with Polars. A namespace is a set of expressions that are gathered under a title and works in the same way as built-in namespaces such as `dt` for timeseries expressions or `str` for string expressions.

We start by importing Polars and the plugin

```1 2 import polars as pl import polars_ols as pls ```

When we import a plugin the plugin registers its namespace with Polars. We can then access the expressions in the namespace which in this case is called `least_squares`.

## Fitting a linear model

We create a `DataFrame` with a target column `y` that we regress against two predictor columns `x1` and `x2`. This is adapted from the example made by Amzy Rajab

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 df = pl.DataFrame( { "y": [1.16, -2.16, -1.57, 0.21, 0.22, 1.6, -2.11, -2.92, -0.86, 0.47], "x1": [0.72, -2.43, -0.63, 0.05, -0.07, 0.65, -0.02, -1.64, -0.92, -0.27], "x2": [0.24, 0.18, -0.95, 0.23, 0.44, 1.01, -2.08, -1.36, 0.01, 0.75], } ) df.head() shape: (5, 3) ┌───────┬───────┬───────┐ │ y ┆ x1 ┆ x2 │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞═══════╪═══════╪═══════╡ │ 1.16 ┆ 0.72 ┆ 0.24 │ │ -2.16 ┆ -2.43 ┆ 0.18 │ │ -1.57 ┆ -0.63 ┆ -0.95 │ │ 0.21 ┆ 0.05 ┆ 0.23 │ │ 0.22 ┆ -0.07 ┆ 0.44 │ └───────┴───────┴───────┘ ```

We start by fitting an ordinary least squares (i.e. vanilla linear regression) model. We specify:

• the target column as `pl.col("y")`
• an ordinary least squares model with the `least_squares.ols` expression
• the predictors as a list of expressions inside `least_squares.ols`
• the name of the output column of predictions with `alias`
```1 2 3 4 5 6 7 8 ols_expr = ( pl.col("y") .least_squares.ols( pl.col("x1"), pl.col("x2") ) .alias("ols") ) ```

We can then add a column with the predictions by passing the expression to `with_columns`

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ( df .with_columns( ols_expr ) ) shape: (10, 4) ┌───────┬───────┬───────┬───────────┐ │ y ┆ x1 ┆ x2 ┆ ols │ │ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 ┆ f32 │ ╞═══════╪═══════╪═══════╪═══════════╡ │ 1.16 ┆ 0.72 ┆ 0.24 ┆ 0.940459 │ │ -2.16 ┆ -2.43 ┆ 0.18 ┆ -2.196536 │ │ -1.57 ┆ -0.63 ┆ -0.95 ┆ -1.55357 │ │ 0.21 ┆ 0.05 ┆ 0.23 ┆ 0.275953 │ │ 0.22 ┆ -0.07 ┆ 0.44 ┆ 0.366057 │ │ 1.6 ┆ 0.65 ┆ 1.01 ┆ 1.632354 │ │ -2.11 ┆ -0.02 ┆ -2.08 ┆ -2.07331 │ │ -2.92 ┆ -1.64 ┆ -1.36 ┆ -2.945234 │ │ -0.86 ┆ -0.92 ┆ 0.01 ┆ -0.889025 │ │ 0.47 ┆ -0.27 ┆ 0.75 ┆ 0.476734 │ └───────┴───────┴───────┴───────────┘ ```

And voila! We have the predictions from the linear model in the `ols` column.

## Fitting a regularised model

The library also supports fitting regularised models such as Lasso and Ridge regression. We can fit a Lasso model by using `least_squares.lasso` instead of `least_squares.ols` . We also specify the `alpha` parameter which is the regularisation strength.

```1 2 3 4 5 6 7 8 9 lasso_expr = ( pl.col("y") .least_squares.lasso( pl.col("x1"), pl.col("x2"), alpha=0.0001, add_intercept=True ) ) ```

Note - I’ve compared the results of the Elastic Net model with the results from the Scikit-learn library and they closely match.

## Training a model for use on a test set

In the examples above we made predictions on the same data that we used to train the model. In practice we typically want to train the model on a training set and then apply it to new data. We can do this with `python-ols` because we can also have it output the regression coefficients instead of the predictions.

Returning to our original ordinary least squares model we can get the coefficients by setting `mode="coefficients"` in the `ols` method.

```1 2 3 4 5 6 7 8 9 10 ols_coef_expr = ( pl.col("y") .least_squares.ols( pl.col("x1"), pl.col("x2"), add_intercept=True, mode="coefficients" ) .alias("ols_intercept") ) ```

The coefficients are returned as a `pl.Struct` column that we unnest to get the values as separate columns.

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ( df .select( ols_coef_expr ) .unnest("ols_intercept") ) shape: (1, 3) ┌──────────┬──────────┬──────────┐ │ x1 ┆ x2 ┆ const │ │ --- ┆ --- ┆ --- │ │ f32 ┆ f32 ┆ f32 │ ╞══════════╪══════════╪══════════╡ │ 0.977375 ┆ 0.987413 ┆ 0.000757 │ └──────────┴──────────┴──────────┘ ```

Now that we have the coefficients we can use them to predict the target variable on a new dataset.

To make this really streamlined we can use the classic Scikit-learn approach of having a `fit` and `transform` method (though you could also call it a `predict` method). We can do this by creating a linear regression class with these methods

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 from typing import List class LinearRegressor: def __init__( self, target_col:str="y", feature_cols:List[str]=["x1","x2"], model="ols", add_intercept:bool=False ): self.target_col = target_col self.feature_cols = [pl.col(feature) for feature in feature_cols] self.add_intercept = add_intercept if model == "ols": self.model_expr = ( pl.col(self.target_col) .least_squares.ols( *self.feature_cols, mode="coefficients", add_intercept=self.add_intercept ) .alias("coef") ) def fit(self, X): # Fit the model and save the coefficients in a DataFrame self.coefficients_df = ( X .select( self.model_expr ) .unnest("coef") ) self.coef_ = ( self.coefficients_df .select(self.feature_cols) .melt() ) if self.add_intercept: self.intercept_ = self.coefficients_df["const"][0] else: self.intercept_ = 0.0 return self def transform(self, X: pl.DataFrame): # Make predictions using the saved coefficients return ( X # Add a row index .with_row_index() .pipe( # Join the predictions lambda X: X.join( # Select the predictor columns X.select("index", *self.feature_cols) # Melt (so we can join the coefficients) .melt(id_vars="index",value_name="predictor") .join( # Join the coefficients ( X .select( self.model_expr ) .unnest("coef") .melt(value_name="coef") ), on="variable", ) # Multiply by the predictors .with_columns(pred=(pl.col("predictor") * pl.col("coef"))) # Gather back up into rows .group_by("index") .agg( pl.col("pred").sum() + self.intercept_ ), on="index", ) ) .sort("index") ) ```

The basic idea is that:

• in `fit` we create the `DataFrame` we saw earlier with the variable names and coefficients
• in `transform` we melt the feature matrix `X` to long format so that we can join the coefficients to the data on the variable names. We then multiply the coefficients by the data and sum across all data on the same row to get the predictions.

We then use this class to make predictions on a test `DataFrame` as follows:

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 df_train, df_test = df[:7], df[7:] linear_regressor = LinearRegressor(target_col="y",feature_cols=["x1","x2"],model="ols") reg.fit(X=df_train) reg.transform(X=df_test) shape: (3, 5) ┌───────┬───────┬───────┬───────┬───────────┐ │ index ┆ y ┆ x1 ┆ x2 ┆ pred │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ u32 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞═══════╪═══════╪═══════╪═══════╪═══════════╡ │ 0 ┆ -2.92 ┆ -1.64 ┆ -1.36 ┆ -2.945234 │ │ 1 ┆ -0.86 ┆ -0.92 ┆ 0.01 ┆ -0.889025 │ │ 2 ┆ 0.47 ┆ -0.27 ┆ 0.75 ┆ 0.476734 │ └───────┴───────┴───────┴───────┴───────────┘ ```

This is just the start of what you can do with the polars-ols plugin. Topics I haven’t covered here include fitting models to different sub-groups of data, fitting models on rolling windows and fitting models with formulae. I cover these in more detail in my Up & Running with Polars course.

Next step - check it out on your own data!

## Next steps

Want to know more about Polars for high performance data science? Then you can:

This post is licensed under CC BY 4.0 by the author.
Recently Updated
Trending Tags
Contents
Trending Tags