Home Cheatsheet for Pandas to Polars
Post
Cancel

Cheatsheet for Pandas to Polars

Getting started with Polars? This post shows you how to convert some familar Pandas commands to Polars. But it also tries to go beyond that to introduce you to some of the more fundamental differences between Pandas and Polars.

Want to get going with Polars? Check out my Up & Running with Polars course or join me at my next workshop on 14th March

In the following examples we compare Polars v0.20.10 with Pandas v2.2.0. I have automated testing for the snippets on this page and will endeavour to update it when things change.

We first create a sample dataset in Polars

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import polars as pl
import polars.selectors as cs

import pandas as pd
import numpy as np

df = pl.DataFrame(
    {
        'grp': [1, 2, 1, 2, 1, 2],
        'x': list(range(6, 0, -1)),
        'y': list(range(4, 10)),
        'z': [3, 4, 5, 6, 7, None],
        "ref" : list('abcdef')
    }
)
shape: (6, 5)
┌─────┬─────┬─────┬──────┬─────┐
 grp  x    y    z     ref 
 ---  ---  ---  ---   --- 
 i64  i64  i64  i64   str 
╞═════╪═════╪═════╪══════╪═════╡
 1    6    4    3     a   
 2    5    5    4     b   
 1    4    6    5     c   
 2    3    7    6     d   
 1    2    8    7     e   
 2    1    9    null  f   
└─────┴─────┴─────┴──────┴─────┘

For more posts like this check out: -Pandas to Polars time series differences

Accessing data in a DataFrame

There are two ways to access data in a Polars DataFrame:

  • using square brackets with [] (other called “indexing”) and
  • using the expression API with methods like filter, select and with_columns

These square bracket and expression API approaches have different use cases. The basic rule is that you should use the expression API unless you are doing a one-off operation such as:

  • inspecting the values of some rows or columns
  • converting a DataFrame column to a Series

In these cases use the [] approach.

The expression API is more powerful than the [] approach because:

  • operations with the expression API are run in parallel and
  • operations within the expression API can be optimised in lazy mode

Accessing data using the expression API

Selecting and transforming a DataFrame
Operation Pandas Polars
Select a subset of columns df[["x","y"]] df.select("x","y")
Select and transform columns df[["x","y"]].astype(float) df.select(pl.col("x","y").cast(pl.Float64))
Add a column from a constant df["w"] = 1 df.with_columns(w = pl.lit(1))
    df.with_columns(pl.lit(1).alias("w"))
Add a column from a list df["w"] = list(range(1,7)) df.with_columns(pl.Series("w",list(range(1,7))))
Add a column from other columns df["w"] = df["x"] + df["y"] df.with_columns(w = pl.col("x") + pl.col("y"))
Change dtype of a column df.assign(x = lambda df: df["x"].astype("float")) df.with_columns(pl.col("x").cast(pl.Float64))
Rename a column df.rename(columns={"x":"x2"}) df.rename({"x":"x2"})
Drop columns df.drop(columns=["x","y"]) df.drop(["x","y"])
    df.drop("x","y")
Sorting df.sort_values("x") df.sort("x")
Copying df.copy() df.clone()

In these examples we’ve typed out the column names explicitly. Polars also has lots of ways to select multiple columns based on their dtype or patterns in their names.

Select Polars
All columns df.select(pl.all())
All integer columns df.select(pl.col(pl.INTEGER_DTYPES))
… except ‘x’ df.select(pl.col(pl.INTEGER_DTYPES).exclude('x'))
All integer columns df.select(cs.integer())
By regex df.select(cs.matches('x\|y'))
By name pattern df.select(cs.starts_with('r'))

Polars does not support in-place operations so when we do any transformations we re-assign the DataFrame variable. For example if we add a new constant column we must do it like this:

1
df = df.with_columns(w = pl.lit(1))

Creating a new DataFrame like this is cheap in Polars as it does not copy the underlying data, Polars just creates a new reference to the underlying data.

Filtering rows
Operation Pandas Polars  
Filter rows df.loc[df.ref == 'c'] df.filter(pl.col("ref") == "c")  
  df.query("ref == 'c'") df.filter(ref = "c") [1]  
Filter rows (text operator) df.loc[df.ref.eq('c')] df.filter(pl.col("ref").eq("c"))  
Multiple filters df.loc[(df.ref == 'c') & (df.x > 1)] df.filter((pl.col("ref") == "c") & pl.col("x") > 1))  
    df.filter(pl.col("ref") == "c", pl.col("x") > 1) [2]  
Multiple filters (optimised)   df.lazy().filter(pl.col("ref") == "c").filter(pl.col("x") > 1) [3]  
Multiple filters (OR condition) df.loc[(df.ref == 'c') ` (df.x > 1)]` df.filter((pl.col("ref") == "c") ` pl.col(“x”) > 1))`
Is in condition df.loc[df.ref.isin(['a','b']) ] ` df.filter(pl.col(“ref”).is_in([‘a’,’b’]))`  
Is between condition df.loc[df.ref.between(2,4) ] ` df.filter(pl.col(“x”).is_between(2,4))`  

[1] In the second row of the Polars column we did a filter without using pl.col. This is because Polars has a special syntax for filtering by column name. This approach takes advantage of Python keyword arguments to a function to treat setting a keyword argument - like ref = "c" - as a filter condition. This approach only works for equality conditions.

[2] Here we apply multiple filter conditions without the annoying brackets and &. Instead we pass a comma separated list of conditions to the filter method.

[3] In the optimised example for Polars we used separate filter calls. However, as we are in lazy mode the Polars query optimiser combines these into a single filter condition applied on a single pass through the data.

There’s a lot more to filtering in Polars. For example, if you do a filter late in a query that could be done earlier (and so reduce the amount of data processed) then the Polars query optimiser moves the filter earlier in the query. Even better, if you apply a filter on Parquet data stored in the clouds Polars tries to apply those filters in the cloud storage layer before the data is transferred across the network. Check out my course for more on this.

Accessing data using []

Operation Pandas Polars
Get column as Series df["grp"] df["grp"]
Cell indexing by location df.iloc[1, 1] df[1, 1]
Row slicing by location df.iloc[1:3] df[1:3]
Column slicing by location df.iloc[:, 1:] df[:, 1:]
Row indexing by label df.loc['c'] df.filter(pl.col("index") == "c")
Column indexing by label df.loc[:, 'x'] df[:, "x"]
    df.select("x")
Column indexing by labels df.loc[:, ['x', 'z']] df[:, ['x', 'z']]
    df.select(['x', 'z'])
Column slicing by label df.loc[:, 'x':'z'] df[:, "x":"z"]
Mixed indexing df.loc['c'][1] df.filter(pl.col("index") == "c")[0, 1]

Note: when a query in Pandas returns a single row then that row is returned as a Series. If the row contains both floats and integers then Pandas casts the integers to floats in the Series. Polars returns a DataFrame with one row keeping the original dtypes.

Duplicates and missing values

Operation Pandas Polars
Keep unique rows df.drop_duplicates() df.unique() [1]
…based a subset of columns df.drop_duplicates(subset=["ref"]) df.unique(subset=["ref"])
Drop rows with missing values df.dropna() df.drop_nulls()

[1] Be aware that in Polars the order of the output from df.unique() is not in general the same as the order of the input. In addition, the default choice of which of each duplicated row to keep is any rather first as in Pandas. I looked at the reasons for this behaviour and how you can control it in this post.

The missing value in Pandas depends on dtype of the column whereas in Polars a missing value is null for all dtypes.

Grouping data and aggregation

Polars has a group_by function to group rows. The following table illustrates some common grouping and aggregation usages. The code snippets are long so scroll horizontally to see Polars.

Operation Pandas Polars
Agg by groups df.groupby('grp')['x'].mean() df.group_by('grp').agg(pl.col("x").mean()
Agg multiple columns df.agg({'x': max, 'y': min}) df.select([pl.col("x").max(),pl.col("y").min()])
  df[['x', 'y']].mean() df.select(["x","y"]).mean()
  df.filter(regex=("^x")).mean() df.select(pl.col("^x$").mean()
    df.select(cs.starts_with("x").mean()
Rename column after aggregation df.groupby('grp')['x'].mean().rename("x_mean") df.group_by("grp").agg(pl.col("x").mean().name.suffix("_mean"))
Add aggregated data as column df.assign(x_mean=df.groupby("grp")["x"].transform("mean")) df_polars.with_column(pl.col("x").mean().over("grp").name.suffix("_mean"))

The output of aggregations in Pandas can be a Series whereas in Polars it is always a DataFrame. Where the output is a Series in Pandas there is a risk of the dtype being changed such as ints to floats.

As noted for unique above be aware that the order of the rows in the output of groupby in Polars is random by default.

Check out the many other posts I’ve written on Polars!

Learn more

Want to know more about Polars for high performance data science and ML? Then you can:

or let me know if you would like a Polars workshop for your organisation.

This post is licensed under CC BY 4.0 by the author.