Home Cheatsheet for Pandas to Polars

Cheatsheet for Pandas to Polars

Most new Polars users are familiar with Pandas so a mapping from Pandas code to Polars code might come in handy. As I show in my Polars quickstart notebook there are a number of important differences between Polars and Pandas including:

  • Pandas uses an index but Polars does not
  • Polars has a lazy mode but Pandas does not
  • Polars allows you to stream larger than memory datasets in lazy mode

I recommend reading this guide after you have covered the key concepts of Polars in the quickstart notebook.

This post was created while writing my Data Analysis with Polars course. Check it out on Udemy with a half price discount

The examples here are derived from this excellent comparison page from Pandas to Julia’s dataframe.jl.

In the following examples we compare Polars v0.15.1 with Pandas v1.5.2. I have automated testing for the snippets on this page and will endeavour to update it when things change.

We first create a sample dataset in Polars

import polars as pl
import polars.selectors as cs

import Pandas as pd
import numpy as np

df = pl.DataFrame(
        'grp': [1, 2, 1, 2, 1, 2],
        'x': list(range(6, 0, -1)),
        'y': list(range(4, 10)),
        'z': [3, 4, 5, 6, 7, None],
        "ref" : list('abcdef')
shape: (6, 5)
 grp  x    y    z     ref 
 ---  ---  ---  ---   --- 
 i64  i64  i64  i64   str 
 1    6    4    3     a   
 2    5    5    4     b   
 1    4    6    5     c   
 2    3    7    6     d   
 1    2    8    7     e   
 2    1    9    null  f   

Accessing data in a DataFrame

There are two ways to access data in a Polars DataFrame:

  • using square brackets with [] (other called “indexing”) and
  • using the expression API with methods like filter, select and with_columns

These square bracket and expression API approaches have different use cases. The basic rule is that you should use the expression API unless you are doing a one-off operation such as:

  • inspecting the values of some rows or columns
  • converting a DataFrame column to a Series

In these cases use the [] approach.

The expression API is more powerful than the [] approach because:

  • operations with the expression API are run in parallel and
  • operations within the expression API can be optimised in lazy mode

Accessing data using the expression API

Selecting and transforming a DataFrame
Operation Pandas Polars
Select a subset of columns df[["x","y"]] df.select("x","y")
Select and transform columns df[["x","y"]].astype(float) df.select(pl.col("x","y").cast(pl.Float64))
Add a column from a constant df["w"] = 1 df.with_columns(w = pl.lit(1))
Add a column from a list df["w"] = list(range(1,7)) df.with_columns(pl.Series("w",list(range(1,7))))
Add a column from other columns df["w"] = df["x"] + df["y"] df.with_columns(w = pl.col("x") + pl.col("y"))
Change dtype of a column df.assign(x = lambda df: df["x"].astype("float")) df.with_columns(pl.col("x").cast(pl.Float64))
Rename a column df.rename(columns={"x":"x2"}) df.rename({"x":"x2"})
Drop columns df.drop(columns=["x","y"]) df.drop(["x","y"])
Sorting df.sort_values("x") df.sort("x")
Copying df.copy() df.clone()

Polars does not support in-place operations so when we do any transformations we must re-assign the DataFrame variable. For example if we add a new constant column we must do it like this:

df = df.with_columns(w = pl.lit(1))

Copying a DataFrame in Pandas is expensive as it copies the underlying data. In Polars copying a DataFrame is cheap as it just creates a new reference to the underlying data.

Filtering and selecting rows
Operation Pandas Polars
Filter rows df.loc[df.ref == 'c'] df.filter(pl.col("ref") == "c")
  df.query("ref == 'c'")  
Filter rows (text operator) df.loc[df.ref.eq('c')] df.filter(pl.col("ref").eq("c"))
Multiple filters df.loc[(df.ref == 'c') & (df.x > 1)] df.filter((pl.col("ref") == "c") & pl.col("x") > 1))
Multiple filters (optimised)   df.lazy().filter(pl.col("ref") == "c").filter(pl.col("x") > 1)
Multiple filters (OR condition) df.loc[(df.ref == 'c') & (df.x > 1)] df.filter((pl.col("ref") == "c") & pl.col("x") > 1))
Select rows by row number df.iloc[1] df.select(pl.all().take(1))
Select every Nth row df.iloc[::2] df.select(pl.all().take_every(2))

In the Multiple filters (optimised) example for Polars we used separate filter calls chained together. The Polars query optimiser then combines these into a single filter operation.

Accessing data using []

Operation Pandas Polars
Get column as Series df["grp"] df["grp"]
Cell indexing by location df.iloc[1, 1] df[1, 1]
Row slicing by location df.iloc[1:3] df[1:3]
Column slicing by location df.iloc[:, 1:] df[:, 1:]
Row indexing by label df.loc['c'] df.filter(pl.col("index") == "c")
Column indexing by label df.loc[:, 'x'] df[:, "x"]
Column indexing by labels df.loc[:, ['x', 'z']] df[:, ['x', 'z']]
    df.select(['x', 'z'])
Column slicing by label df.loc[:, 'x':'z'] df[:, "x":"z"]
Mixed indexing df.loc['c'][1] df.filter(pl.col("index") == "c")[0, 1]

Note: when a query in Pandas returns a single row then that row is returned as a Series. If the row contains both floats and integers then Pandas casts the integers to floats in the Series. Polars returns a DataFrame with one row keeping the original dtypes.

Duplicates and missing values

Operation Pandas Polars
Select unique rows df.drop_duplicates() df.unique()
Drop rows with missing values df.dropna() df.drop_nulls()

Be aware that in Polars the order of the output from df.unique() is not in general the same as the order of the input. In addition, the default choice of which of each duplicated row to keep is any rather first as in Pandas. I looked at the reasons for this behaviour and how you can control it in this post.

The missing value in Pandas depends on dtype of the column whereas in Polars a missing value is null for all dtypes.

Grouping data and aggregation

Polars has a groupby function to group rows. The result of groupby is a GroupBy object in eager mode and a LazyGroupBy object lazy mode. The following table illustrates some common grouping and aggregation usages. The code snippets are long so scroll horizontally to see Polars.

Operation Pandas Polars
Aggregate by groups df.groupby('grp')['x'].mean() df.groupby('grp').agg(pl.col("x").mean()
Aggregate multiple columns df.agg({'x': max, 'y': min}) df.select([pl.col("x").max(),pl.col("y").min()])
  df[['x', 'y']].mean() df.select(["x","y"]).mean()
  df.filter(regex=("^x")).mean() df.select(pl.col("^x$").mean()
Rename column after aggregation df.groupby('grp')['x'].mean().rename("x_mean") df.groupby("grp").agg(pl.col("x").mean().suffix("_mean"))
Add aggregated data as column df.assign(x_mean=df.groupby("grp")["x"].transform("mean")) df_polars.with_column(pl.col("x").mean().over("grp").suffix("_mean"))

The output of aggregations in Pandas can be a Series whereas in Polars it is always a DataFrame. Where the output is a Series in Pandas there is a risk of the dtype being changed such as ints to floats.

As noted for unique above be aware that the order of the rows in the output of groupby in Polars is random by default.

Check out the many other posts I’ve written on Polars!

Learn more

Want to know more about Polars for high performance data science and ML? Then you can:

or let me know if you would like a Polars workshop for your organisation.

This post is licensed under CC BY 4.0 by the author.