Home Cheatsheet for Pandas to Polars
Post
Cancel

Cheatsheet for Pandas to Polars

Most new Polars users are familiar with Pandas so a mapping from Pandas code to Polars code might come in handy. As I show in my Polars quickstart notebook there are a number of important differences between Polars and Pandas including:

  • Pandas uses an index but Polars does not
  • Polars has a lazy mode but Pandas does not
  • Polars allows you to stream larger than memory datasets in lazy mode

I recommend reading this guide after you have covered the key concepts of Polars in the quickstart notebook.

This post was created while writing my Data Analysis with Polars course. Check it out on Udemy with a half price discount

The examples here are derived from this excellent comparison page from Pandas to Julia’s dataframe.jl.

In the following examples we compare Polars v0.15.1 with Pandas v1.5.2. I have automated testing for the snippets on this page and will endeavour to update it when things change.

We first create a sample dataset in Polars

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import polars as pl
import pandas as pd
import numpy as np

df = pl.DataFrame({'grp': [1, 2, 1, 2, 1, 2],
                   'x': list(range(6, 0, -1)),
                   'y': list(range(4, 10)),
                   'z': [3, 4, 5, 6, 7, None],
                   "index" : list('abcdef')})
shape: (6, 5)
┌─────┬─────┬─────┬──────┬───────┐
 grp  x    y    z     index 
 ---  ---  ---  ---   ---   
 i64  i64  i64  i64   str   
╞═════╪═════╪═════╪══════╪═══════╡
 1    6    4    3     a     
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
 2    5    5    4     b     
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
 1    4    6    5     c     
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
 2    3    7    6     d     
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
 1    2    8    7     e     
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
 2    1    9    null  f     
└─────┴─────┴─────┴──────┴───────┘

Polars does not have an index. In the examples below for the Pandas DataFrame we set the index column to be the index.

Accessing data

Operation pandas Polars
Cell indexing by location df.iloc[1, 1] df[1, 1]
Row slicing by location df.iloc[1:3] df[1:3]
Column slicing by location df.iloc[:, 1:] df[:, 1:]
Row indexing by label df.loc['c'] df.filter(pl.col("index") == "c")
Column indexing by label df.loc[:, 'x'] df[:, "x"]
    df.select("x")
Column indexing by labels df.loc[:, ['x', 'z']] df[:, ['x', 'z']]
    df.select(['x', 'z'])
Column slicing by label df.loc[:, 'x':'z'] df[:, "x":"z"]
Mixed indexing df.loc['c'][1] df.filter(pl.col("index") == "c")[0, 1]

In some of these examples for Polars there is a method using [] and a method with the Expression API using filter or select. In Polars it is recommended to use the Expression API because:

  • the Expression API can be used in lazy mode
  • expressions can be optimised with the built-in query optimiser
  • multiple expressions are run in parallel

Note: when a Pandas index returns a single row then that row is returned as a Series. If the row contains both floats and integers then Pandas casts the integers to floats in the Series. Polars returns a DataFrame with one row keeping the original dtypes.

Common operations

Operation Pandas Polars
Reduce multiple values df['z'].mean() df['z'].mean()
  df[['z']].agg(['mean']) df.select(pl.col("z").mean())
Add new column df.assign(z1 = df['z'] + 1) df.with_column((pl.col("z") + 1).alias("z1"))
Rename columns df.rename(columns = {'x': 'x_new'}) df.rename({"x": "x_new"})
Drop columns df.drop(columns = ['x','y']) df.drop(['x','y'])
Sort rows df.sort_values(by = 'x') df.sort("x")
Drop missing rows df.dropna() df.drop_nulls()
Select unique rows df.drop_duplicates() df.unique()

The missing value in Pandas depends on dtype of the column whereas in Polars a missing value is null for all dtypes.

In Pandas you can add a new column by assigning the column:

1
df["z1"] = df["z"] + 1

However, in Polars you always add a new column using with_column:

1
df = df.with_column((pl.col("z") + 1).alias("z1"))

And you add multiple new columns using with_columns:

1
2
3
4
5
6
df = df.with_columns(
    [
        (pl.col("x") + 1).alias("x1"),
        (pl.col("z") + 1).alias("z1"),
    ]
)

Grouping data and aggregation

Polars has a groupby function to group rows. The result of groupby is a GroupBy object in eager mode and a LazyGroupBy object lazy mode. The following table illustrates some common grouping and aggregation usages. The code snippets are long so scroll horizontally to see Polars.

Operation Pandas Polars
Aggregate by groups df.groupby('grp')['x'].mean() df.groupby('grp').agg(pl.col("x").mean()
Aggregate multiple columns df.agg({'x': max, 'y': min}) df.select([pl.col("x").max(),pl.col("y").min()])
  df[['x', 'y']].mean() df.select(["x","y"]).mean()
  df.filter(regex=("^x")).mean() df.select(pl.col("^x$").mean()
Rename column after aggregation df.groupby('grp')['x'].mean().rename("x_mean") df.groupby("grp").agg(pl.col("x").mean().suffix("_mean"))
Add aggregated data as column df.assign(x_mean=df.groupby("grp")["x"].transform("mean")) df_polars.with_column(pl.col("x").mean().over("grp").suffix("_mean"))

The output of aggregations in Pandas can be a Series whereas in Polars it is always a DataFrame. Where the output is a Series in Pandas there is a risk of the dtype being changed such as ints to floats.

Follow me on twitter/linkedin for updates on this post.

Learn more

Want to know more about Polars for high performance data science and ML? Then you can:

or let me know if you would like a Polars workshop for your organisation.

This post is licensed under CC BY 4.0 by the author.