Most new Polars users are familiar with Pandas so a mapping from Pandas code to Polars code might come in handy. As I show in my Polars quickstart notebook there are a number of important differences between Polars and Pandas including:
- Pandas uses an index but Polars does not
- Polars has a lazy mode but Pandas does not
- Polars allows you to stream larger than memory datasets in lazy mode
I recommend reading this guide after you have covered the key concepts of Polars in the quickstart notebook.
This post was created while writing my Data Analysis with Polars course. Check it out on Udemy with a half price discount
The examples here are derived from this excellent comparison page from Pandas to Julia’s dataframe.jl.
In the following examples we compare Polars v0.15.1 with Pandas v1.5.2. I have automated testing for the snippets on this page and will endeavour to update it when things change.
We first create a sample dataset in Polars
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import polars as pl
import polars.selectors as cs
import Pandas as pd
import numpy as np
df = pl.DataFrame(
{
'grp': [1, 2, 1, 2, 1, 2],
'x': list(range(6, 0, -1)),
'y': list(range(4, 10)),
'z': [3, 4, 5, 6, 7, None],
"ref" : list('abcdef')
}
)
shape: (6, 5)
┌─────┬─────┬─────┬──────┬─────┐
│ grp ┆ x ┆ y ┆ z ┆ ref │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╪══════╪═════╡
│ 1 ┆ 6 ┆ 4 ┆ 3 ┆ a │
│ 2 ┆ 5 ┆ 5 ┆ 4 ┆ b │
│ 1 ┆ 4 ┆ 6 ┆ 5 ┆ c │
│ 2 ┆ 3 ┆ 7 ┆ 6 ┆ d │
│ 1 ┆ 2 ┆ 8 ┆ 7 ┆ e │
│ 2 ┆ 1 ┆ 9 ┆ null ┆ f │
└─────┴─────┴─────┴──────┴─────┘
Accessing data in a DataFrame
There are two ways to access data in a Polars DataFrame
:
- using square brackets with
[]
(other called “indexing”) and - using the expression API with methods like
filter
,select
andwith_columns
These square bracket and expression API approaches have different use cases. The basic rule is that you should use the expression API unless you are doing a one-off operation such as:
- inspecting the values of some rows or columns
- converting a
DataFrame
column to aSeries
In these cases use the []
approach.
The expression API is more powerful than the []
approach because:
- operations with the expression API are run in parallel and
- operations within the expression API can be optimised in lazy mode
Accessing data using the expression API
Selecting and transforming a DataFrame
Operation | Pandas | Polars |
---|---|---|
Select a subset of columns | df[["x","y"]] |
df.select("x","y") |
Select and transform columns | df[["x","y"]].astype(float) |
df.select(pl.col("x","y").cast(pl.Float64)) |
Add a column from a constant | df["w"] = 1 |
df.with_columns(w = pl.lit(1)) |
Add a column from a list | df["w"] = list(range(1,7)) |
df.with_columns(pl.Series("w",list(range(1,7)))) |
Add a column from other columns | df["w"] = df["x"] + df["y"] |
df.with_columns(w = pl.col("x") + pl.col("y")) |
Change dtype of a column | df.assign(x = lambda df: df["x"].astype("float")) |
df.with_columns(pl.col("x").cast(pl.Float64)) |
Rename a column | df.rename(columns={"x":"x2"}) |
df.rename({"x":"x2"}) |
Drop columns | df.drop(columns=["x","y"]) |
df.drop(["x","y"]) |
df.drop("x","y") |
||
Sorting | df.sort_values("x") |
df.sort("x") |
Copying | df.copy() |
df.clone() |
Polars does not support in-place operations so when we do any transformations we must re-assign the DataFrame
variable. For example if we add a new constant column we must do it like this:
1
df = df.with_columns(w = pl.lit(1))
Copying a DataFrame
in Pandas is expensive as it copies the underlying data. In Polars copying a DataFrame
is cheap as it just creates a new reference to the underlying data.
Filtering and selecting rows
Operation | Pandas | Polars |
---|---|---|
Filter rows | df.loc[df.ref == 'c'] |
df.filter(pl.col("ref") == "c") |
df.query("ref == 'c'") |
||
Filter rows (text operator) | df.loc[df.ref.eq('c')] |
df.filter(pl.col("ref").eq("c")) |
Multiple filters | df.loc[(df.ref == 'c') & (df.x > 1)] |
df.filter((pl.col("ref") == "c") & pl.col("x") > 1)) |
Multiple filters (optimised) | df.lazy().filter(pl.col("ref") == "c").filter(pl.col("x") > 1) |
|
Multiple filters (OR condition) | df.loc[(df.ref == 'c') & (df.x > 1)] |
df.filter((pl.col("ref") == "c") & pl.col("x") > 1)) |
Select rows by row number | df.iloc[1] |
df.select(pl.all().take(1)) |
Select every Nth row | df.iloc[::2] |
df.select(pl.all().take_every(2)) |
In the Multiple filters (optimised) example for Polars we used separate filter
calls chained together. The Polars query optimiser then combines these into a single filter operation.
Accessing data using []
Operation | Pandas | Polars |
---|---|---|
Get column as Series | df["grp"] |
df["grp"] |
Cell indexing by location | df.iloc[1, 1] |
df[1, 1] |
Row slicing by location | df.iloc[1:3] |
df[1:3] |
Column slicing by location | df.iloc[:, 1:] |
df[:, 1:] |
Row indexing by label | df.loc['c'] |
df.filter(pl.col("index") == "c") |
Column indexing by label | df.loc[:, 'x'] |
df[:, "x"] |
df.select("x") |
||
Column indexing by labels | df.loc[:, ['x', 'z']] |
df[:, ['x', 'z']] |
df.select(['x', 'z']) |
||
Column slicing by label | df.loc[:, 'x':'z'] |
df[:, "x":"z"] |
Mixed indexing | df.loc['c'][1] |
df.filter(pl.col("index") == "c")[0, 1] |
Note: when a query in Pandas returns a single row then that row is returned as a Series. If the row contains both floats and integers then Pandas casts the integers to floats in the Series. Polars returns a DataFrame with one row keeping the original dtypes.
Duplicates and missing values
Operation | Pandas | Polars |
---|---|---|
Select unique rows | df.drop_duplicates() |
df.unique() |
Drop rows with missing values | df.dropna() |
df.drop_nulls() |
Be aware that in Polars the order of the output from df.unique()
is not in general the same as the order of the input. In addition, the default choice of which of each duplicated row to keep is any
rather first
as in Pandas. I looked at the reasons for this behaviour and how you can control it in this post.
The missing value in Pandas depends on dtype of the column whereas in Polars a missing value is null
for all dtypes.
Grouping data and aggregation
Polars has a groupby
function to group rows. The result of groupby
is a GroupBy
object in eager mode and a LazyGroupBy
object lazy mode. The following table illustrates some common grouping and aggregation usages. The code snippets are long so scroll horizontally to see Polars.
Operation | Pandas | Polars |
---|---|---|
Aggregate by groups | df.groupby('grp')['x'].mean() |
df.groupby('grp').agg(pl.col("x").mean() |
Aggregate multiple columns | df.agg({'x': max, 'y': min}) |
df.select([pl.col("x").max(),pl.col("y").min()]) |
df[['x', 'y']].mean() |
df.select(["x","y"]).mean() |
|
df.filter(regex=("^x")).mean() |
df.select(pl.col("^x$").mean() |
|
df.select(cs.starts_with("x").mean() |
||
Rename column after aggregation | df.groupby('grp')['x'].mean().rename("x_mean") |
df.groupby("grp").agg(pl.col("x").mean().suffix("_mean")) |
Add aggregated data as column | df.assign(x_mean=df.groupby("grp")["x"].transform("mean")) |
df_polars.with_column(pl.col("x").mean().over("grp").suffix("_mean")) |
The output of aggregations in Pandas can be a Series
whereas in Polars it is always a DataFrame
. Where the output is a Series
in Pandas there is a risk of the dtype being changed such as ints to floats.
As noted for unique
above be aware that the order of the rows in the output of groupby
in Polars is random by default.
Check out the many other posts I’ve written on Polars!
Learn more
Want to know more about Polars for high performance data science and ML? Then you can:
- join my Polars course on Udemy
- follow me on twitter
- connect with me at linkedin
- check out my youtube videos
or let me know if you would like a Polars workshop for your organisation.