Most new Polars users are familiar with Pandas so a mapping from Pandas code to Polars code might come in handy. As I show in my Polars quickstart notebook there are a number of important differences between Polars and Pandas including:
- Pandas uses an index but Polars does not
- Polars has a lazy mode but Pandas does not
- Polars allows you to stream larger than memory datasets in lazy mode
I recommend reading this guide after you have covered the key concepts of Polars in the quickstart notebook.
This post was created while writing my Data Analysis with Polars course. Check it out on Udemy with a half price discount
The examples here are derived from this excellent comparison page from Pandas to Julia’s dataframe.jl.
In the following examples we compare Polars v0.15.1 with Pandas v1.5.2. I have automated testing for the snippets on this page and will endeavour to update it when things change.
We first create a sample dataset in Polars
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import polars as pl
import pandas as pd
import numpy as np
df = pl.DataFrame({'grp': [1, 2, 1, 2, 1, 2],
'x': list(range(6, 0, -1)),
'y': list(range(4, 10)),
'z': [3, 4, 5, 6, 7, None],
"index" : list('abcdef')})
shape: (6, 5)
┌─────┬─────┬─────┬──────┬───────┐
│ grp ┆ x ┆ y ┆ z ┆ index │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╪══════╪═══════╡
│ 1 ┆ 6 ┆ 4 ┆ 3 ┆ a │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 5 ┆ 5 ┆ 4 ┆ b │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1 ┆ 4 ┆ 6 ┆ 5 ┆ c │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 3 ┆ 7 ┆ 6 ┆ d │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1 ┆ 2 ┆ 8 ┆ 7 ┆ e │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 1 ┆ 9 ┆ null ┆ f │
└─────┴─────┴─────┴──────┴───────┘
Polars does not have an index. In the examples below for the Pandas DataFrame we set the index
column to be the index.
Accessing data
Operation | pandas | Polars |
---|---|---|
Cell indexing by location | df.iloc[1, 1] |
df[1, 1] |
Row slicing by location | df.iloc[1:3] |
df[1:3] |
Column slicing by location | df.iloc[:, 1:] |
df[:, 1:] |
Row indexing by label | df.loc['c'] |
df.filter(pl.col("index") == "c") |
Column indexing by label | df.loc[:, 'x'] |
df[:, "x"] |
df.select("x") |
||
Column indexing by labels | df.loc[:, ['x', 'z']] |
df[:, ['x', 'z']] |
df.select(['x', 'z']) |
||
Column slicing by label | df.loc[:, 'x':'z'] |
df[:, "x":"z"] |
Mixed indexing | df.loc['c'][1] |
df.filter(pl.col("index") == "c")[0, 1] |
In some of these examples for Polars there is a method using []
and a method with the Expression API using filter
or select
. In Polars it is recommended to use the Expression API because:
- the Expression API can be used in lazy mode
- expressions can be optimised with the built-in query optimiser
- multiple expressions are run in parallel
Note: when a Pandas index returns a single row then that row is returned as a Series. If the row contains both floats and integers then Pandas casts the integers to floats in the Series. Polars returns a DataFrame with one row keeping the original dtypes.
Common operations
Operation | Pandas | Polars |
---|---|---|
Reduce multiple values | df['z'].mean() |
df['z'].mean() |
df[['z']].agg(['mean']) |
df.select(pl.col("z").mean()) |
|
Add new column | df.assign(z1 = df['z'] + 1) |
df.with_column((pl.col("z") + 1).alias("z1")) |
Rename columns | df.rename(columns = {'x': 'x_new'}) |
df.rename({"x": "x_new"}) |
Drop columns | df.drop(columns = ['x','y']) |
df.drop(['x','y']) |
Sort rows | df.sort_values(by = 'x') |
df.sort("x") |
Drop missing rows | df.dropna() |
df.drop_nulls() |
Select unique rows | df.drop_duplicates() |
df.unique() |
The missing value in Pandas depends on dtype of the column whereas in Polars a missing value is null
for all dtypes.
In Pandas you can add a new column by assigning the column:
1
df["z1"] = df["z"] + 1
However, in Polars you always add a new column using with_column
:
1
df = df.with_column((pl.col("z") + 1).alias("z1"))
And you add multiple new columns using with_columns
:
1
2
3
4
5
6
df = df.with_columns(
[
(pl.col("x") + 1).alias("x1"),
(pl.col("z") + 1).alias("z1"),
]
)
Grouping data and aggregation
Polars has a groupby
function to group rows. The result of groupby
is a GroupBy
object in eager mode and a LazyGroupBy
object lazy mode. The following table illustrates some common grouping and aggregation usages. The code snippets are long so scroll horizontally to see Polars.
Operation | Pandas | Polars |
---|---|---|
Aggregate by groups | df.groupby('grp')['x'].mean() |
df.groupby('grp').agg(pl.col("x").mean() |
Aggregate multiple columns | df.agg({'x': max, 'y': min}) |
df.select([pl.col("x").max(),pl.col("y").min()]) |
df[['x', 'y']].mean() |
df.select(["x","y"]).mean() |
|
df.filter(regex=("^x")).mean() |
df.select(pl.col("^x$").mean() |
|
Rename column after aggregation | df.groupby('grp')['x'].mean().rename("x_mean") |
df.groupby("grp").agg(pl.col("x").mean().suffix("_mean")) |
Add aggregated data as column | df.assign(x_mean=df.groupby("grp")["x"].transform("mean")) |
df_polars.with_column(pl.col("x").mean().over("grp").suffix("_mean")) |
The output of aggregations in Pandas can be a Series
whereas in Polars it is always a DataFrame
. Where the output is a Series
in Pandas there is a risk of the dtype being changed such as ints to floats.
Follow me on twitter/linkedin for updates on this post.
Learn more
Want to know more about Polars for high performance data science and ML? Then you can:
- join my Polars course on Udemy
- follow me on twitter
- connect with me at linkedin
- check out my youtube videos
or let me know if you would like a Polars workshop for your organisation.