Getting started with Polars? This post shows you how to convert some familar Pandas commands to Polars. But it also tries to go beyond that to introduce you to some of the more fundamental differences between Pandas and Polars.
Want to accelerate your analysis with Polars? Join over 2,000 learners on my highly-rated Up & Running with Polars course
In the following examples we compare Polars v0.20.10 with Pandas v2.2.0. I have automated testing for the snippets on this page and will endeavour to update it when things change.
We first create a sample dataset in Polars
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import polars as pl
import polars.selectors as cs
import pandas as pd
import numpy as np
df = pl.DataFrame(
{
'grp': [1, 2, 1, 2, 1, 2],
'x': list(range(6, 0, -1)),
'y': list(range(4, 10)),
'z': [3, 4, 5, 6, 7, None],
"ref" : list('abcdef')
}
)
shape: (6, 5)
┌─────┬─────┬─────┬──────┬─────┐
│ grp ┆ x ┆ y ┆ z ┆ ref │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╪══════╪═════╡
│ 1 ┆ 6 ┆ 4 ┆ 3 ┆ a │
│ 2 ┆ 5 ┆ 5 ┆ 4 ┆ b │
│ 1 ┆ 4 ┆ 6 ┆ 5 ┆ c │
│ 2 ┆ 3 ┆ 7 ┆ 6 ┆ d │
│ 1 ┆ 2 ┆ 8 ┆ 7 ┆ e │
│ 2 ┆ 1 ┆ 9 ┆ null ┆ f │
└─────┴─────┴─────┴──────┴─────┘
For more posts like this check out: -Pandas to Polars time series differences
Accessing data in a DataFrame
There are two ways to access data in a Polars DataFrame
:
- using square brackets with
[]
(other called “indexing”) and - using the expression API with methods like
filter
,select
andwith_columns
These square bracket and expression API approaches have different use cases. The basic rule is that you should use the expression API unless you are doing a one-off operation such as:
- inspecting the values of some rows or columns
- converting a
DataFrame
column to aSeries
In these cases use the []
approach.
The expression API is more powerful than the []
approach because:
- operations with the expression API are run in parallel and
- operations within the expression API can be optimised in lazy mode
Accessing data using the expression API
Selecting and transforming a DataFrame
Operation | Pandas | Polars |
---|---|---|
Select a subset of columns | df[["x","y"]] |
df.select("x","y") |
Select and transform columns | df[["x","y"]].astype(float) |
df.select(pl.col("x","y").cast(pl.Float64)) |
Add a column from a constant | df["w"] = 1 |
df.with_columns(w = pl.lit(1)) |
df.with_columns(pl.lit(1).alias("w")) |
||
Add a column from a list | df["w"] = list(range(1,7)) |
df.with_columns(pl.Series("w",list(range(1,7)))) |
Add a column from other columns | df["w"] = df["x"] + df["y"] |
df.with_columns(w = pl.col("x") + pl.col("y")) |
Change dtype of a column | df.assign(x = lambda df: df["x"].astype("float")) |
df.with_columns(pl.col("x").cast(pl.Float64)) |
Rename a column | df.rename(columns={"x":"x2"}) |
df.rename({"x":"x2"}) |
Drop columns | df.drop(columns=["x","y"]) |
df.drop(["x","y"]) |
df.drop("x","y") |
||
Sorting | df.sort_values("x") |
df.sort("x") |
Copying | df.copy() |
df.clone() |
In these examples we’ve typed out the column names explicitly. Polars also has lots of ways to select multiple columns based on their dtype or patterns in their names.
Select | Polars |
---|---|
All columns | df.select(pl.all()) |
All integer columns | df.select(pl.col(pl.INTEGER_DTYPES)) |
… except ‘x’ | df.select(pl.col(pl.INTEGER_DTYPES).exclude('x')) |
All integer columns | df.select(cs.integer()) |
By regex | df.select(cs.matches('x\|y')) |
By name pattern | df.select(cs.starts_with('r')) |
Polars does not support in-place operations so when we do any transformations we re-assign the DataFrame
variable. For example if we add a new constant column we must do it like this:
1
df = df.with_columns(w = pl.lit(1))
Creating a new DataFrame
like this is cheap in Polars as it does not copy the underlying data, Polars just creates a new reference to the underlying data.
Filtering rows
Operation | Pandas | Polars | |
---|---|---|---|
Filter rows | df.loc[df.ref == 'c'] |
df.filter(pl.col("ref") == "c") |
|
df.query("ref == 'c'") |
df.filter(ref = "c") [1] |
||
Filter rows (text operator) | df.loc[df.ref.eq('c')] |
df.filter(pl.col("ref").eq("c")) |
|
Multiple filters | df.loc[(df.ref == 'c') & (df.x > 1)] |
df.filter((pl.col("ref") == "c") & pl.col("x") > 1)) |
|
df.filter(pl.col("ref") == "c", pl.col("x") > 1) [2] |
|||
Multiple filters (optimised) | df.lazy().filter(pl.col("ref") == "c").filter(pl.col("x") > 1) [3] |
||
Multiple filters (OR condition) | df.loc[(df.ref == 'c') ` (df.x > 1)]` |
df.filter((pl.col("ref") == "c") |
` pl.col(“x”) > 1))` |
Is in condition | df.loc[df.ref.isin(['a','b']) ] |
` df.filter(pl.col(“ref”).is_in([‘a’,’b’]))` | |
Is between condition | df.loc[df.ref.between(2,4) ] |
` df.filter(pl.col(“x”).is_between(2,4))` |
[1] In the second row of the Polars column we did a filter without using pl.col
. This is because Polars has a special syntax for filtering by column name. This approach takes advantage of Python keyword arguments to a function to treat setting a keyword argument - like ref = "c"
- as a filter condition. This approach only works for equality conditions.
[2] Here we apply multiple filter conditions without the annoying brackets and &. Instead we pass a comma separated list of conditions to the filter
method.
[3] In the optimised example for Polars we used separate filter
calls. However, as we are in lazy mode the Polars query optimiser combines these into a single filter condition applied on a single pass through the data.
There’s a lot more to filtering in Polars. For example, if you do a filter late in a query that could be done earlier (and so reduce the amount of data processed) then the Polars query optimiser moves the filter earlier in the query. Even better, if you apply a filter on Parquet data stored in the clouds Polars tries to apply those filters in the cloud storage layer before the data is transferred across the network. Check out my course for more on this.
Accessing data using []
Operation | Pandas | Polars |
---|---|---|
Get column as Series | df["grp"] |
df["grp"] |
Cell indexing by location | df.iloc[1, 1] |
df[1, 1] |
Row slicing by location | df.iloc[1:3] |
df[1:3] |
Column slicing by location | df.iloc[:, 1:] |
df[:, 1:] |
Row indexing by label | df.loc['c'] |
df.filter(pl.col("index") == "c") |
Column indexing by label | df.loc[:, 'x'] |
df[:, "x"] |
df.select("x") |
||
Column indexing by labels | df.loc[:, ['x', 'z']] |
df[:, ['x', 'z']] |
df.select(['x', 'z']) |
||
Column slicing by label | df.loc[:, 'x':'z'] |
df[:, "x":"z"] |
Mixed indexing | df.loc['c'][1] |
df.filter(pl.col("index") == "c")[0, 1] |
Note: when a query in Pandas returns a single row then that row is returned as a Series. If the row contains both floats and integers then Pandas casts the integers to floats in the Series. Polars returns a DataFrame with one row keeping the original dtypes.
Duplicates and missing values
Operation | Pandas | Polars |
---|---|---|
Keep unique rows | df.drop_duplicates() |
df.unique() [1] |
…based a subset of columns | df.drop_duplicates(subset=["ref"]) |
df.unique(subset=["ref"]) |
Drop rows with missing values | df.dropna() |
df.drop_nulls() |
[1] Be aware that in Polars the order of the output from df.unique()
is not in general the same as the order of the input. In addition, the default choice of which of each duplicated row to keep is any
rather first
as in Pandas. I looked at the reasons for this behaviour and how you can control it in this post.
The missing value in Pandas depends on dtype of the column whereas in Polars a missing value is null
for all dtypes.
Grouping data and aggregation
Polars has a group_by
function to group rows. The following table illustrates some common grouping and aggregation usages. The code snippets are long so scroll horizontally to see Polars.
Operation | Pandas | Polars |
---|---|---|
Agg by groups | df.groupby('grp')['x'].mean() |
df.group_by('grp').agg(pl.col("x").mean() |
Agg multiple columns | df.agg({'x': 'max', 'y': 'min'}) |
df.select([pl.col("x").max(),pl.col("y").min()]) |
df[['x', 'y']].mean() |
df.select(["x","y"]).mean() |
|
df.filter(regex=("^x")).mean() |
df.select(pl.col("^x$").mean() |
|
df.select(cs.starts_with("x").mean() |
||
Rename column after aggregation | df.groupby('grp')['x'].mean().rename("x_mean") |
df.group_by("grp").agg(pl.col("x").mean().name.suffix("_mean")) |
Add aggregated data as column | df.assign(x_mean=df.groupby("grp")["x"].transform("mean")) |
df_polars.with_column(pl.col("x").mean().over("grp").name.suffix("_mean")) |
The output of aggregations in Pandas can be a Series
whereas in Polars it is always a DataFrame
. Where the output is a Series
in Pandas there is a risk of the dtype being changed such as ints to floats.
As noted for unique
above be aware that the order of the rows in the output of groupby
in Polars is random by default.
Check out the many other posts I’ve written on Polars!
Learn more
Want to know more about Polars for high performance data science and ML? Then you can:
- join my Polars course on Udemy
- follow me on twitter
- connect with me at linkedin
- check out my youtube videos
or let me know if you would like a Polars workshop for your organisation.