Home What is a Polars expression?
Post
Cancel

What is a Polars expression?

I got a good question recently from a new Polars user: What is the difference between a Series and an expression in Polars?

Well, a Series is a 1D data structure. An expression is a function that operates on a Series to produce a new Series.

So we use an expression to transform a Series.

Want to get going with Polars? Check out my Polars course here

There are many examples of expressions in Polars:

  • the simplest example is the identity expression, which just returns the Series it is given pl.col("id")
  • we can transform the contents of a Series using expressions like pl.col("id").str.to_uppercase()
  • we can do arithmetic operations on Series using expressions like pl.col("value") + 1
  • we can aggregate Series using expressions like pl.col("value").sum()
  • we can change the name of the output Series using expressions like pl.col("value").alias("double_value")
  • we can apply expressions over groups using expressions like pl.col("value").sum().over("id")

What is the Expression API?

Next question: what is the expression API that the Polars docs are always talking about?

The expression API is the collective name for the methods in Polars that take expressions as arguments and the expressions themselves.

Methods that take expressions as arguments

For examples of DataFrame methods that take expressions as arguments we have:

  • df.filter(pl.col("id") == 1)
  • df.select(pl.col("id").str.to_uppercase())
  • df.with_columns(double_value=pl.col("value") * 2)
  • df.groupby(pl.col("id")).agg(pl.col("value").sum())

When we use these methods with expressions we have an important concept to understand: context. The context tells us what actual data will be used as the input to the expression.

For example, when we do df.filter or df.select we are in the select context. This context means that the whole column of the DataFrame is the input to the expression.

When we do df.groupby we are in the groupby context. This context means that the input to the expression is the group of rows that have the same value in the grouping column.

The expressions themselves

The other component of the expression API is the expressions themselves. These are the functions that we use to transform Series and aggregate Series.

These functions are listed in the Polars API documentation. The functions are grouped into categories like:

  • Aggregation for aggregating (obviously)
  • Computation for computing (obviously)
  • Columns / names for working with column names
  • Window for applying expressions over groups

There are also categories for expressions that apply only to certain dtypes such as:

  • Strings for string manipulation and matching
  • Temporal for working with dates and times
  • Array or List for working with arrays and lists

What are the benefits of the expression API?

The expression API allows you to extract all the Power of Polars.

Firstly, when we apply multiple expressions in the same context Polars can run them in parallel.

Secondly, the expression API allows you to work in lazy mode. An expression is really an instruction to the Polars query engine of what you want to do. In lazy mode you can build up a complex data processing pipeline and then Polars can apply query optimisations before it is executed.

Finally, the expression API allows you to work with larger-than-memory datasets. Polars can handle datasets that are too large to fit into memory by working with the data in chunks. This is called chunked processing and it is a key feature of the expression API.

If you would like more detailed support on working with Polars then I provide consulting on optimising your data processing pipelines with Polars. You can also check out my online course to get you up-and-running with Polars by clicking on the bear below

Want to know more about Polars for high performance data science? Then you can:

This post is licensed under CC BY 4.0 by the author.