Polars ❤️ sorted data 2: groupby
Post
Cancel

# Polars ❤️ sorted data 2: groupby

In a previous post we saw that Polars has fast-track algorithms for calculating some statistics on sorted data. In this post we see that Polars also has a fast-track algorithm for getting `groupby` keys on a sorted column.

## Performance comparison

In this example we create a `DataFrame` with an ID column that we know is sorted. When generating the `DataFrame` below we set `N = 10_000_000` to create a `DataFrame` with 10 million rows.

We also set the `cardinality` to be 10 - this means that there are 10 distinct IDs that we group on. The `cardinality` is a key variable as we see below.

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import polars as pl import numpy as np N = 10_000_000 cardinality = 10 # Create a sorted array of id integers sorted_array = np.sort(np.random.randint(0,cardinality,N)) # Create a DataFrame with a sorted ID column and a random values column df = ( pl.DataFrame( { "id":[i for i in sorted_array], "values":np.random.standard_normal(N) } ) ) ```

For this comparison we `groupby` on the `id` column and take the mean of the `values` column.

```1 2 3 4 5 6 7 ( df .groupby("id") .agg( pl.col("values").mean() ) ) ```

At this point Polars doesn’t know that the `id` column is sorted and so won’t use the fast-track algorithm. If we do the groupby-aggregation above using the standard non-sorted algorithm it takes 70 ms.

As we saw in the previous post we tell Polars that the `id` column is sorted using the `set_sorted` expression.

```1 2 3 4 5 6 df = ( df .with_column( pl.col("id").set_sorted() ) ) ```

If we run the groupby-aggregation again on the column that Polars knows is sorted it takes 14 ms - that is 5 times faster that the standard algorithm.

## Cardinality is king

So we see that groupby-aggregations on sorted columns are much faster when Polars knows the column is sorted, right? Well, it’s not quite that simple…

In the example above we set `cardinality = 10` so in the 10 millions rows there are 10 distinct group IDs.

If we create a new `DataFrame` with `cardinality = 100_000` we find that the normal algorithm takes 90 ms and the fast-track sorted algorithm takes 60 ms. In this case the fast-track algorithm is just 1.5 times faster.

If we set `cardinality = 1_000_000` we find the two approaches take a similar amount of time (or the normal approach is slightly faster).

The reason for this convergence is that the fast-track algorithm doesn’t look through the `id` column element-by-element but instead tries to find chunks of the `id` column with the same value at the start and end. When the algorithm finds these chunks it can assume the whole chunk has the same value because it knows the column is sorted.

When cardinality is low there are many such uniform chunks and the fast-track algorithm can speed through the `id` column. As cardinality increases the length of the uniform chunks decreases until eventually the fast-track algorithm is essentially doing an element-by-element search. In this case the fast-track algorithm is still trying to take jumps through the column before doing element-by-element search so the sorted approach may even be slower when the cardinality is high.

With this caveat in mind this is still a powerful optimisation that you can apply if you know your data is sorted and the cardinality isn’t too high. I also think that there are further optimisations that can be applied to speed things up when cardinality is higher.

Found this useful? Get in touch if you would like to discuss how your pipelines can take full advantage of Polars speed and scalability.

Or check out my Data Analysis with Polars course with a half price discount

## Next steps

Want to know more about Polars for high performance data science and ML? Then you can: