Polars ❤️ sorted data 2: groupby

In a previous post we saw that Polars has fast-track algorithms for calculating some statistics on sorted data. In this post we see that Polars also has a fast-track algorithm for getting groupby keys on a sorted column.

Performance comparison

In this example we create a DataFrame with an ID column that we know is sorted. When generating the DataFrame below we set N = 10_000_000 to create a DataFrame with 10 million rows.

We also set the cardinality to be 10 - this means that there are 10 distinct IDs that we group on. The cardinality is a key variable as we see below.

        
      
import polars as pl
import numpy as np
N = 10_000_000
cardinality = 10
# Create a sorted array of id integers
sorted_array = np.sort(np.random.randint(0,cardinality,N))
# Create a DataFrame with a sorted ID column and a random values column
df = (
    pl.DataFrame(
        {
            "id":[i for i in sorted_array],
            "values":np.random.standard_normal(N)
        }
    )
)

For this comparison we groupby on the id column and take the mean of the values column.

(
    df
    .groupby("id")
    .agg(
        pl.col("values").mean()
    )
)

At this point Polars doesn’t know that the id column is sorted and so won’t use the fast-track algorithm. If we do the groupby-aggregation above using the standard non-sorted algorithm it takes 70 ms.

As we saw in the previous post we tell Polars that the id column is sorted using the set_sorted expression.

        
      
df = (
    df
    .with_column(
        pl.col("id").set_sorted()
    )
)

If we run the groupby-aggregation again on the column that Polars knows is sorted it takes 14 ms - that is 5 times faster that the standard algorithm.

Cardinality is king

So we see that groupby-aggregations on sorted columns are much faster when Polars knows the column is sorted, right? Well, it’s not quite that simple…

In the example above we set cardinality = 10 so in the 10 millions rows there are 10 distinct group IDs.

If we create a new DataFrame with cardinality = 100_000 we find that the normal algorithm takes 90 ms and the fast-track sorted algorithm takes 60 ms. In this case the fast-track algorithm is just 1.5 times faster.

If we set cardinality = 1_000_000 we find the two approaches take a similar amount of time (or the normal approach is slightly faster).

The reason for this convergence is that the fast-track algorithm doesn’t look through the id column element-by-element but instead tries to find chunks of the id column with the same value at the start and end. When the algorithm finds these chunks it can assume the whole chunk has the same value because it knows the column is sorted.

When cardinality is low there are many such uniform chunks and the fast-track algorithm can speed through the id column. As cardinality increases the length of the uniform chunks decreases until eventually the fast-track algorithm is essentially doing an element-by-element search. In this case the fast-track algorithm is still trying to take jumps through the column before doing element-by-element search so the sorted approach may even be slower when the cardinality is high.

With this caveat in mind this is still a powerful optimisation that you can apply if you know your data is sorted and the cardinality isn’t too high. I also think that there are further optimisations that can be applied to speed things up when cardinality is higher.

Found this useful? Get in touch if you would like to discuss how your pipelines can take full advantage of Polars speed and scalability.

Or check out my Data Analysis with Polars course with a half price discount

Next steps

Want to know more about Polars for high performance data science and ML? Then you can:

Polars ❤️ sorted data 2: groupby

Performance comparison

Cardinality is king

Next steps

Further Reading

What does ChatGPT's Advanced Data Analysis have installed?

AWS Lambda with Polars

Streaming large datasets in Polars