Ordering of groupby and unique in Polars
Post
Cancel

# Ordering of groupby and unique in Polars

Polars (and Apache Arrow) has been designed to be careful with your data so you don’t get surprises like the following Pandas code where the `ints` column has been cast to float because of the missing value

```1 2 3 4 5 df = pd.DataFrame({'ints':[None,1,2],'strings':['a','b','c']}) ints strings 0 NaN a 1 1.0 b 2 2.0 c ```

However, every big library will do something that some users won’t expect. These are commonly referred to as gotchas. In this post we explore some of the few gotchas relating to ordering outputs from `group_by` and `unique` that I found while writing my course.

Want to accelerate your analysis with Polars? Join over 2,000 learners on my highly-rated Up & Running with Polars course

## Ordering of groupby

Let’s define a simple `DataFrame` and do a `group_by` aggregation

```1 2 3 4 5 6 7 8 9 10 11 12 13 df = pl.DataFrame( { "color": ["red", "green", "green", "red", "red"], "value": [0, 1, 2, 3, 4], } ) ( df .group_by("color") .agg( pl.col("value").count() ) ) ```

If we run this we might get the following output:

```1 2 3 4 5 6 7 8 9 shape: (2, 2) ┌───────┬───────┐ │ color ┆ value │ │ --- ┆ --- │ │ str ┆ u32 │ ╞═══════╪═══════╡ │ green ┆ 2 │ │ red ┆ 3 │ └───────┴───────┘ ```

Fine - so the groups are ordered alphabetically, right?

Well no - run this a few more times and we will eventually get the following output with a different order of rows:

```1 2 3 4 5 6 7 8 9 shape: (2, 2) ┌───────┬───────┐ │ color ┆ value │ │ --- ┆ --- │ │ str ┆ u32 │ ╞═══════╪═══════╡ │ red ┆ 3 │ │ green ┆ 2 │ └───────┴───────┘ ```

We see the order of `group_by` output isn’t fixed either alphabetically or by the order of the inputs. This can be an issue if we want to ensure we get consistent ordering - for example when writing tests.

If we want to get a consistent output we have two choices. The first is to pass the `maintain_order = True` argument to `group_by`:

```1 2 3 4 5 6 7 ( df .group_by("color",maintain_order = True) .agg( pl.col("value").count() ) ) ```

Setting `maintain_order = True` ensures that the order of the groups is consistent with the order of the input data. However, using `maintain_order = True` prevents Polars from using the streaming engine for larger-than-memory data.

The second solution is to call `sort` on the output to impose an ordering on the groups

```1 2 3 4 5 6 7 8 ( df .group_by("color") .agg( pl.col("value").count(), ) .sort("color") ) ```

As `sort` is now available in the streaming engine this solution can also run in streaming mode.

## Ordering of `unique`

We use `unique` to get the distinct rows of a `DataFrame` in relation to some columns. In this example we define a simple `DataFrame` where we define unique values by the `color` and `value` columns and track row order with the `row` column

```1 2 3 4 5 6 7 8 df = pl.DataFrame( { "color": ["red", "green", "red", "green", "red"], "value": [0, 1, 0, 1, 2], "row":[0,1,2,3,4] } ) df.unique(subset=["color","value"]) ```

Run it once and we might get output like this:

```1 2 3 4 5 6 7 8 9 10 shape: (3, 3) ┌───────┬───────┬─────┐ │ color ┆ value ┆ row │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═══════╪═══════╪═════╡ │ red ┆ 0 ┆ 0 │ │ green ┆ 1 ┆ 1 │ │ red ┆ 2 ┆ 4 │ └───────┴───────┴─────┘ ```

In earlier versions (i.e. before v0.17.0) of Polars we would have got this order every time.

This was becasue the `unique` method behaved differently from `group_by` in that `maintain_order` was set to `True`. This has now changed - `maintain_order` is set to `False` by default and so the output of `unique` is no longer ordered by the input `DataFrame`. This means the output above could also, for example, be

```1 2 3 4 5 6 7 8 9 10 shape: (3, 3) ┌───────┬───────┬─────┐ │ color ┆ value ┆ row │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═══════╪═══════╪═════╡ │ green ┆ 1 ┆ 1 │ │ red ┆ 0 ┆ 0 │ │ red ┆ 2 ┆ 4 │ └───────┴───────┴─────┘ ```

In some ways the previous ordered behaviour was intuitive as we often think of `unique` as returning the input `DataFrame` without the duplicate rows. However, as with `group_by` having a default of `mantain_order = True` would mean that `unique` would not work by default in streaming mode by default. Maintaining order is not streaming-friendly as it requires bringing together all the chunks in memory to compare the order of the rows.

With this change of default the developers want to ensure that Polars is ready to work with datasets of all sizes while allowing users to choose different behaviour if desired.

A related point is the choice of which row within each duplicated group is kept by `unique`. In Pandas this defaults to the first row of each duplicated groups. In Polars the default is `any` as this again allows more optimizations.

## Takeaway

When the order of outputs is important to you be aware if there is a `maintain_order` argument. Some other functions that have this include:

• `partition_by`
• `pivot`
• `upsample` and
• `cut` (applied to a Series)

For more on related topics check out these posts:

or this video where I process a 30 Gb dataset on a not-very-impressive laptop.

Want to accelerate your analysis with Polars? Join over 2,000 learners on my highly-rated Up & Running with Polars course )

## Next steps

Want to know more about Polars for high performance data science? Then you can: