Polars on a diet
Post
Cancel

# Polars on a diet

This post was created while writing my Up & Running with Polars course. Check it out here with a free preview of the first chapters

Polars has a built in tool to go on a dtype diet.

Call the `shrink_dtype` expression and it will convert the column to the dtype that requires the least amount of memory based on the data in the column.

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 import polars as pl ( pl.DataFrame( { "a": [1, 2, 3], "b": [1, 2, 2 << 32], "c": [-1, 2, 1 << 30], "d": [-112, 2, 112], "e": [-112, 2, 129], "f": ["a", "b", "c"], "g": [0.1, 1.32, 0.12], "h": [True, None, False], } ) .select( pl.all().shrink_dtype() ) ) ┌─────┬────────────┬────────────┬──────┬──────┬─────┬──────┬───────┐ │ a ┆ b ┆ c ┆ d ┆ e ┆ f ┆ g ┆ h │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i8 ┆ i64 ┆ i32 ┆ i8 ┆ i16 ┆ str ┆ f32 ┆ bool │ ╞═════╪════════════╪════════════╪══════╪══════╪═════╪══════╪═══════╡ │ 1 ┆ 1 ┆ -1 ┆ -112 ┆ -112 ┆ a ┆ 0.1 ┆ true │ ├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ │ 2 ┆ 2 ┆ 2 ┆ 2 ┆ 2 ┆ b ┆ 1.32 ┆ null │ ├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ │ 3 ┆ 8589934592 ┆ 1073741824 ┆ 112 ┆ 129 ┆ c ┆ 0.12 ┆ false │ └─────┴────────────┴────────────┴──────┴──────┴─────┴──────┴───────┘ ```

Both floats and integers default to 64-bit precision. In the example below from the API docs Polars sees that column “a” could be 8-bit, column “b” must be 64-bit, but column “c” could be 32-bit.

Casting numeric columns from 64-bit to 32-bit is often the easiest win in data science. Memory usage halves and computation time might also be half that of 64-bit.

You do need to check that the loss of precision is ok. I had sensors accurate to 0.01 so a change of 10^-6 was 👍

In my udemy course I show that if you cast to 8- or 16-bits memory usage continues to fall proportionally…

…but computation time probably won’t be better than 32-bits!

Most modern CPUs don’t have native support for 8- or 16-bit so they have to emulate it.

String columns with lots of repeated entries can also usefully be cast to categoricals. But that’s a story for another day.