Home Polars on a diet
Post
Cancel

Polars on a diet

This post was created while writing my Up & Running with Polars course. Check it out here with a free preview of the first chapters

Polars has a built in tool to go on a dtype diet.

Call the shrink_dtype expression and it will convert the column to the dtype that requires the least amount of memory based on the data in the column.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import polars as pl

(
  pl.DataFrame(
  	{
    	"a": [1, 2, 3],
        "b": [1, 2, 2 << 32],
        "c": [-1, 2, 1 << 30],
        "d": [-112, 2, 112],
        "e": [-112, 2, 129],
        "f": ["a", "b", "c"],
        "g": [0.1, 1.32, 0.12],
        "h": [True, None, False],
     }
    )
  .select(
    pl.all().shrink_dtype()
  )
)
┌─────┬────────────┬────────────┬──────┬──────┬─────┬──────┬───────┐
 a    b           c           d     e     f    g     h     
 ---  ---         ---         ---   ---   ---  ---   ---   
 i8   i64         i32         i8    i16   str  f32   bool  
╞═════╪════════════╪════════════╪══════╪══════╪═════╪══════╪═══════╡
 1    1           -1          -112  -112  a    0.1   true  
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
 2    2           2           2     2     b    1.32  null  
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
 3    8589934592  1073741824  112   129   c    0.12  false 
└─────┴────────────┴────────────┴──────┴──────┴─────┴──────┴───────┘

Both floats and integers default to 64-bit precision. In the example below from the API docs Polars sees that column “a” could be 8-bit, column “b” must be 64-bit, but column “c” could be 32-bit.

Casting numeric columns from 64-bit to 32-bit is often the easiest win in data science. Memory usage halves and computation time might also be half that of 64-bit.

You do need to check that the loss of precision is ok. I had sensors accurate to 0.01 so a change of 10^-6 was 👍

In my udemy course I show that if you cast to 8- or 16-bits memory usage continues to fall proportionally…

…but computation time probably won’t be better than 32-bits!

Most modern CPUs don’t have native support for 8- or 16-bit so they have to emulate it.

String columns with lots of repeated entries can also usefully be cast to categoricals. But that’s a story for another day.

Learn more

Want to know more about Polars for high performance data science and ML? Then you can:

or let me know if you would like a Polars workshop for your organisation.

This post is licensed under CC BY 4.0 by the author.