Don't fear the List dtype in Polars
Post
Cancel

# Don't fear the List dtype in Polars

Published on: 6th October 2022

# What is the Polars `pl.List` dtype?

This post was created while writing my Up & Running with Polars course. Check it out here with a free preview of the first chapters

One of the first things you learn doing data analysis in python is that working with lists are slow.

So you might be wary of the `pl.List` dtype in Polars.

But you shouldn’t worry - this dtype is at the core of some fast parallel analysis.

In a List column every row is a Polars Series. And every Series has the same dtype.

Here we create a DataFrame with three list columns: ints, floats and strings. Type is inferred from the first element.

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 import polars as pl dfLists = pl.DataFrame({ 'ints':[ [0,1], [2,3]], 'floats':[ [0.0,1], [2,3]], 'strings':[ ["0","1"],["2","3"]] }) dfLists shape: (2, 3) ┌───────────┬────────────┬────────────┐ │ ints ┆ floats ┆ strings │ │ --- ┆ --- ┆ --- │ │ list[i64] ┆ list[f64] ┆ list[str] │ ╞═══════════╪════════════╪════════════╡ │ [0, 1] ┆ [0.0, 1.0] ┆ ["0", "1"] │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤ │ [2, 3] ┆ [2.0, 3.0] ┆ ["2", "3"] │ └───────────┴────────────┴────────────┘ ```

# Doing some analysis

The great thing about the List dtype is that you can easily do the same analysis to the Series on each row.

```1 2 3 4 5 6 dfLists = pl.DataFrame({ 'ints':[ [0,1], [2,3]], 'floats':[ [0.0,1], [2,3]], 'strings':[ ["0","1"],["2","3"]] }) print(dfLists.select(pl.all().arr.max())) ```

This is a simple example but it can get as gnarly as you need (in parallel if you want).

Although we created them with python lists, they are fast Series.

# Mixing it up?

What happens if we pass a list with a mix of types?

In this case the column has a pl.Object dtype.

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 dfMixed = pl.DataFrame({ 'ints':[ [0,1], [2,3]], 'mixed':[ ['a',0],['b',1]] }) dfMixed shape: (2, 2) ┌───────────┬──────────┐ │ ints ┆ mixed │ │ --- ┆ --- │ │ list[i64] ┆ object │ ╞═══════════╪══════════╡ │ [0, 1] ┆ ['a', 0] │ ├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤ │ [2, 3] ┆ ['b', 1] │ └───────────┴──────────┘ ```

Every row in a pl.Object dtype is just a python list under the hood. While they can hold arbitrary objects these Object columns won’t be so efficient for analysis.

In summary, fear the lists (and maybe fear the `pl.Object`), but don’t fear the `pl.Lists`!

Want to know more about Polars for high performance data science and ML? Then you can:

or let me know if you would like a Polars workshop for your organisation.