Polars has 4 native nested column types. These can be very helpful at solving problems such as:
- working with ML embeddings
- splitting strings
- working with nested JSON data
- working with aggregations
To take advantage of them it’s important you understand the difference between the types. In this post I set out the key differences between the nested column types and give some examples of when you might use each one.
Want to accelerate your analysis with Polars? Join over 3,000 learners on my highly-rated Up & Running with Polars course
Nested column types overview
The 4 native nested column types in Polars are:
pl.Listpl.Arraypl.Objectpl.Struct
We can immediately split these into two groups:
pl.List,pl.Arrayandpl.Objectstore some kind of sequence on each rowpl.Structis a nested collection of columns
The sequence types
The sequence types pl.List, pl.Array and pl.Object store some kind of sequence on each row. The main differences between them are how they store the sequence and whether the length of the sequence can be different on each row.
We can break the sequence types into two groups:
pl.Listandpl.Arraystore the data on each row in a PolarsSeriespl.Objectstores the data on each row in a Pythonlist
pl.List and pl.Array
On each row pl.List and pl.Array store the data in a Polars Series. As with any Polars Series the data in the Series must have a homogenous dtype e.g. floats as pl.Float32 or strings as pl.Utf8. The dtype must also be the same for all rows in the column.
The difference between pl.List and pl.Array is that the length of the sequence can be different on each row for pl.List but must be the same for pl.Array. In this sense a pl.Array is more comparable to a 2D numpy array where the first dimension is the length of the DataFrame and the second dimension is the length of the array.
One further practical difference between pl.List and pl.Array is that pl.Array is relatively new and has less functionality. You may need to use pl.List while pl.Array is further developed.
In this example we create a DataFrame with a float pl.List type and a mixed pl.Object type. Polars infers the pl.List type as pl.Float64 and the pl.Object type as the data types are mixed for the pl.Object column.
We then create a new pl.Array column floats_array by casting the floats column to a pl.Array type. To do this we specify the width of the array as 2 and the inner type as pl.Float64.
To illustrate this we create a DataFrame with each of the sequence types.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import polars as pl
df = pl.DataFrame(
{
"floats": [[0.0, 1], [2, 3]],
"mixed_object": [["a", 0], ["b", 1]]
}
).with_columns(
floats_array=pl.col("floats").cast(pl.Array(width=2, inner=pl.Float64))
)
shape: (2, 3)
┌────────────┬──────────────┬───────────────┐
│ floats ┆ mixed_object ┆ floats_array │
│ --- ┆ --- ┆ --- │
│ list[f64] ┆ object ┆ array[f64, 2] │
╞════════════╪══════════════╪═══════════════╡
│ [0.0, 1.0] ┆ ['a', 0] ┆ [0.0, 1.0] │
│ [2.0, 3.0] ┆ ['b', 1] ┆ [2.0, 3.0] │
└────────────┴──────────────┴───────────────┘
Note that the pl.Object column has lists where each list has a mix different data types.
The use cases of the sequence types include working with vector data and splitting strings.
The pl.List dtype is also used extensively internally - for example a group_by creates a pl.List column with the data for each group and aggregations happen on this pl.List column.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
pl.DataFrame(
{
"grp": ["a", "a", "b"],
"value": [0, 1, 2]
}
).group_by("grp").agg(
pl.col("value")
)
shape: (2, 2)
┌─────┬───────────┐
│ grp ┆ value │
│ --- ┆ --- │
│ str ┆ list[i64] │
╞═════╪═══════════╡
│ b ┆ [2] │
│ a ┆ [0, 1] │
└─────┴───────────┘
The pl.Struct type of nested columns
Whereas the sequence types above have a sequence on each row, the pl.Struct type is a nested collection of columns. The pl.Struct is really just a way of having a nested namespace for columns. The underlying columns are just normal Polars Series.
Of course, like any Polars Series the data in the columns underlying the pl.Struct must have a homogenous dtype.
In this example we have a trades column that is made of a list of python dicts. Each dict has the same keys and the values have the same types.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
df_struct = (
pl.DataFrame(
{
"year":[2020,2021],
"trades":[
{"exporter":"India","importer":"USA","quantity":0.0},
{"exporter":"India","importer":"USA","quantity":1.5},
]
}
)
)
shape: (2, 2)
┌──────┬─────────────────────┐
│ year ┆ trades │
│ --- ┆ --- │
│ i64 ┆ struct[3] │
╞══════╪═════════════════════╡
│ 2020 ┆ {"India","USA",0.0} │
│ 2021 ┆ {"India","USA",1.5} │
└──────┴─────────────────────┘
If you have a pl.Struct column and want to un-nest the columns back into a flat DataFrame you can do so with unnest
1
2
3
4
5
6
7
8
9
10
df_struct.unnest('trades')
shape: (2, 4)
┌──────┬──────────┬──────────┬──────────┐
│ year ┆ exporter ┆ importer ┆ quantity │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ f64 │
╞══════╪══════════╪══════════╪══════════╡
│ 2020 ┆ India ┆ USA ┆ 0.0 │
│ 2021 ┆ India ┆ USA ┆ 1.5 │
└──────┴──────────┴──────────┴──────────┘
Use cases of the pl.Struct type include working with nested JSON data and collapsing columns into groups when working with wide DataFrames.
That’s it for the intro to the nested column types. If you want to learn more about working with the pl.List dtype check out my post focused on that. For a more comprehensive intro check out my Data Analysis with Polars course.
Next steps
Want to know more about Polars for high performance data science? Then you can: