Understanding the Polars nested column types

Polars has 4 native nested column types. These can be very helpful at solving problems such as:

working with ML embeddings
splitting strings
working with nested JSON data
working with aggregations

To take advantage of them it’s important you understand the difference between the types. In this post I set out the key differences between the nested column types and give some examples of when you might use each one.

Want to accelerate your analysis with Polars? Join over 2,000 learners on my highly-rated Up & Running with Polars course

Nested column types overview

The 4 native nested column types in Polars are:

pl.List
pl.Array
pl.Object
pl.Struct

We can immediately split these into two groups:

pl.List, pl.Array and pl.Object store some kind of sequence on each row
pl.Struct is a nested collection of columns

The sequence types

The sequence types pl.List, pl.Array and pl.Object store some kind of sequence on each row. The main differences between them are how they store the sequence and whether the length of the sequence can be different on each row.

We can break the sequence types into two groups:

pl.List and pl.Array store the data on each row in a Polars Series
pl.Object stores the data on each row in a Python list

`pl.List` and `pl.Array`

On each row pl.List and pl.Array store the data in a Polars Series. As with any Polars Series the data in the Series must have a homogenous dtype e.g. floats as pl.Float32 or strings as pl.Utf8. The dtype must also be the same for all rows in the column.

The difference between pl.List and pl.Array is that the length of the sequence can be different on each row for pl.List but must be the same for pl.Array. In this sense a pl.Array is more comparable to a 2D numpy array where the first dimension is the length of the DataFrame and the second dimension is the length of the array.

One further practical difference between pl.List and pl.Array is that pl.Array is relatively new and has less functionality. You may need to use pl.List while pl.Array is further developed.

In this example we create a DataFrame with a float pl.List type and a mixed pl.Object type. Polars infers the pl.List type as pl.Float64 and the pl.Object type as the data types are mixed for the pl.Object column.

We then create a new pl.Array column floats_array by casting the floats column to a pl.Array type. To do this we specify the width of the array as 2 and the inner type as pl.Float64.

To illustrate this we create a DataFrame with each of the sequence types.

        
      
import polars as pl
df = pl.DataFrame(
    {
        "floats": [[0.0, 1], [2, 3]], 
        "mixed_object": [["a", 0], ["b", 1]]
    }
).with_columns(
    floats_array=pl.col("floats").cast(pl.Array(width=2, inner=pl.Float64))
    )
shape: (2, 3)
┌────────────┬──────────────┬───────────────┐
│ floats     ┆ mixed_object ┆ floats_array  │
│ ---        ┆ ---          ┆ ---           │
│ list[f64]  ┆ object       ┆ array[f64, 2] │
╞════════════╪══════════════╪═══════════════╡
│ [0.0, 1.0] ┆ ['a', 0]     ┆ [0.0, 1.0]    │
│ [2.0, 3.0] ┆ ['b', 1]     ┆ [2.0, 3.0]    │
└────────────┴──────────────┴───────────────┘

Note that the pl.Object column has lists where each list has a mix different data types.

The use cases of the sequence types include working with vector data and splitting strings.

The pl.List dtype is also used extensively internally - for example a group_by creates a pl.List column with the data for each group and aggregations happen on this pl.List column.

        
      
pl.DataFrame(
    {
        "grp": ["a", "a", "b"], 
        "value": [0, 1, 2]
    }
).group_by("grp").agg(
    pl.col("value")
)
shape: (2, 2)
┌─────┬───────────┐
│ grp ┆ value     │
│ --- ┆ ---       │
│ str ┆ list[i64] │
╞═════╪═══════════╡
│ b   ┆ [2]       │
│ a   ┆ [0, 1]    │
└─────┴───────────┘

The `pl.Struct` type of nested columns

Whereas the sequence types above have a sequence on each row, the pl.Struct type is a nested collection of columns. The pl.Struct is really just a way of having a nested namespace for columns. The underlying columns are just normal Polars Series.

Of course, like any Polars Series the data in the columns underlying the pl.Struct must have a homogenous dtype.

In this example we have a trades column that is made of a list of python dicts. Each dict has the same keys and the values have the same types.

        
      
df_struct = (
    pl.DataFrame(
        {
            "year":[2020,2021],
            "trades":[
                {"exporter":"India","importer":"USA","quantity":0.0},
                {"exporter":"India","importer":"USA","quantity":1.5},
            ]
          }
    )
)
shape: (2, 2)
┌──────┬─────────────────────┐
│ year ┆ trades              │
│ ---  ┆ ---                 │
│ i64  ┆ struct[3]           │
╞══════╪═════════════════════╡
│ 2020 ┆ {"India","USA",0.0} │
│ 2021 ┆ {"India","USA",1.5} │
└──────┴─────────────────────┘

If you have a pl.Struct column and want to un-nest the columns back into a flat DataFrame you can do so with unnest

        
      
df_struct.unnest('trades')
shape: (2, 4)
┌──────┬──────────┬──────────┬──────────┐
│ year ┆ exporter ┆ importer ┆ quantity │
│ ---  ┆ ---      ┆ ---      ┆ ---      │
│ i64  ┆ str      ┆ str      ┆ f64      │
╞══════╪══════════╪══════════╪══════════╡
│ 2020 ┆ India    ┆ USA      ┆ 0.0      │
│ 2021 ┆ India    ┆ USA      ┆ 1.5      │
└──────┴──────────┴──────────┴──────────┘

Use cases of the pl.Struct type include working with nested JSON data and collapsing columns into groups when working with wide DataFrames.

That’s it for the intro to the nested column types. If you want to learn more about working with the pl.List dtype check out my post focused on that. For a more comprehensive intro check out my Data Analysis with Polars course.

Next steps

Want to know more about Polars for high performance data science? Then you can:

Understanding the Polars nested column types

Nested column types overview

The sequence types

pl.List and pl.Array

The pl.Struct type of nested columns

Next steps

Further Reading

What does ChatGPT's Advanced Data Analysis have installed?

AWS Lambda with Polars

Streaming large datasets in Polars

`pl.List` and `pl.Array`

The `pl.Struct` type of nested columns