Home Understanding the Polars nested column types
Post
Cancel

Understanding the Polars nested column types

Polars has 4 native nested column types. These can be very helpful at solving problems such as:

  • working with ML embeddings
  • splitting strings
  • working with nested JSON data
  • working with aggregations

To take advantage of them it’s important you understand the difference between the types. In this post I set out the key differences between the nested column types and give some examples of when you might use each one.

Want to get going with Polars? This post is an extract from my Up & Running with Polars course - learn more here or check out the preview of the first chapters

Nested column types overview

The 4 native nested column types in Polars are:

  • pl.List
  • pl.Array
  • pl.Object
  • pl.Struct

We can immediately split these into two groups:

  • pl.List, pl.Array and pl.Object store some kind of sequence on each row
  • pl.Struct is a nested collection of columns

The sequence types

The sequence types pl.List, pl.Array and pl.Object store some kind of sequence on each row. The main differences between them are how they store the sequence and whether the length of the sequence can be different on each row.

We can break the sequence types into two groups:

  • pl.List and pl.Array store the data on each row in a Polars Series
  • pl.Object stores the data on each row in a Python list

pl.List and pl.Array

On each row pl.List and pl.Array store the data in a Polars Series. As with any Polars Series the data in the Series must have a homogenous dtype e.g. floats as pl.Float32 or strings as pl.Utf8. The dtype must also be the same for all rows in the column.

The difference between pl.List and pl.Array is that the length of the sequence can be different on each row for pl.List but must be the same for pl.Array. In this sense a pl.Array is more comparable to a 2D numpy array where the first dimension is the length of the DataFrame and the second dimension is the length of the array.

One further practical difference between pl.List and pl.Array is that pl.Array is relatively new and has less functionality. You may need to use pl.List while pl.Array is further developed.

In this example we create a DataFrame with a float pl.List type and a mixed pl.Object type. Polars infers the pl.List type as pl.Float64 and the pl.Object type as the data types are mixed for the pl.Object column.

We then create a new pl.Array column floats_array by casting the floats column to a pl.Array type. To do this we specify the width of the array as 2 and the inner type as pl.Float64.

To illustrate this we create a DataFrame with each of the sequence types.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import polars as pl
df = pl.DataFrame(
    {
        "floats": [[0.0, 1], [2, 3]], 
        "mixed_object": [["a", 0], ["b", 1]]
    }
).with_columns(
    floats_array=pl.col("floats").cast(pl.Array(width=2, inner=pl.Float64))
    )
shape: (2, 3)
┌────────────┬──────────────┬───────────────┐
 floats      mixed_object  floats_array  
 ---         ---           ---           
 list[f64]   object        array[f64, 2] 
╞════════════╪══════════════╪═══════════════╡
 [0.0, 1.0]  ['a', 0]      [0.0, 1.0]    
 [2.0, 3.0]  ['b', 1]      [2.0, 3.0]    
└────────────┴──────────────┴───────────────┘

Note that the pl.Object column has lists where each list has a mix different data types.

The use cases of the sequence types include working with vector data and splitting strings.

The pl.List dtype is also used extensively internally - for example a group_by creates a pl.List column with the data for each group and aggregations happen on this pl.List column.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
pl.DataFrame(
    {
        "grp": ["a", "a", "b"], 
        "value": [0, 1, 2]
    }
).group_by("grp").agg(
    pl.col("value")
)
shape: (2, 2)
┌─────┬───────────┐
 grp  value     
 ---  ---       
 str  list[i64] 
╞═════╪═══════════╡
 b    [2]       
 a    [0, 1]    
└─────┴───────────┘

The pl.Struct type of nested columns

Whereas the sequence types above have a sequence on each row, the pl.Struct type is a nested collection of columns. The pl.Struct is really just a way of having a nested namespace for columns. The underlying columns are just normal Polars Series.

Of course, like any Polars Series the data in the columns underlying the pl.Struct must have a homogenous dtype.

In this example we have a trades column that is made of a list of python dicts. Each dict has the same keys and the values have the same types.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
df_struct = (
    pl.DataFrame(
        {
            "year":[2020,2021],
            "trades":[
                {"exporter":"India","importer":"USA","quantity":0.0},
                {"exporter":"India","importer":"USA","quantity":1.5},
            ]
          }
    )
)
shape: (2, 2)
┌──────┬─────────────────────┐
 year  trades              
 ---   ---                 
 i64   struct[3]           
╞══════╪═════════════════════╡
 2020  {"India","USA",0.0} 
 2021  {"India","USA",1.5} 
└──────┴─────────────────────┘

If you have a pl.Struct column and want to un-nest the columns back into a flat DataFrame you can do so with unnest

1
2
3
4
5
6
7
8
9
10
df_struct.unnest('trades')
shape: (2, 4)
┌──────┬──────────┬──────────┬──────────┐
 year  exporter  importer  quantity 
 ---   ---       ---       ---      
 i64   str       str       f64      
╞══════╪══════════╪══════════╪══════════╡
 2020  India     USA       0.0      
 2021  India     USA       1.5      
└──────┴──────────┴──────────┴──────────┘

Use cases of the pl.Struct type include working with nested JSON data and collapsing columns into groups when working with wide DataFrames.

That’s it for the intro to the nested column types. If you want to learn more about working with the pl.List dtype check out my post focused on that. For a more comprehensive intro check out my Data Analysis with Polars course.

Next steps

Want to know more about Polars for high performance data science? Then you can:

This post is licensed under CC BY 4.0 by the author.