Home Nested dtypes in Polars 1: the `pl.List` dtype
Post
Cancel

Nested dtypes in Polars 1: the `pl.List` dtype

Polars uses Apache Arrow to store its data in-memory. One of the big advantages of Arrow is that it supports a variety of nested data types (or “dtypes”). In this post we look at the pl.List dtype in more detail:

  • we start with an overview of the pl.List dtype
  • we call expressions on each row of a pl.List column
  • we do aggregations with neural network embeddings
  • we do simple text analytics

Want to get going with Polars? This post is an extract from my Up & Running with Polars course - learn more here or check out the preview of the first chapters )

Overview of the pl.List dtype

The pl.List dtype allows us to store an array of values on each row. The crucial point is that the type of the values within each array must be the same and these types must be the same on all rows.

In this example, we create a DataFrame with an integer, float and string pl.List column. Note that:

  • in the floats column we have a mix of floats and integers in one row and so Polars casts all values to a float type
  • the length of the arrays can vary within a column
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import polars as pl

dfLists = pl.DataFrame({
    'ints':[ [0,1], [4,3,2]],
    'floats':[ [0.0,1], [2,3]],
    'strings':[ ["0","1"],["2","3"]]
})
dfLists
shape: (2, 3)
┌───────────┬────────────┬────────────┐
 ints       floats      strings    
 ---        ---         ---        
 list[i64]  list[f64]   list[str]  
╞═══════════╪════════════╪════════════╡
 [0, 1]     [0.0, 1.0]  ["0", "1"] 
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
 [4, 3, 2]  [2.0, 3.0]  ["2", "3"] 
└───────────┴────────────┴────────────┘

The key point to understand with the pl.List dtype is that each row is a pl.Series underneath the hood. This means that operations on a pl.List column will be fast vectorised operations.

Expressions within arrays

In the use cases later in this post we see how to apply expressions on the entire array. However, we can also apply expressions row-by-row on a pl.List column.

In this example we rank the elements within each array

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
(
    dfLists
    .with_columns(
        pl.col("ints").arr.eval(
            pl.element().rank(method="ordinal")
        )
    )
)
shape: (2, 1)
┌───────────┐
 ints      
 ---       
 list[u32] 
╞═══════════╡
 [1, 2]    
 [3, 2, 1] 
└───────────┘

To call the rank expression inside each array we

  • call arr.eval on the ints column
  • inside arr.eval we call pl.element to start the expression for each row and
  • then we call rank on pl.element to do the rank expression on each row

Use cases

Analysis of embeddings

The pl.List dtype is a great option when you are working with embeddings from a neural network model alongside other metadata.

In the example below we have a doc_id column to identify the document each row came from, a text column showing a chunk of text from each document and an embeddings column with the embeddings for that text.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
df = pl.DataFrame(
    {
    "doc_id":[0,0,1,1,2,2],
    "text":
        [
                "Polars is a dataframe library",
                "Polars is written in Rust",
                "Expressions allow you to transform data",
                "Expressions run in paralell",
                "Apache Arrow supports nested data",
                "There are three nested dtypes"
        ]
    }
   )
   .with_columns(
       pl.Series(
        "embeddings",
        [pl.Series("",np.random.randint(0,5,3)) for _ in range(6)]
        )
)
shape: (6, 3)
┌────────┬─────────────────────────────────────┬────────────┐
 doc_id  text                                 embeddings 
 ---     ---                                  ---        
 i64     str                                  list[i64]  
╞════════╪═════════════════════════════════════╪════════════╡
 0       Polars is a dataframe library        [1, 1, 4]  
 0       Polars is written in Rust            [2, 3, 3]  
 1       Expressions allow you to transfo...  [4, 3, 4]  
 1       Expressions run in paralell          [4, 0, 0]  
 2       Apache Arrow supports nested dat...  [1, 2, 1]  
 2       There are three nested dtypes        [1, 0, 0]  
└────────┴─────────────────────────────────────┴────────────┘

We then get the document-averaged embeddings by doing a groupby on the doc_id column and averaging the embeddings

1
2
3
4
5
6
7
8
9
(
    df
    .groupby(
        "doc_id"
        )
    .agg(
        pl.col("embeddings").arr.mean()
        )
)

We do the aggregation using arr.mean rather than just mean. By using arr.mean we take advantage of the array expressions for the pl.List dtype in the arr namespace. You can see the full set of expressions here.

Word counts

Another use case for arrays is when we split strings. In this example we split the text column by whitespace to get individual words. This transforms the text column into a column with arrays of strings.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
df2 = (
    df
    .with_columns(
        pl.col("text").str.split(" ")
    )
    .select(
        ["doc_id","text"]
    )
)
shape: (6, 2)
┌────────┬─────────────────────────────────────┐
 doc_id  text                                
 ---     ---                                 
 i64     list[str]                           
╞════════╪═════════════════════════════════════╡
 0       ["Polars", "is", ... "library"]     
 0       ["Polars", "is", ... "Rust"]        
 1       ["Expressions", "allow", ... "da... │
│ 1      ┆ ["Expressions", "run", ... "para... 
 2       ["Apache", "Arrow", ... "data"]     
 2       ["There", "are", ... "dtypes"]      
└────────┴─────────────────────────────────────┘

With this array of strings we can then count the number of words on each row using arr.lengths

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
 (
    df2
    .with_columns(
        pl.col("text").arr.lengths()
    )
)
shape: (6, 2)
┌────────┬──────┐
 doc_id  text 
 ---     ---  
 i64     u32  
╞════════╪══════╡
 0       5    
 0       5    
 1       6    
 1       4    
 2       5    
 2       5    
└────────┴──────┘

Alternatively, we can count the occurence of each word. We do this by calling explode to transform the string arrays into separate rows

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
(
    df2
    .select(["doc_id","text"])
    .explode("text")
)
shape: (30, 2)
┌────────┬───────────┐
 doc_id  text      
 ---     ---       
 i64     str       
╞════════╪═══════════╡
 0       Polars    
 0       is        
 0       a         
 0       dataframe 
 ...     ...       
 2       are       
 2       three     
 2       nested    
 2       dtypes    
└────────┴───────────┘

From this exploded format we can then count the word occurence

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
(
    df2
    .explode("text")
    ["text"]
    .value_counts(sort=True)
)
shape: (24, 2)
┌─────────────┬────────┐
 text         counts 
 ---          ---    
 str          u32    
╞═════════════╪════════╡
 Polars       2      
 is           2      
 in           2      
 Expressions  2      
 ...          ...    
 There        1      
 are          1      
 three        1      
 dtypes       1      
└─────────────┴────────┘

Of course, this is just the start of what we can do with the pl.List dtype. Get in touch on twitter/linkedin/youtube if you find other interesting use cases or check out my course to learn more.

Next steps

Want to know more about Polars for high performance data science? Then you can:

This post is licensed under CC BY 4.0 by the author.