Nested dtypes in Polars 1: the `pl.List` dtype

Polars uses Apache Arrow to store its data in-memory. One of the big advantages of Arrow is that it supports a variety of nested data types (or “dtypes”). In this post we look at the pl.List dtype in more detail:

we start with an overview of the pl.List dtype
we call expressions on each row of a pl.List column
we do aggregations with neural network embeddings
we do simple text analytics

Want to accelerate your analysis with Polars? Join over 2,000 learners on my highly-rated Up & Running with Polars course )

Overview of the `pl.List` dtype

The pl.List dtype allows us to store an array of values on each row. The crucial point is that the type of the values within each array must be the same and these types must be the same on all rows.

In this example, we create a DataFrame with an integer, float and string pl.List column. Note that:

in the floats column we have a mix of floats and integers in one row and so Polars casts all values to a float type
the length of the arrays can vary within a column

        
      
import polars as pl

dfLists = pl.DataFrame({
    'ints':[ [0,1], [4,3,2]],
    'floats':[ [0.0,1], [2,3]],
    'strings':[ ["0","1"],["2","3"]]
})
dfLists
shape: (2, 3)
┌───────────┬────────────┬────────────┐
│ ints      ┆ floats     ┆ strings    │
│ ---       ┆ ---        ┆ ---        │
│ list[i64] ┆ list[f64]  ┆ list[str]  │
╞═══════════╪════════════╪════════════╡
│ [0, 1]    ┆ [0.0, 1.0] ┆ ["0", "1"] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [4, 3, 2] ┆ [2.0, 3.0] ┆ ["2", "3"] │
└───────────┴────────────┴────────────┘

The key point to understand with the pl.List dtype is that each row is a pl.Series underneath the hood. This means that operations on a pl.List column will be fast vectorised operations.

Expressions within arrays

In the use cases later in this post we see how to apply expressions on the entire array. However, we can also apply expressions row-by-row on a pl.List column.

In this example we rank the elements within each array

        
      
(
    dfLists
    .with_columns(
        pl.col("ints").arr.eval(
            pl.element().rank(method="ordinal")
        )
    )
)
shape: (2, 1)
┌───────────┐
│ ints      │
│ ---       │
│ list[u32] │
╞═══════════╡
│ [1, 2]    │
│ [3, 2, 1] │
└───────────┘

To call the rank expression inside each array we

call arr.eval on the ints column
inside arr.eval we call pl.element to start the expression for each row and
then we call rank on pl.element to do the rank expression on each row

Use cases

Analysis of embeddings

The pl.List dtype is a great option when you are working with embeddings from a neural network model alongside other metadata.

In the example below we have a doc_id column to identify the document each row came from, a text column showing a chunk of text from each document and an embeddings column with the embeddings for that text.

        
      
df = pl.DataFrame(
    {
    "doc_id":[0,0,1,1,2,2],
    "text":
        [
                "Polars is a dataframe library",
                "Polars is written in Rust",
                "Expressions allow you to transform data",
                "Expressions run in paralell",
                "Apache Arrow supports nested data",
                "There are three nested dtypes"
        ]
    }
   )
   .with_columns(
       pl.Series(
        "embeddings",
        [pl.Series("",np.random.randint(0,5,3)) for _ in range(6)]
        )
)
shape: (6, 3)
┌────────┬─────────────────────────────────────┬────────────┐
│ doc_id ┆ text                                ┆ embeddings │
│ ---    ┆ ---                                 ┆ ---        │
│ i64    ┆ str                                 ┆ list[i64]  │
╞════════╪═════════════════════════════════════╪════════════╡
│ 0      ┆ Polars is a dataframe library       ┆ [1, 1, 4]  │
│ 0      ┆ Polars is written in Rust           ┆ [2, 3, 3]  │
│ 1      ┆ Expressions allow you to transfo... ┆ [4, 3, 4]  │
│ 1      ┆ Expressions run in paralell         ┆ [4, 0, 0]  │
│ 2      ┆ Apache Arrow supports nested dat... ┆ [1, 2, 1]  │
│ 2      ┆ There are three nested dtypes       ┆ [1, 0, 0]  │
└────────┴─────────────────────────────────────┴────────────┘

We then get the document-averaged embeddings by doing a groupby on the doc_id column and averaging the embeddings

        
      
(
    df
    .groupby(
        "doc_id"
        )
    .agg(
        pl.col("embeddings").arr.mean()
        )
)

We do the aggregation using arr.mean rather than just mean. By using arr.mean we take advantage of the array expressions for the pl.List dtype in the arr namespace. You can see the full set of expressions here.

Word counts

Another use case for arrays is when we split strings. In this example we split the text column by whitespace to get individual words. This transforms the text column into a column with arrays of strings.

        
      
df2 = (
    df
    .with_columns(
        pl.col("text").str.split(" ")
    )
    .select(
        ["doc_id","text"]
    )
)
shape: (6, 2)
┌────────┬─────────────────────────────────────┐
│ doc_id ┆ text                                │
│ ---    ┆ ---                                 │
│ i64    ┆ list[str]                           │
╞════════╪═════════════════════════════════════╡
│ 0      ┆ ["Polars", "is", ... "library"]     │
│ 0      ┆ ["Polars", "is", ... "Rust"]        │
│ 1      ┆ ["Expressions", "allow", ... "da... │
│ 1      ┆ ["Expressions", "run", ... "para... │
│ 2      ┆ ["Apache", "Arrow", ... "data"]     │
│ 2      ┆ ["There", "are", ... "dtypes"]      │
└────────┴─────────────────────────────────────┘

With this array of strings we can then count the number of words on each row using arr.lengths

        
      
 (
    df2
    .with_columns(
        pl.col("text").arr.lengths()
    )
)
shape: (6, 2)
┌────────┬──────┐
│ doc_id ┆ text │
│ ---    ┆ ---  │
│ i64    ┆ u32  │
╞════════╪══════╡
│ 0      ┆ 5    │
│ 0      ┆ 5    │
│ 1      ┆ 6    │
│ 1      ┆ 4    │
│ 2      ┆ 5    │
│ 2      ┆ 5    │
└────────┴──────┘

Alternatively, we can count the occurence of each word. We do this by calling explode to transform the string arrays into separate rows

        
      
(
    df2
    .select(["doc_id","text"])
    .explode("text")
)
shape: (30, 2)
┌────────┬───────────┐
│ doc_id ┆ text      │
│ ---    ┆ ---       │
│ i64    ┆ str       │
╞════════╪═══════════╡
│ 0      ┆ Polars    │
│ 0      ┆ is        │
│ 0      ┆ a         │
│ 0      ┆ dataframe │
│ ...    ┆ ...       │
│ 2      ┆ are       │
│ 2      ┆ three     │
│ 2      ┆ nested    │
│ 2      ┆ dtypes    │
└────────┴───────────┘

From this exploded format we can then count the word occurence

        
      
(
    df2
    .explode("text")
    ["text"]
    .value_counts(sort=True)
)
shape: (24, 2)
┌─────────────┬────────┐
│ text        ┆ counts │
│ ---         ┆ ---    │
│ str         ┆ u32    │
╞═════════════╪════════╡
│ Polars      ┆ 2      │
│ is          ┆ 2      │
│ in          ┆ 2      │
│ Expressions ┆ 2      │
│ ...         ┆ ...    │
│ There       ┆ 1      │
│ are         ┆ 1      │
│ three       ┆ 1      │
│ dtypes      ┆ 1      │
└─────────────┴────────┘

Of course, this is just the start of what we can do with the pl.List dtype. Get in touch on twitter/linkedin/youtube if you find other interesting use cases or check out my course to learn more.

Next steps

Want to know more about Polars for high performance data science? Then you can:

Nested dtypes in Polars 1: the `pl.List` dtype

Overview of the pl.List dtype

Expressions within arrays

Use cases

Analysis of embeddings

Word counts

Next steps

Further Reading

What does ChatGPT's Advanced Data Analysis have installed?

AWS Lambda with Polars

Streaming large datasets in Polars

Overview of the `pl.List` dtype