Polars uses Apache Arrow to store its data in-memory. One of the big advantages of Arrow is that it supports a variety of nested data types (or “dtypes”). In this post we look at the pl.List dtype in more detail:
- we start with an overview of the
pl.Listdtype - we call expressions on each row of a
pl.Listcolumn - we do aggregations with neural network embeddings
- we do simple text analytics
Want to accelerate your analysis with Polars? Join over 3,000 learners on my highly-rated Up & Running with Polars course )
Overview of the pl.List dtype
The pl.List dtype allows us to store an array of values on each row. The crucial point is that the type of the values within each array must be the same and these types must be the same on all rows.
In this example, we create a DataFrame with an integer, float and string pl.List column. Note that:
- in the
floatscolumn we have a mix of floats and integers in one row and so Polars casts all values to a float type - the length of the arrays can vary within a column
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import polars as pl
dfLists = pl.DataFrame({
'ints':[ [0,1], [4,3,2]],
'floats':[ [0.0,1], [2,3]],
'strings':[ ["0","1"],["2","3"]]
})
dfLists
shape: (2, 3)
┌───────────┬────────────┬────────────┐
│ ints ┆ floats ┆ strings │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ list[f64] ┆ list[str] │
╞═══════════╪════════════╪════════════╡
│ [0, 1] ┆ [0.0, 1.0] ┆ ["0", "1"] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [4, 3, 2] ┆ [2.0, 3.0] ┆ ["2", "3"] │
└───────────┴────────────┴────────────┘
The key point to understand with the pl.List dtype is that each row is a pl.Series underneath the hood. This means that operations on a pl.List column will be fast vectorised operations.
Expressions within arrays
In the use cases later in this post we see how to apply expressions on the entire array. However, we can also apply expressions row-by-row on a pl.List column.
In this example we rank the elements within each array
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
(
dfLists
.with_columns(
pl.col("ints").arr.eval(
pl.element().rank(method="ordinal")
)
)
)
shape: (2, 1)
┌───────────┐
│ ints │
│ --- │
│ list[u32] │
╞═══════════╡
│ [1, 2] │
│ [3, 2, 1] │
└───────────┘
To call the rank expression inside each array we
- call
arr.evalon theintscolumn - inside
arr.evalwe callpl.elementto start the expression for each row and - then we call
rankonpl.elementto do therankexpression on each row
Use cases
Analysis of embeddings
The pl.List dtype is a great option when you are working with embeddings from a neural network model alongside other metadata.
In the example below we have a doc_id column to identify the document each row came from, a text column showing a chunk of text from each document and an embeddings column with the embeddings for that text.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
df = pl.DataFrame(
{
"doc_id":[0,0,1,1,2,2],
"text":
[
"Polars is a dataframe library",
"Polars is written in Rust",
"Expressions allow you to transform data",
"Expressions run in paralell",
"Apache Arrow supports nested data",
"There are three nested dtypes"
]
}
)
.with_columns(
pl.Series(
"embeddings",
[pl.Series("",np.random.randint(0,5,3)) for _ in range(6)]
)
)
shape: (6, 3)
┌────────┬─────────────────────────────────────┬────────────┐
│ doc_id ┆ text ┆ embeddings │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ list[i64] │
╞════════╪═════════════════════════════════════╪════════════╡
│ 0 ┆ Polars is a dataframe library ┆ [1, 1, 4] │
│ 0 ┆ Polars is written in Rust ┆ [2, 3, 3] │
│ 1 ┆ Expressions allow you to transfo... ┆ [4, 3, 4] │
│ 1 ┆ Expressions run in paralell ┆ [4, 0, 0] │
│ 2 ┆ Apache Arrow supports nested dat... ┆ [1, 2, 1] │
│ 2 ┆ There are three nested dtypes ┆ [1, 0, 0] │
└────────┴─────────────────────────────────────┴────────────┘
We then get the document-averaged embeddings by doing a groupby on the doc_id column and averaging the embeddings
1
2
3
4
5
6
7
8
9
(
df
.groupby(
"doc_id"
)
.agg(
pl.col("embeddings").arr.mean()
)
)
We do the aggregation using arr.mean rather than just mean. By using arr.mean we take advantage of the array expressions for the pl.List dtype in the arr namespace. You can see the full set of expressions here.
Word counts
Another use case for arrays is when we split strings. In this example we split the text column by whitespace to get individual words. This transforms the text column into a column with arrays of strings.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
df2 = (
df
.with_columns(
pl.col("text").str.split(" ")
)
.select(
["doc_id","text"]
)
)
shape: (6, 2)
┌────────┬─────────────────────────────────────┐
│ doc_id ┆ text │
│ --- ┆ --- │
│ i64 ┆ list[str] │
╞════════╪═════════════════════════════════════╡
│ 0 ┆ ["Polars", "is", ... "library"] │
│ 0 ┆ ["Polars", "is", ... "Rust"] │
│ 1 ┆ ["Expressions", "allow", ... "da... │
│ 1 ┆ ["Expressions", "run", ... "para... │
│ 2 ┆ ["Apache", "Arrow", ... "data"] │
│ 2 ┆ ["There", "are", ... "dtypes"] │
└────────┴─────────────────────────────────────┘
With this array of strings we can then count the number of words on each row using arr.lengths
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
(
df2
.with_columns(
pl.col("text").arr.lengths()
)
)
shape: (6, 2)
┌────────┬──────┐
│ doc_id ┆ text │
│ --- ┆ --- │
│ i64 ┆ u32 │
╞════════╪══════╡
│ 0 ┆ 5 │
│ 0 ┆ 5 │
│ 1 ┆ 6 │
│ 1 ┆ 4 │
│ 2 ┆ 5 │
│ 2 ┆ 5 │
└────────┴──────┘
Alternatively, we can count the occurence of each word. We do this by calling explode to transform the string arrays into separate rows
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
(
df2
.select(["doc_id","text"])
.explode("text")
)
shape: (30, 2)
┌────────┬───────────┐
│ doc_id ┆ text │
│ --- ┆ --- │
│ i64 ┆ str │
╞════════╪═══════════╡
│ 0 ┆ Polars │
│ 0 ┆ is │
│ 0 ┆ a │
│ 0 ┆ dataframe │
│ ... ┆ ... │
│ 2 ┆ are │
│ 2 ┆ three │
│ 2 ┆ nested │
│ 2 ┆ dtypes │
└────────┴───────────┘
From this exploded format we can then count the word occurence
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
(
df2
.explode("text")
["text"]
.value_counts(sort=True)
)
shape: (24, 2)
┌─────────────┬────────┐
│ text ┆ counts │
│ --- ┆ --- │
│ str ┆ u32 │
╞═════════════╪════════╡
│ Polars ┆ 2 │
│ is ┆ 2 │
│ in ┆ 2 │
│ Expressions ┆ 2 │
│ ... ┆ ... │
│ There ┆ 1 │
│ are ┆ 1 │
│ three ┆ 1 │
│ dtypes ┆ 1 │
└─────────────┴────────┘
Of course, this is just the start of what we can do with the pl.List dtype. Get in touch on twitter/linkedin/youtube if you find other interesting use cases or check out my course to learn more.
Next steps
Want to know more about Polars for high performance data science? Then you can: