Home Polars & Huggingface datasets
Post
Cancel

Polars & Huggingface datasets

This post was created while writing my Up & Running with Polars course. Check it out here with a free preview of the first chapters

One consequence of the Apache Arrow era is that different libraries will integrate more easily.

Here for example we load data from a Huggingface dataset into a Polars dataframe with zero-copy.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from datasets import load_dataset
import polars as pl

dataset = load_dataset("rotten_tomatoes", split="train")
df = pl.from_arrow(dataset.data.table)

shape: (3, 2)
┌───────────────────────────────────────────────────────┬───────┐
 text                                                   label 
 ---                                                    ---   
 str                                                    i64   
╞═══════════════════════════════════════════════════════╪═══════╡
 the rock is destined to be the 21st century's new ... ┆ 1     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ the gorgeously elaborate continuation of " the lor... ┆ 1     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ effective but too-tepid biopic                        ┆ 1     │
└───────────────────────────────────────────────────────┴───────┘

Hopefully there will be an explicit to_polars() method in datasets.

I’ll be digging into this in more detail - can we exploit the memory-mapped datasets that datasets can produce with Polars new out-of-core capabilities?

Also: please don’t call libraries datasets😂

Learn more

Want to know more about Polars for high performance data science and ML? Then you can:

or let me know if you would like a Polars workshop for your organisation.

This post is licensed under CC BY 4.0 by the author.