This post was created while writing my Data Analysis with Polars course. Check it out on Udemy
One consequence of the Apache Arrow era is that different libraries will integrate more easily.
Here for example we load data from a Huggingface dataset into a Polars dataframe with zero-copy.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from datasets import load_dataset
import polars as pl
dataset = load_dataset("rotten_tomatoes", split="train")
df = pl.from_arrow(dataset.data.table)
shape: (3, 2)
┌───────────────────────────────────────────────────────┬───────┐
│ text ┆ label │
│ --- ┆ --- │
│ str ┆ i64 │
╞═══════════════════════════════════════════════════════╪═══════╡
│ the rock is destined to be the 21st century's new ... ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ the gorgeously elaborate continuation of " the lor... ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ effective but too-tepid biopic ┆ 1 │
└───────────────────────────────────────────────────────┴───────┘
Hopefully there will be an explicit to_polars()
method in datasets.
I’ll be digging into this in more detail - can we exploit the memory-mapped datasets that datasets can produce with Polars new out-of-core capabilities?
Also: please don’t call libraries datasets😂
Learn more
Want to know more about Polars for high performance data science and ML? Then you can:
- join my Polars course on Udemy
- follow me on twitter
- connect with me at linkedin
- check out my youtube videos
or let me know if you would like a Polars workshop for your organisation.