In a recent post I showed how to use Polars in AWS Lambda using the smart_open
library. There are a variety of ways that you can work with Polars in AWS Lambda, however. In this post we look at how to use Polars with PyArrow.
Get in touch if you want to discuss ways how to reduce your cloud bills. Or check out my Data Analysis with Polars course with a half price discount
In this example we use the dataset module in PyArrow. The dataset
module has a number of helpful features for managing datasets that I hope to explore in future posts. Today we’ll see how it combines with Polars to do data transfer when working with cloud storage.
Full example
In the example below we do a groupby on a Parquet file in S3. We set out the full example here and then explore the two key steps in more detail below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import polars as pl
import pyarrow.dataset as ds
def lambda_handler(event, context):
try:
s3_file_name = event["s3_file_name"]
url = f"s3://braaannigan/{s3_file_name}"
s3_dataset = ds.dataset(url)
df = (
pl.scan_ds(s3_dataset)
.groupby("id1")
.agg(pl.col("v1").mean())
.collect(streaming=True)
)
except Exception as err:
print(err)
return df.write_json()
Step 1 - create a dataset object
We pass the object URL in S3 to ds.dataset
1
s3_dataset = ds.dataset(url)
In this step PyArrow finds the Parquet file in S3 and retrieves some crucial information. Be aware that PyArrow downloads the file at this stage so this does not avoid full transfer of the file.
The s3_dataset
now knows the schema of the Parquet file - that is the dtypes of the columns.
With the schema we can use Polars in lazy mode and apply some query optimisations.
Step 2 - Scan the dataset object
In the second step we do our groupby aggregation on the Parquet file.
1
2
3
4
5
6
7
df = (
pl.scan_ds(s3_dataset)
.groupby("id1")
.agg(pl.col("v1").mean())
.collect(streaming=True)
)
The key point is that we use pl.scan_ds
on the dataset object. As with any pl.scan_
function this tells Polars that we are working in lazy mode.
In lazy mode Polars can apply query optimisations. In this example the optimisation is that only the id1
and v1
columns should be read from the file. This optimisation reduces the amount of data that we need read from the file.
This is just the start of what Polars can do in serverless environments. Get in touch if you’d like to discuss using Polars to reduce your cloud compute bills.
Learn more
Want to know more about Polars for high performance data science and ML? Then you can:
- get in touch to discuss your data processing challenges
- join my Polars course on Udemy
- follow me on twitter
- connect with me at linkedin
- check out my youtube videos
or let me know if you would like a Polars workshop for your organisation.