Home AWS Lambda with Polars II: PyArrow
Post
Cancel

AWS Lambda with Polars II: PyArrow

In a recent post I showed how to use Polars in AWS Lambda using the smart_open library. There are a variety of ways that you can work with Polars in AWS Lambda, however. In this post we look at how to use Polars with PyArrow.

Get in touch if you want to discuss ways how to reduce your cloud bills. Or check out my Data Analysis with Polars course with a half price discount

In this example we use the dataset module in PyArrow. The dataset module has a number of helpful features for managing datasets that I hope to explore in future posts. Today we’ll see how it combines with Polars to do data transfer when working with cloud storage.

Full example

In the example below we do a groupby on a Parquet file in S3. We set out the full example here and then explore the two key steps in more detail below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import polars as pl

import pyarrow.dataset as ds


def lambda_handler(event, context):
    try:
        s3_file_name = event["s3_file_name"]
        url = f"s3://braaannigan/{s3_file_name}"
        s3_dataset = ds.dataset(url)
        df = (
            pl.scan_ds(s3_dataset)
            .groupby("id1")
            .agg(pl.col("v1").mean())
            .collect(streaming=True)
        )

    except Exception as err:
        print(err)

    return df.write_json()

Step 1 - create a dataset object

We pass the object URL in S3 to ds.dataset

1
        s3_dataset = ds.dataset(url)

In this step PyArrow finds the Parquet file in S3 and retrieves some crucial information. Be aware that PyArrow downloads the file at this stage so this does not avoid full transfer of the file.

The s3_dataset now knows the schema of the Parquet file - that is the dtypes of the columns.

With the schema we can use Polars in lazy mode and apply some query optimisations.

Step 2 - Scan the dataset object

In the second step we do our groupby aggregation on the Parquet file.

1
2
3
4
5
6
7
        df = (
            pl.scan_ds(s3_dataset)
            .groupby("id1")
            .agg(pl.col("v1").mean())
            .collect(streaming=True)
        )

The key point is that we use pl.scan_ds on the dataset object. As with any pl.scan_ function this tells Polars that we are working in lazy mode.

In lazy mode Polars can apply query optimisations. In this example the optimisation is that only the id1 and v1 columns should be read from the file. This optimisation reduces the amount of data that we need read from the file.

This is just the start of what Polars can do in serverless environments. Get in touch if you’d like to discuss using Polars to reduce your cloud compute bills.

Learn more

Want to know more about Polars for high performance data science and ML? Then you can:

or let me know if you would like a Polars workshop for your organisation.

This post is licensed under CC BY 4.0 by the author.