Home AWS Lambda with Polars
Post
Cancel

AWS Lambda with Polars

Updated December 2023

This post was created while writing my Up & Running with Polars course. Check it out on Udemy with a 50% discount

These days working in serverless environments with Polars is a lot more like working locally with Polars. As Polars now has built-in support for reading and writing from cloud storage like AWS S3 in both eager and lazy mode we can often write standard Polars syntax in our handler functions.

In this example I show you how to create an AWS Lambda function in a Docker image. One nice part of doing this in Docker is that you can test your lambda functions locally before deploying them to AWS.

We start off by defining our dependencies in a requirements.txt file. All we need are Polars and the libraries for working with files in cloud storage in eager mode.

1
2
3
polars
fsspec 
s3fs

In an actual production query I highly recommend pinning the versions of your depdencies. From experience I can tell you that the libraries for working with cloud storage are updated frequently and pip often struggles to reconcile different versions.

We continue by defining our Docker image in a Dockerfile using a recent Python runtime. We use a base Docker image created for the purpose of running lambda functions.

One feature of this image is that it has a variable called LAMBDA_TASK_ROOT that points to the directory where the lambda function is run. This is useful for copying files into the image.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Use a python image from AWS
FROM public.ecr.aws/lambda/python:3.11

# Copy requirements.txt into the right directory for the lambda function
COPY requirements.txt ${LAMBDA_TASK_ROOT}

# Install the specified packages (and cache the downloaded packages)
RUN --mount=type=cache,target=/root/.cache/pip  pip install -r requirements.txt

# Copy function code
COPY lambda_function.py ${LAMBDA_TASK_ROOT}

# Set the CMD to your handler (could also be done as a parameter override outside of the Dockerfile)
CMD [ "lambda_function.handler" ]

On the line of the Dockerfile with pip install I prefixed a command to cache the downloaded python packages. This saves a lot of time when developing the function. I explain more about this here.

Now we need the python script that is run when the lambda function is invoked. In this function we are going to read a Parquet file from S3 in eager mode, group by a column and calculate the mean of another column. I get the result as JSON.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import polars as pl

def handler(event, context):
    try:
        # Define the object url
        url = "s3://<bucket_name>/test_file.pq"
        # Download and read the parquet file
        df = (
            pl.read_parquet(
                url,
                columns=["id1", "v1"],
            )
            .groupby("id1")
            .agg(pl.col("v1").mean())
        )
        # Return the dataframe as json
        return df.write_json()

    except Exception as err:
        # Return the error if something goes wrong
        return err

When working with Docker I often have a shell script to build the image, run the container locally and deploy it to the cloud.

In this example shell script I mount my .aws folder to the .aws folder in the container when running the container. This mounting allows me to I use my AWS credentials to access the S3 bucket when running locally.

1
2
3
4
5
6
7
#!/bin/bash
# Build the docker image
docker build --platform linux/amd64 -t docker-image:test .
# Run the docker image locally
# Open port 9000 on the host and map it to port 8080 in the container
# Mount the .aws folder in the home directory to the .aws folder in the container
docker run --platform linux/amd64 -p 9000:8080 -v ~/.aws:/root/.aws docker-image:test

We can now test the lambda function locally by sending a request to the local endpoint at port 9000.

1
curl "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{}'

If I run this locally I see the output of my simple function as JSON.

Next steps

From this point you need to create an Elastic Container Repository (ECR) in AWS and push your image to it. Then you can create a lambda function that uses your image as a container. See this AWS tutorial for more details on these steps.

There’s a lot more to say about optimising Polars and AWS Lambda. For example, you can use Polars to read and write from S3 in lazy mode and this allows Polars to apply query optimisations. I’ll cover this in a future post.

We could also think about how we can speed up the query based on how we store the data in S3. For example, we could use partitioned Parquet files to make our queries more efficient. I cover this in my workshops and I’ll cover it in a future blogpost too.

Learn more

Want to know more about Polars for high performance data science and ML? Then you can:

or let me know if you would like a Polars workshop for your organisation.

This post is licensed under CC BY 4.0 by the author.