Home Reading and writing files on S3 with Polars
Post
Cancel

Reading and writing files on S3 with Polars

Updated June 2024 for Polars version 1.0

In this post we see how to read and write from a CSV or Parquet file in S3 with Polars. We also see how to filter the file on S3 before downloading it to reduce the amount of data transferred across the network.

Want to accelerate your analysis with Polars? Join over 2,000 learners on my highly-rated Up & Running with Polars course.

Writing a file to S3

We create a simple DataFrame with 3 columns which we will write to both a CSV and Parquet file in S3 using s3fs. The s3fs library allows you to read and write files to S3 with similar syntax to working on a local file system.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
bucket_name = "my_bucket"
csv_key = "test_write.csv"
parquet_key = "test_write.parquet"

fs = s3fs.S3FileSystem()  
df = pl.DataFrame(
    {
        "foo": [1, 2, 3, 4, 5],
        "bar": [6, 7, 8, 9, 10],
        "ham": ["a", "b", "c", "d", "e"],
    }
)
with fs.open(f"{bucket_name}/{csv_key}", mode="wb") as f:
    df.write_csv(f)
with fs.open(f"{bucket_name}/{parquet_key}", mode="wb") as f:
    df.write_parquet(f)

I recommend the Parquet format as it has a smaller file size, preserves dtypes and makes subsequent reads faster.

Reading a file from S3

We can use Polars to read the entire file back from S3 using pl.read_csv

1
2
df_csv = pl.read_csv(f"s3://{bucket}/{csv_key}")
df_parquet = pl.read_parquet(f"s3://{bucket}/{parquet_key}")

Internally Polars reads the remote file into a memory buffer using ffspec and then reads the buffer into a DataFrame. This is a fast approach but it does mean that the whole file is read into memory. This is fine for small files but can be slow and memory intensive for large files.

However, reading the whole file is wasteful when we only want to read a subset of rows. With a Parquet file we can instead scan the file on S3 and only read the rows we need.

Scanning a file on S3 with query optimisation

With a Parquet file we can scan the file on S3 and build a lazy query. The Polars query optimiser applies:

  • predicate pushdown meaning that it tries to limit the number of rows to read from S3 and
  • projection pushdown meaning that it tries to limit the number of columns to read from S3

We can do this with pl.scan_parquet. This may also require some cloud storage provider specific options to be passed (see this post for more on authentication)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import polars as pl

source = "s3://bucket/*.parquet"

storage_options = {
    "aws_access_key_id": "<secret>",
    "aws_secret_access_key": "<secret>",
    "aws_region": "eu-west-1",
}
df = (
    # Scan the file on S3
    pl.scan_parquet(source, storage_options=storage_options)
    # Apply a filter condition
    .filter(pl.col("id") > 100)
    # Select only the columns we need
    .select("id","value")
    # Collect the data
    .collect()
)

In this case Polars will only read the id and value columns from the Parquet and only the rows where the id column is greater than 100. This can be much faster and more memory efficient than reading the whole file.

We can also (post version 1.0 of Polars) scan a CSV file in cloud storage

1
2
3
4
5
6
7
8
9
10
df = (
    # Scan the file on S3
    pl.scan_csv(source, storage_options=storage_options)
    # Apply a filter condition
    .filter(pl.col("id") > 100)
    # Select only the columns we need
    .select("id","value")
    # Collect the data
    .collect()
)

The limitations of CSVs compared to Parquet apply here, of course. For example, we cannot extract a subset of columns or rows from a CSV file as we can from Parquet.

Wrap-up

In this post we have seen how to read and write files from S3 with Polars. This is a fast-developing area so I’m sure I’ll be back to update this post in the future (again!) as Polars does more of this natively.

There are more sophisticated ways to manage data on S3. For example, you could use a data lake tool like Delta Lake to manage your data on S3. These tools allow you to manage your data in a more structured way and to perform operations like upserts and deletes. See this post by Matthew Powers for an intro to using Delta Lake with Polars.

Again, I invite you to buy my Polars course if you want to learn the Polars API and how to use Polars in the real world.

Next steps

Want to know more about Polars for high performance data science? Then you can:

This post is licensed under CC BY 4.0 by the author.