Home To go big you must be lazy
Post
Cancel

To go big you must be lazy

I was consulting for a client recently who needs to process hundreds of Gb of CSV files. On their first pass with Polars they had read from their CSVs with a pattern like this (simplified) version. I show it here with a breakdown below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Create a list to hold a LazyFrame for each CSV
queries = []
# Glob the CSV files
csv_files = glob.glob("data_files/*.csv")
# Loop over the CSV files
for csv_file in csv_files:
    # Add the LazyFrame for each CSV file to the list
    queries.append(pl.scan_csv(csv_file))
# Evaluate all the queries in the list
queries = pl.collect_all(queries)
# Concatenate the DataFrames into a single DataFrame
polars_df = pl.concat(queries)
# Select a subset of columns
polars_df.select(["date","temperature","humdity"])
...

The key points are that they:

  • used glob to iterate over their CSV files (šŸ‘)
  • used lazy mode by doing pl.scan_csv (šŸ˜Š)
  • added each LazyFrame from a pl.scan_csv into a list called queries (šŸ‘)
  • ran pl.collect_all on queries to evaluate each LazyFrame and create a list of DataFrames (šŸ˜°)
  • concatenated the DataFrames into a single DataFrame (šŸ‘)
  • went on to do further transformations such as selecting a subset of columns (šŸ˜¢)

Unfortunately, with this pattern you lose out on Polarsā€™ ability to do query optimisation and run large datasets in streaming mode.

So how do we do better?

Be lazy

The first way to improve things is to delay calling collect (or collect_all). You generally want to stay in lazy mode for as long as possible - hopefully for your entire query - so that Polars can apply query optimisation.

In this case we select a subset of columns in the last line. However, we do this after we have evaluated the CSV reads with pl.collect_all. This means Polars canā€™t take advantage of query optimisation to only read the subset of columns from the CSV.

If we instead do the select in the same lazy query as pl.scan_csv then Polars only reads the subset of columns that we need from the CSV - this can be a major speedup and a memory saver.

Streaming to go big

To work with larger than memory datasets we want to evaluate the lazy queries with the streaming argument set to True so we run either

1
pl.collect(streaming = True)

on a LazyFrame or

1
pl.collect_all(streaming = True)

on a list of LazyFrames. In this streaming mode Polars will process the data from each CSV file in chunks and allows us to process datasets that are much larger than the memory we have available.

Before we bring these ideas together: you can see how effective streaming is in this video where I use streaming to process a large dataset on a not-so-amazing laptop.

Bringing it all together

Here is how I would put this all together

1
2
3
4
5
6
7
polars_df = (
    pl.scan_csv("data_files/*.csv")
    # Select a subset of columns
    .select(["date","temperature","humdity"])
    ...
    .collect(streaming=True)
)

So this is not only much faster and more scalable, itā€™s also much easier to read and write!

Letā€™s break it down:

  • I used the glob string pattern inside pl.scan_csv to output a single LazyFrame. This means that Polars takes care of scanning all the CSV files and concatenating them into a single LazyFrame
  • I called select on the LazyFrame - this allows Polars to optimise the query by only reading the date,temperature and humdity columns from the CSVs
  • I called collect with the streaming = True argument to tell Polars I want it to evaluate the dataset in chunks

There are some caveats here:

  • streaming does not work for all operations (but does for core operations like filter,groupby and join). If streaming is not available for some of your operations Polars will default to non-streaming and you may run out of memory with a large dataset
  • the output DataFrame must fit in memory so we will need some kind of filtering or aggregation. This might change in a future release so we could stream the whole query into a new output file

So now itā€™s your turn! Let me know on social media how you get on with streaming on your large datasets.

Found this useful? Get in touch if you would like to discuss how your pipelines can take full advantage of Polars speed and scalability.

Or check out my Data Analysis with Polars course with a half price discount

Next steps

Want to know more about Polars for high performance data science and ML? Then you can:

This post is licensed under CC BY 4.0 by the author.