Polars now allows you to write Parquet files even when the file is too large to fit in memory. It does this by using streaming to process data in batches and then writing these batches to a Parquet file with a method called
In recent months Polars has added a streaming approach where larger-than-memory datasets can be processed in batches so that they fit into memory. If you aren’t familiar with streaming see this video where I process a 30 Gb dataset on a not-very-impressive laptop.
However, a limitation of streaming has been that the output of the query must fit in memory as a
Polars now has a
sink_parquet method which means that you can write the output of your streaming query to a Parquet file. This means that you can process large datasets on a laptop even if the output of your query doesn’t fit in memory.
In this example we process a large Parquet file in lazy mode and write the output to another Parquet file with
1 2 3 4 5 6 7 8 9 10 import polars as pl ( pl.scan_parquet(large_parquet_file_path) .groupby("passenger_count") .agg( pl.col("tip_amount") ) .sink_parquet(output_parquet_file_path) )
Unlike a normal lazy query we evaluate the query and write the output by calling
sink_parquet instead of
One great use case for
sink_parquet is converting one or more large CSV files to Parquet so the data is much faster to work with. This process can be tedious and time-consuming to do manually, but with Polars wWe can do this as follows:
1 2 3 4 ( pl.scan_csv("*.csv") .sink_parquet(output_parquet_file_path) )
There are some limitations to be aware of. Streaming does not work for all operations in Polars and so you may get an out-of-memory exceptions if your query does not support streaming.
You can learn more about
sink_parquet here in the official docs including options to set compression level and row group statistics.
Want to know more about Polars for high performance data science and ML? Then you can: