On the face of it the concat
,extend
and vstack
functions in Polars do the same job: they take two initial DataFrames
and turn them into a single DataFrame
. In this post I show that they do quite different things to your data underneath-the-hood and this can have a significant effect on your query performance.
Want to get going with Polars? This post is an extract from my Up & Running with Polars course - learn more here or check out the preview of the first chapters
Basic setup
This is the basic setup - we want to combine two DataFrames
df1
and df2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import polars as pl
df1 = (
pl.DataFrame(
{
"id":[0,1],
"values":["a","b"]
}
)
)
shape: (2, 2)
┌─────┬────────┐
│ id ┆ values │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪════════╡
│ 0 ┆ a │
│ 1 ┆ b │
└─────┴────────┘
df2 = (
pl.DataFrame(
{
"id":[2,3],
"values":["c","d"]
}
)
)
shape: (2, 2)
┌─────┬────────┐
│ id ┆ values │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪════════╡
│ 2 ┆ c │
│ 3 ┆ d │
└─────┴────────┘
If we call any of concat
,vstack
or extend
we get the following output:
1
2
3
4
5
6
7
8
9
10
11
shape: (4, 2)
┌─────┬────────┐
│ id ┆ values │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪════════╡
│ 0 ┆ a │
│ 1 ┆ b │
│ 2 ┆ c │
│ 3 ┆ d │
└─────┴────────┘
So what’s the difference?
With two initial DataFrames
the data sits in two different locations in memory. When we combine them into a new DataFrame
there are three options:
- copy all the data to a single new location
- leave the data where it is and link the new
DataFrame
to the existing two locations in memory - copy the data from one location and append it to the data in the other location
Note that in the last case of appending there has to be enough space to append the data. If there isn’t then both are copied to a new location.
The three methods concat
,vstack
or extend
use these three options:
pl.concat([df_1,df_2])
copies all the data to a single new locationdf_1.vstack(df_2)
doesn’t copy any data and just links the newDataFrame
to the existing two locations in memorydf_1.extend(df_2)
copies the data fromdf_2
and appends it to the data fordf_1
I’m simplifying things a little bit for this post but these are the basic paradigms.
Pros and cons?
There are obviously pros and cons of these different approaches:
- Copying all data to a new location is expensive. However, having the data in a single location makes subsequent queries faster and gives more consistent results in terms of timing.
- Not copying any data is very fast (perhaps sub millisecond) but slows down subsequent queries.
- Appending the data from one location to the other is faster than copying both but it will be hard to predict when it won’t fit and both will need to be copied to a new location.
In my course I explore some relative timings of the different approaches in simple queries. I was surprised to see that vstack
was faster than concat
in some queries even when the query involved a groupby
or sort
after combining the DataFrames
.
I was not surprised, however, to see that extend
works really well when you are adding a small DataFrame
to a large DataFrame
as you are only copying the data from the small DataFrame
.
Also, bear in mind that when you read in multiple CSVs in eager or lazy mode there is a pl.concat
that copies each DataFrame
to a single location after each CSV file is read. If reading CSVs is a bottleneck it’s worth experimenting with not doing this copy with
1
2
pl.read_csv("path/to/*.csv",rechunk=False)
pl.scan_csv("path/to/*.csv",rechunk=False)
The best approach is very dependant on your problem, but I recommend comparing each of these methods if combining data is taking a lot of time in your pipeline.
Want to get going with Polars? This post is an extract from my Up & Running with Polars course - learn more here or check out the preview of the first chapters )
Next steps
Want to know more about Polars for high performance data science? Then you can: