Home Concat, extend or vstack?
Post
Cancel

Concat, extend or vstack?

On the face of it the concat,extend and vstack functions in Polars do the same job: they take two initial DataFrames and turn them into a single DataFrame. In this post I show that they do quite different things to your data underneath-the-hood and this can have a significant effect on your query performance.

Want to get going with Polars? This post is an extract from my Up & Running with Polars course - learn more here or check out the preview of the first chapters

Basic setup

This is the basic setup - we want to combine two DataFrames df1 and df2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import polars as pl

df1 = (
    pl.DataFrame(
        {
            "id":[0,1],
            "values":["a","b"]
        }
    )
)
shape: (2, 2)
┌─────┬────────┐
 id   values 
 ---  ---    
 i64  str    
╞═════╪════════╡
 0    a      
 1    b      
└─────┴────────┘
df2 = (
    pl.DataFrame(
        {
            "id":[2,3],
            "values":["c","d"]
        }
    )
)
shape: (2, 2)
┌─────┬────────┐
 id   values 
 ---  ---    
 i64  str    
╞═════╪════════╡
 2    c      
 3    d      
└─────┴────────┘

If we call any of concat,vstack or extend we get the following output:

1
2
3
4
5
6
7
8
9
10
11
shape: (4, 2)
┌─────┬────────┐
 id   values 
 ---  ---    
 i64  str    
╞═════╪════════╡
 0    a      
 1    b      
 2    c      
 3    d      
└─────┴────────┘

So what’s the difference?

With two initial DataFrames the data sits in two different locations in memory. When we combine them into a new DataFrame there are three options:

  • copy all the data to a single new location
  • leave the data where it is and link the new DataFrame to the existing two locations in memory
  • copy the data from one location and append it to the data in the other location

Note that in the last case of appending there has to be enough space to append the data. If there isn’t then both are copied to a new location.

The three methods concat,vstack or extend use these three options:

  • pl.concat([df_1,df_2]) copies all the data to a single new location
  • df_1.vstack(df_2) doesn’t copy any data and just links the new DataFrame to the existing two locations in memory
  • df_1.extend(df_2) copies the data from df_2 and appends it to the data for df_1

I’m simplifying things a little bit for this post but these are the basic paradigms.

Pros and cons?

There are obviously pros and cons of these different approaches:

  • Copying all data to a new location is expensive. However, having the data in a single location makes subsequent queries faster and gives more consistent results in terms of timing.
  • Not copying any data is very fast (perhaps sub millisecond) but slows down subsequent queries.
  • Appending the data from one location to the other is faster than copying both but it will be hard to predict when it won’t fit and both will need to be copied to a new location.

In my course I explore some relative timings of the different approaches in simple queries. I was surprised to see that vstack was faster than concat in some queries even when the query involved a groupby or sort after combining the DataFrames.

I was not surprised, however, to see that extend works really well when you are adding a small DataFrame to a large DataFrame as you are only copying the data from the small DataFrame.

Also, bear in mind that when you read in multiple CSVs in eager or lazy mode there is a pl.concat that copies each DataFrame to a single location after each CSV file is read. If reading CSVs is a bottleneck it’s worth experimenting with not doing this copy with

1
2
pl.read_csv("path/to/*.csv",rechunk=False)
pl.scan_csv("path/to/*.csv",rechunk=False)

The best approach is very dependant on your problem, but I recommend comparing each of these methods if combining data is taking a lot of time in your pipeline.

Want to get going with Polars? This post is an extract from my Up & Running with Polars course - learn more here or check out the preview of the first chapters )

Next steps

Want to know more about Polars for high performance data science? Then you can:

This post is licensed under CC BY 4.0 by the author.