Concat, extend or vstack?

On the face of it the concat,extend and vstack functions in Polars can do the same job: they can take two initial DataFrames and turn them into a single DataFrame. In this post I show that they do quite different things to your data underneath-the-hood and this can have a significant effect on your query performance.

Want to accelerate your analysis with Polars? Join over 3,000 learners on my highly-rated Up & Running with Polars course

Basic setup

This is the basic setup - we want to combine two DataFrames df1 and df2

        
      
import polars as pl

df1 = (
    pl.DataFrame(
        {
            "id":[0,1],
            "values":["a","b"]
        }
    )
)
shape: (2, 2)
┌─────┬────────┐
│ id  ┆ values │
│ --- ┆ ---    │
│ i64 ┆ str    │
╞═════╪════════╡
│ 0   ┆ a      │
│ 1   ┆ b      │
└─────┴────────┘
df2 = (
    pl.DataFrame(
        {
            "id":[2,3],
            "values":["c","d"]
        }
    )
)
shape: (2, 2)
┌─────┬────────┐
│ id  ┆ values │
│ --- ┆ ---    │
│ i64 ┆ str    │
╞═════╪════════╡
│ 2   ┆ c      │
│ 3   ┆ d      │
└─────┴────────┘

If we call any of concat,vstack or extend we get the following output:

        
      
shape: (4, 2)
┌─────┬────────┐
│ id  ┆ values │
│ --- ┆ ---    │
│ i64 ┆ str    │
╞═════╪════════╡
│ 0   ┆ a      │
│ 1   ┆ b      │
│ 2   ┆ c      │
│ 3   ┆ d      │
└─────┴────────┘

So what’s the difference?

With two initial DataFrames the data sits in two different locations in memory. When we combine them into a new DataFrame there are three options:

copy all the data to a single new location
leave the data where it is and link the new DataFrame to the existing two locations in memory
copy the data from one location and append it to the data in the other location

Note that in the last case of appending there has to be enough space to append the data. If there isn’t then both are copied to a new location.

The three methods concat,vstack or extend use these three options:

pl.concat([df_1,df_2]) copies all the data to a single new location when we use the default rechunk=True argument
df_1.vstack(df_2) doesn’t copy any data and just links the new DataFrame to the existing two locations in memory
df_1.extend(df_2) copies the data from df_2 and appends it to the data for df_1

I’m simplifying things a little bit for this post but these are the basic paradigms. Underneath-the-hood pl.concat carries out a series of .vstack operations (given a list of DataFrames) and then does the rechunk operation to copy the data to a single location.

Pros and cons?

There are obviously pros and cons of these different approaches:

Copying all data to a new location is expensive. However, having the data in a single location makes subsequent queries faster and gives more consistent results in terms of timing.
Not copying any data is very fast (perhaps sub millisecond) but slows down subsequent queries.
Appending the data from one location to the other is faster than copying both but it will be hard to predict when it won’t fit and both will need to be copied to a new location.

In my course I explore some relative timings of the different approaches in simple queries. In general if you are going to do subsequent operations on a DataFrame then it’s normally worth copying the data to a single location with pl.concat. However, if you just want to combine the DataFrames to do something trivial - like checking the shape - then vstack is the way to go. If you are adding a small DataFrame to a large DataFrame then extend works really well as you are only copying the data from the small DataFrame.

The best approach is very dependant on your problem, but I recommend comparing each of these methods if combining data is taking a lot of time in your pipeline.

Want to accelerate your analysis with Polars? Join over 3,000 learners on my highly-rated Up & Running with Polars course )

Next steps

Want to know more about Polars for high performance data science? Then you can:

Concat, extend or vstack?

Basic setup

So what’s the difference?

Pros and cons?

Next steps

Further Reading

What is a Polars expression?

Reading from S3 with Polars (or DeltaLake) using AWS SSO

Fitting linear models within Polars