Combining data with different schemas

This post was created while writing my Up & Running with Polars course. Check it out here with a free preview of the first chapters

You’ve got a bunch of data files in your project and they all follow a consistent data schema 😊

You get a new file and see that from now on there will be some useful extra columns. How are you going to combine this file with the old stuff?? 😣

A vertical concatenation won’t work as it doesn’t like schema changes.

This is where diagonal concatenation in Polars comes in.

        
      
# Old schema year, exporter, importer
dfTrades2020 = pl.DataFrame(
    [
        {"year":2020,"exporter":"China","importer":"USA"},
        {"year":2020,"exporter":"China","importer":"USA"},
    ]
)
# New schema includes value
dfTrades2021 = pl.DataFrame(
    [
        {"year":2021,"exporter":"China","importer":"USA","value":10},
        {"year":2021,"exporter":"China","importer":"USA","value":100},
    ]
)
# Diagonal concatenation
pl.concat([dfTrades2020,dfTrades2021],how="diagonal")

Diagonal concatenation appends your new records with their new columns, and add nulls to the new columns for the old records to show the data is missing. Sorted.

Output of the diagonal concatenation

Learn more

Want to know more about Polars for high performance data science and ML? Then you can:

or let me know if you would like a Polars workshop for your organisation.

software

Polars

This post is licensed under CC BY 4.0 by the author.

Combining data with different schemas

Combining data with different schemas

Learn more

Further Reading

What does ChatGPT's Advanced Data Analysis have installed?

AWS Lambda with Polars

Streaming large datasets in Polars