Home Filtering one df by another
Post
Cancel

Filtering one df by another

One of the most common questions we get on the Polars discord is how to filter rows in one dataframe by values in another.

I think people don’t realise this is a basically a join because they don’t want any of the columns from the other dataframe.

The good news is that functionality exists - it’s called either:

  • a semi join if you want to keep rows also found in the other dataframe or
  • an anti join if you want to remove rows also found in the other dataframe.

Want to get going with Polars? This post is an extract from my Up & Running with Polars course - learn more here or check out the preview of the first chapters

Here’s the example I made for the Polars user guide…

For a rental car company we have a cars dataframe with cars we own and their unique IDs…

1
2
3
4
5
6
df_cars = pl.DataFrame(
    {
        "id": ["a", "b", "c"],
        "make": ["ford", "toyota", "bmw"],
    }
)
1
2
3
4
5
6
7
8
9
10
shape: (3, 2)
┌─────┬────────┐
│ id  ┆ make   │
│ --- ┆ ---    │
│ str ┆ str    │
╞═════╪════════╡
│ a   ┆ ford   │
│ b   ┆ toyota │
│ c   ┆ bmw    │
└─────┴────────┘

along with a repairs dataframe that tracks repairs made to each car.

1
2
3
4
5
6
df_repairs = pl.DataFrame(
    {
        "id": ["c", "c"],
        "cost": [100, 200],
    }
)
1
2
3
4
5
6
7
8
9
shape: (2, 2)
┌─────┬──────┐
│ id  ┆ cost │
│ --- ┆ ---  │
│ str ┆ i64  │
╞═════╪══════╡
│ c   ┆ 100  │
│ c   ┆ 200  │
└─────┴──────┘

We want to find either:

  • all cars that have had repairs (the semi join) or
  • all cars that had not had repairs (the anti join)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
df_semi_join = df_cars.join(df_repairs, on="id", how="semi")
shape: (1, 2)
┌─────┬──────┐
 id   make 
 ---  ---  
 str  str  
╞═════╪══════╡
 c    bmw  
└─────┴──────┘

df_anti_join = df_cars.join(df_repairs, on="id", how="anti")
shape: (2, 2)
┌─────┬────────┐
 id   make   
 ---  ---    
 str  str    
╞═════╪════════╡
 a    ford   
 b    toyota 
└─────┴────────┘

Notice that with these joins we don’t end up with any columns from the repairs dataframe.

The semi and anti joins have standard join advantages. For example you can do these joins based on conditions over multiple columns. In Polars you can do them in Polars lazy mode and with streaming for large data.

Want to get going with Polars? This post is an extract from my Up & Running with Polars course - learn more here or check out the preview of the first chapters )

Next steps

Want to know more about Polars for high performance data science? Then you can:

This post is licensed under CC BY 4.0 by the author.