Exploding a Polars pivot for feature engineering

In my ML pipelines these days I find I replace some of the simpler scikit-learn metrics such as root-mean-squared-error with my own hand-rolled Polars expressions. This approach saves me from copying data to a different format and ensures I can keep the normal advantages of Polars such as parallelisation, optimisation and scaling to large datasets.

Recently I was adding a new section to my Up & Running with Polars course(check it out here) on pivoting data when it struck me that the CountVectorizer method is based on a pivot. I decided to see how much effort it would take to re-implement this in Polars.

For anyone not familiar with CountVectorizer - it is a feature engineering technique where each column of the 2D array corresponds to a word and each row corresponds to a document. The value in each cell is 1 if that word is present in that document and 0 otherwise. See below for an example of the output.

Getting some fake data

I needed some fake text data for this exercise so I asked ChatGPT to generate a small dataset of fake news articles along with publication name and title. It delivered me a truly fake dataset with artices from The Daily Deception and the Faux News Network:

        
      
fake_news_df = pl.DataFrame(
    {
    'publication': [
        'The Daily Deception', 'Faux News Network', 'The Fabricator', 'The Misleader', 'The Hoax Herald', ],
    'title': [
        'Scientists Discover New Species of Flying Elephant', 
        'Aliens Land on Earth and Offer to Solve All Our Problems', 
        'Study Shows That Eating Pizza Every Day Leads to Longer Life', 
        'New Study Finds That Smoking is Good for You', 
        "World's Largest Iceberg Discovered in Florida"],
    'text': [
        'In a groundbreaking discovery, scientists have found a new species of elephant that can fly. The flying elephants, which were found in the Amazon rainforest, have wings that span over 50 feet and can reach speeds of up to 100 miles per hour. This is a game-changing discovery that could revolutionize the field of zoology.',

        'In a historic moment for humanity, aliens have landed on Earth and offered to solve all our problems. The extraterrestrial visitors, who arrived in a giant spaceship that landed in Central Park, have advanced technology that can cure disease, end hunger, and reverse climate change. The world is waiting to see how this incredible offer will play out.',

        'A new study has found that eating pizza every day can lead to a longer life. The study, which was conducted by a team of Italian researchers, looked at the eating habits of over 10,000 people and found that those who ate pizza regularly lived on average two years longer than those who didn\'t. The study has been hailed as a breakthrough in the field of nutrition.',

        'In a surprising twist, a new study has found that smoking is actually good for you. The study, which was conducted by a team of British researchers, looked at the health outcomes of over 100,000 people and found that those who smoked regularly had lower rates of heart disease and cancer than those who didn\'t. The findings have sparked controversy among health experts.',

        'In a bizarre turn of events, the world\'s largest iceberg has been discovered in Florida. The iceberg, which is over 100 miles long and 50 miles wide, was found off the coast of Miami by a group of tourists on a whale-watching tour. Scientists are baffled by the discovery and are scrambling to figure out how an iceberg of this size could have']
    }
)

Split, explode and pivot

The first thing we need to do is convert the text to lowercase and split each article into separate words. We do this with expressions from the str namespace. We also add a column called placeholder with a value of 1. These are the 1’s that will later populate our feature matrix.

        
      
(
    fake_news_df
    .with_columns(
        pl.col("text").str.to_lowercase().str.split(" "),
        pl.lit(1).alias("placeholder")
    )
)

shape: (5, 4)
┌─────────────────────┬───────────────────────────────┬──────────────────────────────┬─────────────┐
│ publication         ┆ title                         ┆ text                         ┆ placeholder │
│ ---                 ┆ ---                           ┆ ---                          ┆ ---         │
│ str                 ┆ str                           ┆ list[str]                    ┆ i32         │
╞═════════════════════╪═══════════════════════════════╪══════════════════════════════╪═════════════╡
│ The Daily Deception ┆ Scientists Discover New       ┆ ["in", "a", … "zoology."]    ┆ 1           │
│                     ┆ Species …                     ┆                              ┆             │
│ Faux News Network   ┆ Aliens Land on Earth and      ┆ ["in", "a", … "out."]        ┆ 1           │
│                     ┆ Offer t…                      ┆                              ┆             │
│ The Fabricator      ┆ Study Shows That Eating Pizza ┆ ["a", "new", … "nutrition."] ┆ 1           │
│                     ┆ Ev…                           ┆                              ┆             │
│ The Misleader       ┆ New Study Finds That Smoking  ┆ ["in", "a", … "experts."]    ┆ 1           │
│                     ┆ is …                          ┆                              ┆             │
│ The Hoax Herald     ┆ World's Largest Iceberg       ┆ ["in", "a", … "have"]        ┆ 1           │
│                     ┆ Discover…                     ┆                              ┆             │
└─────────────────────┴───────────────────────────────┴──────────────────────────────┴─────────────┘

By splitting the string values we turn the string column into a column with a Polars pl.List(str) dtype. In a previous post I showed how a pl.List type allows fast operations because each row is a Polars Series under the hood rather than a slow Python list.

However, it would still be better to stretch out that pl.List column to have a row for each element of each list. At the same time, we want to keep the metadata of the original article such as the publication name and title.

We do this stretching out by calling explode on the text column to give us a row for each element of each list

        
      
(
    fake_news_df
    .with_columns(
        pl.col("text").str.to_lowercase().str.split(" "),
        pl.lit(1).alias("placeholder")
    )
    .explode("text")
)

shape: (306, 4)
┌─────────────────────┬───────────────────────────────────┬────────────────┬─────────────┐
│ publication         ┆ title                             ┆ text           ┆ placeholder │
│ ---                 ┆ ---                               ┆ ---            ┆ ---         │
│ str                 ┆ str                               ┆ str            ┆ i32         │
╞═════════════════════╪═══════════════════════════════════╪════════════════╪═════════════╡
│ The Daily Deception ┆ Scientists Discover New Species … ┆ in             ┆ 1           │
│ The Daily Deception ┆ Scientists Discover New Species … ┆ a              ┆ 1           │
│ The Daily Deception ┆ Scientists Discover New Species … ┆ groundbreaking ┆ 1           │
│ The Daily Deception ┆ Scientists Discover New Species … ┆ discovery,     ┆ 1           │
│ …                   ┆ …                                 ┆ …              ┆ …           │
│ The Hoax Herald     ┆ World's Largest Iceberg Discover… ┆ this           ┆ 1           │
│ The Hoax Herald     ┆ World's Largest Iceberg Discover… ┆ size           ┆ 1           │
│ The Hoax Herald     ┆ World's Largest Iceberg Discover… ┆ could          ┆ 1           │
│ The Hoax Herald     ┆ World's Largest Iceberg Discover… ┆ have           ┆ 1           │
└─────────────────────┴───────────────────────────────────┴────────────────┴─────────────┘

Note that the explode method can be used with the streaming engine in Polars so you can use it on larger-than-memory datasets.

Now it’s time to transform the text column so we have a column for each distinct word and a row for each article. We do this by calling pivot with the metadata columns (publication and title) as the index for each row, the text column to see the column names and the placeholder values to be the values.

        
      
(
    fake_news_df
    .with_columns(
        pl.col("text").str.to_lowercase().str.split(" "),
        pl.lit(1).alias("placeholder")
    )
    .explode("text")
    .pivot(
        index=["publication","title"],
        columns="text",
        values="placeholder",
        sort_columns=True
    )
)

shape: (5, 166)
┌─────────────────────┬────────────────────┬────────┬──────┬───┬─────────┬───────┬──────┬──────────┐
│ publication         ┆ title              ┆ 10,000 ┆ 100  ┆ … ┆ world's ┆ years ┆ you. ┆ zoology. │
│ ---                 ┆ ---                ┆ ---    ┆ ---  ┆   ┆ ---     ┆ ---   ┆ ---  ┆ ---      │
│ str                 ┆ str                ┆ i32    ┆ i32  ┆   ┆ i32     ┆ i32   ┆ i32  ┆ i32      │
╞═════════════════════╪════════════════════╪════════╪══════╪═══╪═════════╪═══════╪══════╪══════════╡
│ The Daily Deception ┆ Scientists         ┆ null   ┆ 1    ┆ … ┆ null    ┆ null  ┆ null ┆ 1        │
│                     ┆ Discover New       ┆        ┆      ┆   ┆         ┆       ┆      ┆          │
│                     ┆ Species …          ┆        ┆      ┆   ┆         ┆       ┆      ┆          │
│ Faux News Network   ┆ Aliens Land on     ┆ null   ┆ null ┆ … ┆ null    ┆ null  ┆ null ┆ null     │
│                     ┆ Earth and Offer t… ┆        ┆      ┆   ┆         ┆       ┆      ┆          │
│ The Fabricator      ┆ Study Shows That   ┆ 1      ┆ null ┆ … ┆ null    ┆ 1     ┆ null ┆ null     │
│                     ┆ Eating Pizza Ev…   ┆        ┆      ┆   ┆         ┆       ┆      ┆          │
│ The Misleader       ┆ New Study Finds    ┆ null   ┆ null ┆ … ┆ null    ┆ null  ┆ 1    ┆ null     │
│                     ┆ That Smoking is …  ┆        ┆      ┆   ┆         ┆       ┆      ┆          │
│ The Hoax Herald     ┆ World's Largest    ┆ null   ┆ 1    ┆ … ┆ 1       ┆ null  ┆ null ┆ null     │
│                     ┆ Iceberg Discover…  ┆        ┆      ┆   ┆         ┆       ┆      ┆          │
└─────────────────────┴────────────────────┴────────┴──────┴───┴─────────┴───────┴──────┴──────────┘

Note that we use the sort_columns argument to get the text columns in lexigraphical order.

The last stage is to replace the null values with 0’s so it’s clear what we’re doing with them

        
      
(
    fake_news_df
    .with_columns(
        pl.col("text").str.to_lowercase().str.split(" "),
        pl.lit(1).alias("placeholder")
    )
    .explode("text")
    .pivot(
        index=["publication","title"],
        columns="text",
        values="placeholder",
        sort_columns=True
    )
    .fill_null(value=0)
)

shape: (5, 166)
┌─────────────────────┬─────────────────────┬────────┬─────┬───┬─────────┬───────┬──────┬──────────┐
│ publication         ┆ title               ┆ 10,000 ┆ 100 ┆ … ┆ world's ┆ years ┆ you. ┆ zoology. │
│ ---                 ┆ ---                 ┆ ---    ┆ --- ┆   ┆ ---     ┆ ---   ┆ ---  ┆ ---      │
│ str                 ┆ str                 ┆ i32    ┆ i32 ┆   ┆ i32     ┆ i32   ┆ i32  ┆ i32      │
╞═════════════════════╪═════════════════════╪════════╪═════╪═══╪═════════╪═══════╪══════╪══════════╡
│ The Daily Deception ┆ Scientists Discover ┆ 0      ┆ 1   ┆ … ┆ 0       ┆ 0     ┆ 0    ┆ 1        │
│                     ┆ New Species …       ┆        ┆     ┆   ┆         ┆       ┆      ┆          │
│ Faux News Network   ┆ Aliens Land on      ┆ 0      ┆ 0   ┆ … ┆ 0       ┆ 0     ┆ 0    ┆ 0        │
│                     ┆ Earth and Offer t…  ┆        ┆     ┆   ┆         ┆       ┆      ┆          │
│ The Fabricator      ┆ Study Shows That    ┆ 1      ┆ 0   ┆ … ┆ 0       ┆ 1     ┆ 0    ┆ 0        │
│                     ┆ Eating Pizza Ev…    ┆        ┆     ┆   ┆         ┆       ┆      ┆          │
│ The Misleader       ┆ New Study Finds     ┆ 0      ┆ 0   ┆ … ┆ 0       ┆ 0     ┆ 1    ┆ 0        │
│                     ┆ That Smoking is …   ┆        ┆     ┆   ┆         ┆       ┆      ┆          │
│ The Hoax Herald     ┆ World's Largest     ┆ 0      ┆ 1   ┆ … ┆ 1       ┆ 0     ┆ 0    ┆ 0        │
│                     ┆ Iceberg Discover…   ┆        ┆     ┆   ┆         ┆       ┆      ┆          │
└─────────────────────┴─────────────────────┴────────┴─────┴───┴─────────┴───────┴──────┴──────────┘

Of course there are still differences with the output of CountVectorizer - for example CountVectorizer returns a sparse matrix by default. In addition Count Vectorizer uses a more sophisticated regex to separate the words - but we can re-implement this by using str.extract_all instead of .str.split

        
      
(
    fake_news_df
    .with_columns(
        pl.col("text").str.to_lowercase().str.extract_all('(?u)\\b\\w\\w+\\b'),
        pl.lit(1).alias("placeholder")
    )
)

So here we’ve seen how we can quickly implement a classic NLP feature engineering method using Polars. I’m sure we’ll see many more examples of Polars as an all-purpose data workhorse in the years to come.

Want to accelerate your analysis with Polars? Join over 2,000 learners on my highly-rated Up & Running with Polars course )

Next steps

Want to know more about Polars for high performance data science? Then you can:

Exploding a Polars pivot for feature engineering

Getting some fake data

Split, explode and pivot

Next steps

Further Reading

What does ChatGPT's Advanced Data Analysis have installed?

Doing ML pre-processing in Polars

Fit Scikit-learn and XGBoost models directly from Polars