Too many conditional statements? Try this faster, more pythonic way.

#pandas #conditionals #if-else #numpy #python

Mar 24, 2023

We have all faced this - a continuous barrage of if’s and elif’s over your existing Pandas DataFrame or Numpy array for a new model feature or a metric for your report. Sometimes, you just can’t avoid checking multiple complex conditions, one by one, to get where you need to.

Let's set up a toy example.

import pandas as pd

df = pd.DataFrame({"name": ["Rick","Morty","Beth","Summer","Jerry"], 
                   "age": [70, 14, 35, 17, 35]})
print(df)

     name  age
0    Rick   70
1   Morty   14
2    Beth   35
3  Summer   17
4   Jerry   35

Let’s say we want to create labels for the Smith family members based on age. A classic way might be to write a function with all of the necessary conditions and then apply that function row-wise.

def age_group(row):
    if row.age >= 0 and row.age < 20:
        return '0 - 20 yrs'
    elif row.age >= 20 and row.age < 40:
        return '20 - 40 yrs'
    elif row.age >= 40 and row.age < 60:
        return '40 - 60 yrs'
    elif row.age >= 60:
        return '60+ yrs'
    else:
        return 'invalid age'
    
df['apply_age'] = df.apply(age_group,1)
print(df)

     name  age    apply_age
0    Rick   70      60+ yrs
1   Morty   14   0 - 20 yrs
2    Beth   35  20 - 40 yrs
3  Summer   17   0 - 20 yrs
4   Jerry   35  20 - 40 yrs

But this is ugly, and slow, mostly because a pd.DataFrame.apply method in pandas is not vectorized! So is there a more pythonic way to do this, which is readable, cleaner, and faster at the same time?

Introducing Numpy’s Select method -

This is how it works. You pass the numpy.select method a list of conditions, a list of values for each of those conditions (if, elif), and a default value (else).

numpy.select(condlist, choicelist, default)

So how does it work in action?

import numpy as np

#Conditions
c1 = (df.age >= 0) & (df.age < 20)
c2 = (df.age >= 20) & (df.age < 40)
c3 = (df.age >= 40) & (df.age < 60)
c4 = (df.age >= 60)

#Choices
values = ['0 - 20 yrs', '20 - 40 yrs', '40 - 60 yrs', '60+ yrs']

#Default
default = 'invalid age'

df['select_age'] = np.select([c1,c2,c3,c4], values, default=default)
print(df)

     name  age   select_age
0    Rick   70      60+ yrs
1   Morty   14   0 - 20 yrs
2    Beth   35  20 - 40 yrs
3  Summer   17   0 - 20 yrs
4   Jerry   35  20 - 40 yrs

I hope you agree with me when I say this is significantly cleaner and more readable, especially when you imagine working with 10s of 100s such conditions with higher complexity!

But, as you might remember, I claimed earlier that this is also much faster than the traditional way. That is because it is vectorized unlike the pd.DataFrame.apply which is applying your function one row at a time.

How fast you ask?

Let’s scale the data and test this out with a %%timeit -

df = pd.concat([df]*10000)  #Repeating the dataframe 10k times

%%timeit for apply_method
539 ms ± 3.48 ms per loop

%%timeit for np_select_method
3.84 ms ± 52.7 µs per loop

That’s 140x faster at just 50k rows and this only gets better and better as the data size increases!

So next time you are working with a ton of if’s and elif’s, know that there is a more pythonic way to do this which your fellow Data Scientists will thank you for!

Data Science Philosophy

Discussion about this post