Too many conditional statements? Try this faster, more pythonic way.
#pandas #conditionals #if-else #numpy #python
We have all faced this - a continuous barrage of if
’s and elif
’s over your existing Pandas
DataFrame or Numpy
array for a new model feature or a metric for your report. Sometimes, you just can’t avoid checking multiple complex conditions, one by one, to get where you need to.
Let's set up a toy example.
import pandas as pd
df = pd.DataFrame({"name": ["Rick","Morty","Beth","Summer","Jerry"],
"age": [70, 14, 35, 17, 35]})
print(df)
name age
0 Rick 70
1 Morty 14
2 Beth 35
3 Summer 17
4 Jerry 35
Let’s say we want to create labels for the Smith family members based on age. A classic way might be to write a function with all of the necessary conditions and then apply that function row-wise.
def age_group(row):
if row.age >= 0 and row.age < 20:
return '0 - 20 yrs'
elif row.age >= 20 and row.age < 40:
return '20 - 40 yrs'
elif row.age >= 40 and row.age < 60:
return '40 - 60 yrs'
elif row.age >= 60:
return '60+ yrs'
else:
return 'invalid age'
df['apply_age'] = df.apply(age_group,1)
print(df)
name age apply_age
0 Rick 70 60+ yrs
1 Morty 14 0 - 20 yrs
2 Beth 35 20 - 40 yrs
3 Summer 17 0 - 20 yrs
4 Jerry 35 20 - 40 yrs
But this is ugly, and slow, mostly because a pd.DataFrame.apply
method in pandas is not vectorized! So is there a more pythonic way to do this, which is readable, cleaner, and faster at the same time?
Introducing Numpy’s Select method -
This is how it works. You pass the numpy.select
method a list of conditions, a list of values for each of those conditions (if
, elif
), and a default value (else
).
numpy.select(condlist, choicelist, default)
So how does it work in action?
import numpy as np
#Conditions
c1 = (df.age >= 0) & (df.age < 20)
c2 = (df.age >= 20) & (df.age < 40)
c3 = (df.age >= 40) & (df.age < 60)
c4 = (df.age >= 60)
#Choices
values = ['0 - 20 yrs', '20 - 40 yrs', '40 - 60 yrs', '60+ yrs']
#Default
default = 'invalid age'
df['select_age'] = np.select([c1,c2,c3,c4], values, default=default)
print(df)
name age select_age
0 Rick 70 60+ yrs
1 Morty 14 0 - 20 yrs
2 Beth 35 20 - 40 yrs
3 Summer 17 0 - 20 yrs
4 Jerry 35 20 - 40 yrs
I hope you agree with me when I say this is significantly cleaner and more readable, especially when you imagine working with 10s of 100s such conditions with higher complexity!
But, as you might remember, I claimed earlier that this is also much faster than the traditional way. That is because it is vectorized unlike the pd.DataFrame.apply
which is applying your function one row at a time.
How fast you ask?
Let’s scale the data and test this out with a %%timeit -
df = pd.concat([df]*10000) #Repeating the dataframe 10k times
%%timeit for apply_method
539 ms ± 3.48 ms per loop
%%timeit for np_select_method
3.84 ms ± 52.7 µs per loop
That’s 140x faster at just 50k rows and this only gets better and better as the data size increases!
So next time you are working with a ton of if’
s and elif’
s, know that there is a more pythonic way to do this which your fellow Data Scientists will thank you for!