Groupby & Aggregate using Pandas

#python #pandas #groupby #aggregate

Mar 17, 2023

Introduction
Syntax
2.1 Adding more groups/levels
2.2 Adding more variables/features
Grouping
Aggregation
4.1 In-built aggregation methods
4.2 Custom functions with pandas apply
4.3 Multiple aggregations using agg method
4.4 Custom functions with agg method
Transform
Advanced Usage
6.1 Sequential/local grouping of a dataframe
6.2 Re-indexing to a fixed date range for each group
Other ways of grouping data
7.1 Using collections' defaultdict
7.2 Using numpy's split function
7.3 Using itertools' groupby
References

1. Introduction

"Groupby" is probably one of the most basic data pre-processing steps that a Data Scientist should master as soon as possible. Interestingly enough, you find it in almost every scripting language that claims to work well with databases.

Most of us would have been introduced to the SQL GROUPBY statement which allows a user to summarize or aggregate a given dataset. Python brings the pandas groupby method to the table, which is highly pythonic in its syntax and equally versatile, if not more. But the utility of a groupby is much more than just aggregation. In this notebook, I will showcase a few examples, where you could really exploit this method for various other use cases.

#SQL Query to groupby Col1 and Col2
#and get mean and sum of col3 and col 4 respectively

SELECT Col1, Col2, mean(Col3), sum(Col4)
FROM Table
GROUP BY Col1, Col2

Before we can start writing code, let's explore the basics behind a groupby operation. The core concept behind any groupby operation is a three-step process called Split-Apply-Combine.

Split: Splitting the data into groups based on some criteria
Apply: Applying a function to each group independently
Combine: Combining the results into a data structure

Here is a diagram to make this more intuitive.

2. Syntax

The syntax for using a groupby method in Pandas comprises 2 parts. The first is a grouper object and the second is the aggregator. The general structure looks like the following -

        df.groupby(['groups'])['cols'].aggregations()
        |____________________||_____________________|
                   |                     |
            grouper(split)  aggregation(apply & combine)

This will get more clear as we take an example from actual data. So, let's start by loading a dataset to work with. For this notebook, I will use the Titanic dataset which can either be downloaded from Kaggle or directly loaded using the visualization library called Seaborn.

#Load dependencies
import pandas as pd
import numpy as np
import seaborn as sns

titanic = sns.load_dataset('titanic').dropna() #drop missing data

print(titanic.shape)
titanic.head()

Question: What is the total number of passengers from each class who survived?

#Step 1: Create grouper
grouper = titanic.groupby(['class'])

#Step 2: Filter column and apply aggregation
grouper['survived'].sum().reset_index()

You would usually do this in a single statement as the following:

titanic.groupby(['class'])['survived'].sum().reset_index()

Note: The reset_index() helps bring the grouping columns from index, back as a column in a dataframe.

Question: What was the average fare for passengers from each town?

titanic.groupby(['embark_town'])['fare'].mean().reset_index()

2.1 Adding more groups/levels

We can pass a list of features in the groupby() to increase the levels for grouping the data as below.

Question: What was the average fare for male vs female passengers from each town?

titanic.groupby(['embark_town','sex'])['fare'].mean().reset_index()

2.2 Adding more variables/features

Similarly, we can select a list of variables for which you need to apply the aggregate function.

Question: What was the average fare and age for male vs female passengers from each town?

titanic.groupby(['embark_town','sex'])[['fare', 'age']].mean().reset_index()

3. Grouping

Before we go further and try other, more complex scenarios, let's try to understand the data structures we are working with, so that we can be much more creative with our approaches and get a deeper understanding of how they work.

You can imagine the pipeline of the above code to be as -

Step 1: Create a grouper object with titanic.groupby(['embark_town']) which splits data into the relevant groups
Step 2: Select the column 'fare' from each of those groups
Step 3: Apply mean() on this column for each of the groups, combine and then return the aggregated dataset

Let's see what the grouper object looks like for a better understanding.

grouper = titanic.groupby(['embark_town'])

#Print dtype for each of the elements in the grouper
[(type(k),type(g)) for k,g in grouper]

[(str, pandas.core.frame.DataFrame),
 (str, pandas.core.frame.DataFrame),
 (str, pandas.core.frame.DataFrame)]

So, this shows that if we try to iterate over the grouper object, it’s nothing but a tuple with the key and a dataframe.

Let's see what each of those is.

#Print shape for the dataframe groups
[(k,g.shape) for k,g in grouper]

[('Cherbourg', (65, 15)), ('Queenstown', (2, 15)), ('Southampton', (115, 15))]

The key for each of the tuples/groups is the value from the grouper column (in this case the embark_town) and the value is just the complete dataframe filtered for that value! If we try to print one of the dataframe from this grouper, you can see that all the rows in this slice of data contain Queenstown as the embark_town, as shown below.

print(list(grouper)[1][0]) #print key
list(grouper)[1][1]        #print dataframe

Similarly, let's see what the grouper object looks like for multiple grouping features. The 'key' in this case is just a tuple with all the group combinations, which, after aggregation, gets set as the index of the final output.

grouper = titanic.groupby(['embark_town','sex'])

#Print dtype for each of the elements in the grouper
[(k,g.shape) for k,g in grouper]

[(('Cherbourg', 'female'), (34, 15)),
 (('Cherbourg', 'male'), (31, 15)),
 (('Queenstown', 'female'), (1, 15)),
 (('Queenstown', 'male'), (1, 15)),
 (('Southampton', 'female'), (53, 15)),
 (('Southampton', 'male'), (62, 15))]

#The grouping columns become the index after the groupby aggregation
titanic.groupby(['embark_town','sex'])['age'].mean()

embark_town  sex   
Cherbourg    female    35.352941
             male      39.774194
Queenstown   female    33.000000
             male      44.000000
Southampton  female    30.952830
             male      37.595484
Name: age, dtype: float64

4. Aggregation

There are multiple ways of aggregating your grouper object.

The first part of this section is to understand that you can perform aggregations on multiple columns, OR, perform multiple aggregations themselves on different columns, or a combination of both.
Second, you can use apply() or agg() to write your own custom aggregators but Pandas makes it much easier by providing a ton of in-built aggregators such as sum() or mean() as we discussed in the above examples.

Let's try to go through a few scenarios and explore how we can use these aggregations.

4.1 In-built aggregation methods

Pandas provides a ton of aggregation methods to quickly get the statistics you are looking for. Below are a few of the common ones that are used and more details on these can be found in the official pandas documentation.

4.2 Custom functions with pandas apply

This is by far the most popular way of applying a custom function to a dataframe, or in this case, applying it on each of the dataframe slices for groups defined by the grouper. The behavior of the apply() method with groupby is similar to the standard one.

You can apply it to each row (or column) of a dataframe input (if you have more than one column for aggregation) or to a series (if you have single column for aggregation). Within the function, you can actually either work directly with individual series or just write your own lambda function. Here are a few ways using the apply function.

Question: Get the unique set of ages for all each age category (who column) from each town.

titanic.groupby(['embark_town', 'who'])['age'].apply(set).reset_index()

Question: Get the range (min-max) of ages for each age category (who column) from each town.

titanic.groupby(['embark_town', 'who'])['age'].apply(lambda x: x.max()-x.min()).reset_index()

Question: Get the mean fare-by-age ratio for each age category (who column) from each town.

titanic.groupby(['embark_town', 'who']).apply(lambda x: (x['fare']/x['age']).mean())

embark_town  who  
Cherbourg    man       3.146083
             woman     3.614103
Queenstown   man       2.045455
             woman     2.727273
Southampton  child    28.956893
             man       1.410745
             woman     2.593897
dtype: float64

4.3 Multiple aggregations using agg method

Sooner or later, you would find it necessary to work with multiple aggregations over multiple columns at once. This is where the agg() method comes in. Here is a quick example of how you can use multiple in-built functions over multiple columns at once.

The general way to do this is to create a dictionary with the requirements and pass it to the agg() function. There are a few ways to structure the dictionary -

##Single function per column
{
 'column1': 'function1', 
 'column2': 'function2'
}

##Multiple functions per column
{
 'column1': ['function1', 'function2'], 
 'column2': ['function3', 'function4']
}

Question: Get the mean of fare, AND median of age for each age category (who column) from each town

#Define aggregations as a dictionary
g = {'fare':'mean', 
     'age':'median'
    }

titanic.groupby(['embark_town', 'who']).agg(g).reset_index()

Question: Get the sum & mean of fare, AND median, min, and max of age for each age category (who column) from each town

#Define aggregations as a dictionary
g = {'fare':['sum', 'mean'], 
     'age':['median', 'min', 'max']
    }

titanic.groupby(['embark_town', 'who']).agg(g).reset_index()

As of Pandas >= 0.25, another way to define the agg function is to define each column with ('column', 'function').

Let's demonstrate that with an example.

Question: Get the sum & mean of fare, AND min and max of age for each age category (who column) from each town, but rename columns

#Define aggregations directly as columns and tuples
titanic.groupby(['embark_town', 'who']).agg(A=('fare', 'sum'), 
                                            B=('fare', 'mean'),
                                            C=('age', 'min'),
                                            D=('age', 'max')).reset_index()

4.4 Custom functions with agg method

As you might think, just modifying the aggregate functions to include lambda functions is a way to create your own custom functions applied to specific columns. Here are a few examples.

#Define aggregations as a dictionary
g = {'fare':lambda x: x.sum(), 
     'age' :lambda x: x.max()
    }

titanic.groupby(['embark_town', 'who']).agg(g).reset_index()

5. Transform

Apart from just aggregating, you can use groupby to transform columns based on the grouper object. This requires using transform() function and returns the same number of rows as the original dataset, but the functions are applied based on the grouping defined. Let's consider the following point.

Question: Create a new column that returns the average fare for the age group (who column) the passenger belongs to.

titanic.groupby('who')['fare'].transform(lambda x: x.mean())

1      88.817429
3      88.817429
6      69.821026
10     77.379485
11     88.817429
         ...    
871    88.817429
872    69.821026
879    88.817429
887    88.817429
889    69.821026
Name: fare, Length: 182, dtype: float64

Notice that the output series is the length of the original titanic dataframe, but contains only 3 unique values [88.8, 69.8, 77.3], one for each of the ['woman', 'man', 'child']. This makes the grouping object highly versatile in the way you would use it for data preprocessing.

6. Advanced Usage

Let's introduce a few advanced cases where you end up using groupby for data preprocessing.

6.1 Sequential/local grouping of a dataframe

The grouper object doesn’t need to explicitly come from the dataframe. As long as the length of the grouper is the same as the number of rows in the dataframe, you can assign any grouper to groupby the rows by.

df = pd.DataFrame({'A':[1,2,3,4,8,10,12,13],
                   'B':[1,2,2,3,1,3,2,3]
                  })

#custom grouping
even_odd = ['even' if i%2==0 else 'odd' for i in df['A']]

df.groupby(even_odd)['B'].mean()

even    2.2
odd     2.0
Name: B, dtype: float64

Question: Get the sum of the value column of the given dataframe based on the sequentially occurring groups category i.e, in [1,1,2,2,1,1] the first group of 1's should be a separate group from the second set of 1's.

We can solve this creating a custom grouper, by shifting the column value by 1 and comparing them with original. If not equal, it will swap the boolean value. Then we can take a cumsum over the boolean to get groups where the value changes consecutively. Here is the solution for a similar problem I solved on Stack Overflow.

df = pd.DataFrame({'A':[1,1,2,2,2,1,1,3,3], #<- column to group on
                   'B':[1,7,2,4,1,8,2,1,3]  #<- column to aggregate
                  })

grouper = (df['A']!=df['A'].shift()).cumsum()

df.groupby(grouper).agg({'A':'mean','B':'sum'}).reset_index(drop=True)

6.2 Re-indexing to a fixed date range for each group

Question: A dataframe only contains rows for a few dates for each id. The goal is to re-index the dataframe for a fixed date range, but for each of the id individually. Also, fill in the missing data with 0 values.

Here we can create a custom reindex using pandas.date_range. Then, after setting the original date column as index, we can apply pandas.DataFrame.reindex along with groupby on the id column to reindex with the new date range for group, while filling empty values as 0.

d = {'id': [11, 11, 11, 11, 13, 13, 13],
     'date': ['2017-06-01','2017-06-03','2017-06-05','2017-06-06','2017-06-01','2017-06-02','2017-06-07'],
     'value': [1, 7, 8, 2, 9, 2, 11]
    }

df = pd.DataFrame(d)
df['date'] = pd.to_datetime(df['date'])
print("Input dataframe:")
df

#custom date range
idx = pd.date_range('2017-06-01','2017-06-07')

#set original date column as index
df.set_index('date', inplace=True)

#grouby and apply pd.DataFrame.reindex to apply new index and fill value as 0
df.groupby('id').apply(pd.DataFrame.reindex, idx, fill_value=0).drop('id',1).reset_index()

7. Other ways of grouping data

Here I discuss 3 ways that are popularly used to group data depending on the data structures and libraries you are already working with.

Grouping using collections.defaultdict
Using numpy.split() to group an array
Chunking into groups using itertools.groupby()

Let's say we have a list of tuples with keys and values that we need to group.

data = list(zip(np.random.randint(0,4,(10,)), np.random.randint(0,100,(10,))))
print(data)

[(1, 41), (1, 30), (2, 70), (3, 82), (2, 68), (0, 18), (3, 97), (1, 37), (3, 8), (0, 51)]

7.1 Using collections' defaultdict

A useful way of grouping data is to use defaultdict. Defaultdict can store the grouping values as keys, and store the values as a list of values (or a custom function on them)

from collections import defaultdict

d = defaultdict(list)

for k,v in data:
    d[k].append(v)
    
grouped_data = dict(d)

print(grouped_data)

{1: [41, 30, 37], 2: [70, 68], 3: [82, 97, 8], 0: [18, 51]}

7.2 Using numpy's split function

Another way of splitting an array into a list of sub-arrays based on a grouping key is by using np.split along with the indexes for each group returned by np.unique. Only important thing is, the arrays need to be sorted explicitly.

import numpy as np

#sorted numpy array (sorted by the grouping column)
a = np.array(data)
a = a[np.argsort(a[:, 0])]

#Take the index positions for the unique values using return_index 
#and start from the second one to split the data
groups = np.split(a[:,1], np.unique(a[:,0], return_index=True)[1][1:])
print(groups)

[array([18, 51]), array([41, 30, 37]), array([70, 68]), array([82, 97,  8])]

7.3 Using itertools' groupby

Itertools provides a groupby api which is actually a sequential/local grouping method.

It's powerful for getting groups from "AAABBBAACCC" as "AAA BBB AA CCC". But in order to get groups as "AAAAA BBB CCC", it’s necessary to first sort the data by the grouping key.

import itertools

items = sorted(data, key=lambda x:x[0])
grouper = itertools.groupby(items, key=lambda x:x[0])
groups = [list(g) for k, g in grouper]
groups

[[(0, 18), (0, 51)],
 [(1, 41), (1, 30), (1, 37)],
 [(2, 70), (2, 68)],
 [(3, 82), (3, 97), (3, 8)]]

Data Science Philosophy

Discussion about this post