<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Data Science Philosophy]]></title><description><![CDATA[Knowledge is Power]]></description><link>https://www.blog.datasciencephilosophy.com</link><image><url>https://substackcdn.com/image/fetch/$s_!tofb!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5daf423-732b-4eb7-b26e-f8d1e2fe0786_410x410.png</url><title>Data Science Philosophy</title><link>https://www.blog.datasciencephilosophy.com</link></image><generator>Substack</generator><lastBuildDate>Tue, 05 May 2026 15:36:04 GMT</lastBuildDate><atom:link href="https://www.blog.datasciencephilosophy.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Akshay Sehgal]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[akshaysehgal@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[akshaysehgal@substack.com]]></itunes:email><itunes:name><![CDATA[Akshay Sehgal]]></itunes:name></itunes:owner><itunes:author><![CDATA[Akshay Sehgal]]></itunes:author><googleplay:owner><![CDATA[akshaysehgal@substack.com]]></googleplay:owner><googleplay:email><![CDATA[akshaysehgal@substack.com]]></googleplay:email><googleplay:author><![CDATA[Akshay Sehgal]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Visualizing the Void: Unveiling Missing Data with Python's missingno]]></title><description><![CDATA[#python #missingno #missingdata #dataanalysis]]></description><link>https://www.blog.datasciencephilosophy.com/p/visualizing-the-void-unveiling-missing</link><guid isPermaLink="false">https://www.blog.datasciencephilosophy.com/p/visualizing-the-void-unveiling-missing</guid><dc:creator><![CDATA[Akshay Sehgal]]></dc:creator><pubDate>Tue, 12 Dec 2023 03:58:46 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/13f0d52d-59df-4e91-bc5b-67f5a336c33c_1322x944.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="pullquote"><p>Recently while setting up Python on a new device, I instinctively installed the <code>missingno</code> library, along with essentials like <code>pandas</code> and <code>numpy</code>. This made me reflect on why this underrated library has become as indispensable to me as these Python staples. Let's dive into how this little-known library became a favorite of mine, and perhaps might become yours too!</p></div><h3><strong>Introduction</strong></h3><p>Handling missing data is a common yet critical challenge in data analysis. The <code>missingno</code> library in Python offers a visually intuitive way to understand the distribution and patterns of missing data. In this article, we'll explore how to use <code>missingno</code> to analyze missing data effectively.</p><h3><strong>Installation</strong></h3><p>First, install <code>missingno</code> using pip:</p><pre><code>pip install missingno</code></pre><h3><strong>Importing Libraries</strong></h3><p>Let's start by importing the necessary libraries:</p><pre><code>import pandas as pd
import numpy as np
import missingno as msno</code></pre><h3><strong>Loading a Dataset</strong></h3><pre><code># Sample dataframe (fetch it from github, or read your own)

path = "https://raw.githubusercontent.com/Akshaysehgal2005/Datasets/main/titanic_extended.csv"
df = pd.read_csv(path)

print(df.shape) #(1310, 14)
df.head()</code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5dL0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec6f4891-a0a2-45ec-b827-f4f1a467b531_1282x189.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5dL0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec6f4891-a0a2-45ec-b827-f4f1a467b531_1282x189.png 424w, https://substackcdn.com/image/fetch/$s_!5dL0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec6f4891-a0a2-45ec-b827-f4f1a467b531_1282x189.png 848w, https://substackcdn.com/image/fetch/$s_!5dL0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec6f4891-a0a2-45ec-b827-f4f1a467b531_1282x189.png 1272w, https://substackcdn.com/image/fetch/$s_!5dL0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec6f4891-a0a2-45ec-b827-f4f1a467b531_1282x189.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5dL0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec6f4891-a0a2-45ec-b827-f4f1a467b531_1282x189.png" width="1282" height="189" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ec6f4891-a0a2-45ec-b827-f4f1a467b531_1282x189.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:189,&quot;width&quot;:1282,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:66927,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5dL0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec6f4891-a0a2-45ec-b827-f4f1a467b531_1282x189.png 424w, https://substackcdn.com/image/fetch/$s_!5dL0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec6f4891-a0a2-45ec-b827-f4f1a467b531_1282x189.png 848w, https://substackcdn.com/image/fetch/$s_!5dL0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec6f4891-a0a2-45ec-b827-f4f1a467b531_1282x189.png 1272w, https://substackcdn.com/image/fetch/$s_!5dL0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec6f4891-a0a2-45ec-b827-f4f1a467b531_1282x189.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h3><strong>Visualizing Missing Data</strong></h3><p>The <code>missingno</code> library offers various visualization tools to analyze missing data. Let's explore some of them -</p><h4>1. Matrix</h4><p>The matrix provides a heatmap-like representation of the data's completeness:</p><pre><code>msno.matrix(df)</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3dSt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35f33984-3221-4418-af6a-e9360f529f86_1476x641.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3dSt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35f33984-3221-4418-af6a-e9360f529f86_1476x641.png 424w, https://substackcdn.com/image/fetch/$s_!3dSt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35f33984-3221-4418-af6a-e9360f529f86_1476x641.png 848w, https://substackcdn.com/image/fetch/$s_!3dSt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35f33984-3221-4418-af6a-e9360f529f86_1476x641.png 1272w, https://substackcdn.com/image/fetch/$s_!3dSt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35f33984-3221-4418-af6a-e9360f529f86_1476x641.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3dSt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35f33984-3221-4418-af6a-e9360f529f86_1476x641.png" width="1456" height="632" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/35f33984-3221-4418-af6a-e9360f529f86_1476x641.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:632,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:40043,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3dSt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35f33984-3221-4418-af6a-e9360f529f86_1476x641.png 424w, https://substackcdn.com/image/fetch/$s_!3dSt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35f33984-3221-4418-af6a-e9360f529f86_1476x641.png 848w, https://substackcdn.com/image/fetch/$s_!3dSt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35f33984-3221-4418-af6a-e9360f529f86_1476x641.png 1272w, https://substackcdn.com/image/fetch/$s_!3dSt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35f33984-3221-4418-af6a-e9360f529f86_1476x641.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ol><li><p>In this matrix, the white lines represent missing values. This visualization helps in spotting trends in the occurrence of missing values.</p></li><li><p>The vertical line graph on the far right indicates the count of data points in each row, also marking the minimum and maximum counts. For instance, the number 13 signifies a row with a maximum of 13 data points at that position, while a 0 indicates a row at that position with no data points, representing an empty row.</p></li></ol><h4>2. Bar Chart</h4><p>The bar chart shows the completeness of the dataset:</p><pre><code>msno.bar(df)</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jsut!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62cf12b5-fc6f-40d1-aadb-7adcc4a3d378_1425x672.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jsut!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62cf12b5-fc6f-40d1-aadb-7adcc4a3d378_1425x672.png 424w, https://substackcdn.com/image/fetch/$s_!jsut!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62cf12b5-fc6f-40d1-aadb-7adcc4a3d378_1425x672.png 848w, https://substackcdn.com/image/fetch/$s_!jsut!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62cf12b5-fc6f-40d1-aadb-7adcc4a3d378_1425x672.png 1272w, https://substackcdn.com/image/fetch/$s_!jsut!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62cf12b5-fc6f-40d1-aadb-7adcc4a3d378_1425x672.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jsut!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62cf12b5-fc6f-40d1-aadb-7adcc4a3d378_1425x672.png" width="1425" height="672" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62cf12b5-fc6f-40d1-aadb-7adcc4a3d378_1425x672.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:672,&quot;width&quot;:1425,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:35502,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jsut!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62cf12b5-fc6f-40d1-aadb-7adcc4a3d378_1425x672.png 424w, https://substackcdn.com/image/fetch/$s_!jsut!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62cf12b5-fc6f-40d1-aadb-7adcc4a3d378_1425x672.png 848w, https://substackcdn.com/image/fetch/$s_!jsut!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62cf12b5-fc6f-40d1-aadb-7adcc4a3d378_1425x672.png 1272w, https://substackcdn.com/image/fetch/$s_!jsut!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62cf12b5-fc6f-40d1-aadb-7adcc4a3d378_1425x672.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This visualization is quite practical, offering a straightforward count of missing values by column.</p><h4>3. Heatmap</h4><p>The heatmap shows the correlation of missingness between different columns:</p><pre><code>msno.heatmap(df)</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nENv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64eb306c-be46-496f-bfc0-e2d853a47f65_1240x738.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nENv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64eb306c-be46-496f-bfc0-e2d853a47f65_1240x738.png 424w, https://substackcdn.com/image/fetch/$s_!nENv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64eb306c-be46-496f-bfc0-e2d853a47f65_1240x738.png 848w, https://substackcdn.com/image/fetch/$s_!nENv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64eb306c-be46-496f-bfc0-e2d853a47f65_1240x738.png 1272w, https://substackcdn.com/image/fetch/$s_!nENv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64eb306c-be46-496f-bfc0-e2d853a47f65_1240x738.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nENv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64eb306c-be46-496f-bfc0-e2d853a47f65_1240x738.png" width="1240" height="738" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/64eb306c-be46-496f-bfc0-e2d853a47f65_1240x738.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:738,&quot;width&quot;:1240,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42188,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nENv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64eb306c-be46-496f-bfc0-e2d853a47f65_1240x738.png 424w, https://substackcdn.com/image/fetch/$s_!nENv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64eb306c-be46-496f-bfc0-e2d853a47f65_1240x738.png 848w, https://substackcdn.com/image/fetch/$s_!nENv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64eb306c-be46-496f-bfc0-e2d853a47f65_1240x738.png 1272w, https://substackcdn.com/image/fetch/$s_!nENv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64eb306c-be46-496f-bfc0-e2d853a47f65_1240x738.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A positive value indicates a positive correlation between the missingness of two columns, and vice versa. This concept is called <strong>Nullity Correlation</strong>.</p><blockquote><p><strong>Nullity correlation</strong> is a measure that assesses the relationship between the presence of missing values in different variables or columns in a dataset. It is a specific type of correlation that focuses on the patterns of missing (null) data rather than the actual data values.</p></blockquote><p>Similar to a standard correlation coefficient, nullity correlation values typically range from -1 to 1.</p><ul><li><p>A value close to +1 indicates that if one column has missing data, the other column is likely to have missing data as well.</p></li><li><p>A value close to -1 suggests that if one column has missing data, the other column is likely not to have missing data.</p></li><li><p>A value around 0 indicates no particular correlation between the missingness of the two columns.</p></li></ul><p>Here, we observe an inverse nullity correlation of 0.2 between the variables <em>Body</em> and <em>Boat</em>, indicating that rows with one variable occasionally lack data in the other, and vice versa.</p><h4>4. Dendrogram</h4><p>The dendrogram helps to understand the clustering of missing data:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aJpw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb11b3145-d2c0-4838-8624-88d41f3cd1ed_2002x901.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aJpw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb11b3145-d2c0-4838-8624-88d41f3cd1ed_2002x901.png 424w, https://substackcdn.com/image/fetch/$s_!aJpw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb11b3145-d2c0-4838-8624-88d41f3cd1ed_2002x901.png 848w, https://substackcdn.com/image/fetch/$s_!aJpw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb11b3145-d2c0-4838-8624-88d41f3cd1ed_2002x901.png 1272w, https://substackcdn.com/image/fetch/$s_!aJpw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb11b3145-d2c0-4838-8624-88d41f3cd1ed_2002x901.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aJpw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb11b3145-d2c0-4838-8624-88d41f3cd1ed_2002x901.png" width="1456" height="655" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b11b3145-d2c0-4838-8624-88d41f3cd1ed_2002x901.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:655,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:48716,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aJpw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb11b3145-d2c0-4838-8624-88d41f3cd1ed_2002x901.png 424w, https://substackcdn.com/image/fetch/$s_!aJpw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb11b3145-d2c0-4838-8624-88d41f3cd1ed_2002x901.png 848w, https://substackcdn.com/image/fetch/$s_!aJpw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb11b3145-d2c0-4838-8624-88d41f3cd1ed_2002x901.png 1272w, https://substackcdn.com/image/fetch/$s_!aJpw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb11b3145-d2c0-4838-8624-88d41f3cd1ed_2002x901.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Columns clustered together have a similar pattern of missing data.</p><h3><strong>What&#8217;s next?</strong></h3><p>Once you have analyzed the gaps in your dataset, you could use a ton of methods for handling the missing data.</p><ul><li><p>Flat remove rows with any missing data</p></li><li><p>Remove rows for specific variables containing missing data</p></li><li><p>Drop variables with high volumes of missing data</p></li><li><p>Impute missing data for numeric variables</p></li><li><p>Direct replacements using aggregates like mean, median, mode, etc.</p></li><li><p>Estimate and fill in missing data using ML (e.g. <a href="https://scikit-learn.org/stable/modules/impute.html">nearest neighbor imputation</a>)</p></li><li><p>etc.</p></li></ul><p>The possibilities are endless and each dataset and use-case will require its own strategy to handle information loss due to missing data. Once your dataset is cleaned up, you might want to use <code>missingno</code> to visualize it again which might look something like this.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TGtc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39c9c5d9-0bda-4cfe-8b67-c7ace9ac4d86_1476x641.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TGtc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39c9c5d9-0bda-4cfe-8b67-c7ace9ac4d86_1476x641.png 424w, https://substackcdn.com/image/fetch/$s_!TGtc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39c9c5d9-0bda-4cfe-8b67-c7ace9ac4d86_1476x641.png 848w, https://substackcdn.com/image/fetch/$s_!TGtc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39c9c5d9-0bda-4cfe-8b67-c7ace9ac4d86_1476x641.png 1272w, https://substackcdn.com/image/fetch/$s_!TGtc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39c9c5d9-0bda-4cfe-8b67-c7ace9ac4d86_1476x641.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TGtc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39c9c5d9-0bda-4cfe-8b67-c7ace9ac4d86_1476x641.png" width="1456" height="632" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/39c9c5d9-0bda-4cfe-8b67-c7ace9ac4d86_1476x641.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:632,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:19758,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TGtc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39c9c5d9-0bda-4cfe-8b67-c7ace9ac4d86_1476x641.png 424w, https://substackcdn.com/image/fetch/$s_!TGtc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39c9c5d9-0bda-4cfe-8b67-c7ace9ac4d86_1476x641.png 848w, https://substackcdn.com/image/fetch/$s_!TGtc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39c9c5d9-0bda-4cfe-8b67-c7ace9ac4d86_1476x641.png 1272w, https://substackcdn.com/image/fetch/$s_!TGtc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39c9c5d9-0bda-4cfe-8b67-c7ace9ac4d86_1476x641.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Nothing is more satisfying after an hour of data cleaning than the visualization above!</p><h3><strong>Conclusion</strong></h3><p>In conclusion<code>,</code> <code>missingno</code> is a powerful tool for a preliminary check of missing data in your dataset. It helps in understanding the pattern and extent of missingness, which is crucial for any data preprocessing steps. By integrating <code>missingno</code> it into your data analysis workflow, you can make more informed decisions about how to handle missing values in your dataset.</p><blockquote><p>Thanks to <a href="https://github.com/ResidentMario">Aleksey Bilogur</a> for authoring and maintaining <code>missingno</code> for all these years!</p></blockquote><h3><strong>References</strong></h3><ul><li><p>https://github.com/ResidentMario/missingno</p></li><li><p>https://www.kaggle.com/code/akshaysehgal/handling-missing-data-like-a-boss</p></li><li><p>https://scikit-learn.org/stable/modules/impute.html</p></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.blog.datasciencephilosophy.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Data Science Philosophy! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Stack Puzzle: Finding "patterns" in 2D arrays using Convolution Operations]]></title><description><![CDATA[#stackoverflow #python #numpy #convolution #datascience]]></description><link>https://www.blog.datasciencephilosophy.com/p/stack-puzzle-finding-patterns-in</link><guid isPermaLink="false">https://www.blog.datasciencephilosophy.com/p/stack-puzzle-finding-patterns-in</guid><dc:creator><![CDATA[Akshay Sehgal]]></dc:creator><pubDate>Sun, 02 Apr 2023 18:28:48 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b254680-7a0f-4a5c-8fd6-50dd557b5d70_495x576.gif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Around the month of January from 2021, I solved an interesting problem on <strong>Stack Overflow<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> </strong>so in this post, allow me to share the problem with you as a <strong>coding puzzle,</strong> and let&#8217;s explore an interesting solution to it.</p><div class="pullquote"><p><strong>Puzzle</strong>: <em>Given an imperfect 2D lattice/array with 1s and 0s forming a repeating pattern, find the &#8220;gaps&#8221; in the structure where some of the 1s are missing.</em></p></div><p>Here are a few examples of what the 2D lattices and the &#8220;gaps&#8221; in them look like - </p><div class="image-gallery-embed" data-attrs="{&quot;gallery&quot;:{&quot;images&quot;:[{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dbda2224-e910-4d2f-887f-0095e8c8d70d_362x217.png&quot;},{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bde160d1-14bf-44ee-8bb8-78cafba37989_251x249.png&quot;},{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/76a08f36-3aca-41ef-a4a7-ac0580a94c3f_368x227.png&quot;},{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/74045d60-1043-4cbe-8e20-185e1502f600_368x198.png&quot;},{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d29d42c2-cd92-4d42-a3f9-7dd9bca5cd5d_179x248.png&quot;},{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1286b3a9-7866-43a4-8ba6-0d45d3f88718_318x248.png&quot;}],&quot;caption&quot;:&quot;Imperfect Lattices&quot;,&quot;alt&quot;:&quot;&quot;,&quot;staticGalleryImage&quot;:{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/35c225f5-f6e4-4a89-b879-18286ca87b78_1456x964.png&quot;}},&quot;isEditorNode&quot;:true}"></div><p>The yellow points represent 1s and the rest represent 0s. Notice that in a few places where the regular structure should persist, a few 1s are missing. The goal of this puzzle is to write code that can identify these &#8220;gaps&#8221;. Here is the starter code to create these 2D arrays in python - </p><pre><code>import numpy as np

arr = np.array([[1, 0, 0, 0, 1, 0, 0, 0, 0], #one gap here 
                [0, 0, 1, 0, 0, 0, 1, 0, 0],
                [1, 0, 0, 0, 1, 0, 0, 0, 1],
                [0, 0, 1, 0, 0, 0, 0, 0, 0], #one gap here
                [1, 0, 0, 0, 1, 0, 0, 0, 1]])</code></pre><blockquote><p><strong>NOTE</strong>: There is one assumption here. There should exist at least one complete pattern in the lattice/array. Not every instance of the pattern can have a gap in it!</p></blockquote><p>And, let&#8217;s say you are given a function to generate such &#8220;impure&#8221; lattices on demand, so you can test your algorithm properly!</p><pre><code>import random
import numpy as np
import matplotlib.pyplot as plt

def generate_random_lattice(gaps=0.1):
    
    #Generate random tile pattern
    n,m = random.randrange(3,9,2), random.randrange(3,9,2)
    tile = np.zeros((n,m)).astype(int)
    tile[::tile.shape[0]-1, ::tile.shape[1]-1] = 1  #update corners
    tile[tile.shape[0]//2, tile.shape[1]//2] = 1    #update center
    tile = tile[:-1, :-1]                           #remove bottom edge
    
    #Create pure lattice
    x,y = np.random.randint(2,10), np.random.randint(2,10)
    lattice = np.tile(tile, (x,y))
    
    #Add impurities / gaps
    ones_shape = lattice[lattice.nonzero()].shape
    lattice[lattice.nonzero()] = np.random.binomial(n=1, p=1-gaps, size=ones_shape)
    
    #Plot lattice
    plt.imshow(lattice)
    return lattice

arr = generate_random_lattice()</code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0m1q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feea46c5b-adbb-42e4-9e85-f721a523008f_368x227.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0m1q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feea46c5b-adbb-42e4-9e85-f721a523008f_368x227.png 424w, https://substackcdn.com/image/fetch/$s_!0m1q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feea46c5b-adbb-42e4-9e85-f721a523008f_368x227.png 848w, https://substackcdn.com/image/fetch/$s_!0m1q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feea46c5b-adbb-42e4-9e85-f721a523008f_368x227.png 1272w, https://substackcdn.com/image/fetch/$s_!0m1q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feea46c5b-adbb-42e4-9e85-f721a523008f_368x227.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0m1q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feea46c5b-adbb-42e4-9e85-f721a523008f_368x227.png" width="368" height="227" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eea46c5b-adbb-42e4-9e85-f721a523008f_368x227.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:227,&quot;width&quot;:368,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3451,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0m1q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feea46c5b-adbb-42e4-9e85-f721a523008f_368x227.png 424w, https://substackcdn.com/image/fetch/$s_!0m1q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feea46c5b-adbb-42e4-9e85-f721a523008f_368x227.png 848w, https://substackcdn.com/image/fetch/$s_!0m1q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feea46c5b-adbb-42e4-9e85-f721a523008f_368x227.png 1272w, https://substackcdn.com/image/fetch/$s_!0m1q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feea46c5b-adbb-42e4-9e85-f721a523008f_368x227.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Do think about how you would solve this puzzle algorithmically, or better yet, attempt it before scrolling down to the (potential) solution!</p><p>There are a million ways to solve this problem, but I would like to explore one that involves an operation which is is very frequently used in the world of Data Science and Machine Learning: <strong>Convolutions</strong>.</p><div><hr></div><h1>Solution: Using Convolutions</h1><p>The folks with a background in ML/Data Science would already know what the term &#8220;Convolution<strong>&#8221;</strong> is, as it is foundational to Computer Vision and Image Processing. <strong>Convolution Neural Networks</strong> (CNNs) are a class of neural network architectures and at their core, they use the operation named <strong>Convolution</strong>. </p><p>I will not go a lot deeper into what the Convolution operation or CNNs are but here is a quick overview.</p><blockquote><h4>What is a 2D Convolution?</h4><p>A 2D convolution operation involves a matrix/image and a filter. The filter is moved/convolved over the matrix and a dot product is performed, which is returned in the output matrix. In an implementation, since these are independent operations, these can be done in parallel in a vectorized manner.</p><p>We can use other operations instead of a dot product as well! The idea here is that if the pattern in the filter is found in the image exactly, it should return a very high value. The analysis of the output of the operation reveals what the filter and the operation is designed to discover.</p><p>There is additional complexity here due to how padding can be leveraged etc, but I will skip that for this post. There is a lot of great material on this online, for example, <a href="https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee1">this one</a>. Here is a visual representation of what this operation looks like (<a href="https://commons.wikimedia.org/wiki/File:Convolution_arithmetic_-_Full_padding_no_strides.gif">image source: wikipedia</a>)</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mM_F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b254680-7a0f-4a5c-8fd6-50dd557b5d70_495x576.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mM_F!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b254680-7a0f-4a5c-8fd6-50dd557b5d70_495x576.gif 424w, https://substackcdn.com/image/fetch/$s_!mM_F!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b254680-7a0f-4a5c-8fd6-50dd557b5d70_495x576.gif 848w, https://substackcdn.com/image/fetch/$s_!mM_F!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b254680-7a0f-4a5c-8fd6-50dd557b5d70_495x576.gif 1272w, https://substackcdn.com/image/fetch/$s_!mM_F!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b254680-7a0f-4a5c-8fd6-50dd557b5d70_495x576.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mM_F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b254680-7a0f-4a5c-8fd6-50dd557b5d70_495x576.gif" width="417" height="485.23636363636365" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5b254680-7a0f-4a5c-8fd6-50dd557b5d70_495x576.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:576,&quot;width&quot;:495,&quot;resizeWidth&quot;:417,&quot;bytes&quot;:435605,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mM_F!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b254680-7a0f-4a5c-8fd6-50dd557b5d70_495x576.gif 424w, https://substackcdn.com/image/fetch/$s_!mM_F!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b254680-7a0f-4a5c-8fd6-50dd557b5d70_495x576.gif 848w, https://substackcdn.com/image/fetch/$s_!mM_F!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b254680-7a0f-4a5c-8fd6-50dd557b5d70_495x576.gif 1272w, https://substackcdn.com/image/fetch/$s_!mM_F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b254680-7a0f-4a5c-8fd6-50dd557b5d70_495x576.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><em>An interesting thing here is that Convolutions are used for identifying &#8220;patterns&#8221; in an image, but they can also be used to find a lack of a given pattern, which is the case in this coding puzzle.</em></p></blockquote><p>With this knowledge, let&#8217;s design an algorithm of how we can detect and mark these gaps given an &#8220;imperfect lattice&#8221; or an array with a repeating pattern with some imperfections. </p><p><strong>Here is the pseudo-code</strong> - </p><ol><li><p>Identify the <strong>smallest convolution window</strong> size by analyzing the proximity of 1s, row-wise and column-wise.</p></li><li><p>Estimate the number of such <strong>repeating patterns</strong> in the array over both axes.</p></li><li><p><strong>Fetch all the convolution windows</strong> by iterating over the array.</p></li><li><p>Take the <strong>sum of each convolution window</strong> to search for &#8220;gaps&#8221;. The largest sum would indicate a complete pattern, while anything less would indicate incomplete patterns.</p></li><li><p>Fetch a copy of the <strong>complete pattern</strong> and subtract it (with broadcasting) from each convolution window to get the &#8220;gaps&#8221; as -1&#8217;s.</p></li><li><p>Check the location of the -1&#8217;s and <strong>mark them in the original array</strong>.</p></li></ol><div><hr></div><p>Let&#8217;s begin implementing it. First, let&#8217;s start by generating an example to work with.</p><pre><code>arr = generate_random_lattice() #using the function above
print(arr.shape)

(19, 17)</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!G-LN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce782d66-1223-44a8-bf0b-4047e59fb1c9_238x248.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!G-LN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce782d66-1223-44a8-bf0b-4047e59fb1c9_238x248.png 424w, https://substackcdn.com/image/fetch/$s_!G-LN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce782d66-1223-44a8-bf0b-4047e59fb1c9_238x248.png 848w, https://substackcdn.com/image/fetch/$s_!G-LN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce782d66-1223-44a8-bf0b-4047e59fb1c9_238x248.png 1272w, https://substackcdn.com/image/fetch/$s_!G-LN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce782d66-1223-44a8-bf0b-4047e59fb1c9_238x248.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!G-LN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce782d66-1223-44a8-bf0b-4047e59fb1c9_238x248.png" width="238" height="248" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ce782d66-1223-44a8-bf0b-4047e59fb1c9_238x248.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:248,&quot;width&quot;:238,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3724,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!G-LN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce782d66-1223-44a8-bf0b-4047e59fb1c9_238x248.png 424w, https://substackcdn.com/image/fetch/$s_!G-LN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce782d66-1223-44a8-bf0b-4047e59fb1c9_238x248.png 848w, https://substackcdn.com/image/fetch/$s_!G-LN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce782d66-1223-44a8-bf0b-4047e59fb1c9_238x248.png 1272w, https://substackcdn.com/image/fetch/$s_!G-LN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce782d66-1223-44a8-bf0b-4047e59fb1c9_238x248.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let&#8217;s start by analyzing the proximity of 1s over the 2 axes. When working on anything related to &#8220;distance between elements&#8221; or &#8220;span of a specific element&#8221; in arrays, I find using <code>numpy.cumsum()</code> is quite common. Because we have gaps in the lattice, we can get the pattern over the 2 axes by just taking a sum as below - </p><pre><code>print(arr.sum(1))
print(arr.sum(0))

[4 0 0 3 0 0 5 0 0 3 0 0 5 0 0 4 0 0 4] #pattern of x-2-x-2-x
[4 0 2 0 4 0 3 0 4 0 3 0 4 0 2 0 2]     #pattern of x-1-x-1-x</code></pre><p>We can see that this shows us the structure of the pattern over both axes while &#8220;ignoring&#8221; the gaps in the lattice. Let&#8217;s use this to get the shape of the convolution window that we want to fetch from the array.</p><pre><code>xdims = np.unique((arr.sum(1)!=0).cumsum(), return_counts=True)[1].max()*2+1
ydims = np.unique((arr.sum(0)!=0).cumsum(), return_counts=True)[1].max()*2+1

pattern_shape = (xdims, ydims)
print(pattern_shape)

(7, 5)</code></pre><p>With this, now we know that the smallest pattern shape that we can create convolution windows is <code>(7,5)</code>. Now, we could create a perfect pattern of this shape and just use that to find imperfections in our lattice, but where is the fun in that? Let&#8217;s first find out how many sliding window operations are needed over both axes, for a pattern of this shape to repeat and fit into the overall array.</p><pre><code>num_xpatterns = int(arr.shape[0]/(pattern_shape[0]-1))
num_ypatterns = int(arr.shape[1]/(pattern_shape[1]-1))

print((num_xpatterns, num_ypatterns))

(3, 4)</code></pre><p>This is straightforward. Since the pattern is not completely repeating, but shares an edge, we need to remove the last row/column of pixels from the pattern shape and then calculate how many times we need to move. Here, we need to repeat the <code>(7,5)</code> window 3 times to the right and 4 times downwards.</p><p>Let&#8217;s now break the original array into these <code>3 X 4 = 12</code> matrices of <code>(7,5) </code>shape sliding windows. I will use <code>numpy.lib.stride_tricks</code><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> for this.</p><pre><code>#Calculating the stride and shape that needs to be taken with stride_tricks
shp = (num_xpatterns, num_ypatterns, xdims, ydims)
strd = (arr.strides[0]*(xdims-1), arr.strides[1]*(ydims-1), arr.strides[0], arr.strides[1])

#Generating rolling windows/convolutions over the image to separate the patterns.
convolve_pattern = np.lib.stride_tricks.as_strided(arr, shape=shp, strides=strd)
convolve_pattern.shape

(3, 4, 7, 5)</code></pre><p>As expected, we get a tensor of the shape (3,4,7,5) which means 3X4 matrices of shape (7,5). Let&#8217;s fetch one of them to check what it looks like.</p><pre><code>convolve_pattern[0,-1,:,:]

array([[1, 0, 0, 0, 0],   #&lt;- one gap here
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [1, 0, 0, 0, 1]])</code></pre><p>This corresponds to this window of the array - </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6q6w!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85f03d20-d6c8-4412-9d89-39843620c1a6_238x248.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6q6w!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85f03d20-d6c8-4412-9d89-39843620c1a6_238x248.png 424w, https://substackcdn.com/image/fetch/$s_!6q6w!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85f03d20-d6c8-4412-9d89-39843620c1a6_238x248.png 848w, https://substackcdn.com/image/fetch/$s_!6q6w!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85f03d20-d6c8-4412-9d89-39843620c1a6_238x248.png 1272w, https://substackcdn.com/image/fetch/$s_!6q6w!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85f03d20-d6c8-4412-9d89-39843620c1a6_238x248.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6q6w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85f03d20-d6c8-4412-9d89-39843620c1a6_238x248.png" width="238" height="248" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/85f03d20-d6c8-4412-9d89-39843620c1a6_238x248.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:248,&quot;width&quot;:238,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:5721,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6q6w!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85f03d20-d6c8-4412-9d89-39843620c1a6_238x248.png 424w, https://substackcdn.com/image/fetch/$s_!6q6w!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85f03d20-d6c8-4412-9d89-39843620c1a6_238x248.png 848w, https://substackcdn.com/image/fetch/$s_!6q6w!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85f03d20-d6c8-4412-9d89-39843620c1a6_238x248.png 1272w, https://substackcdn.com/image/fetch/$s_!6q6w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85f03d20-d6c8-4412-9d89-39843620c1a6_238x248.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Great next, we just calculate the sum of each of these windows to find the gaps!</p><pre><code>pattern_sums = convolve_pattern.sum(axis=(-1,-2))
pattern_sums

array([[4, 5, 5, 4],
       [5, 5, 5, 4],
       [5, 5, 5, 4]])</code></pre><p>So, in this (3,4) matrix, the values where the sum is 5 are examples of perfect patterns, while the ones &lt; 5 are patterns with gaps. Let&#8217;s fetch a perfect pattern from this view.</p><pre><code>pattern_sums = convolve_pattern.sum(axis=(-1,-2))
idx = np.unravel_index(np.argmax(pattern_sums), pattern_sums.shape)
truth_pattern = convolve_pattern[idx]
print(truth_pattern)

array([[1, 0, 0, 0, 1],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [1, 0, 0, 0, 1]])</code></pre><p>Nice, we are almost there! Now, since we have this <code>(7,5)</code> pattern, we can subtract it from the convolved view of the array <code>(3,4,7,5)</code> with broadcasting to get the gap locations!</p><pre><code>gaps = convolve_pattern - truth_pattern[None, None, :, :]
gaps[0,-1,:,:]

array([[ 0,  0,  0,  0, -1],  #&lt;- gap marked with -1
       [ 0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0]])</code></pre><p>Then we can use the beautiful concept of <strong>numpy views vs copies</strong><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>, to overwrite the 0s of these gap locations on the memory of the original matrix with -1s. Overwriting the locations on the <code>(3,4,7,5)</code> view also overwrites the original location in the <code>(19,17)</code> array since they share the same memory! This is the power of understanding how to work with views vs copies!</p><pre><code>for i in np.argwhere(gaps==-1):
    convolve_pattern[tuple(i)]=-1</code></pre><p>Thats it! Here is the complete code and the function in action!</p><pre><code>#IDENTIFYING IMPERFECT PATTERNS IN A LATTICE

import numpy as np
import matplotlib.pyplot as plt

def find_gaps(x, verbose=False):
    #Identifying the size and shape of the repeating pattern
    xdims = np.unique((x.sum(1)!=0).cumsum(), return_counts=True)[1].max()*2+1
    ydims = np.unique((x.sum(0)!=0).cumsum(), return_counts=True)[1].max()*2+1
    pattern_shape = (xdims, ydims)

    #Calculating number of rolling windows that exist with that pattern
    num_xpatterns = int(x.shape[0]/(pattern_shape[0]-1))
    num_ypatterns = int(x.shape[1]/(pattern_shape[1]-1))
    
    #Calculating the stride and shape that needs to be taken with stride_tricks
    shp = (num_xpatterns, num_ypatterns, xdims, ydims)
    strd = (x.strides[0]*(xdims-1), x.strides[1]*(ydims-1), x.strides[0], x.strides[1])

    #Generating rolling windows/convolutions over the image to separate the patterns.
    convolve_pattern = np.lib.stride_tricks.as_strided(x, shape=shp, strides=strd)

    #Assuming at least 1 untouched pattern exists, finding that pure pattern
    pattern_sums = convolve_pattern.sum(axis=(-1,-2))
    idx = np.unravel_index(np.argmax(pattern_sums), pattern_sums.shape)
    truth_pattern = convolve_pattern[idx]

    #Printing Debugging info
    if verbose==True:
        print('x shape:',x.shape)
        print('pattern shape:',pattern_shape)

        print('convolved shape:',convolve_pattern.shape)

        print('')
        print('pattern sums')
        print(pattern_sums)

        print('')
        print('true pattern')
        print(truth_pattern)
    
    #Identifying the gaps by subtracting the convolved image with the truth pattern
    gaps = convolve_pattern - truth_pattern[None, None, :, :]

    #Setting the gaps as -1 directly into the location of memory of the original image
    for i in np.argwhere(gaps==-1):
        convolve_pattern[tuple(i)]=-1
    
    plt.imshow(x)
    return x

arr = generate_random_lattice()   #Generate lattice
o = find_gaps(x, verbose=False)   #Find gaps in lattice</code></pre><div class="image-gallery-embed" data-attrs="{&quot;gallery&quot;:{&quot;images&quot;:[{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ab58730f-d2da-48d8-bc38-e7b71d2dd12a_375x248.png&quot;},{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9ea0b9bf-53e9-4f52-8a4a-00d879cde184_375x248.png&quot;}],&quot;caption&quot;:&quot;example 1&quot;,&quot;alt&quot;:&quot;&quot;,&quot;staticGalleryImage&quot;:{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e5d2a1fe-59d0-4499-b175-62cff2c9542a_1456x720.png&quot;}},&quot;isEditorNode&quot;:true}"></div><div class="image-gallery-embed" data-attrs="{&quot;gallery&quot;:{&quot;images&quot;:[{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e87ff527-a686-4bfd-877d-f037cf100db9_134x249.png&quot;},{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aaf9894b-3743-43da-8c06-9eb600ff87d2_134x249.png&quot;}],&quot;caption&quot;:&quot;example 2&quot;,&quot;alt&quot;:&quot;&quot;,&quot;staticGalleryImage&quot;:{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d86fc02f-d65e-48c7-adf2-b388a7c1eea9_1456x720.png&quot;}},&quot;isEditorNode&quot;:true}"></div><div class="image-gallery-embed" data-attrs="{&quot;gallery&quot;:{&quot;images&quot;:[{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/69cb4a96-c57e-4998-acf9-07c86216b719_184x248.png&quot;},{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f89f80ae-d78c-4bf9-a48d-1dcfa4df51b2_184x248.png&quot;}],&quot;caption&quot;:&quot;example 3&quot;,&quot;alt&quot;:&quot;&quot;,&quot;staticGalleryImage&quot;:{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7e6eb71c-4f1b-4892-bcc5-02043a6d798b_1456x720.png&quot;}},&quot;isEditorNode&quot;:true}"></div><div class="image-gallery-embed" data-attrs="{&quot;gallery&quot;:{&quot;images&quot;:[{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3161d9db-fda3-467d-b996-f602d81ec3a2_218x248.png&quot;},{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a0678f23-31fc-4176-9666-97c41b153134_218x248.png&quot;}],&quot;caption&quot;:&quot;example 4&quot;,&quot;alt&quot;:&quot;&quot;,&quot;staticGalleryImage&quot;:{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ec6e62c1-aeb2-4a69-9bfd-5c1a89da00cc_1456x720.png&quot;}},&quot;isEditorNode&quot;:true}"></div><div class="image-gallery-embed" data-attrs="{&quot;gallery&quot;:{&quot;images&quot;:[{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/36655e53-50f3-44b2-aede-f2bfc0d5a3f8_182x248.png&quot;},{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/efabae77-7354-443a-9560-68110c9d46f4_182x248.png&quot;}],&quot;caption&quot;:&quot;example 5&quot;,&quot;alt&quot;:&quot;&quot;,&quot;staticGalleryImage&quot;:{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/464c4f2f-2157-4376-a4b8-12ec7b8b9437_1456x720.png&quot;}},&quot;isEditorNode&quot;:true}"></div><p><strong>Thanks for reading all the way to the end! My goal here was to demonstrate the power of Convolution operations and Numpy in solving unique scenarios. Hope this was entertaining and educational for you, the reader!</strong></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>https://stackoverflow.com/questions/65296608</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>https://numpy.org/doc/stable/reference/generated/numpy.lib.stride_tricks.as_strided.html</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>https://numpy.org/doc/stable/user/basics.copies.html</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[Haskell style Infinite Sequences in Python? It's easier than you think!]]></title><description><![CDATA[#python #itertools #iterators #infinitesequences]]></description><link>https://www.blog.datasciencephilosophy.com/p/haskell-style-infinite-sequences</link><guid isPermaLink="false">https://www.blog.datasciencephilosophy.com/p/haskell-style-infinite-sequences</guid><dc:creator><![CDATA[Akshay Sehgal]]></dc:creator><pubDate>Tue, 28 Mar 2023 12:41:18 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/58547f37-a6a0-4e3c-adb6-70394c382a5d_1200x669.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Let's say you need to continuously generate a bunch of information and send it as a signal to another system, or you need to perform a harmonic analysis for your project, or you have to perform an operation on an undefined number of inputs until a specific condition is met. These fall into the domain of "<strong>Infinite Sequences</strong>" and are not alien to the world of Data Science.</p><p>In fact, infinite sequences are natural to programming languages such as <strong>Haskell</strong>. Here is a great video from <a href="https://www.youtube.com/@Computerphile">Computerphile</a> to demonstrate that.</p><div id="youtube2-bnRNiE_OVWA" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;bnRNiE_OVWA&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/bnRNiE_OVWA?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>But did you know that you can work with infinite sequences quite efficiently in Python as well? You can write <code>while</code> loops but a pythonic way would be to create iterators that you can use. In fact, <code>itertools</code> has some very interesting out-of-the-box methods called "<a href="https://docs.python.org/3/library/itertools.html">infinite iterators</a>" for you to work with infinite sequences.</p><p>There are 3 infinite iterators in <code>itertools</code> that you can work with - </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!93Sk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfec79f8-88c4-4633-b5da-eea0e4531db5_1656x526.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!93Sk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfec79f8-88c4-4633-b5da-eea0e4531db5_1656x526.png 424w, https://substackcdn.com/image/fetch/$s_!93Sk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfec79f8-88c4-4633-b5da-eea0e4531db5_1656x526.png 848w, https://substackcdn.com/image/fetch/$s_!93Sk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfec79f8-88c4-4633-b5da-eea0e4531db5_1656x526.png 1272w, https://substackcdn.com/image/fetch/$s_!93Sk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfec79f8-88c4-4633-b5da-eea0e4531db5_1656x526.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!93Sk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfec79f8-88c4-4633-b5da-eea0e4531db5_1656x526.png" width="1456" height="462" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cfec79f8-88c4-4633-b5da-eea0e4531db5_1656x526.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:462,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:333730,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!93Sk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfec79f8-88c4-4633-b5da-eea0e4531db5_1656x526.png 424w, https://substackcdn.com/image/fetch/$s_!93Sk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfec79f8-88c4-4633-b5da-eea0e4531db5_1656x526.png 848w, https://substackcdn.com/image/fetch/$s_!93Sk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfec79f8-88c4-4633-b5da-eea0e4531db5_1656x526.png 1272w, https://substackcdn.com/image/fetch/$s_!93Sk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfec79f8-88c4-4633-b5da-eea0e4531db5_1656x526.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h3>Working with Infinite Iterators</h3><p>Let's start with a simple example of <em>generating prime factors for an infinite sequence of multiples of 5 greater than 100</em> -</p><pre><code>#!pip install sympy
from itertools import count, islice
from sympy.ntheory import primefactors

itr = count(start=100,step=5)      #&lt;-- infinite sequence
f = lambda x: (x, primefactors(x)) #&lt;-- fx for prime factors

factors = map(f, itr)              #&lt;-- map the function

list(islice(factors, 10))          #&lt;-- get first 10 elements</code></pre><pre><code> (105, [3, 5, 7]),
 (110, [2, 5, 11]),
 (115, [5, 23]),
 (120, [2, 3, 5]),
 (125, [5]),
 (130, [2, 5, 13]),
 (135, [3, 5]),
 (140, [2, 5, 7]),
 (145, [5, 29])]</code></pre><p>You can see the power of this. The <code>map()</code> + <code>count()</code> structure can get super complex super, fast with more advanced operations, and all of this, without defining a close set of inputs. Then during inference, you could use <code>itertools.islice</code> or just a simple for loop to infer them as per need with lazy execution.</p><p>Here is another example of <em>summing 2 infinite sequences</em> - </p><pre><code>iter1 = count(10,2)         #&lt;-- inf seq start 10, step 2
iter2 = count(5,3)          #&lt;-- inf seq start 5, step 3
g = lambda x, y: (x,y,x+y)  #&lt;-- sum function

inf_sum = map(g, iter1, iter2) #&lt;-- map the function

list(islice(inf_sum, 10))   #&lt;-- get first 10 elements</code></pre><pre><code>[(10, 5, 15),
 (12, 8, 20),
 (14, 11, 25),
 (16, 14, 30),
 (18, 17, 35),
 (20, 20, 40),
 (22, 23, 45),
 (24, 26, 50),
 (26, 29, 55),
 (28, 32, 60)]</code></pre><p>You have a few more options to build infinite sequences using <code>itertools.cycle</code> and <code>itertools.repeat</code> - </p><pre><code>from itertools import cycle

iter3 = itertools.cycle(['A','B','C'])
list(islice(iter3, 10))

## ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A']</code></pre><pre><code>from itertools import repeat

iter4 = itertools.repeat('hello world') #&lt;-- pass any object
list(islice(iter4, 5))

## ['hello world', 'hello world', 'hello world', 'hello world', 'hello world']</code></pre><p>And, you can mix and match these iterators with your functions using map().</p><p>I worked with these iterators for creating a batch-wise <code>Numpy</code> computation. My <code>Numpy</code> operation was taking a ton of memory due to a large number of input vectors and so I needed a strategy where I could break the large computation task into smaller batches. The function I wrote would take a batch of vectors, run a vectorized operation on them, save the output, and then fetch the next batch. This would work infinitely until it met a break condition.</p><p><strong>This was a very high-level overview of the world of infinite sequences using Python. Hope this gives you some ideas when working on your next project involving such functions!</strong></p>]]></content:encoded></item><item><title><![CDATA[Representing Hierarchical Data with Poincaré Embeddings!]]></title><description><![CDATA[#representationlearning #embeddings #neuralnetworks #poincar&#233;]]></description><link>https://www.blog.datasciencephilosophy.com/p/representing-hierarchical-data-with</link><guid isPermaLink="false">https://www.blog.datasciencephilosophy.com/p/representing-hierarchical-data-with</guid><dc:creator><![CDATA[Akshay Sehgal]]></dc:creator><pubDate>Sat, 25 Mar 2023 14:43:27 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F828ef48c-c83d-45b7-a2fa-e8c0bdbbd6de_1494x1374.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When I started exploring <a href="https://arxiv.org/abs/1206.5538">Representation Learning</a> early in my career, I was amazed at how critical this concept was for Applied Data Science and Machine Learning. And, it&#8217;s easy to understand why. Back then, all the buzz was around the use of Word2Vec, GloVe, FastText, etc, for generating embeddings for text, and it&#8217;s surprising that these models are still relevant today!</p><p><strong>The principle is simple</strong> - Embed words/sentences into an N-dimensional vector space where the distance between the vectors of 2 words encodes their semantic similarity. In practice, if done correctly, you would find that the embedding of the word &#8220;car&#8221; and &#8220;bike&#8221; would be clustered closer to each other than the embedding of &#8220;car&#8221; and &#8220;spaceship&#8221;. Given good vector representations or embeddings of your data, you could explore a large variety of downstream models or use them as features for secondary tasks! But this same concept is expandable to other types of data - images, tabular, graphs, etc.</p><p><strong>Flashback</strong> - At the time, I was working with a database of &#8220;Skillsets&#8221; which were structured as a hierarchy, for example, <code>Programming</code> &#8594; <code>Python</code> &#8594; <code>Pandas</code> &#8594; <code>Dataframe</code>. And, during my quest to &#8220;master&#8221; the world of Representation learning, I came across this very unique challenge of representing Hierarchical Data. I wanted to know how I could create embeddings for these &#8220;skills&#8221; using the hierarchical relationships between them and use these as part of my downstream estimators.</p><div><hr></div><h3>Enter Poincar&#233; Embeddings!</h3><p>This post will focus on implementing these rather than exploring the maths behind them but here is a quick overview. <strong>So feel free to skip this section!!</strong></p><blockquote><p><em>&#8220;However, while complex symbolic datasets often exhibit a latent hierarchical structure, state-of-the-art methods typically learn embeddings in Euclidean vector spaces, which do not account for this property. For this purpose, we introduce a new approach for learning hierarchical representations of symbolic data by embedding them into hyperbolic space &#8211; or more precisely into an n-dimensional Poincar&#233; ball.&#8221;</em></p><p>&#8212; <a href="https://arxiv.org/abs/1705.08039">Nickel, M., &amp; Kiela, D. (2017). Poincar&#233; embeddings for learning hierarchical representations. </a><em><a href="https://arxiv.org/abs/1705.08039">Advances in neural information processing systems</a></em><a href="https://arxiv.org/abs/1705.08039">, </a><em><a href="https://arxiv.org/abs/1705.08039">30</a></em><a href="https://arxiv.org/abs/1705.08039">.</a></p></blockquote><p>Poincar&#233; embeddings allow you to create hierarchical embeddings in a non-euclidean space. The vectors on the outside of the Poincar&#233; ball are lower in the hierarchy compared to the ones in the center.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gO_J!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7720be1-01e8-4ba9-bbeb-cd677dde5c17_818x380.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gO_J!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7720be1-01e8-4ba9-bbeb-cd677dde5c17_818x380.png 424w, https://substackcdn.com/image/fetch/$s_!gO_J!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7720be1-01e8-4ba9-bbeb-cd677dde5c17_818x380.png 848w, https://substackcdn.com/image/fetch/$s_!gO_J!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7720be1-01e8-4ba9-bbeb-cd677dde5c17_818x380.png 1272w, https://substackcdn.com/image/fetch/$s_!gO_J!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7720be1-01e8-4ba9-bbeb-cd677dde5c17_818x380.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gO_J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7720be1-01e8-4ba9-bbeb-cd677dde5c17_818x380.png" width="818" height="380" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c7720be1-01e8-4ba9-bbeb-cd677dde5c17_818x380.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:380,&quot;width&quot;:818,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:64249,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gO_J!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7720be1-01e8-4ba9-bbeb-cd677dde5c17_818x380.png 424w, https://substackcdn.com/image/fetch/$s_!gO_J!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7720be1-01e8-4ba9-bbeb-cd677dde5c17_818x380.png 848w, https://substackcdn.com/image/fetch/$s_!gO_J!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7720be1-01e8-4ba9-bbeb-cd677dde5c17_818x380.png 1272w, https://substackcdn.com/image/fetch/$s_!gO_J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7720be1-01e8-4ba9-bbeb-cd677dde5c17_818x380.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The transformation to map a Euclidean metric tensor to a Riemannian metric tensor is an open d-dimensional unit ball.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WdDI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eef9390-45e5-481b-92d4-ca5a60715a3b_894x152.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WdDI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eef9390-45e5-481b-92d4-ca5a60715a3b_894x152.png 424w, https://substackcdn.com/image/fetch/$s_!WdDI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eef9390-45e5-481b-92d4-ca5a60715a3b_894x152.png 848w, https://substackcdn.com/image/fetch/$s_!WdDI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eef9390-45e5-481b-92d4-ca5a60715a3b_894x152.png 1272w, https://substackcdn.com/image/fetch/$s_!WdDI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eef9390-45e5-481b-92d4-ca5a60715a3b_894x152.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WdDI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eef9390-45e5-481b-92d4-ca5a60715a3b_894x152.png" width="894" height="152" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4eef9390-45e5-481b-92d4-ca5a60715a3b_894x152.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:152,&quot;width&quot;:894,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:16366,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WdDI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eef9390-45e5-481b-92d4-ca5a60715a3b_894x152.png 424w, https://substackcdn.com/image/fetch/$s_!WdDI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eef9390-45e5-481b-92d4-ca5a60715a3b_894x152.png 848w, https://substackcdn.com/image/fetch/$s_!WdDI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eef9390-45e5-481b-92d4-ca5a60715a3b_894x152.png 1272w, https://substackcdn.com/image/fetch/$s_!WdDI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eef9390-45e5-481b-92d4-ca5a60715a3b_894x152.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Distances between 2 vectors in this non-euclidean space are calculated as - </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gJ1a!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd2db222-1296-4f6e-85ce-eebc6a2b7ef7_808x90.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gJ1a!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd2db222-1296-4f6e-85ce-eebc6a2b7ef7_808x90.png 424w, https://substackcdn.com/image/fetch/$s_!gJ1a!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd2db222-1296-4f6e-85ce-eebc6a2b7ef7_808x90.png 848w, https://substackcdn.com/image/fetch/$s_!gJ1a!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd2db222-1296-4f6e-85ce-eebc6a2b7ef7_808x90.png 1272w, https://substackcdn.com/image/fetch/$s_!gJ1a!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd2db222-1296-4f6e-85ce-eebc6a2b7ef7_808x90.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gJ1a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd2db222-1296-4f6e-85ce-eebc6a2b7ef7_808x90.png" width="808" height="90" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dd2db222-1296-4f6e-85ce-eebc6a2b7ef7_808x90.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:90,&quot;width&quot;:808,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:16091,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gJ1a!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd2db222-1296-4f6e-85ce-eebc6a2b7ef7_808x90.png 424w, https://substackcdn.com/image/fetch/$s_!gJ1a!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd2db222-1296-4f6e-85ce-eebc6a2b7ef7_808x90.png 848w, https://substackcdn.com/image/fetch/$s_!gJ1a!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd2db222-1296-4f6e-85ce-eebc6a2b7ef7_808x90.png 1272w, https://substackcdn.com/image/fetch/$s_!gJ1a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd2db222-1296-4f6e-85ce-eebc6a2b7ef7_808x90.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Once plotted, your hierarchical data might look like this when mapped over a Poincar&#233; ball.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZI7V!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F828ef48c-c83d-45b7-a2fa-e8c0bdbbd6de_1494x1374.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZI7V!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F828ef48c-c83d-45b7-a2fa-e8c0bdbbd6de_1494x1374.png 424w, https://substackcdn.com/image/fetch/$s_!ZI7V!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F828ef48c-c83d-45b7-a2fa-e8c0bdbbd6de_1494x1374.png 848w, https://substackcdn.com/image/fetch/$s_!ZI7V!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F828ef48c-c83d-45b7-a2fa-e8c0bdbbd6de_1494x1374.png 1272w, https://substackcdn.com/image/fetch/$s_!ZI7V!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F828ef48c-c83d-45b7-a2fa-e8c0bdbbd6de_1494x1374.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZI7V!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F828ef48c-c83d-45b7-a2fa-e8c0bdbbd6de_1494x1374.png" width="1456" height="1339" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/828ef48c-c83d-45b7-a2fa-e8c0bdbbd6de_1494x1374.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1339,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1322209,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZI7V!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F828ef48c-c83d-45b7-a2fa-e8c0bdbbd6de_1494x1374.png 424w, https://substackcdn.com/image/fetch/$s_!ZI7V!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F828ef48c-c83d-45b7-a2fa-e8c0bdbbd6de_1494x1374.png 848w, https://substackcdn.com/image/fetch/$s_!ZI7V!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F828ef48c-c83d-45b7-a2fa-e8c0bdbbd6de_1494x1374.png 1272w, https://substackcdn.com/image/fetch/$s_!ZI7V!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F828ef48c-c83d-45b7-a2fa-e8c0bdbbd6de_1494x1374.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The <a href="https://arxiv.org/pdf/1705.08039.pdf">research paper for Poincar&#233; embeddings</a> is wonderfully written and you will find some easy implementations in popular libraries for them as well. <strong>Needless to say, they are underrated</strong>.</p><p>Two popular implementations that you can use are found in -</p><ul><li><p><code>gensim.models.poincare</code></p></li><li><p><code>tensorflow_addons.PoincareNormalize</code></p></li></ul><div><hr></div><h3>Implementing Poincar&#233; Embeddings!</h3><p>Let&#8217;s start by generating some hierarchical data. If you have access to such a dataset already, feel free to use that. But for our exploration of the concept, I find that the synset relationships in NLTK&#8217;s Wordnet are perfect!</p><p>Here is a function I wrote to generate this data (inspired by <a href="https://github.com/TatsuyaShirakawa/poincare-embedding/blob/master/scripts/create_mammal_subtree.py">this repo</a>) - </p><pre><code>def generate_data(root='mathematics.n.01'):
    """
    Script to generate sample DAG data using input root
    """
    #INSTALL THE CORPUS WORDNET (UNCOMMENT FIRST RUN)
    #import nltk
    #nltk.download('wordnet')
    #nltk.download('omw-1.4')

    #IMPORT DEPENDENCIES
    import random
    from nltk.corpus import wordnet as wn

    #SET ROOT/TARGET AS MAMMAL(NOUN)
    target = wn.synset(root)
    print('target', target.name())

    #FIND ALL NOUNS IN WORDNET
    words = wn.words()
    nouns = set([])
    for word in words:
        nouns.update(wn.synsets(word, pos='n'))  
    print(len(nouns), 'nouns')

    #FETCH ALL HYPERNYMS WITH PATH TO TARGET
    hypernyms = []
    for noun in nouns:
        paths = noun.hypernym_paths()
        for path in paths:
            try:
                pos = path.index(target)
                for i in range(pos, len(path)-1):
                    hypernyms.append((noun, path[i]))
            except Exception:
                continue            
    hypernyms = list(set(hypernyms))
    print(len(hypernyms), 'hypernyms')

    #SHUFFLE AND SAVE ALL RELATIONSHIP TUPLES
    random.shuffle(hypernyms)
    data = [(n1.name(), n2.name()) for n1, n2 in hypernyms]
    print(len(data), 'relations')
    return data

#generating data with root as "bird"
data = generate_data('bird.n.01')

# target bird.n.01
# 82115 nouns
# 3435 hypernyms
# 3435 relations</code></pre><pre><code>#Show 20 sample datapoints
print(data[:20])

[('australian_turtledove.n.01', 'columbiform_bird.n.01'),
 ('snowy_egret.n.01', 'bird.n.01'),
 ('american_gallinule.n.01', 'gallinule.n.01'),
 ('banded_stilt.n.01', 'stilt.n.03'),
 ('plover.n.01', 'shorebird.n.01'),
 ('great_bowerbird.n.01', 'passerine.n.01'),
 ('jaeger.n.01', 'coastal_diving_bird.n.01'),
 ('blue_jay.n.01', 'new_world_jay.n.01'),
 ('new_world_jay.n.01', 'passerine.n.01'),
 ('auklet.n.01', 'aquatic_bird.n.01'),
 ('hedge_sparrow.n.01', 'passerine.n.01'),
 ('shoveler.n.02', 'anseriform_bird.n.01'),
 ('peahen.n.01', 'gallinaceous_bird.n.01'),
 ('pheasant_coucal.n.01', 'coucal.n.01'),
 ('goldfinch.n.02', 'oscine.n.01'),
 ('old_squaw.n.01', 'sea_duck.n.01'),
 ('cochin.n.01', 'domestic_fowl.n.01'),
 ('corncrake.n.01', 'aquatic_bird.n.01'),
 ('least_sandpiper.n.01', 'shorebird.n.01'),
 ('warbler.n.02', 'oscine.n.01')]</code></pre><p>This hierarchical data is structured as a list of <code>3,435 &#8220;child</code> &#8594; <code>parent&#8221;</code> tuples. Let&#8217;s quickly plot this to see what these might look like. </p><blockquote><p>Also, try using other roots such as <code>mathematics.n.01</code> or <code>mammal.n.01</code></p></blockquote><pre><code>import networkx as nx
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(18,14))
g = nx.Graph(data)
nx.draw(g, with_labels=True)</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!36_B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88280d28-4873-4051-97c5-68c9c8cf8fd3_1310x1022.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!36_B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88280d28-4873-4051-97c5-68c9c8cf8fd3_1310x1022.png 424w, https://substackcdn.com/image/fetch/$s_!36_B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88280d28-4873-4051-97c5-68c9c8cf8fd3_1310x1022.png 848w, https://substackcdn.com/image/fetch/$s_!36_B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88280d28-4873-4051-97c5-68c9c8cf8fd3_1310x1022.png 1272w, https://substackcdn.com/image/fetch/$s_!36_B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88280d28-4873-4051-97c5-68c9c8cf8fd3_1310x1022.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!36_B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88280d28-4873-4051-97c5-68c9c8cf8fd3_1310x1022.png" width="1310" height="1022" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/88280d28-4873-4051-97c5-68c9c8cf8fd3_1310x1022.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1022,&quot;width&quot;:1310,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1629429,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!36_B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88280d28-4873-4051-97c5-68c9c8cf8fd3_1310x1022.png 424w, https://substackcdn.com/image/fetch/$s_!36_B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88280d28-4873-4051-97c5-68c9c8cf8fd3_1310x1022.png 848w, https://substackcdn.com/image/fetch/$s_!36_B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88280d28-4873-4051-97c5-68c9c8cf8fd3_1310x1022.png 1272w, https://substackcdn.com/image/fetch/$s_!36_B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88280d28-4873-4051-97c5-68c9c8cf8fd3_1310x1022.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Not very readable but the local and global structure is somewhat visible.</p><p><strong>Let&#8217;s use Gensim to implement Poincare Embeddings.</strong> Want to know how easy it is? Just 2 lines of code as marked below - </p><pre><code>import numpy as np
from gensim.models.poincare import PoincareModel

#MODEL TRAINING
model = PoincareModel(data, size=50, negative=10)  #&lt;---
model.train(epochs=30)                             #&lt;---

#FETCH VECTORS AND STORE
vectors = {}
keys = model.kv.index_to_key
for k in keys:
    vectors[k] = model.kv.get_vector(k)

#STACKING EMBEDDINGS TO A NDARRAY
embeddings = np.stack(vectors.values())

print(len(keys),'keys')
print(embeddings.shape[0], 'embeddings')
print(embeddings.shape[1], 'dimensions')</code></pre><pre><code>872 keys
872 embeddings
50 dimensions</code></pre><p>Great, so we have 872 nodes in our hierarchy and each one is represented with a 50-dimensional vector in this non-euclidean space. Let&#8217;s use the <strong>Poincare distance</strong> implementation from Gensim to see if the hierarchical relationships are encoded in the model.</p><pre><code>print('owl vs hawk:', model.kv.similarity('owl.n.01','hawk.n.01'))

print('eagle vs bald_eagle:', model.kv.similarity('eagle.n.01','bald_eagle.n.01'))
print('golden_eagle vs bald_eagle:', model.kv.similarity('golden_eagle.n.01','bald_eagle.n.01'))

print('owl vs barn_owl:', model.kv.similarity('owl.n.01','barn_owl.n.01'))
print('barn_owl vs spotted_owl:', model.kv.similarity('barn_owl.n.01','spotted_owl.n.01'))
print('barn_owl vs mandarin_duck:', model.kv.similarity('barn_owl.n.01','mandarin_duck.n.01'))</code></pre><pre><code>owl vs hawk: 0.31635084567961813

eagle vs bald_eagle: 0.8757754918787964
golden_eagle vs bald_eagle: 0.8758848440323967

owl vs barn_owl: 0.6982827095728305
barn_owl vs spotted_owl: 0.738832541978296
barn_owl vs mandarin_duck: 0.1329321668369148</code></pre><p>Excellent! Related nodes are closer to each other in this space as expected! Let&#8217;s maybe use <strong>Principle Component Analysis</strong> to plot this space in 3D just to get a sense of how this is clustered.</p><blockquote><p><strong>NOTE</strong>: We are using PCA to project a non-euclidean space into a 3 dimensional euclidean space, so take it with a grain of salt!</p></blockquote><pre><code>import pandas as pd
import plotly.express as px
from sklearn.decomposition import PCA

pca = PCA(3)
components = pca.fit_transform(embeddings)

keys = model.kv.index_to_key
cols = ['component1','component2','component3']
plot_df = pd.DataFrame(components, columns=cols, index=keys)

# PLOTTING ONLY 100 SAMPLE VECTORS
plot_df = plot_df.head(100)   #&lt;---

px.scatter_3d(plot_df, 
              x='component1', y='component2', z='component3', 
              text=plot_df.index, opacity=0.5, height=900, width=900)</code></pre><div class="image-gallery-embed" data-attrs="{&quot;gallery&quot;:{&quot;images&quot;:[{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e56fd4e8-9758-4988-8143-88cb9f0a075f_900x900.png&quot;},{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7a96493f-5673-4c86-b105-4cfd166f154d_900x900.png&quot;},{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3e2ce8af-f363-494d-ba6d-236eb0f07abf_900x900.png&quot;}],&quot;caption&quot;:&quot;PCA 3-D vector space&quot;,&quot;alt&quot;:&quot;&quot;,&quot;staticGalleryImage&quot;:{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4f549558-ee33-465c-abc2-148e5f96d413_1456x474.png&quot;}},&quot;isEditorNode&quot;:true}"></div><p>And as expected, we see the outline of the sphere where these embeddings exist. Also, we see that the outer edge of the sphere has a higher density of nodes due to the nature of how these embeddings are mapped onto this curved space. The clusters of related nodes are quite visible in the additional zoomed-in views.</p><div><hr></div><h3>Want to use these in your neural networks?</h3><p>Worry not! You can use, <code>tensorflow_addons.layers.PoincareNormalize</code> and add this layer over your Embedding layer!</p><pre><code>from tensorflow.keras import layers, Model, utils
import tensorflow_addons as tfa

X = np.random.random((100,10))
y = np.random.random((100,))


inp = layers.Input((10,))
x = layers.Embedding(500, 5)(inp)
x = tfa.layers.PoincareNormalize(axis=-1)(x)  #&lt;-------
x = layers.Flatten()(x)
out = layers.Dense(1)(x)

model = Model(inp, out)
model.compile(optimizer='adam', loss='binary_crossentropy')
utils.plot_model(model, show_shapes=True, show_layer_names=False)

model.fit(X, y, epochs=3)</code></pre><p><strong>So there you have it! Poincar&#233; Embeddings in all their glory. It&#8217;s a powerful concept that I hope you can now use as part of your model experiments when you have to work with hierarchical data!</strong></p><p></p>]]></content:encoded></item><item><title><![CDATA[Rolling windows / n-grams? Here are 10 ways to implement them in python!]]></title><description><![CDATA[#python #ngrams #rolling #iteration]]></description><link>https://www.blog.datasciencephilosophy.com/p/rolling-windows-n-grams-here-are</link><guid isPermaLink="false">https://www.blog.datasciencephilosophy.com/p/rolling-windows-n-grams-here-are</guid><dc:creator><![CDATA[Akshay Sehgal]]></dc:creator><pubDate>Fri, 24 Mar 2023 21:49:43 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/102d0a7e-8943-4c06-8469-1d52f51c3ec6_301x209.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Whether you are working with text, time series, or sequences of any type, sooner or later you would find yourself building a rolling window function or an n-gram generator to iterate over your sequence.</p><h4>Here are 10 ways you can implement rolling windows/n-grams in python!</h4><p>Let&#8217;s start with a toy example as usual - </p><pre><code># Rolling window size / n in ngram
n = 3

# Example sentence / sequence
doc = "The quick brown fox jumps over the lazy dog"
tokens = doc.split(' ')
print(tokens)</code></pre><pre><code>['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']</code></pre><p>So, we have a humble yet overused sequence of tokens, and we need to iterate over this with a rolling window of size 3, otherwise known as tri-grams. </p><p>Let&#8217;s get rolling (pun intended)!</p><h3>1. Classic <code>for</code> loop</h3><p>All we need to do here is iterate over the range of length of the sequence and use basic <code>list</code> indexing as follows - </p><pre><code>out = []
for i in range(len(tokens)):
    out.append(tokens[i:i+n])

out</code></pre><pre><code>[['The', 'quick', 'brown'],
 ['quick', 'brown', 'fox'],
 ['brown', 'fox', 'jumps'],
 ['fox', 'jumps', 'over'],
 ['jumps', 'over', 'the'],
 ['over', 'the', 'lazy'],
 ['the', 'lazy', 'dog'],
 ['lazy', 'dog'],
 ['dog']]</code></pre><p>You can choose to skip the sublists which are smaller than the window size with a simple <code>if</code> condition.</p><h3>2. <a href="https://www.w3schools.com/python/python_lists_comprehension.asp">List Comprehension</a></h3><p>I know this might be a tiny bit of cheating, but some of us just love using list comprehensions everywhere, which are just another way of writing the above code.</p><pre><code>[tokens[i:i+n] for i in range(len(tokens))]</code></pre><pre><code>[['The', 'quick', 'brown'],
 ['quick', 'brown', 'fox'],
 ['brown', 'fox', 'jumps'],
 ['fox', 'jumps', 'over'],
 ['jumps', 'over', 'the'],
 ['over', 'the', 'lazy'],
 ['the', 'lazy', 'dog'],
 ['lazy', 'dog'],
 ['dog']]</code></pre><h3>3. <a href="https://www.w3schools.com/python/ref_func_zip.asp">ZIP</a> *</h3><p>This might be my go-to method as I find it quite pythonic, but here is how you can use <code>zip</code> along with the <code>unpacking operator * -</code></p><pre><code>list(zip(*[tokens[i:] for i in range(n)]))</code></pre><pre><code>[('The', 'quick', 'brown'),
 ('quick', 'brown', 'fox'),
 ('brown', 'fox', 'jumps'),
 ('fox', 'jumps', 'over'),
 ('jumps', 'over', 'the'),
 ('over', 'the', 'lazy'),
 ('the', 'lazy', 'dog')]</code></pre><p><strong>How does this work?</strong> What's happening here is that list comprehension creates <code>L, L[1:], L[2:]</code> which then get zipped, meaning the first element of each of these (which are the 1st, 2nd, and 3rd elements of L) get clubbed together and the second elements get clubbed together, and so on. The unpacking operator just unpacks the list containing <code>L, L[1:], L[2:] </code>for <code>zip</code> to work with. Here is a helpful diagram that I made to explain this intuitively.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!N4NT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c081f5-ceb4-40e0-949e-a521fcd5d760_301x209.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!N4NT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c081f5-ceb4-40e0-949e-a521fcd5d760_301x209.png 424w, https://substackcdn.com/image/fetch/$s_!N4NT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c081f5-ceb4-40e0-949e-a521fcd5d760_301x209.png 848w, https://substackcdn.com/image/fetch/$s_!N4NT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c081f5-ceb4-40e0-949e-a521fcd5d760_301x209.png 1272w, https://substackcdn.com/image/fetch/$s_!N4NT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c081f5-ceb4-40e0-949e-a521fcd5d760_301x209.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!N4NT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c081f5-ceb4-40e0-949e-a521fcd5d760_301x209.png" width="301" height="209" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/40c081f5-ceb4-40e0-949e-a521fcd5d760_301x209.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:209,&quot;width&quot;:301,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:15766,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!N4NT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c081f5-ceb4-40e0-949e-a521fcd5d760_301x209.png 424w, https://substackcdn.com/image/fetch/$s_!N4NT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c081f5-ceb4-40e0-949e-a521fcd5d760_301x209.png 848w, https://substackcdn.com/image/fetch/$s_!N4NT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c081f5-ceb4-40e0-949e-a521fcd5d760_301x209.png 1272w, https://substackcdn.com/image/fetch/$s_!N4NT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c081f5-ceb4-40e0-949e-a521fcd5d760_301x209.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h3>4. Pandas <a href="https://pandas.pydata.org/docs/reference/api/pandas.Series.rolling.html">Rolling</a> Objects</h3><p>You might have used this for performing rolling averages or sums, but did you know that you can iterate over a pandas rolling object? Here is how you can do this -</p><pre><code>import pandas as pd
list(map(list,pd.Series(tokens).rolling(n)))</code></pre><pre><code>[['The'],
 ['The', 'quick'],
 ['The', 'quick', 'brown'],
 ['quick', 'brown', 'fox'],
 ['brown', 'fox', 'jumps'],
 ['fox', 'jumps', 'over'],
 ['jumps', 'over', 'the'],
 ['over', 'the', 'lazy'],
 ['the', 'lazy', 'dog']]</code></pre><p>All we have to do is to <code>map</code> a <code>list</code> class to typecast the iterable rolling object. Or you could use a list comprehension as well! However, if you want to get a &#8220;forward rolling window&#8221; style output, it just takes an extra step of defining the window type.</p><pre><code>import pandas as pd

indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=n)
list(map(list,pd.Series(tokens).rolling(indexer)))</code></pre><pre><code>[['The', 'quick', 'brown'],
 ['quick', 'brown', 'fox'],
 ['brown', 'fox', 'jumps'],
 ['fox', 'jumps', 'over'],
 ['jumps', 'over', 'the'],
 ['over', 'the', 'lazy'],
 ['the', 'lazy', 'dog'],
 ['lazy', 'dog'],
 ['dog']]</code></pre><h3>5. Numpy <a href="https://numpy.org/doc/stable/reference/generated/numpy.lib.stride_tricks.as_strided.html">Stride Tricks</a></h3><p>This one is a slightly more advanced method as it involves accessing a <code>numpy</code> array&#8217;s contiguous memory storage directly to create a &#8220;view&#8221;, but it&#8217;s super powerful and pretty much used universally behind the scenes for a lot of <code>numpy</code> high-level functions.</p><pre><code>import numpy as np

arr = np.array(tokens)
shape = (arr.shape[0] - n + 1, n)            # (7, 3)
strides = (arr.strides[0], arr.strides[0])   # (20, 20) bytes

np.lib.stride_tricks.as_strided(arr, shape=shape, strides=strides)</code></pre><pre><code>array([['The', 'quick', 'brown'],
       ['quick', 'brown', 'fox'],
       ['brown', 'fox', 'jumps'],
       ['fox', 'jumps', 'over'],
       ['jumps', 'over', 'the'],
       ['over', 'the', 'lazy'],
       ['the', 'lazy', 'dog']], dtype='&lt;U5')</code></pre><p>The &#8220;shape&#8221; is the expected output shape for the view of this <code>numpy</code> array and the &#8220;strides&#8221; are the numbers of bytes <code>numpy</code> has to move in each axis to reach the next element. Use this method at your own risk as it can cause memory corruption if now used properly!!</p><h3>6. NLTK <a href="https://tedboy.github.io/nlps/generated/generated/nltk.ngrams.html">ngrams</a></h3><p>You can&#8217;t talk about n-grams without talking about NLTK. Most of us have learned our first implementations of n-grams using NLTK, and it can be used for any standard iterator!</p><pre><code>from nltk import ngrams
list(ngrams(tokens, n))</code></pre><pre><code>[('The', 'quick', 'brown'),
 ('quick', 'brown', 'fox'),
 ('brown', 'fox', 'jumps'),
 ('fox', 'jumps', 'over'),
 ('jumps', 'over', 'the'),
 ('over', 'the', 'lazy'),
 ('the', 'lazy', 'dog')]</code></pre><h3>7. <code>more_itertools</code> <a href="https://pypi.org/project/more-itertools/">library</a></h3><p>This is a purpose-built library that extends the classic <code>itertools</code> library with some interesting functions. One of these is the <code>more_itertools.windowed </code>method.</p><pre><code>#pip install more_itertools
import more_itertools
list(more_itertools.windowed(tokens, n))</code></pre><pre><code>[('The', 'quick', 'brown'),
 ('quick', 'brown', 'fox'),
 ('brown', 'fox', 'jumps'),
 ('fox', 'jumps', 'over'),
 ('jumps', 'over', 'the'),
 ('over', 'the', 'lazy'),
 ('the', 'lazy', 'dog')]</code></pre><h3>8. <code>toolz</code> <a href="https://toolz.readthedocs.io/en/latest/">library</a></h3><p>Another example of a library built as an extension to <code>itertools</code>, which comes inbuilt with the handy <code>sliding_window</code> method.</p><pre><code>#pip install toolz
import toolz
list(toolz.sliding_window(n, tokens))</code></pre><pre><code>[('The', 'quick', 'brown'),
 ('quick', 'brown', 'fox'),
 ('brown', 'fox', 'jumps'),
 ('fox', 'jumps', 'over'),
 ('jumps', 'over', 'the'),
 ('over', 'the', 'lazy'),
 ('the', 'lazy', 'dog')]</code></pre><h3>9. Itertools <a href="https://docs.python.org/3/library/itertools.html">islice</a></h3><p>Implementing this in <code>itertools</code> is a slight bit more complex, but very handy to learn as it exposes you to work with some underrated yet powerful <code>itertools</code> methods such as <code>islice</code> and <code>tee</code>.</p><pre><code>from itertools import islice, tee
list(zip(*(islice(s, i, None) for i, s in enumerate(tee(tokens, n)))))</code></pre><pre><code>[('The', 'quick', 'brown'),
 ('quick', 'brown', 'fox'),
 ('brown', 'fox', 'jumps'),
 ('fox', 'jumps', 'over'),
 ('jumps', 'over', 'the'),
 ('over', 'the', 'lazy'),
 ('the', 'lazy', 'dog')]</code></pre><h3>10. Scikit Learn <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html">CountVectorizer</a></h3><p>Lastly, this is kinda a cheat and mostly applicable to text documents/sentences but it&#8217;s something that data scientists use quite regularly as part of their modeling pipelines. The trick is to define your <code>CountVectorizer</code> with the <code>ngram_range</code> as <code>(n,n)</code> thus only fetching n-grams and not the uni-grams, bi-grams, etc.</p><pre><code>from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(ngram_range=(n,n))
analyzer = cv.build_analyzer()
analyzer(doc)</code></pre><pre><code>['the quick brown',
 'quick brown fox',
 'brown fox jumps',
 'fox jumps over',
 'jumps over the',
 'over the lazy',
 'the lazy dog']</code></pre><p>The output returned is not a list of tokens but a string and notice that the input to the analyzer is also the original document rather than the token list.</p><p><strong>So that&#8217;s 10 ways of quickly implementing rolling window iteration or n-grams in a given sequence using Python! Hope this has been useful for you, the reader!</strong></p><p></p>]]></content:encoded></item><item><title><![CDATA[Too many conditional statements? Try this faster, more pythonic way.]]></title><description><![CDATA[#pandas #conditionals #if-else #numpy #python]]></description><link>https://www.blog.datasciencephilosophy.com/p/too-many-conditional-statements-try</link><guid isPermaLink="false">https://www.blog.datasciencephilosophy.com/p/too-many-conditional-statements-try</guid><dc:creator><![CDATA[Akshay Sehgal]]></dc:creator><pubDate>Fri, 24 Mar 2023 02:24:41 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/806b9e53-24fb-4f19-ac62-32067d77023f_1456x1080.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We have all faced this -  a continuous barrage of <code>if</code>&#8217;s and <code>elif</code>&#8217;s over your existing <code>Pandas</code> DataFrame or <code>Numpy</code> array for a new model feature or a metric for your report. Sometimes, you just can&#8217;t avoid checking multiple complex conditions, one by one, to get where you need to.</p><p>Let's set up a toy example.</p><pre><code>import pandas as pd

df = pd.DataFrame({"name": ["Rick","Morty","Beth","Summer","Jerry"], 
                   "age": [70, 14, 35, 17, 35]})
print(df)</code></pre><pre><code>     name  age
0    Rick   70
1   Morty   14
2    Beth   35
3  Summer   17
4   Jerry   35</code></pre><p>Let&#8217;s say we want to create labels for the <strong>Smith family members</strong> based on age. A classic way might be to write a function with all of the necessary conditions and then apply that function row-wise.</p><pre><code>def age_group(row):
    if row.age &gt;= 0 and row.age &lt; 20:
        return '0 - 20 yrs'
    elif row.age &gt;= 20 and row.age &lt; 40:
        return '20 - 40 yrs'
    elif row.age &gt;= 40 and row.age &lt; 60:
        return '40 - 60 yrs'
    elif row.age &gt;= 60:
        return '60+ yrs'
    else:
        return 'invalid age'
    
df['apply_age'] = df.apply(age_group,1)
print(df)</code></pre><pre><code>     name  age    apply_age
0    Rick   70      60+ yrs
1   Morty   14   0 - 20 yrs
2    Beth   35  20 - 40 yrs
3  Summer   17   0 - 20 yrs
4   Jerry   35  20 - 40 yrs</code></pre><p>But this is ugly, and slow, mostly because a <code>pd.DataFrame.apply</code> method in pandas is not <strong>vectorized</strong>! So is there a more pythonic way to do this, which is readable, cleaner, and faster at the same time? </p><div><hr></div><h3>Introducing Numpy&#8217;s <a href="https://numpy.org/doc/stable/reference/generated/numpy.select.html">Select method</a> - </h3><p>This is how it works. You pass the <code>numpy.select</code> method a list of conditions, a list of values for each of those conditions (<code>if</code>, <code>elif</code>), and a default value (<code>else</code>). </p><pre><code>numpy.select(condlist, choicelist, default)</code></pre><p>So how does it work in action?</p><pre><code>import numpy as np

#Conditions
c1 = (df.age &gt;= 0) &amp; (df.age &lt; 20)
c2 = (df.age &gt;= 20) &amp; (df.age &lt; 40)
c3 = (df.age &gt;= 40) &amp; (df.age &lt; 60)
c4 = (df.age &gt;= 60)

#Choices
values = ['0 - 20 yrs', '20 - 40 yrs', '40 - 60 yrs', '60+ yrs']

#Default
default = 'invalid age'

df['select_age'] = np.select([c1,c2,c3,c4], values, default=default)
print(df)</code></pre><pre><code>     name  age   select_age
0    Rick   70      60+ yrs
1   Morty   14   0 - 20 yrs
2    Beth   35  20 - 40 yrs
3  Summer   17   0 - 20 yrs
4   Jerry   35  20 - 40 yrs</code></pre><p>I hope you agree with me when I say this is significantly <em>cleaner</em> and more <em>readable</em>, especially when you imagine working with 10s of 100s such conditions with higher complexity!</p><p>But, as you might remember, I claimed earlier that this is also much <em>faster</em> than the traditional way. That is because it is vectorized unlike the <code>pd.DataFrame.apply</code> which is applying your function one row at a time. </p><h4>How fast you ask?</h4><p>Let&#8217;s scale the data and test this out with a <code>%%timeit -</code></p><pre><code>df = pd.concat([df]*10000)  #Repeating the dataframe 10k times</code></pre><pre><code>%%timeit for apply_method
539 ms &#177; 3.48 ms per loop

%%timeit for np_select_method
3.84 ms &#177; 52.7 &#181;s per loop</code></pre><p>That&#8217;s <strong>140x faster</strong> at just 50k rows and this only gets better and better as the data size increases!</p><p><strong>So next time you are working with a ton of </strong><code>if&#8217;</code><strong>s and </strong><code>elif&#8217;</code><strong>s, know that there is a more pythonic way to do this which your fellow Data Scientists will thank you for!</strong></p>]]></content:encoded></item><item><title><![CDATA[Groupby & Aggregate using Pandas]]></title><description><![CDATA[#python #pandas #groupby #aggregate]]></description><link>https://www.blog.datasciencephilosophy.com/p/groupby-and-aggregate-using-pandas</link><guid isPermaLink="false">https://www.blog.datasciencephilosophy.com/p/groupby-and-aggregate-using-pandas</guid><dc:creator><![CDATA[Akshay Sehgal]]></dc:creator><pubDate>Fri, 17 Mar 2023 00:04:42 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!D4Gq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b34f582-2eb2-42b5-a4bd-90ea5169ec16_431x369.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3>Table of contents:</h3><ol><li><p>Introduction</p></li><li><p>Syntax<br>2.1 Adding more groups/levels<br>2.2 Adding more variables/features</p></li><li><p>Grouping</p></li><li><p>Aggregation<br>4.1 In-built aggregation methods<br>4.2 Custom functions with pandas apply<br>4.3 Multiple aggregations using agg method<br>4.4 Custom functions with agg method</p></li><li><p>Transform</p></li><li><p>Advanced Usage<br>6.1 Sequential/local grouping of a dataframe<br>6.2 Re-indexing to a fixed date range for each group</p></li><li><p>Other ways of grouping data<br>7.1 Using collections' defaultdict<br>7.2 Using numpy's split function<br>7.3 Using itertools' groupby</p></li><li><p>References</p></li></ol><h2>1. Introduction</h2><p>"Groupby" is probably one of the most basic data pre-processing steps that a Data Scientist should master as soon as possible. Interestingly enough, you find it in almost every scripting language that claims to work well with databases.</p><p>Most of us would have been introduced to the <strong>SQL GROUPBY</strong> statement which allows a user to summarize or aggregate a given dataset. Python brings the pandas groupby method to the table, which is highly pythonic in its syntax and equally versatile, if not more. But the utility of a groupby is much more than just aggregation. In this notebook, I will showcase a few examples, where you could really exploit this method for various other use cases.</p><pre><code><em>#SQL Query to groupby Col1 and Col2</em>
<em>#and get mean and sum of col3 and col 4 respectively</em>

SELECT Col1, Col2, mean(Col3), sum(Col4)
FROM Table
GROUP BY Col1, Col2</code></pre><p>Before we can start writing code, let's explore the basics behind a groupby operation. The core concept behind any groupby operation is a three-step process called <a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html">Split-Apply-Combine</a>.</p><ul><li><p><strong>Split:</strong> Splitting the data into groups based on some criteria</p></li><li><p><strong>Apply:</strong> Applying a function to each group independently</p></li><li><p><strong>Combine:</strong> Combining the results into a data structure</p></li></ul><p>Here is a diagram to make this more intuitive.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!D4Gq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b34f582-2eb2-42b5-a4bd-90ea5169ec16_431x369.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!D4Gq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b34f582-2eb2-42b5-a4bd-90ea5169ec16_431x369.png 424w, https://substackcdn.com/image/fetch/$s_!D4Gq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b34f582-2eb2-42b5-a4bd-90ea5169ec16_431x369.png 848w, https://substackcdn.com/image/fetch/$s_!D4Gq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b34f582-2eb2-42b5-a4bd-90ea5169ec16_431x369.png 1272w, https://substackcdn.com/image/fetch/$s_!D4Gq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b34f582-2eb2-42b5-a4bd-90ea5169ec16_431x369.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!D4Gq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b34f582-2eb2-42b5-a4bd-90ea5169ec16_431x369.png" width="431" height="369" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2b34f582-2eb2-42b5-a4bd-90ea5169ec16_431x369.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:369,&quot;width&quot;:431,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:22982,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!D4Gq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b34f582-2eb2-42b5-a4bd-90ea5169ec16_431x369.png 424w, https://substackcdn.com/image/fetch/$s_!D4Gq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b34f582-2eb2-42b5-a4bd-90ea5169ec16_431x369.png 848w, https://substackcdn.com/image/fetch/$s_!D4Gq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b34f582-2eb2-42b5-a4bd-90ea5169ec16_431x369.png 1272w, https://substackcdn.com/image/fetch/$s_!D4Gq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b34f582-2eb2-42b5-a4bd-90ea5169ec16_431x369.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>2. Syntax</h2><p>The syntax for using a groupby method in Pandas comprises 2 parts. The first is a grouper object and the second is the aggregator. The general structure looks like the following -</p><pre><code><code>        df.groupby(['groups'])['cols'].aggregations()
        |____________________||_____________________|
                   |                     |
            grouper(split)  aggregation(apply &amp; combine)</code></code></pre><p>This will get more clear as we take an example from actual data. So, let's start by loading a dataset to work with. For this notebook, I will use the Titanic dataset which can either be downloaded from <a href="https://www.kaggle.com/c/titanic">Kaggle</a> or directly loaded using the visualization library called <a href="https://github.com/mwaskom/seaborn-data">Seaborn</a>.</p><pre><code><em>#Load dependencies</em>
<strong>import</strong> pandas <strong>as</strong> pd
<strong>import</strong> numpy <strong>as</strong> np
<strong>import</strong> seaborn <strong>as</strong> sns

titanic <strong>=</strong> sns<strong>.</strong>load_dataset('titanic')<strong>.</strong>dropna() <em>#drop missing data</em>

print(titanic<strong>.</strong>shape)
titanic<strong>.</strong>head()</code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5BqC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F600f8804-004f-44dc-9177-df2a0272d8bb_1826x370.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5BqC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F600f8804-004f-44dc-9177-df2a0272d8bb_1826x370.png 424w, https://substackcdn.com/image/fetch/$s_!5BqC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F600f8804-004f-44dc-9177-df2a0272d8bb_1826x370.png 848w, https://substackcdn.com/image/fetch/$s_!5BqC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F600f8804-004f-44dc-9177-df2a0272d8bb_1826x370.png 1272w, https://substackcdn.com/image/fetch/$s_!5BqC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F600f8804-004f-44dc-9177-df2a0272d8bb_1826x370.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5BqC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F600f8804-004f-44dc-9177-df2a0272d8bb_1826x370.png" width="1456" height="295" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/600f8804-004f-44dc-9177-df2a0272d8bb_1826x370.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:295,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:88470,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5BqC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F600f8804-004f-44dc-9177-df2a0272d8bb_1826x370.png 424w, https://substackcdn.com/image/fetch/$s_!5BqC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F600f8804-004f-44dc-9177-df2a0272d8bb_1826x370.png 848w, https://substackcdn.com/image/fetch/$s_!5BqC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F600f8804-004f-44dc-9177-df2a0272d8bb_1826x370.png 1272w, https://substackcdn.com/image/fetch/$s_!5BqC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F600f8804-004f-44dc-9177-df2a0272d8bb_1826x370.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>Question:</strong> What is the total number of passengers from each class who survived?</p><pre><code><em>#Step 1: Create grouper</em>
grouper <strong>=</strong> titanic<strong>.</strong>groupby(['class'])

<em>#Step 2: Filter column and apply aggregation</em>
grouper['survived']<strong>.</strong>sum()<strong>.</strong>reset_index()</code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tdJW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b86e52-7b4a-4d39-bbfa-2afdbeabd68b_1820x218.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tdJW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b86e52-7b4a-4d39-bbfa-2afdbeabd68b_1820x218.png 424w, https://substackcdn.com/image/fetch/$s_!tdJW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b86e52-7b4a-4d39-bbfa-2afdbeabd68b_1820x218.png 848w, https://substackcdn.com/image/fetch/$s_!tdJW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b86e52-7b4a-4d39-bbfa-2afdbeabd68b_1820x218.png 1272w, https://substackcdn.com/image/fetch/$s_!tdJW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b86e52-7b4a-4d39-bbfa-2afdbeabd68b_1820x218.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tdJW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b86e52-7b4a-4d39-bbfa-2afdbeabd68b_1820x218.png" width="1456" height="174" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d7b86e52-7b4a-4d39-bbfa-2afdbeabd68b_1820x218.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:174,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:23306,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tdJW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b86e52-7b4a-4d39-bbfa-2afdbeabd68b_1820x218.png 424w, https://substackcdn.com/image/fetch/$s_!tdJW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b86e52-7b4a-4d39-bbfa-2afdbeabd68b_1820x218.png 848w, https://substackcdn.com/image/fetch/$s_!tdJW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b86e52-7b4a-4d39-bbfa-2afdbeabd68b_1820x218.png 1272w, https://substackcdn.com/image/fetch/$s_!tdJW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b86e52-7b4a-4d39-bbfa-2afdbeabd68b_1820x218.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>You would usually do this in a single statement as the following:</p><pre><code>titanic<strong>.</strong>groupby(['class'])['survived']<strong>.</strong>sum()<strong>.</strong>reset_index()</code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3g7y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0837ef-be69-4947-be70-792399e7efc3_1814x234.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3g7y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0837ef-be69-4947-be70-792399e7efc3_1814x234.png 424w, https://substackcdn.com/image/fetch/$s_!3g7y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0837ef-be69-4947-be70-792399e7efc3_1814x234.png 848w, https://substackcdn.com/image/fetch/$s_!3g7y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0837ef-be69-4947-be70-792399e7efc3_1814x234.png 1272w, https://substackcdn.com/image/fetch/$s_!3g7y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0837ef-be69-4947-be70-792399e7efc3_1814x234.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3g7y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0837ef-be69-4947-be70-792399e7efc3_1814x234.png" width="1456" height="188" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3a0837ef-be69-4947-be70-792399e7efc3_1814x234.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:188,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:23847,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3g7y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0837ef-be69-4947-be70-792399e7efc3_1814x234.png 424w, https://substackcdn.com/image/fetch/$s_!3g7y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0837ef-be69-4947-be70-792399e7efc3_1814x234.png 848w, https://substackcdn.com/image/fetch/$s_!3g7y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0837ef-be69-4947-be70-792399e7efc3_1814x234.png 1272w, https://substackcdn.com/image/fetch/$s_!3g7y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0837ef-be69-4947-be70-792399e7efc3_1814x234.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p><strong>Note:</strong> The <code>reset_index()</code> helps bring the grouping columns from index, back as a column in a dataframe.</p></blockquote><p><strong>Question:</strong> What was the average fare for passengers from each town?</p><pre><code>titanic<strong>.</strong>groupby(['embark_town'])['fare']<strong>.</strong>mean()<strong>.</strong>reset_index()</code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IxBg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85efa718-053f-43fa-8c3f-f5cf9ea69f0b_1832x226.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IxBg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85efa718-053f-43fa-8c3f-f5cf9ea69f0b_1832x226.png 424w, https://substackcdn.com/image/fetch/$s_!IxBg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85efa718-053f-43fa-8c3f-f5cf9ea69f0b_1832x226.png 848w, https://substackcdn.com/image/fetch/$s_!IxBg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85efa718-053f-43fa-8c3f-f5cf9ea69f0b_1832x226.png 1272w, https://substackcdn.com/image/fetch/$s_!IxBg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85efa718-053f-43fa-8c3f-f5cf9ea69f0b_1832x226.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IxBg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85efa718-053f-43fa-8c3f-f5cf9ea69f0b_1832x226.png" width="1456" height="180" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/85efa718-053f-43fa-8c3f-f5cf9ea69f0b_1832x226.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:180,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:34199,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IxBg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85efa718-053f-43fa-8c3f-f5cf9ea69f0b_1832x226.png 424w, https://substackcdn.com/image/fetch/$s_!IxBg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85efa718-053f-43fa-8c3f-f5cf9ea69f0b_1832x226.png 848w, https://substackcdn.com/image/fetch/$s_!IxBg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85efa718-053f-43fa-8c3f-f5cf9ea69f0b_1832x226.png 1272w, https://substackcdn.com/image/fetch/$s_!IxBg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85efa718-053f-43fa-8c3f-f5cf9ea69f0b_1832x226.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h4>2.1 Adding more groups/levels</h4><p>We can pass a list of features in the <code>groupby()</code> to increase the levels for grouping the data as below.</p><p><strong>Question:</strong> What was the average fare for male vs female passengers from each town?</p><pre><code>titanic<strong>.</strong>groupby(['embark_town','sex'])['fare']<strong>.</strong>mean()<strong>.</strong>reset_index()</code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ni3z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b476773-0027-4b13-95d8-4b423bdb1fdc_1818x384.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ni3z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b476773-0027-4b13-95d8-4b423bdb1fdc_1818x384.png 424w, https://substackcdn.com/image/fetch/$s_!ni3z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b476773-0027-4b13-95d8-4b423bdb1fdc_1818x384.png 848w, https://substackcdn.com/image/fetch/$s_!ni3z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b476773-0027-4b13-95d8-4b423bdb1fdc_1818x384.png 1272w, https://substackcdn.com/image/fetch/$s_!ni3z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b476773-0027-4b13-95d8-4b423bdb1fdc_1818x384.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ni3z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b476773-0027-4b13-95d8-4b423bdb1fdc_1818x384.png" width="1456" height="308" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2b476773-0027-4b13-95d8-4b423bdb1fdc_1818x384.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:308,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:62014,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ni3z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b476773-0027-4b13-95d8-4b423bdb1fdc_1818x384.png 424w, https://substackcdn.com/image/fetch/$s_!ni3z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b476773-0027-4b13-95d8-4b423bdb1fdc_1818x384.png 848w, https://substackcdn.com/image/fetch/$s_!ni3z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b476773-0027-4b13-95d8-4b423bdb1fdc_1818x384.png 1272w, https://substackcdn.com/image/fetch/$s_!ni3z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b476773-0027-4b13-95d8-4b423bdb1fdc_1818x384.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h4>2.2 Adding more variables/features</h4><p>Similarly, we can select a list of variables for which you need to apply the aggregate function.</p><p><strong>Question:</strong> What was the average fare and age for male vs female passengers from each town?</p><pre><code>titanic<strong>.</strong>groupby(['embark_town','sex'])[['fare', 'age']]<strong>.</strong>mean()<strong>.</strong>reset_index()</code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!o3AO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cdc74d2-7dc0-4498-bf1a-47cfbfd7a27e_1818x388.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!o3AO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cdc74d2-7dc0-4498-bf1a-47cfbfd7a27e_1818x388.png 424w, https://substackcdn.com/image/fetch/$s_!o3AO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cdc74d2-7dc0-4498-bf1a-47cfbfd7a27e_1818x388.png 848w, https://substackcdn.com/image/fetch/$s_!o3AO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cdc74d2-7dc0-4498-bf1a-47cfbfd7a27e_1818x388.png 1272w, https://substackcdn.com/image/fetch/$s_!o3AO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cdc74d2-7dc0-4498-bf1a-47cfbfd7a27e_1818x388.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!o3AO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cdc74d2-7dc0-4498-bf1a-47cfbfd7a27e_1818x388.png" width="1456" height="311" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6cdc74d2-7dc0-4498-bf1a-47cfbfd7a27e_1818x388.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:311,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:72570,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!o3AO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cdc74d2-7dc0-4498-bf1a-47cfbfd7a27e_1818x388.png 424w, https://substackcdn.com/image/fetch/$s_!o3AO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cdc74d2-7dc0-4498-bf1a-47cfbfd7a27e_1818x388.png 848w, https://substackcdn.com/image/fetch/$s_!o3AO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cdc74d2-7dc0-4498-bf1a-47cfbfd7a27e_1818x388.png 1272w, https://substackcdn.com/image/fetch/$s_!o3AO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cdc74d2-7dc0-4498-bf1a-47cfbfd7a27e_1818x388.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h2>3. Grouping</h2><p>Before we go further and try other, more complex scenarios, let's try to understand the data structures we are working with, so that we can be much more creative with our approaches and get a deeper understanding of how they work.</p><p>You can imagine the pipeline of the above code to be as -</p><ul><li><p><strong>Step 1:</strong> Create a grouper object with <code>titanic.groupby(['embark_town'])</code> which splits data into the relevant groups</p></li><li><p><strong>Step 2:</strong> Select the column <code>'fare'</code> from each of those groups</p></li><li><p><strong>Step 3:</strong> Apply <code>mean()</code> on this column for each of the groups, combine and then return the aggregated dataset</p></li></ul><p>Let's see what the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Grouper.html">grouper object</a> looks like for a better understanding.</p><pre><code>grouper <strong>=</strong> titanic<strong>.</strong>groupby(['embark_town'])

<em>#Print dtype for each of the elements in the grouper</em>
[(type(k),type(g)) <strong>for</strong> k,g <strong>in</strong> grouper]</code></pre><pre><code><code>[(str, pandas.core.frame.DataFrame),
 (str, pandas.core.frame.DataFrame),
 (str, pandas.core.frame.DataFrame)]</code></code></pre><p>So, this shows that if we try to iterate over the grouper object, it&#8217;s nothing but a tuple with the key and a dataframe.</p><p>Let's see what each of those is.</p><pre><code><em>#Print shape for the dataframe groups</em>
[(k,g<strong>.</strong>shape) <strong>for</strong> k,g <strong>in</strong> grouper]</code></pre><pre><code><code>[('Cherbourg', (65, 15)), ('Queenstown', (2, 15)), ('Southampton', (115, 15))]</code></code></pre><p>The key for each of the tuples/groups is the value from the grouper column (in this case the <code>embark_town</code>) and the value is just the complete dataframe filtered for that value! If we try to print one of the dataframe from this grouper, you can see that all the rows in this slice of data contain <code>Queenstown</code> as the <code>embark_town</code>, as shown below.</p><pre><code>print(list(grouper)[1][0]) <em>#print key</em>
list(grouper)[1][1]        <em>#print dataframe</em></code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!I8ek!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd642efb-6b3e-45d8-8127-3e8562002a14_1814x226.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!I8ek!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd642efb-6b3e-45d8-8127-3e8562002a14_1814x226.png 424w, https://substackcdn.com/image/fetch/$s_!I8ek!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd642efb-6b3e-45d8-8127-3e8562002a14_1814x226.png 848w, https://substackcdn.com/image/fetch/$s_!I8ek!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd642efb-6b3e-45d8-8127-3e8562002a14_1814x226.png 1272w, https://substackcdn.com/image/fetch/$s_!I8ek!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd642efb-6b3e-45d8-8127-3e8562002a14_1814x226.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!I8ek!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd642efb-6b3e-45d8-8127-3e8562002a14_1814x226.png" width="1456" height="181" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd642efb-6b3e-45d8-8127-3e8562002a14_1814x226.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:181,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:51177,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!I8ek!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd642efb-6b3e-45d8-8127-3e8562002a14_1814x226.png 424w, https://substackcdn.com/image/fetch/$s_!I8ek!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd642efb-6b3e-45d8-8127-3e8562002a14_1814x226.png 848w, https://substackcdn.com/image/fetch/$s_!I8ek!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd642efb-6b3e-45d8-8127-3e8562002a14_1814x226.png 1272w, https://substackcdn.com/image/fetch/$s_!I8ek!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd642efb-6b3e-45d8-8127-3e8562002a14_1814x226.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Similarly, let's see what the grouper object looks like for multiple grouping features. The 'key' in this case is just a tuple with all the group combinations, which, after aggregation, gets set as the index of the final output.</p><pre><code>grouper <strong>=</strong> titanic<strong>.</strong>groupby(['embark_town','sex'])

<em>#Print dtype for each of the elements in the grouper</em>
[(k,g<strong>.</strong>shape) <strong>for</strong> k,g <strong>in</strong> grouper]</code></pre><pre><code><code>[(('Cherbourg', 'female'), (34, 15)),
 (('Cherbourg', 'male'), (31, 15)),
 (('Queenstown', 'female'), (1, 15)),
 (('Queenstown', 'male'), (1, 15)),
 (('Southampton', 'female'), (53, 15)),
 (('Southampton', 'male'), (62, 15))]</code></code></pre><pre><code><em>#The grouping columns become the index after the groupby aggregation</em>
titanic<strong>.</strong>groupby(['embark_town','sex'])['age']<strong>.</strong>mean()</code></pre><pre><code><code>embark_town  sex   
Cherbourg    female    35.352941
             male      39.774194
Queenstown   female    33.000000
             male      44.000000
Southampton  female    30.952830
             male      37.595484
Name: age, dtype: float64</code></code></pre><h2>4. Aggregation</h2><p>There are multiple ways of aggregating your grouper object.</p><ol><li><p>The first part of this section is to understand that you can perform aggregations on multiple columns, OR, perform multiple aggregations themselves on different columns, or a combination of both.</p></li><li><p>Second, you can use <code>apply()</code> or <code>agg()</code> to write your own custom aggregators but Pandas makes it much easier by providing a ton of in-built aggregators such as <code>sum()</code> or <code>mean()</code> as we discussed in the above examples.</p></li></ol><p>Let's try to go through a few scenarios and explore how we can use these aggregations.</p><h4>4.1 In-built aggregation methods</h4><p>Pandas provides a ton of aggregation methods to quickly get the statistics you are looking for. Below are a few of the common ones that are used and more details on these can be found in the official <a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html">pandas documentation</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Me8K!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F591dfec3-5f6d-4918-88a4-34389690dc9a_1784x1080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Me8K!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F591dfec3-5f6d-4918-88a4-34389690dc9a_1784x1080.png 424w, https://substackcdn.com/image/fetch/$s_!Me8K!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F591dfec3-5f6d-4918-88a4-34389690dc9a_1784x1080.png 848w, https://substackcdn.com/image/fetch/$s_!Me8K!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F591dfec3-5f6d-4918-88a4-34389690dc9a_1784x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!Me8K!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F591dfec3-5f6d-4918-88a4-34389690dc9a_1784x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Me8K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F591dfec3-5f6d-4918-88a4-34389690dc9a_1784x1080.png" width="1456" height="881" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/591dfec3-5f6d-4918-88a4-34389690dc9a_1784x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:881,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:173815,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Me8K!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F591dfec3-5f6d-4918-88a4-34389690dc9a_1784x1080.png 424w, https://substackcdn.com/image/fetch/$s_!Me8K!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F591dfec3-5f6d-4918-88a4-34389690dc9a_1784x1080.png 848w, https://substackcdn.com/image/fetch/$s_!Me8K!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F591dfec3-5f6d-4918-88a4-34389690dc9a_1784x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!Me8K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F591dfec3-5f6d-4918-88a4-34389690dc9a_1784x1080.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>4.2 Custom functions with pandas apply</h4><p>This is by far the most popular way of applying a custom function to a dataframe, or in this case, applying it on each of the dataframe slices for groups defined by the grouper. The behavior of the <code>apply()</code> method with groupby is similar to the standard one.</p><p>You can apply it to each row (or column) of a dataframe input (if you have more than one column for aggregation) or to a series (if you have single column for aggregation). Within the function, you can actually either work directly with individual series or just write your own lambda function. Here are a few ways using the <a href="https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.apply.html">apply</a> function.</p><p><strong>Question:</strong> Get the unique set of ages for all each age category (<code>who</code> column) from each town.</p><pre><code>titanic<strong>.</strong>groupby(['embark_town', 'who'])['age']<strong>.</strong>apply(set)<strong>.</strong>reset_index()</code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7lJ7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57320841-f618-4f53-9706-60b754974efe_1814x436.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7lJ7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57320841-f618-4f53-9706-60b754974efe_1814x436.png 424w, https://substackcdn.com/image/fetch/$s_!7lJ7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57320841-f618-4f53-9706-60b754974efe_1814x436.png 848w, https://substackcdn.com/image/fetch/$s_!7lJ7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57320841-f618-4f53-9706-60b754974efe_1814x436.png 1272w, https://substackcdn.com/image/fetch/$s_!7lJ7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57320841-f618-4f53-9706-60b754974efe_1814x436.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7lJ7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57320841-f618-4f53-9706-60b754974efe_1814x436.png" width="1456" height="350" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/57320841-f618-4f53-9706-60b754974efe_1814x436.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:350,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:86715,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7lJ7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57320841-f618-4f53-9706-60b754974efe_1814x436.png 424w, https://substackcdn.com/image/fetch/$s_!7lJ7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57320841-f618-4f53-9706-60b754974efe_1814x436.png 848w, https://substackcdn.com/image/fetch/$s_!7lJ7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57320841-f618-4f53-9706-60b754974efe_1814x436.png 1272w, https://substackcdn.com/image/fetch/$s_!7lJ7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57320841-f618-4f53-9706-60b754974efe_1814x436.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>Question:</strong> Get the range (min-max) of ages for each age category (<code>who</code> column) from each town.</p><pre><code>titanic<strong>.</strong>groupby(['embark_town', 'who'])['age']<strong>.</strong>apply(<strong>lambda</strong> x: x<strong>.</strong>max()<strong>-</strong>x<strong>.</strong>min())<strong>.</strong>reset_index()</code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Dg5d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95acc984-d407-4167-838b-9345e5d67211_1810x434.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Dg5d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95acc984-d407-4167-838b-9345e5d67211_1810x434.png 424w, https://substackcdn.com/image/fetch/$s_!Dg5d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95acc984-d407-4167-838b-9345e5d67211_1810x434.png 848w, https://substackcdn.com/image/fetch/$s_!Dg5d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95acc984-d407-4167-838b-9345e5d67211_1810x434.png 1272w, https://substackcdn.com/image/fetch/$s_!Dg5d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95acc984-d407-4167-838b-9345e5d67211_1810x434.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Dg5d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95acc984-d407-4167-838b-9345e5d67211_1810x434.png" width="1456" height="349" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/95acc984-d407-4167-838b-9345e5d67211_1810x434.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:349,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:63344,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Dg5d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95acc984-d407-4167-838b-9345e5d67211_1810x434.png 424w, https://substackcdn.com/image/fetch/$s_!Dg5d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95acc984-d407-4167-838b-9345e5d67211_1810x434.png 848w, https://substackcdn.com/image/fetch/$s_!Dg5d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95acc984-d407-4167-838b-9345e5d67211_1810x434.png 1272w, https://substackcdn.com/image/fetch/$s_!Dg5d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95acc984-d407-4167-838b-9345e5d67211_1810x434.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>Question:</strong> Get the mean fare-by-age ratio for each age category (<code>who</code> column) from each town.</p><pre><code>titanic<strong>.</strong>groupby(['embark_town', 'who'])<strong>.</strong>apply(<strong>lambda</strong> x: (x['fare']<strong>/</strong>x['age'])<strong>.</strong>mean())</code></pre><pre><code><code>embark_town  who  
Cherbourg    man       3.146083
             woman     3.614103
Queenstown   man       2.045455
             woman     2.727273
Southampton  child    28.956893
             man       1.410745
             woman     2.593897
dtype: float64</code></code></pre><h4>4.3 Multiple aggregations using agg method</h4><p>Sooner or later, you would find it necessary to work with multiple aggregations over multiple columns at once. This is where the <code>agg()</code> method comes in. Here is a quick example of how you can use multiple in-built functions over multiple columns at once.</p><p>The general way to do this is to create a dictionary with the requirements and pass it to the <code>agg()</code> function. There are a few ways to structure the dictionary -</p><pre><code><code>##Single function per column
{
 'column1': 'function1', 
 'column2': 'function2'
}

##Multiple functions per column
{
 'column1': ['function1', 'function2'], 
 'column2': ['function3', 'function4']
}</code></code></pre><p><strong>Question:</strong> Get the mean of fare, AND median of age for each age category (<code>who</code> column) from each town</p><pre><code><code>#Define aggregations as a dictionary
g = {'fare':'mean', 
     'age':'median'
    }

titanic.groupby(['embark_town', 'who']).agg(g).reset_index()</code></code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Kou8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c9ed4f0-7c91-40f4-8e8a-7f56e40e62e7_1814x440.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Kou8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c9ed4f0-7c91-40f4-8e8a-7f56e40e62e7_1814x440.png 424w, https://substackcdn.com/image/fetch/$s_!Kou8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c9ed4f0-7c91-40f4-8e8a-7f56e40e62e7_1814x440.png 848w, https://substackcdn.com/image/fetch/$s_!Kou8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c9ed4f0-7c91-40f4-8e8a-7f56e40e62e7_1814x440.png 1272w, https://substackcdn.com/image/fetch/$s_!Kou8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c9ed4f0-7c91-40f4-8e8a-7f56e40e62e7_1814x440.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Kou8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c9ed4f0-7c91-40f4-8e8a-7f56e40e62e7_1814x440.png" width="1456" height="353" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1c9ed4f0-7c91-40f4-8e8a-7f56e40e62e7_1814x440.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:353,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:77893,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Kou8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c9ed4f0-7c91-40f4-8e8a-7f56e40e62e7_1814x440.png 424w, https://substackcdn.com/image/fetch/$s_!Kou8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c9ed4f0-7c91-40f4-8e8a-7f56e40e62e7_1814x440.png 848w, https://substackcdn.com/image/fetch/$s_!Kou8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c9ed4f0-7c91-40f4-8e8a-7f56e40e62e7_1814x440.png 1272w, https://substackcdn.com/image/fetch/$s_!Kou8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c9ed4f0-7c91-40f4-8e8a-7f56e40e62e7_1814x440.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>Question:</strong> Get the sum &amp; mean of fare, AND median, min, and max of age for each age category (<code>who</code> column) from each town</p><pre><code><em>#Define aggregations as a dictionary</em>
g <strong>=</strong> {'fare':['sum', 'mean'], 
     'age':['median', 'min', 'max']
    }

titanic<strong>.</strong>groupby(['embark_town', 'who'])<strong>.</strong>agg(g)<strong>.</strong>reset_index()</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lWg_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d81e87-600c-47ad-84f8-9e7756c97b99_1818x492.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lWg_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d81e87-600c-47ad-84f8-9e7756c97b99_1818x492.png 424w, https://substackcdn.com/image/fetch/$s_!lWg_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d81e87-600c-47ad-84f8-9e7756c97b99_1818x492.png 848w, https://substackcdn.com/image/fetch/$s_!lWg_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d81e87-600c-47ad-84f8-9e7756c97b99_1818x492.png 1272w, https://substackcdn.com/image/fetch/$s_!lWg_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d81e87-600c-47ad-84f8-9e7756c97b99_1818x492.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lWg_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d81e87-600c-47ad-84f8-9e7756c97b99_1818x492.png" width="1456" height="394" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a6d81e87-600c-47ad-84f8-9e7756c97b99_1818x492.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:394,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:101532,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lWg_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d81e87-600c-47ad-84f8-9e7756c97b99_1818x492.png 424w, https://substackcdn.com/image/fetch/$s_!lWg_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d81e87-600c-47ad-84f8-9e7756c97b99_1818x492.png 848w, https://substackcdn.com/image/fetch/$s_!lWg_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d81e87-600c-47ad-84f8-9e7756c97b99_1818x492.png 1272w, https://substackcdn.com/image/fetch/$s_!lWg_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d81e87-600c-47ad-84f8-9e7756c97b99_1818x492.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As of <strong>Pandas &gt;= 0.25</strong>, another way to define the <code>agg</code> function is to define each column with <code>('column', 'function')</code>.</p><p>Let's demonstrate that with an example.</p><p><strong>Question:</strong> Get the sum &amp; mean of fare, AND min and max of age for each age category (<code>who</code> column) from each town, but rename columns</p><pre><code><em>#Define aggregations directly as columns and tuples</em>
titanic<strong>.</strong>groupby(['embark_town', 'who'])<strong>.</strong>agg(A<strong>=</strong>('fare', 'sum'), 
                                            B<strong>=</strong>('fare', 'mean'),
                                            C<strong>=</strong>('age', 'min'),
                                            D<strong>=</strong>('age', 'max'))<strong>.</strong>reset_index()</code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gp9E!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a28f2cf-7246-4aa5-97b1-5991b43675dd_1822x442.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gp9E!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a28f2cf-7246-4aa5-97b1-5991b43675dd_1822x442.png 424w, https://substackcdn.com/image/fetch/$s_!gp9E!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a28f2cf-7246-4aa5-97b1-5991b43675dd_1822x442.png 848w, https://substackcdn.com/image/fetch/$s_!gp9E!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a28f2cf-7246-4aa5-97b1-5991b43675dd_1822x442.png 1272w, https://substackcdn.com/image/fetch/$s_!gp9E!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a28f2cf-7246-4aa5-97b1-5991b43675dd_1822x442.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gp9E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a28f2cf-7246-4aa5-97b1-5991b43675dd_1822x442.png" width="1456" height="353" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5a28f2cf-7246-4aa5-97b1-5991b43675dd_1822x442.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:353,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:91768,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gp9E!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a28f2cf-7246-4aa5-97b1-5991b43675dd_1822x442.png 424w, https://substackcdn.com/image/fetch/$s_!gp9E!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a28f2cf-7246-4aa5-97b1-5991b43675dd_1822x442.png 848w, https://substackcdn.com/image/fetch/$s_!gp9E!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a28f2cf-7246-4aa5-97b1-5991b43675dd_1822x442.png 1272w, https://substackcdn.com/image/fetch/$s_!gp9E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a28f2cf-7246-4aa5-97b1-5991b43675dd_1822x442.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h4>4.4 Custom functions with agg method</h4><p>As you might think, just modifying the aggregate functions to include lambda functions is a way to create your own custom functions applied to specific columns. Here are a few examples.</p><pre><code><em>#Define aggregations as a dictionary</em>
g <strong>=</strong> {'fare':<strong>lambda</strong> x: x<strong>.</strong>sum(), 
     'age' :<strong>lambda</strong> x: x<strong>.</strong>max()
    }

titanic<strong>.</strong>groupby(['embark_town', 'who'])<strong>.</strong>agg(g)<strong>.</strong>reset_index()</code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qz0o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7747db36-e283-424e-b8bd-48fef889455e_1816x426.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qz0o!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7747db36-e283-424e-b8bd-48fef889455e_1816x426.png 424w, https://substackcdn.com/image/fetch/$s_!qz0o!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7747db36-e283-424e-b8bd-48fef889455e_1816x426.png 848w, https://substackcdn.com/image/fetch/$s_!qz0o!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7747db36-e283-424e-b8bd-48fef889455e_1816x426.png 1272w, https://substackcdn.com/image/fetch/$s_!qz0o!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7747db36-e283-424e-b8bd-48fef889455e_1816x426.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qz0o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7747db36-e283-424e-b8bd-48fef889455e_1816x426.png" width="1456" height="342" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7747db36-e283-424e-b8bd-48fef889455e_1816x426.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:342,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:76426,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qz0o!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7747db36-e283-424e-b8bd-48fef889455e_1816x426.png 424w, https://substackcdn.com/image/fetch/$s_!qz0o!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7747db36-e283-424e-b8bd-48fef889455e_1816x426.png 848w, https://substackcdn.com/image/fetch/$s_!qz0o!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7747db36-e283-424e-b8bd-48fef889455e_1816x426.png 1272w, https://substackcdn.com/image/fetch/$s_!qz0o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7747db36-e283-424e-b8bd-48fef889455e_1816x426.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h2>5. Transform</h2><p>Apart from just aggregating, you can use groupby to <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.transform.html">transform</a> columns based on the grouper object. This requires using <code>transform()</code> function and returns the same number of rows as the original dataset, but the functions are applied based on the grouping defined. Let's consider the following point.</p><p><strong>Question:</strong> Create a new column that returns the average fare for the age group (<code>who</code> column) the passenger belongs to.</p><pre><code>titanic<strong>.</strong>groupby('who')['fare']<strong>.</strong>transform(<strong>lambda</strong> x: x<strong>.</strong>mean())</code></pre><pre><code>1      88.817429
3      88.817429
6      69.821026
10     77.379485
11     88.817429
         ...    
871    88.817429
872    69.821026
879    88.817429
887    88.817429
889    69.821026
Name: fare, Length: 182, dtype: float64</code></pre><p>Notice that the output series is the length of the original titanic dataframe, but contains only 3 unique values <code>[88.8, 69.8, 77.3]</code>, one for each of the <code>['woman', 'man', 'child']</code>. This makes the grouping object highly versatile in the way you would use it for data preprocessing.</p><h2>6. Advanced Usage</h2><p>Let's introduce a few advanced cases where you end up using groupby for data preprocessing.</p><h4>6.1 Sequential/local grouping of a dataframe</h4><p>The grouper object doesn&#8217;t need to explicitly come from the dataframe. As long as the length of the grouper is the same as the number of rows in the dataframe, you can assign any grouper to groupby the rows by.</p><pre><code>df <strong>=</strong> pd<strong>.</strong>DataFrame({'A':[1,2,3,4,8,10,12,13],
                   'B':[1,2,2,3,1,3,2,3]
                  })

<em>#custom grouping</em>
even_odd <strong>=</strong> ['even' <strong>if</strong> i<strong>%2</strong>==0 else 'odd' for i in df['A']]

df<strong>.</strong>groupby(even_odd)['B']<strong>.</strong>mean()</code></pre><pre><code>even    2.2
odd     2.0
Name: B, dtype: float64</code></pre><p><strong>Question</strong>: Get the sum of the <code>value</code> column of the given dataframe based on the sequentially occurring groups <code>category</code> i.e, in <code>[1,1,2,2,1,1]</code> the first group of <code>1's</code> should be a separate group from the second set of <code>1's</code>.</p><blockquote><p>We can solve this creating a custom grouper, by shifting the column value by 1 and comparing them with original. If not equal, it will swap the boolean value. Then we can take a <code>cumsum</code> over the boolean to get groups where the value changes consecutively. Here is the solution for a <a href="https://stackoverflow.com/a/69287629/4755954">similar problem</a> I solved on Stack Overflow.</p></blockquote><pre><code>df <strong>=</strong> pd<strong>.</strong>DataFrame({'A':[1,1,2,2,2,1,1,3,3], <em>#&lt;- column to group on</em>
                   'B':[1,7,2,4,1,8,2,1,3]  <em>#&lt;- column to aggregate</em>
                  })

grouper <strong>=</strong> (df['A']<strong>!=</strong>df['A']<strong>.</strong>shift())<strong>.</strong>cumsum()

df<strong>.</strong>groupby(grouper)<strong>.</strong>agg({'A':'mean','B':'sum'})<strong>.</strong>reset_index(drop<strong>=True</strong>)</code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7E3c!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe22586ea-6ff1-4294-b6ad-7604d499725c_1820x272.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7E3c!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe22586ea-6ff1-4294-b6ad-7604d499725c_1820x272.png 424w, https://substackcdn.com/image/fetch/$s_!7E3c!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe22586ea-6ff1-4294-b6ad-7604d499725c_1820x272.png 848w, https://substackcdn.com/image/fetch/$s_!7E3c!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe22586ea-6ff1-4294-b6ad-7604d499725c_1820x272.png 1272w, https://substackcdn.com/image/fetch/$s_!7E3c!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe22586ea-6ff1-4294-b6ad-7604d499725c_1820x272.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7E3c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe22586ea-6ff1-4294-b6ad-7604d499725c_1820x272.png" width="1456" height="218" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e22586ea-6ff1-4294-b6ad-7604d499725c_1820x272.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:218,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:20152,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7E3c!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe22586ea-6ff1-4294-b6ad-7604d499725c_1820x272.png 424w, https://substackcdn.com/image/fetch/$s_!7E3c!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe22586ea-6ff1-4294-b6ad-7604d499725c_1820x272.png 848w, https://substackcdn.com/image/fetch/$s_!7E3c!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe22586ea-6ff1-4294-b6ad-7604d499725c_1820x272.png 1272w, https://substackcdn.com/image/fetch/$s_!7E3c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe22586ea-6ff1-4294-b6ad-7604d499725c_1820x272.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h4>6.2 Re-indexing to a fixed date range for each group</h4><p><strong>Question</strong>: A dataframe only contains rows for a few dates for each <code>id</code>. The goal is to re-index the dataframe for a fixed date range, but for each of the <code>id</code> individually. Also, fill in the missing data with 0 values.</p><blockquote><p>Here we can create a custom reindex using <code>pandas.date_range</code>. Then, after setting the original date column as index, we can apply <code>pandas.DataFrame.reindex</code> along with groupby on the <code>id</code> column to reindex with the new date range for group, while filling empty values as 0.</p></blockquote><pre><code>d <strong>=</strong> {'id': [11, 11, 11, 11, 13, 13, 13],
     'date': ['2017-06-01','2017-06-03','2017-06-05','2017-06-06','2017-06-01','2017-06-02','2017-06-07'],
     'value': [1, 7, 8, 2, 9, 2, 11]
    }

df <strong>=</strong> pd<strong>.</strong>DataFrame(d)
df['date'] <strong>=</strong> pd<strong>.</strong>to_datetime(df['date'])
print("Input dataframe:")
df</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!s1Rq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F239a6268-4c23-4b77-b39c-127594ac66b6_1820x468.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!s1Rq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F239a6268-4c23-4b77-b39c-127594ac66b6_1820x468.png 424w, https://substackcdn.com/image/fetch/$s_!s1Rq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F239a6268-4c23-4b77-b39c-127594ac66b6_1820x468.png 848w, https://substackcdn.com/image/fetch/$s_!s1Rq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F239a6268-4c23-4b77-b39c-127594ac66b6_1820x468.png 1272w, https://substackcdn.com/image/fetch/$s_!s1Rq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F239a6268-4c23-4b77-b39c-127594ac66b6_1820x468.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!s1Rq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F239a6268-4c23-4b77-b39c-127594ac66b6_1820x468.png" width="1456" height="374" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/239a6268-4c23-4b77-b39c-127594ac66b6_1820x468.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:374,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:56110,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!s1Rq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F239a6268-4c23-4b77-b39c-127594ac66b6_1820x468.png 424w, https://substackcdn.com/image/fetch/$s_!s1Rq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F239a6268-4c23-4b77-b39c-127594ac66b6_1820x468.png 848w, https://substackcdn.com/image/fetch/$s_!s1Rq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F239a6268-4c23-4b77-b39c-127594ac66b6_1820x468.png 1272w, https://substackcdn.com/image/fetch/$s_!s1Rq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F239a6268-4c23-4b77-b39c-127594ac66b6_1820x468.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><pre><code><em>#custom date range</em>
idx <strong>=</strong> pd<strong>.</strong>date_range('2017-06-01','2017-06-07')

<em>#set original date column as index</em>
df<strong>.</strong>set_index('date', inplace<strong>=True</strong>)

<em>#grouby and apply pd.DataFrame.reindex to apply new index and fill value as 0</em>
df<strong>.</strong>groupby('id')<strong>.</strong>apply(pd<strong>.</strong>DataFrame<strong>.</strong>reindex, idx, fill_value<strong>=</strong>0)<strong>.</strong>drop('id',1)<strong>.</strong>reset_index()</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!o00u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa07c4bd5-0a4c-416e-8373-8f4cc3fe57ab_1818x802.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!o00u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa07c4bd5-0a4c-416e-8373-8f4cc3fe57ab_1818x802.png 424w, https://substackcdn.com/image/fetch/$s_!o00u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa07c4bd5-0a4c-416e-8373-8f4cc3fe57ab_1818x802.png 848w, https://substackcdn.com/image/fetch/$s_!o00u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa07c4bd5-0a4c-416e-8373-8f4cc3fe57ab_1818x802.png 1272w, https://substackcdn.com/image/fetch/$s_!o00u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa07c4bd5-0a4c-416e-8373-8f4cc3fe57ab_1818x802.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!o00u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa07c4bd5-0a4c-416e-8373-8f4cc3fe57ab_1818x802.png" width="1456" height="642" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a07c4bd5-0a4c-416e-8373-8f4cc3fe57ab_1818x802.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:642,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:93501,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!o00u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa07c4bd5-0a4c-416e-8373-8f4cc3fe57ab_1818x802.png 424w, https://substackcdn.com/image/fetch/$s_!o00u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa07c4bd5-0a4c-416e-8373-8f4cc3fe57ab_1818x802.png 848w, https://substackcdn.com/image/fetch/$s_!o00u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa07c4bd5-0a4c-416e-8373-8f4cc3fe57ab_1818x802.png 1272w, https://substackcdn.com/image/fetch/$s_!o00u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa07c4bd5-0a4c-416e-8373-8f4cc3fe57ab_1818x802.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>7. Other ways of grouping data</h2><p>Here I discuss 3 ways that are popularly used to group data depending on the data structures and libraries you are already working with.</p><ul><li><p>Grouping using <code>collections.defaultdict</code></p></li><li><p>Using <code>numpy.split()</code> to group an array</p></li><li><p>Chunking into groups using <code>itertools.groupby()</code></p></li></ul><p>Let's say we have a list of tuples with keys and values that we need to group.</p><pre><code>data <strong>=</strong> list(zip(np<strong>.</strong>random<strong>.</strong>randint(0,4,(10,)), np<strong>.</strong>random<strong>.</strong>randint(0,100,(10,))))
print(data)
</code></pre><pre><code>[(1, 41), (1, 30), (2, 70), (3, 82), (2, 68), (0, 18), (3, 97), (1, 37), (3, 8), (0, 51)]
</code></pre><h4>7.1 Using collections' defaultdict</h4><p>A useful way of grouping data is to use <code>defaultdict</code>. <a href="https://docs.python.org/3/library/collections.html#collections.defaultdict">Defaultdict</a> can store the grouping values as keys, and store the values as a list of values (or a custom function on them)</p><pre><code><strong>from</strong> collections <strong>import</strong> defaultdict

d <strong>=</strong> defaultdict(list)

<strong>for</strong> k,v <strong>in</strong> data:
    d[k]<strong>.</strong>append(v)
    
grouped_data <strong>=</strong> dict(d)

print(grouped_data)
</code></pre><pre><code>{1: [41, 30, 37], 2: [70, 68], 3: [82, 97, 8], 0: [18, 51]}
</code></pre><h4>7.2 Using numpy's split function</h4><p>Another way of <a href="https://stackoverflow.com/questions/38013778/is-there-any-numpy-group-by-function/43094244">splitting an array</a> into a list of sub-arrays based on a grouping key is by using <code>np.split</code> along with the indexes for each group returned by <code>np.unique</code>. Only important thing is, the arrays need to be sorted explicitly.</p><pre><code><strong>import</strong> numpy <strong>as</strong> np

<em>#sorted numpy array (sorted by the grouping column)</em>
a <strong>=</strong> np<strong>.</strong>array(data)
a <strong>=</strong> a[np<strong>.</strong>argsort(a[:, 0])]

<em>#Take the index positions for the unique values using return_index </em>
<em>#and start from the second one to split the data</em>
groups <strong>=</strong> np<strong>.</strong>split(a[:,1], np<strong>.</strong>unique(a[:,0], return_index<strong>=True</strong>)[1][1:])
print(groups)
</code></pre><pre><code>[array([18, 51]), array([41, 30, 37]), array([70, 68]), array([82, 97,  8])]
</code></pre><h4>7.3 Using itertools' groupby</h4><p><a href="https://stackoverflow.com/questions/38013778/is-there-any-numpy-group-by-function/43094244">Itertools</a> provides a <code>groupby</code> api which is actually a sequential/local grouping method.</p><p>It's powerful for getting groups from "AAABBBAACCC" as "AAA BBB AA CCC". But in order to get groups as "AAAAA BBB CCC", it&#8217;s necessary to first sort the data by the grouping key.</p><pre><code><strong>import</strong> itertools

items <strong>=</strong> sorted(data, key<strong>=lambda</strong> x:x[0])
grouper <strong>=</strong> itertools<strong>.</strong>groupby(items, key<strong>=lambda</strong> x:x[0])
groups <strong>=</strong> [list(g) <strong>for</strong> k, g <strong>in</strong> grouper]
groups
</code></pre><pre><code>[[(0, 18), (0, 51)],
 [(1, 41), (1, 30), (1, 37)],
 [(2, 70), (2, 68)],
 [(3, 82), (3, 97), (3, 8)]]</code></pre><h2>8. References</h2><ul><li><p><a href="http://www.scipy-lectures.org/packages/statistics/index.html#hypothesis-testing-comparing-two-groups">http://www.scipy-lectures.org/packages/statistics/index.html#hypothesis-testing-comparing-two-groups</a></p></li><li><p><a href="https://www.simple-talk.com/sql/t-sql-programming/sql-group-by-basics/">https://www.simple-talk.com/sql/t-sql-programming/sql-group-by-basics/</a></p></li><li><p><a href="http://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/">http://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/</a></p></li><li><p><a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html">http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html</a></p><p></p></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.blog.datasciencephilosophy.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Akshay&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>