Rolling windows / n-grams? Here are 10 ways to implement them in python!

#python #ngrams #rolling #iteration

Mar 24, 2023

Whether you are working with text, time series, or sequences of any type, sooner or later you would find yourself building a rolling window function or an n-gram generator to iterate over your sequence.

Here are 10 ways you can implement rolling windows/n-grams in python!

Let’s start with a toy example as usual -

# Rolling window size / n in ngram
n = 3

# Example sentence / sequence
doc = "The quick brown fox jumps over the lazy dog"
tokens = doc.split(' ')
print(tokens)

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

So, we have a humble yet overused sequence of tokens, and we need to iterate over this with a rolling window of size 3, otherwise known as tri-grams.

Let’s get rolling (pun intended)!

1. Classic `for` loop

All we need to do here is iterate over the range of length of the sequence and use basic list indexing as follows -

out = []
for i in range(len(tokens)):
    out.append(tokens[i:i+n])

out

[['The', 'quick', 'brown'],
 ['quick', 'brown', 'fox'],
 ['brown', 'fox', 'jumps'],
 ['fox', 'jumps', 'over'],
 ['jumps', 'over', 'the'],
 ['over', 'the', 'lazy'],
 ['the', 'lazy', 'dog'],
 ['lazy', 'dog'],
 ['dog']]

You can choose to skip the sublists which are smaller than the window size with a simple if condition.

2. List Comprehension

I know this might be a tiny bit of cheating, but some of us just love using list comprehensions everywhere, which are just another way of writing the above code.

[tokens[i:i+n] for i in range(len(tokens))]

[['The', 'quick', 'brown'],
 ['quick', 'brown', 'fox'],
 ['brown', 'fox', 'jumps'],
 ['fox', 'jumps', 'over'],
 ['jumps', 'over', 'the'],
 ['over', 'the', 'lazy'],
 ['the', 'lazy', 'dog'],
 ['lazy', 'dog'],
 ['dog']]

3. ZIP *

This might be my go-to method as I find it quite pythonic, but here is how you can use zip along with the unpacking operator * -

list(zip(*[tokens[i:] for i in range(n)]))

[('The', 'quick', 'brown'),
 ('quick', 'brown', 'fox'),
 ('brown', 'fox', 'jumps'),
 ('fox', 'jumps', 'over'),
 ('jumps', 'over', 'the'),
 ('over', 'the', 'lazy'),
 ('the', 'lazy', 'dog')]

How does this work? What's happening here is that list comprehension creates L, L[1:], L[2:] which then get zipped, meaning the first element of each of these (which are the 1st, 2nd, and 3rd elements of L) get clubbed together and the second elements get clubbed together, and so on. The unpacking operator just unpacks the list containing L, L[1:], L[2:] for zip to work with. Here is a helpful diagram that I made to explain this intuitively.

4. Pandas Rolling Objects

You might have used this for performing rolling averages or sums, but did you know that you can iterate over a pandas rolling object? Here is how you can do this -

import pandas as pd
list(map(list,pd.Series(tokens).rolling(n)))

[['The'],
 ['The', 'quick'],
 ['The', 'quick', 'brown'],
 ['quick', 'brown', 'fox'],
 ['brown', 'fox', 'jumps'],
 ['fox', 'jumps', 'over'],
 ['jumps', 'over', 'the'],
 ['over', 'the', 'lazy'],
 ['the', 'lazy', 'dog']]

All we have to do is to map a list class to typecast the iterable rolling object. Or you could use a list comprehension as well! However, if you want to get a “forward rolling window” style output, it just takes an extra step of defining the window type.

import pandas as pd

indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=n)
list(map(list,pd.Series(tokens).rolling(indexer)))

[['The', 'quick', 'brown'],
 ['quick', 'brown', 'fox'],
 ['brown', 'fox', 'jumps'],
 ['fox', 'jumps', 'over'],
 ['jumps', 'over', 'the'],
 ['over', 'the', 'lazy'],
 ['the', 'lazy', 'dog'],
 ['lazy', 'dog'],
 ['dog']]

5. Numpy Stride Tricks

This one is a slightly more advanced method as it involves accessing a numpy array’s contiguous memory storage directly to create a “view”, but it’s super powerful and pretty much used universally behind the scenes for a lot of numpy high-level functions.

import numpy as np

arr = np.array(tokens)
shape = (arr.shape[0] - n + 1, n)            # (7, 3)
strides = (arr.strides[0], arr.strides[0])   # (20, 20) bytes

np.lib.stride_tricks.as_strided(arr, shape=shape, strides=strides)

array([['The', 'quick', 'brown'],
       ['quick', 'brown', 'fox'],
       ['brown', 'fox', 'jumps'],
       ['fox', 'jumps', 'over'],
       ['jumps', 'over', 'the'],
       ['over', 'the', 'lazy'],
       ['the', 'lazy', 'dog']], dtype='<U5')

The “shape” is the expected output shape for the view of this numpy array and the “strides” are the numbers of bytes numpy has to move in each axis to reach the next element. Use this method at your own risk as it can cause memory corruption if now used properly!!

6. NLTK ngrams

You can’t talk about n-grams without talking about NLTK. Most of us have learned our first implementations of n-grams using NLTK, and it can be used for any standard iterator!

from nltk import ngrams
list(ngrams(tokens, n))

[('The', 'quick', 'brown'),
 ('quick', 'brown', 'fox'),
 ('brown', 'fox', 'jumps'),
 ('fox', 'jumps', 'over'),
 ('jumps', 'over', 'the'),
 ('over', 'the', 'lazy'),
 ('the', 'lazy', 'dog')]

7. `more_itertools` library

This is a purpose-built library that extends the classic itertools library with some interesting functions. One of these is the more_itertools.windowed method.

#pip install more_itertools
import more_itertools
list(more_itertools.windowed(tokens, n))

[('The', 'quick', 'brown'),
 ('quick', 'brown', 'fox'),
 ('brown', 'fox', 'jumps'),
 ('fox', 'jumps', 'over'),
 ('jumps', 'over', 'the'),
 ('over', 'the', 'lazy'),
 ('the', 'lazy', 'dog')]

8. `toolz` library

Another example of a library built as an extension to itertools, which comes inbuilt with the handy sliding_window method.

#pip install toolz
import toolz
list(toolz.sliding_window(n, tokens))

[('The', 'quick', 'brown'),
 ('quick', 'brown', 'fox'),
 ('brown', 'fox', 'jumps'),
 ('fox', 'jumps', 'over'),
 ('jumps', 'over', 'the'),
 ('over', 'the', 'lazy'),
 ('the', 'lazy', 'dog')]

9. Itertools islice

Implementing this in itertools is a slight bit more complex, but very handy to learn as it exposes you to work with some underrated yet powerful itertools methods such as islice and tee.

from itertools import islice, tee
list(zip(*(islice(s, i, None) for i, s in enumerate(tee(tokens, n)))))

[('The', 'quick', 'brown'),
 ('quick', 'brown', 'fox'),
 ('brown', 'fox', 'jumps'),
 ('fox', 'jumps', 'over'),
 ('jumps', 'over', 'the'),
 ('over', 'the', 'lazy'),
 ('the', 'lazy', 'dog')]

10. Scikit Learn CountVectorizer

Lastly, this is kinda a cheat and mostly applicable to text documents/sentences but it’s something that data scientists use quite regularly as part of their modeling pipelines. The trick is to define your CountVectorizer with the ngram_range as (n,n) thus only fetching n-grams and not the uni-grams, bi-grams, etc.

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(ngram_range=(n,n))
analyzer = cv.build_analyzer()
analyzer(doc)

['the quick brown',
 'quick brown fox',
 'brown fox jumps',
 'fox jumps over',
 'jumps over the',
 'over the lazy',
 'the lazy dog']

The output returned is not a list of tokens but a string and notice that the input to the analyzer is also the original document rather than the token list.

So that’s 10 ways of quickly implementing rolling window iteration or n-grams in a given sequence using Python! Hope this has been useful for you, the reader!

Data Science Philosophy

Discussion about this post