Rolling windows / n-grams? Here are 10 ways to implement them in python!
#python #ngrams #rolling #iteration
Whether you are working with text, time series, or sequences of any type, sooner or later you would find yourself building a rolling window function or an n-gram generator to iterate over your sequence.
Here are 10 ways you can implement rolling windows/n-grams in python!
Let’s start with a toy example as usual -
# Rolling window size / n in ngram
n = 3
# Example sentence / sequence
doc = "The quick brown fox jumps over the lazy dog"
tokens = doc.split(' ')
print(tokens)
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
So, we have a humble yet overused sequence of tokens, and we need to iterate over this with a rolling window of size 3, otherwise known as tri-grams.
Let’s get rolling (pun intended)!
1. Classic for
loop
All we need to do here is iterate over the range of length of the sequence and use basic list
indexing as follows -
out = []
for i in range(len(tokens)):
out.append(tokens[i:i+n])
out
[['The', 'quick', 'brown'],
['quick', 'brown', 'fox'],
['brown', 'fox', 'jumps'],
['fox', 'jumps', 'over'],
['jumps', 'over', 'the'],
['over', 'the', 'lazy'],
['the', 'lazy', 'dog'],
['lazy', 'dog'],
['dog']]
You can choose to skip the sublists which are smaller than the window size with a simple if
condition.
2. List Comprehension
I know this might be a tiny bit of cheating, but some of us just love using list comprehensions everywhere, which are just another way of writing the above code.
[tokens[i:i+n] for i in range(len(tokens))]
[['The', 'quick', 'brown'],
['quick', 'brown', 'fox'],
['brown', 'fox', 'jumps'],
['fox', 'jumps', 'over'],
['jumps', 'over', 'the'],
['over', 'the', 'lazy'],
['the', 'lazy', 'dog'],
['lazy', 'dog'],
['dog']]
3. ZIP *
This might be my go-to method as I find it quite pythonic, but here is how you can use zip
along with the unpacking operator * -
list(zip(*[tokens[i:] for i in range(n)]))
[('The', 'quick', 'brown'),
('quick', 'brown', 'fox'),
('brown', 'fox', 'jumps'),
('fox', 'jumps', 'over'),
('jumps', 'over', 'the'),
('over', 'the', 'lazy'),
('the', 'lazy', 'dog')]
How does this work? What's happening here is that list comprehension creates L, L[1:], L[2:]
which then get zipped, meaning the first element of each of these (which are the 1st, 2nd, and 3rd elements of L) get clubbed together and the second elements get clubbed together, and so on. The unpacking operator just unpacks the list containing L, L[1:], L[2:]
for zip
to work with. Here is a helpful diagram that I made to explain this intuitively.
4. Pandas Rolling Objects
You might have used this for performing rolling averages or sums, but did you know that you can iterate over a pandas rolling object? Here is how you can do this -
import pandas as pd
list(map(list,pd.Series(tokens).rolling(n)))
[['The'],
['The', 'quick'],
['The', 'quick', 'brown'],
['quick', 'brown', 'fox'],
['brown', 'fox', 'jumps'],
['fox', 'jumps', 'over'],
['jumps', 'over', 'the'],
['over', 'the', 'lazy'],
['the', 'lazy', 'dog']]
All we have to do is to map
a list
class to typecast the iterable rolling object. Or you could use a list comprehension as well! However, if you want to get a “forward rolling window” style output, it just takes an extra step of defining the window type.
import pandas as pd
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=n)
list(map(list,pd.Series(tokens).rolling(indexer)))
[['The', 'quick', 'brown'],
['quick', 'brown', 'fox'],
['brown', 'fox', 'jumps'],
['fox', 'jumps', 'over'],
['jumps', 'over', 'the'],
['over', 'the', 'lazy'],
['the', 'lazy', 'dog'],
['lazy', 'dog'],
['dog']]
5. Numpy Stride Tricks
This one is a slightly more advanced method as it involves accessing a numpy
array’s contiguous memory storage directly to create a “view”, but it’s super powerful and pretty much used universally behind the scenes for a lot of numpy
high-level functions.
import numpy as np
arr = np.array(tokens)
shape = (arr.shape[0] - n + 1, n) # (7, 3)
strides = (arr.strides[0], arr.strides[0]) # (20, 20) bytes
np.lib.stride_tricks.as_strided(arr, shape=shape, strides=strides)
array([['The', 'quick', 'brown'],
['quick', 'brown', 'fox'],
['brown', 'fox', 'jumps'],
['fox', 'jumps', 'over'],
['jumps', 'over', 'the'],
['over', 'the', 'lazy'],
['the', 'lazy', 'dog']], dtype='<U5')
The “shape” is the expected output shape for the view of this numpy
array and the “strides” are the numbers of bytes numpy
has to move in each axis to reach the next element. Use this method at your own risk as it can cause memory corruption if now used properly!!
6. NLTK ngrams
You can’t talk about n-grams without talking about NLTK. Most of us have learned our first implementations of n-grams using NLTK, and it can be used for any standard iterator!
from nltk import ngrams
list(ngrams(tokens, n))
[('The', 'quick', 'brown'),
('quick', 'brown', 'fox'),
('brown', 'fox', 'jumps'),
('fox', 'jumps', 'over'),
('jumps', 'over', 'the'),
('over', 'the', 'lazy'),
('the', 'lazy', 'dog')]
7. more_itertools
library
This is a purpose-built library that extends the classic itertools
library with some interesting functions. One of these is the more_itertools.windowed
method.
#pip install more_itertools
import more_itertools
list(more_itertools.windowed(tokens, n))
[('The', 'quick', 'brown'),
('quick', 'brown', 'fox'),
('brown', 'fox', 'jumps'),
('fox', 'jumps', 'over'),
('jumps', 'over', 'the'),
('over', 'the', 'lazy'),
('the', 'lazy', 'dog')]
8. toolz
library
Another example of a library built as an extension to itertools
, which comes inbuilt with the handy sliding_window
method.
#pip install toolz
import toolz
list(toolz.sliding_window(n, tokens))
[('The', 'quick', 'brown'),
('quick', 'brown', 'fox'),
('brown', 'fox', 'jumps'),
('fox', 'jumps', 'over'),
('jumps', 'over', 'the'),
('over', 'the', 'lazy'),
('the', 'lazy', 'dog')]
9. Itertools islice
Implementing this in itertools
is a slight bit more complex, but very handy to learn as it exposes you to work with some underrated yet powerful itertools
methods such as islice
and tee
.
from itertools import islice, tee
list(zip(*(islice(s, i, None) for i, s in enumerate(tee(tokens, n)))))
[('The', 'quick', 'brown'),
('quick', 'brown', 'fox'),
('brown', 'fox', 'jumps'),
('fox', 'jumps', 'over'),
('jumps', 'over', 'the'),
('over', 'the', 'lazy'),
('the', 'lazy', 'dog')]
10. Scikit Learn CountVectorizer
Lastly, this is kinda a cheat and mostly applicable to text documents/sentences but it’s something that data scientists use quite regularly as part of their modeling pipelines. The trick is to define your CountVectorizer
with the ngram_range
as (n,n)
thus only fetching n-grams and not the uni-grams, bi-grams, etc.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(n,n))
analyzer = cv.build_analyzer()
analyzer(doc)
['the quick brown',
'quick brown fox',
'brown fox jumps',
'fox jumps over',
'jumps over the',
'over the lazy',
'the lazy dog']
The output returned is not a list of tokens but a string and notice that the input to the analyzer is also the original document rather than the token list.
So that’s 10 ways of quickly implementing rolling window iteration or n-grams in a given sequence using Python! Hope this has been useful for you, the reader!