# Rolling windows / n-grams? Here are 10 ways to implement them in python!

### #python #ngrams #rolling #iteration

Whether you are working with text, time series, or sequences of any type, sooner or later you would find yourself building a rolling window function or an n-gram generator to iterate over your sequence.

#### Here are 10 ways you can implement rolling windows/n-grams in python!

Let’s start with a toy example as usual -

```
# Rolling window size / n in ngram
n = 3
# Example sentence / sequence
doc = "The quick brown fox jumps over the lazy dog"
tokens = doc.split(' ')
print(tokens)
```

`['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']`

So, we have a humble yet overused sequence of tokens, and we need to iterate over this with a rolling window of size 3, otherwise known as tri-grams.

Let’s get rolling (pun intended)!

### 1. Classic `for`

loop

All we need to do here is iterate over the range of length of the sequence and use basic `list`

indexing as follows -

```
out = []
for i in range(len(tokens)):
out.append(tokens[i:i+n])
out
```

```
[['The', 'quick', 'brown'],
['quick', 'brown', 'fox'],
['brown', 'fox', 'jumps'],
['fox', 'jumps', 'over'],
['jumps', 'over', 'the'],
['over', 'the', 'lazy'],
['the', 'lazy', 'dog'],
['lazy', 'dog'],
['dog']]
```

You can choose to skip the sublists which are smaller than the window size with a simple `if`

condition.

### 2. List Comprehension

I know this might be a tiny bit of cheating, but some of us just love using list comprehensions everywhere, which are just another way of writing the above code.

`[tokens[i:i+n] for i in range(len(tokens))]`

```
[['The', 'quick', 'brown'],
['quick', 'brown', 'fox'],
['brown', 'fox', 'jumps'],
['fox', 'jumps', 'over'],
['jumps', 'over', 'the'],
['over', 'the', 'lazy'],
['the', 'lazy', 'dog'],
['lazy', 'dog'],
['dog']]
```

### 3. ZIP *

This might be my go-to method as I find it quite pythonic, but here is how you can use `zip`

along with the `unpacking operator * -`

`list(zip(*[tokens[i:] for i in range(n)]))`

```
[('The', 'quick', 'brown'),
('quick', 'brown', 'fox'),
('brown', 'fox', 'jumps'),
('fox', 'jumps', 'over'),
('jumps', 'over', 'the'),
('over', 'the', 'lazy'),
('the', 'lazy', 'dog')]
```

**How does this work?** What's happening here is that list comprehension creates `L, L[1:], L[2:]`

which then get zipped, meaning the first element of each of these (which are the 1st, 2nd, and 3rd elements of L) get clubbed together and the second elements get clubbed together, and so on. The unpacking operator just unpacks the list containing `L, L[1:], L[2:] `

for `zip`

to work with. Here is a helpful diagram that I made to explain this intuitively.

### 4. Pandas Rolling Objects

You might have used this for performing rolling averages or sums, but did you know that you can iterate over a pandas rolling object? Here is how you can do this -

```
import pandas as pd
list(map(list,pd.Series(tokens).rolling(n)))
```

```
[['The'],
['The', 'quick'],
['The', 'quick', 'brown'],
['quick', 'brown', 'fox'],
['brown', 'fox', 'jumps'],
['fox', 'jumps', 'over'],
['jumps', 'over', 'the'],
['over', 'the', 'lazy'],
['the', 'lazy', 'dog']]
```

All we have to do is to `map`

a `list`

class to typecast the iterable rolling object. Or you could use a list comprehension as well! However, if you want to get a “forward rolling window” style output, it just takes an extra step of defining the window type.

```
import pandas as pd
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=n)
list(map(list,pd.Series(tokens).rolling(indexer)))
```

```
[['The', 'quick', 'brown'],
['quick', 'brown', 'fox'],
['brown', 'fox', 'jumps'],
['fox', 'jumps', 'over'],
['jumps', 'over', 'the'],
['over', 'the', 'lazy'],
['the', 'lazy', 'dog'],
['lazy', 'dog'],
['dog']]
```

### 5. Numpy Stride Tricks

This one is a slightly more advanced method as it involves accessing a `numpy`

array’s contiguous memory storage directly to create a “view”, but it’s super powerful and pretty much used universally behind the scenes for a lot of `numpy`

high-level functions.

```
import numpy as np
arr = np.array(tokens)
shape = (arr.shape[0] - n + 1, n) # (7, 3)
strides = (arr.strides[0], arr.strides[0]) # (20, 20) bytes
np.lib.stride_tricks.as_strided(arr, shape=shape, strides=strides)
```

```
array([['The', 'quick', 'brown'],
['quick', 'brown', 'fox'],
['brown', 'fox', 'jumps'],
['fox', 'jumps', 'over'],
['jumps', 'over', 'the'],
['over', 'the', 'lazy'],
['the', 'lazy', 'dog']], dtype='<U5')
```

The “shape” is the expected output shape for the view of this `numpy`

array and the “strides” are the numbers of bytes `numpy`

has to move in each axis to reach the next element. Use this method at your own risk as it can cause memory corruption if now used properly!!

### 6. NLTK ngrams

You can’t talk about n-grams without talking about NLTK. Most of us have learned our first implementations of n-grams using NLTK, and it can be used for any standard iterator!

```
from nltk import ngrams
list(ngrams(tokens, n))
```

```
[('The', 'quick', 'brown'),
('quick', 'brown', 'fox'),
('brown', 'fox', 'jumps'),
('fox', 'jumps', 'over'),
('jumps', 'over', 'the'),
('over', 'the', 'lazy'),
('the', 'lazy', 'dog')]
```

### 7. `more_itertools`

library

This is a purpose-built library that extends the classic `itertools`

library with some interesting functions. One of these is the `more_itertools.windowed `

method.

```
#pip install more_itertools
import more_itertools
list(more_itertools.windowed(tokens, n))
```

```
[('The', 'quick', 'brown'),
('quick', 'brown', 'fox'),
('brown', 'fox', 'jumps'),
('fox', 'jumps', 'over'),
('jumps', 'over', 'the'),
('over', 'the', 'lazy'),
('the', 'lazy', 'dog')]
```

### 8. `toolz`

library

Another example of a library built as an extension to `itertools`

, which comes inbuilt with the handy `sliding_window`

method.

```
#pip install toolz
import toolz
list(toolz.sliding_window(n, tokens))
```

```
[('The', 'quick', 'brown'),
('quick', 'brown', 'fox'),
('brown', 'fox', 'jumps'),
('fox', 'jumps', 'over'),
('jumps', 'over', 'the'),
('over', 'the', 'lazy'),
('the', 'lazy', 'dog')]
```

### 9. Itertools islice

Implementing this in `itertools`

is a slight bit more complex, but very handy to learn as it exposes you to work with some underrated yet powerful `itertools`

methods such as `islice`

and `tee`

.

```
from itertools import islice, tee
list(zip(*(islice(s, i, None) for i, s in enumerate(tee(tokens, n)))))
```

```
[('The', 'quick', 'brown'),
('quick', 'brown', 'fox'),
('brown', 'fox', 'jumps'),
('fox', 'jumps', 'over'),
('jumps', 'over', 'the'),
('over', 'the', 'lazy'),
('the', 'lazy', 'dog')]
```

### 10. Scikit Learn CountVectorizer

Lastly, this is kinda a cheat and mostly applicable to text documents/sentences but it’s something that data scientists use quite regularly as part of their modeling pipelines. The trick is to define your `CountVectorizer`

with the `ngram_range`

as `(n,n)`

thus only fetching n-grams and not the uni-grams, bi-grams, etc.

```
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(n,n))
analyzer = cv.build_analyzer()
analyzer(doc)
```

```
['the quick brown',
'quick brown fox',
'brown fox jumps',
'fox jumps over',
'jumps over the',
'over the lazy',
'the lazy dog']
```

The output returned is not a list of tokens but a string and notice that the input to the analyzer is also the original document rather than the token list.

**So that’s 10 ways of quickly implementing rolling window iteration or n-grams in a given sequence using Python! Hope this has been useful for you, the reader!**