2008-07-20

Groupsofn, itertools grouper Recipe, and Padding

This post is a moron's attempt to understand something that most others take for granted, a recipe given in the itertools documentation. This is not a solution to any real problem.

Groups of n

Groups of n is a function that takes an iterable and returns n consecutive elements from that iterable.

I needed something like this, while I was trying to convert a rather silly ID-generation function written in Perl to Python. It turned out to be a simple hexadecimal encoding/decoding function.

Under normal circumstances, this is how you would implement that in Python:

import codecs

def hexenc(input):
    return codecs.getencoder("hex")(input)[0]

def hexdec(input):
    return codecs.getdecoder("hex")(input)[0]

In any case, I was halfway through, and so I decided to continue, and ended up with the following implementation:

def hexenc(input):
    return "".join(
        hex(ord(char))[2:].rjust(2, "0")
            for char in input
    )

def hexdec(input):
    return "".join(
        chr(int("".join(hx), 16))
            for hx in groupsofn(2, input)
    )

Notice that the hexdec function assumes that a function called “groupsofn” exists and that it splits the input into n consecutive items.

Plain Generator Implementation

This is a plain generator implementation of it, and incredibly dumb:

def groupsofn(n, iterable, pad=False, padwith=None):
    from itertools import repeat
    group = []
    for i, item in enumerate(iterable):
        group.append(item)
        if (i + 1) % n == 0:
            yield group
            group = []
    
    if pad and 0 < len(group) < n:
        group.extend(repeat(padwith, n - len(group)))
    
    if group:
        yield group

The itertools Recipe

The itertools documentation provides the following recipe:

# from itertools recipe
def grouper(n, iterable, padvalue=None):
    """grouper(3, 'abcdefg', 'x') --> ('a','b','c'),
                              ('d','e','f'), ('g','x','x')
    """
    return izip(*[chain(iterable, repeat(padvalue, n-1))]*n)

How does this work? This left me puzzled for some time. (I am sure expert programmers would not find anything to be puzzled here, but I am just a plain, dumb programmer.)

How the itertools Recipe Works?

Let us assume for the time being that we are not concerned with padding. Now, the implementation simplifies to the following:

def grouper(n, iterable):
    return izip(*[iterable]*n)

The [iterable]*n bit makes n references to the same iterator. Then when izip picks one from each iterator, we simply get consecutive elements. It is like taking apples from the same basket, but we think we are taking it from three different baskets.

Consider the example of invoking grouper with 3 (I am expanding the variables in each step):

def grouper(3, iterable):
    return izip(*([iterable] * 3))

# multiplying the list creates a list with 3 copies
def grouper(3, iterable):
    return izip(*[iterable, iterable, iterable])

# Unpacking makes the list into 3 params to izip
def grouper(3, iterable):
    return izip(iterable, iterable, iterable)

# Now, let us look just at izip (conceptually, it 
# yields, one from each iterator packed into a tuple)
def izip(iter1, iter2, iter3):
    # ... imaginary izip implementation
    yield (iter1.next(), iter2.next(), iter3.next())

# Note that iter1 is iter2 is iter3 == True.
# so, it is essentially this:
def izip(iterable, iterable, iterable)
    # ... imaginary izip implementation
    yield (iterable.next(), iterable.next(), iterable.next())
Adding the Padding

The chain(iterable, repeat(padvalue, n-1) bit ensures that there will be enough items in the iterable for izip to go through, because izip bails out when the shortest iterator ends.

Let us consider the case, where there is one odd (not the mathematical sense) item at the end (that is, len(iterable) % n == 1). If n - 1, padding is always added to the end of the iterable, then, izip will not exit and leave out that one last item.

Consider the other extreme, when, there are no odd ones out at the end, then the padded n-1 values, will not give enough elements for the iterator, to consume and that extra padding will be left behind.

In other cases, then having the additional padding of n-1, will ensure that that 1 element is returned, and for anything larger, the excess padding will be skipped.

X: padding value
 *: values in iterable
 n == 3

 len(iterable) == 7
 * * *
 * * *
 * X X

 len(iterable) == 6
 * * *
 * * *
 X X

 len(iterable) == 8
 * * *
 * * *
 * * X
 X

A consequence of this is that it does not provide a way to leave out padding altogether.

A grouper with Optional Padding

To add optional padding, I could use the simple generator based code listed earlier. However, after you see the itertools recipe, you just don't feel like manually accumulating anything.

The groupby Solution

The itertools.groupby looked promising, and after a few trials, came up with the following:

def grouper(n, iterable):
    for k, g in groupby(enumerate(iterable), lambda x: x[0] // n):
        yield (e for i, e in g)

I chose to follow the itertools.groupby's behavior of keeping the groups as generators.

A similar implementation was pointed out by David Isaac in the Python mailing list.

Adding Padding to the groupby Solution

To add padding, we could do something dumb like this:

def pad(length, iterable, padding=None):
    return islice(chain(iterable, repeat(padding)), length)

def grouper(n, iterable, pad=False, padding=None):
    if pad:
        for k, g in groupby(enumerate(iterable), 
                                  lambda x: x[0] // n):
            yield pad(n, (e for i, e in g), padding)
    else:
        for k, g in groupby(enumerate(iterable), 
                                  lambda x: x[0] // n):
            yield (e for i, e in g)

The padding routine is an adaptation of the padnone recipe from the itertools documentation.

However, This implementation has a big problem. We pointlessly pad every group, when, we might only need to do it for one.

The Hybrid

We already have a solution that works well for the case where padding is required. So, the obvious solution is to use a hybrid.

Here is the full version of the hybrid code:

def grouper(n, iterable, pad=False, padding=None):
    if pad:
        return izip(*[chain(iterable, repeat(padding, n-1))]*n)
    else:
        return ((e for i, e in group) for key, group in 
                    groupby(enumerate(iterable), lambda x: x[0] // n)
        )

Resources