I was reading through the standard library and came across the groupby function in the itertools module, but didn't grasp how to use it from the documentation:
groups = []
uniquekeys = []
data = sorted(data, key=keyfunc)
for k, g in groupby(data, keyfunc):
# Store group iterator as a list
groups.append(list(g))
uniquekeys.append(k)
After Googling I came across a few examples that clarified it for me.
The data argument must be a container of objects that can be identified by a key. In the following examples, the beer array will be the container holding the objects that can be identified by a key. Note that in all of these examples, the array is sorted in such a way that all of the same keys are next to each other and not out of order.
For example, a tuple in the form (group, item):
beers = [
('IPA', 'Sierra Nevada'),
('IPA', 'Goose Island'),
('Porter', 'Deschutes Black Butte'),
('Porter', 'Stone Smoked Porter'),
('Pilsener', 'Sierra Nevada Pilsener'),
('Pilsener', 'Pilsener Urquell')
]
# The key for the group is the 0th item in the tuple
keyfunc = lambda beer: beer[0]
Or an object:
class Beer(object):
def __init__(self, category, brand):
self.category = category
self.brand = brand
beers = [
Beer('IPA', 'Sierra Nevada'),
Beer('IPA', 'Goose Island'),
Beer('Porter', 'Deschutes Black Butte'),
Beer('Porter', 'Stone Smoked Porter'),
Beer('Pilsener', 'Sierra Nevada Pilsener'),
Beer('Pilsener', 'Pilsener Urquell')
]
keyfunc = lambda beer: beer.category
Now let's use the itertools.groupby function to populate a dictionary who's key is the style category and value is an array of the brand names. I'm going to use namedtuple for my container of items.
from collections import namedtuple
from itertools import groupby
#a namedtuple is a tuple that permits attribute access.
#In this case, beer.category maps to beer[0], and
#beer.brand maps to beer[0]
Beer = namedtuple('Beer', ['category', 'brand'])
# Note that the beers array is sorted
beers = [
Beer('IPA', 'Sierra Nevada'),
Beer('IPA', 'Goose Island'),
Beer('Porter', 'Deschutes Black Butte'),
Beer('Porter', 'Stone Smoked Porter'),
Beer('Pilsener', 'Sierra Nevada Pilsener'),
Beer('Pilsener', 'Pilsener Urquell')
]
beer_map = {}
for key, group in groupby(beers, lambda beer: beer.category):
beer_map[key] = [beer.brand for beer in group]
# Or, preferably using dict comprehensions:
beer_map = {
key: [beer.brand for beer in group]
for key, group in groupby(beers, lambda beer: beer.category)
}
print beer_map
#{
# 'IPA': ['Sierra Nevada', 'Goose Island'],
# 'Pilsener': ['Sierra Nevada Pilsener', 'Pilsener Urquell'],
# 'Porter': ['Deschutes Black Butte', 'Stone Smoked Porter']
#}
Note that it is important that the container be sorted by the key (actually, just that each item with the same group is next to each other in the array. Note how in the above example the Pilsener comes after the Porter with no ill effects) .
Check the output of this one, where the beers array doesn't have consecutive keys:
from collections import namedtuple
from itertools import groupby
Beer = namedtuple('Beer', ['category', 'brand'])
beers = [
Beer('IPA', 'Sierra Nevada'),
Beer('Porter', 'Deschutes Black Butte'),
Beer('Pilsener', 'Sierra Nevada Pilsener'),
Beer('IPA', 'Goose Island'),
Beer('Porter', 'Stone Smoked Porter'),
Beer('Pilsener', 'Pilsener Urquell')
]
beer_map = {
key: [beer.brand for beer in group]
for key, group in groupby(beers, lambda beer: beer.category)
}
beer_map
#{
# 'IPA': ['Goose Island'],
# 'Pilsener': ['Pilsener Urquell'],
# 'Porter': ['Stone Smoked Porter']
#}
One thing to note about all these examples is that they are contrived and would not really be useful.
Instead of using itertools.groupby
, which requires the container's keys to be adjacent to one another, can't we group our objects in fewer characters of code and without requiring that keys be sorted?
from collections import defaultdict
beer_map = defaultdict(list)
for beer in beers:
beer_map[beer.category].append(beer.brand)
I asked this question on stackoverflow and received a good response:
Generally the point of using iterators is to avoid keeping an entire data set in memory. In your example, it doesn't matter because: The input is already all in memory; You're just dumping everything into a dict, so the output is also all in memory.
So the only reason that you should even use itertools.groupby
in the first place is when you are working with an iterable. e.g., pulling rows out of a DB cursor and writing them to a remote queue. Perhaps that is obvious for a function in the itertools
module.
Name: Sergio
Creation Date: 2018-07-02
Thanks for the post, it has really helped me to understand the idea!
Name: Varad
Creation Date: 2018-08-25
Thank you its easy and quick to refresh
Name: RaulS
Creation Date: 2018-10-09
I still do not understand how the for can have two elements on it. for key, groups in groupby(data, keyfunc) At first I thought it was because of the second argument, then I ran it without a keyfunction and it kept working. Later I thought it was because somehow groupby worked with a matrix. I tried to use a for with two element arguments on a matrix but, to no avail. I've been struggling with the documentation and even with your article for around three days now. I'm growing desperate, haha. Please, send help.