10Gen Mongo-db for Developers Week 5 - Aggregation Notes

The aggregate() function retrieves aggregate information.

Syntax information:

These aggregation operations are all accessible by way of the aggregate() method. While all examples in this document use this method, aggregate() is merely a wrapper around the database command aggregate. The following prototype aggregation operations are equivalent:
db.people.aggregate( <pipeline> )  
> db.people.aggregate( [<pipeline>] )  
> db.runCommand( { aggregate: "people", pipeline: [<pipeline>] } )
These operations perform aggregation routines on the collection named people. <pipeline> is a placeholder for the aggregation pipeline definition. aggregate() accepts the stages of the pipeline (i.e. <pipeline>) as an array, or as arguments to the method.

The Aggregation Pipeline in Mongo-db

Mongo uses a pipe system similar to the unix world. Collections are piped through the pipeline document by document to get a result set. There is a $project stage, a $match stage, a $group stage, a $sort stage, a $skip stage, a $limit stage, and an $unwind stage. Each of the items in the array passed to the aggregate() function is a stage.

$project selects which keys you are interested and allows you to "reshape" the document. Project produces documents on a 1:1 basis. $project is used to include fields from the original document, insert computed fields, rename documents or create and populate fields that hold sub-documents.

$match filters like a where clause. $match produces documents on a n:1 basis. (n<1). $match pipes the documents that match its conditions to the next operator in the pipeline. It is a good idea to place the $match as early as possible in the agregation pipeline, as the $match minimizes the amount of processesing down the piipe because it limits the total number of documents; $match at the very beginning of the pipeline qill take advantage of indexes.

$group is the aggregation. $group produces documents on a n:1 basis. (n<1). The output of $group depends on how you define groups. Begin by specifying an identifier(an _id field) for the group you're creating with this pipeline. You can specify a single field from the documents in the pipeline, a previously computed value, or an aggregate key made up from several incoming fields. $group cannot output nested docuemnts. Every $group expression must specify an _id field. You may specify the _id field as a dotted field path reference, a documet with mutliple fields enclosed in braces, or a constant value. You can use $project to rename the grouped field(s) if necessary.

$sort is 1:1. $sort can take advantage of an index when placed before $project, $unwind, or $group, or when plcaed at the beginning of the pipeline.

$skip is n:1 (n<1)

$limit is n:1 (n<1)

$unwind unjoins prejoined (embedded) data and essentially normalizes a embedded document by exploding it into multiple elements to make it easier to operate on. $unwind peels off the elements of an array individually ,returning a stream of documents. $unwind returns one document for every member of the unwound array within every source document; each returned document is identical except for the value specified by $unwind. The dollar sign ($) must proceed the field specification handed to the $unwind operator. $unwind is most useful in combination with $group. You may undo the effects of unwind operations with the $group pipeline operator. If you specify a target field for $unwind that does not ecist in an input document, the pipeline will generate no result documents. If you specify a target field for $unwind that is not an array, the aggregate() command generates an error. If you specify a target field for $unwind that holds an empty array ([]) in an input document, the pipeline ignores the input document and will generate no result documents.

Stages can happen in any order and can happen multiple times.

select manufacturer, count(*) from product group by manufacturer;

.aggregate([ {$group : { _id : "$manufacturer", num_products: {$sum : 1}}} ]) ;

Compound Grouping in Mongo-db

select manufacturer, category, count(*) from products group by manufacturer, category;

.aggregate([ {$group: { _id : { "manufacturer" : $manufacturer", "category" : $category" } , num_products : { $sum : 1 } } } ] )

Note the compound _id key.

Using a document for _id in Mongo

db.foo.insert( { _id : { name : "matthew", class : "m101" } , hometown : "SJ" } )

This is legal, as long as the _id is unique

Aggregate Expressions in Mongo

Each field defined for the $group phase must use one of the following aggregation functions to generate its composite value:

$sum, $avg, $min, $max, $push, $addToSet, $first, $last are aggregation expressions used in the $group pipe.

$push and $addToSet build arrays. $push willadd repeats, $addToSet will not. $first and $last require you to sort the document

Using $sum

db.products.aggregate([ {$group: { _id : { maker : $manufacturer" }, sum_prices : { $sum : "$price" } } } ])

Using $avg

db.products.aggregate([ {$group : { _id : { category : "$category" }, avg_price: {$avg : "$price"} } } ]);

Using $addToSet

$addToSet has no direct parallel from SQL.

db.products.aggregate([ {$group : $ _id : { maker : "$manufacturer"}. categories: {$addToSet:"$category"} } } ]) ;

Using $push

Using $max and $min

For $min, if some, but not all, documents for the $min operation have either a null value for the field or are missing the field, the $min operator only considers the non-null and the non-missing values for the field.

Double $group stages

db.grades.aggregate([ {$group: {id : {class_id : "$class_id", student_id: "$student_id}", 'average' : {$avg : "$score"}}}, {$group: {_id: "$_id,class_id", 'average' : {$avg : "$average"}}}]) ;

$project

$match

$match performs a filter as they pass through the pipe. It has a reducing effect: there may be n documents in for every 1 document out.

$sort

Sort can be a big memory hog. If the sort is after the grouping phase, it cannot use an index. The sort can become before or after the group phase, you can sort multiple times, and sort is really useful.

$limit and $skip

When using find(), the order you gave skip and limit didn't matter, but in the aggregation framework, it does matter.

Revisiting $first and $last

$first and $last are group operators that allow you to get the first or last value in each group as the aggregation pipeline processes each document. There is no point in using $first or $last unless it follows a $sort operations, otherwise, the results are unpredictable.

$unwind

Documents can have arrays, and its not easy to group within an array unless it is moved out of the array form and made more flat. An array in a document is like prejoined data. Unwind unjoins it and makes it easier to group on. It results in a data explosion

[$unwind

example](https://www.youtube.com/watch?feature=player_embedded&v=XiWCJr4Lqag)

Which grouping operator will enable you to reverse the effects of an $unwind? $push. $addToSet would only work if each key in an array was unique.

[Double

$unwind](https://www.youtube.com/watch?feature=player_embedded&v=sgnidsHyFeU)

You may want to double $unwind if you have two arrays in a document. This will create a Cartesian product of the two arrays.

Can you reverse the effects of a double unwind with the $push operator? Yes. By using two $push's in a row you can do this.

[Mapping between SQL and

Aggregation](https://www.youtube.com/watch?feature=player_embedded&v=auL2R0XKlyM)

The Where clause is equivalent to the $match phase.

The HAVING clause is equivalent to the $match phase, which happens after the $group phase.

COUNT() is used by $sum:1

[Some common SQL

examples](https://www.youtube.com/watch?feature=player_embedded&v=ep2gLSR6C0U)

Visit this link ASAP

[Limitations of the Aggregation

Framework](https://www.youtube.com/watch?feature=player_embedded&v=8BQzKXI- _wE)

Result sets are limited to 16mb of memory. You cannot use 10% of the memory on a machine; you can reduce the size by the use of $match and $project. In a Sharded environment, after the first $group or $sort phase, the data needs to come back to the mongos router. Usually it the mongos is colocated with the application on the same machine, so this could slow down the application and other applications running on the machine. If you are unable to run a query, you can use the mapreduce option available in mongo, or use the hadoop connector to do it through mapreduce.. The pipeline cannot operatore on the following types: Binary, Symbol, MinKey, MaxKey, DBRef, Code, and CodeWScope.

Documentation Notes

[Pipeline Operators and

Indexes](http://docs.mongodb.org/manual/core/aggregation/#pipeline-operators- and-indexes)

When occurring at the beginning of the pipeline, or when placed before the $project, $unwind, and $group aggregation operators, the following pipeline operators can capitalize an index: $match, $sort, $limit, and $skip. Otherwise, they cannot take advantage of an index. A $geoNear pipeline must appear as the first stage in an aggregation pipeline.

A $match pipeline stage preceeded by a a $sort stage in the beginning of the pipeline is logically equivalent to a single query with a sort and can take advantage of an index.