mongoDB aggregation pipelines

Why use an aggregation pipeline?

In this post we will give an example of an aggregation pipeline step by step. It would be as a light tutorial to touch on the subject. We will start with the given context and expose the problem afterwards, then write an aggregation pipeline and dissect it!

The aggregate pipeline is a framework for data aggregation modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the documents into aggregated results. https://docs.mongodb.com/manual/core/aggregation-pipeline/

READY? Let's go.

Given

Given a bunch of documents in a mongoDB collection like the below.
Let's imagine for a moment that those are events generated by a game, concerning win or loss (only, for simplification). Each kind of event has some specific data to it. So we would have a bunch of documents like so:

{
  "playerId": "c5f348a003840f8",
  "level": 0,
  "event": {
    "score": 54,
    "damagedBy": "ASTEROID",
    "levelKey": {
      "id": 0
    },
    "type": "lost"
  }
}

{
  "playerId": "5f348a0038cb8d",
  "level": 2,
  "event": {
    "flawless": true,
    "collector": false,
    "levelKey": {
      "id": 2
    },
    "score": 315,
    "retries": 3,
    "type": "won"
  }
}
 ... imagine many events of those, let's say from a game  ...

The problem

Lets say we need to construct for a given player (identified by it's id), an aggregate for each level. This aggregate would tell how many times the level was played, how many times it was won, if it was flawless, if it has the collector property and the score (as coins).

To give you a better idea, the document should look like so:

{
  "played": 1,
  "won": 0,
  "flawless": false,
  "collector": false,
  "coins": 38,
  "level": 5
}
  • played = count of events
  • won =count of event with type=won
  • coins = sum(score),
  • flawless and collector are true when at least one event for the level has them true
  • level = the level id.

Solution

The below aggregation pipeline does that. Have a look at it and then we will break it down bit by bit.

[{
    "$match": {
        "playerId": "c5f348a003840f8"
    }
}, {
    "$group": {
        "_id": "$level",
        "played": {
            "$sum": 1
        },
        "won": {
            "$sum": {
                "$cond": [{
                    "$eq": [
                        "$event.type",
                        "won"
                    ]
                },1,0]
            }
        },
        "flawless": {
            "$sum": {
                "$cond": [{
                    "$eq": [
                        "$event.flawless",
                        true
                    ]
                },1,0]
            }
        },
        "collector": {
            "$sum": {
                "$cond": [{
                    "$eq": [
                        "$event.collector",
                        true
                    ]
                },1,0]
            }
        },
        "coins": {
            "$sum": "$event.score"
        }
    }
}, {
    "$project": {
        "level": '$_id',
        "coins": 1,
        "played": 1,
        "won": 1,
        "flawless":  { $gt: ['$flawless', 0] } , 
        "collector": { $gt: ['$collector', 0] } 
    }
    }]

Let's break it down

A bit of definitions first

The aggregate pipeline is

a framework for data aggregation modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the documents into aggregated results. https://docs.mongodb.com/manual/core/aggregation-pipeline/

https://docs.mongodb.com/manual/_images/aggregation-pipeline.png

In layman's terms, a pipeline consists of several stages. Each stage transforms the data as it passes through it, and pass it along the next stage. Easy, right?

Ok, so next logical question you ask is what kind of stages are there?
I am glad you asked! At the time of writing there is around 2Oish operator. (too many to list or explain - check here if you want to see them all)

For the concern of this blog post, we will use 3 stages to perform our data transformation, namely the match, group and project.

$match stage

Filters the document stream to allow only matching documents to pass unmodified into the next pipeline stage. $match uses standard MongoDB queries. For each input document, outputs either one document (a match) or zero documents (no match).

Remember that in the context of our problem we are interested in filtering the events for a given player. We do not want to mix results from different players. That wouldn't make any sense for the problem we want to solve!

To match the documents for a given player, let's say id=c5f348a003840f8, our $match stage is as easy as this:

{ "$match": { "playerId": "c5f348a003840f8" } }

Great. Now we need to start munching some data, grouping and transforming it. Shall we?

$group stage

Groups input documents by a specified identifier expression and applies the accumulator expression(s), if specified, to each group. Consumes all input documents and outputs one document per each distinct group. The output documents only contain the identifier field and, if specified, accumulated fields.

If all you've understood resembles, blah blah blah blah, don't be discouraged. Let me try to explain with our example.

Let's start small. (Most things start like that.)
If we want to see how many times the player played a given level, we would have to group by the level id, and count the number of documents related to that level. (remember we only generate won and lost event in our "complex" game)

{ "$group": {
      "_id": "$level",
      "played": { "$sum": 1 }
  }
}
  • _id: this is the key you pass to SQL's GROUP BY, or RxJava groupBy key. Set this to the things you want to group on
  • played: here we use an accumulator, the sum. Practically we would add 1 each time we see a document. Thus counting them.

Sweet! Let's make it a bit harder.

In order to see how many time the player won his level, we'd have to count (accumulate the sum) the documents having the event.type == "won"

"won": {
    "$sum": {
        "$cond": [{
            "$eq": [
                "$event.type",
                "won"
            ]
        }, 1, 0]
    }
}

A bit harder, but not by much! We basically used an equality conditional ($cond and $eq), which operates like a ternary operator, and returns 1 if true or 0 otherwise. and then we sum those. Basically adding 1 each time we see a won.

After adding almost the same for the other fields, our documents look like so:

{
   "_id": 7,
   "played": 4,
   "won": 2,
   "flawless": 2,
   "collector": 2,
   "coins": 1044
}

Pretty close, but not there yet.

inner voice: I wish there were a way to reshape somehow the document.

$project stage

Reshapes each document in the stream, such as by adding new fields or removing existing fields. For each input document, outputs one document.

Great! Santa DO EXIST! :'(

Finally, we will use a projection in order to filter out the _id field, and rename or reshape the others:

 "$project": {
       "level": '$_id',
       "_id": 0,
       "coins": 1,
       "played": 1,
       "won": 1,
       "flawless":  { $gt: ['$flawless', 0] } ,
       "collector": { $gt: ['$collector', 0] }
 }

Here is what we are saying:

  • level: we set it to be equal
  • _id: we restrict it from showing into the final document by specifying it as 0
  • coins played and won are passed as they are - so we set them to 1
  • collector and flawless: we convert it to boolean by checking if we had at least 1 as the sum

and that's it! It was easy :)

I'm pretty sure we could have written that in another way. Maybe you want to tell me about it in the comments?

Comments

comments powered by Disqus