Text 24 Jan MongoDB aggregation using Scala (1/2)

This is something that you probably were looking for if you’re using Casbah and recently upgraded mongo to support the famous aggregation framework.

In this post we will work through how to add this implementation to your code base, and a simple example that will cover most of the use cases for the aggregation pipeline.

Let’s assume we have a collection “city” filled with documents that look like:

[
    {
        "name" : "Washington",
        "state" : "DC",
        "coast" : "EAST",
        "population" : 617996,
        "mayor" : "Vincent C. Gray"
    },
    {
        "name" : "Baltimore",
        "state" : "MD",
        "coast" : "EAST",
        "population" : 619493,
        "mayor" : "Stephanie R. Blake"
    },
    {
        "name" : "Annapolis",
        "state" : "MD",
        "coast" : "EAST",
        "population" : 38880,
        "mayor" : "Joshua J. Cohen"
    },
    {
        "name" : "Philadelphia",
        "state" : "PA",
        "coast" : "EAST",
        "population" : 1536471,
        "mayor" : "Michael A. Nutter"
    },
    {
        "name" : "Scranton",
        "state" : "PA",
        "coast" : "EAST",
        "population" : 75995,
        "mayor" : "Christopher Doherty"
    },
    {
        "name" : "San Francisco",
        "state" : "CA",
        "coast" : "WEST",
        "population" : 805235,
        "mayor" : "Edwin M. Lee"
    },
    {
        "name" : "Sacramento",
        "state" : "CA",
        "coast" : "WEST",
        "population" : 472178,
        "mayor" : "Kevin Johnson"
    }
]

Now let’s try to get the mayor of the most populated city for each state that is on the East cost. But first, let’s break down the problem into smaller steps:

1) Filter out the non east-coast cities.

2) Sort the cities by population.

3) Group cities by state, and return the mayor name.

// step 1: Filtering
val matchStatement = MongoDBObject("$match" -> MongoDBObject("coast" -> "EAST"))

// step 2: Sorting
val sortStatement =  MongoDBObject("$sort" -> MongoDBObject("population" -> -1))

// step 3: Grouping
val groupStatement = MongoDBObject("$group" -> MongoDBObject(
      "_id" -> MongoDBObject("state" -> "$state"),
      "mayor" -> MongoDBObject("$first" -> "$mayor"),
      "population" -> MongoDBObject("$first" -> "$population")
    ))
    
val pipeLine = MongoDBList(matchStatement, sortStatement, groupStatement)

val result = db.command(MongoDBObject("aggregate" -> "city", "pipeline" -> pipeline)).get("result")

Now, let’s explain a bit of what happened here:

1) $match: Pretty straight forward clause that will only forward filtered states to the group statement. One can argue that this step can be done after the grouping, but (in case of a huge set of data) that will simply use much more memory/time to process since all this computation takes place in your mongo box RAM.

2) $sort: This will help fetching the highest population when grouping by using the $first operator. This step can be replaced by $max within the $group clause.

3) $group: As any group statement, this will group by the “_id” passed which is the “state” field in our case (please note that when using the aggregation framework, the DB fields should be preceded with $, it’s also highly recommended to name the fetched field the same as the parsed). Also as we used the $sort prior to the $group, we can take advantage of the $first operator that will ‘hold’ the first document for each (state) group.

Comments
blog comments powered by Disqus

Design crafted by Prashanth Kamalakanthan. Powered by Tumblr.