Counting Tags with CouchDB and Map-Reduce

January 28, 2009

My previous post covered adding a simple view to CouchDB, but what happens when a plain map isn't enough? Say we want a list of every tag used across all articles, along with a count of how many articles use each one. Sure, we could emit doc.tags and crunch the arrays on the client side, but wouldn't it be nicer if CouchDB did the heavy lifting for us?

Good news: it can.

Here's a reminder of what the article documents look like:

{
  "_id": "monkeys-are-awesome",
  "_rev": "1534115156",
  "type": "article",
  "title": "Monkeys are awesome",
  "posted_at": "2008-09-14T20:45:14Z",
  "tags": [
    "monkeys",
    "awesome"
  ],
  "status": "Live",
  "author_id": "craig@barkingiguana.com",
  "updated_at": "2008-09-14T21:23:59Z",
  "body": "The article body would go here..."
}

First, we write a map function that emits each tag individually with a value of 1:

function(doc) {
  if(doc.type == 'article') {
    for(i in doc.tags) {
      emit(doc.tags[i], 1);
    }
  }
}

For the example document above, this would emit ("awesome", 1) and ("monkeys", 1). If several documents are tagged "monkeys", we'd see ("monkeys", 1) appear multiple times in the output.

Now we need to reduce those results down to a list of unique tags with their totals. The reduce function gets called once per unique key, receiving that key and an array of all the values that were emitted for it. Since our values are all 1s, we just sum them up:

function(tag, counts) {
  var sum = 0;
  for(var i=0; i < counts.length; i++) {
     sum += counts[i];
  }
  return sum;
}

Install this alongside the map function using the "reduce" key in the design document:

{
  "tags": {
    "map": "function(doc) { if(doc.type == 'article') { for(var i in doc.tags) { emit(doc.tags[i], 1); }}}",
    "reduce": "function(tag, counts) { var sum = 0; for(var i = 0; i < counts.length; i++) { sum += counts[i]; }; return sum; }"
  }
  // other views omitted for brevity
}

Viewing this in Futon gives you a nicely formatted list of tags and counts. To use the view via the HTTP API, you need to tell CouchDB to group results by key:

// GET http://localhost:5984/blog/_view/articles/tags?group=true&group_level=1

{"rows":[
  {"key":"awesome","value":1},
  {"key":"agile","value":2},
  {"key":"ajax","value":2},
  {"key":"apache","value":2},
  {"key":"api","value":1},
  {"key":"caching","value":1},
  {"key":"coding","value":7},
  {"key":"conference","value":1},
  // and so on ...
]}

And there it is — a tag cloud's worth of data, computed entirely inside CouchDB. Map-reduce is one of those things that clicks beautifully once you see it in action.

Questions or thoughts? Get in touch.