Simple fulltext analysis in couchdb

In my previous post I presented a simple map function to query WordPress articles I imported in CouchDB. The map function looked at the categories / terms manually assigned to the articles. I decided to take this a step further and analyze the actual text in the posts to extract keywords.

I created a very simple parser which:

  • Strips out HTML
  • Removes (english) stopwords
  • Counts the number of occurences of the word to provide a hint for ‘scoring’ results

The mapping code looks like this:

[javascript]
Array.prototype.contains = function(obj) {
var i = this.length;
while (i–) {
if (this[i] === obj) {
return true;
}
}
return false;
}

Array.prototype.count = function(obj) {
var count = 0;
var i = this.length;
while (i–) {
if (this[i] === obj) {
count++;
}
}

return count;
}

function stripHTML(w){
return w.replace(/(<([^>]+)>)|nbsp/ig,”");
}

function stripNonWords(w){
return w.replace(/[^a-zA-Z]+/ig,” “);
}

stopwords = ['a','about','above','across','after','afterwards','again','against','all','almost','alone','along','already','also','although','always','am','among','amongst','amoungst','amount','an','and','another','any','anyhow','anyone','anything','anyway','anywhere','are','around','as','at','back','be','became','because','become','becomes','becoming','been','before','beforehand','behind','being','below','beside','besides','between','beyond','bill','both','bottom','but','by','call','can','cannot','cant','co','computer','con','could','couldnt','cry','de','describe','detail','do','done','down','due','during','each','eg','eight','either','eleven','else','elsewhere','empty','enough','etc','even','ever','every','everyone','everything','everywhere','except','few','fifteen','fify','fill','find','fire','first','five','for','former','formerly','forty','found','four','from','front','full','further','get','give','go','had','has','hasnt','have','he','hence','her','here','hereafter','hereby','herein','hereupon','hers','herself','him','himself','his','how','however','hundred','i','ie','if','in','inc','indeed','interest','into','is','it','its','itself','keep','last','latter','latterly','least','less','ltd','made','many','may','me','meanwhile','might','mill','mine','more','moreover','most','mostly','move','much','must','my','myself','name','namely','neither','never','nevertheless','next','nine','no','nobody','none','noone','nor','not','nothing','now','nowhere','of','off','often','on','once','one','only','onto','or','other','others','otherwise','our','ours','ourselves','out','over','own','part','per','perhaps','please','put','rather','re','same','see','seem','seemed','seeming','seems','serious','several','she','should','show','side','since','sincere','six','sixty','so','some','somehow','someone','something','sometime','sometimes','somewhere','still','such','system','take','ten','than','that','the','their','them','themselves','then','thence','there','thereafter','thereby','therefore','therein','thereupon','these','they','thick','thin','third','this','those','though','three','through','throughout','thru','thus','to','together','too','top','toward','towards','twelve','twenty','two','un','under','until','up','upon','us','very','via','was','we','well','were','what','whatever','when','whence','whenever','where','whereafter','whereas','whereby','wherein','whereupon','wherever','whether','which','while','whither','who','whoever','whole','whom','whose','why','will','with','within','without','would','yet','you','your','yours','yourself','yourselves'];

map = function(doc) {
var body = stripNonWords(stripHTML(doc.body)).toLowerCase();
var terms = [];
var words = body.split(/\s+/);

var i = words.length;
while (i–) {
var word = words[i];
if(word.length > 2 && !stopwords.contains(word)) {
if(!terms.contains(word)){
terms.push(word);
var weight = words.count(word);
if(weight > 1) {
emit([word, weight], {title: doc.title});
}
}
}
}
}
[/javascript]

The resulting view can be used similar to the previous one I described:

http://log4p.com:5984/articles/_design/split/_view/withoutStopWords?startkey=["groovy",{}]&endkey=["groovy",0]&descending=true

  • startkey=["java",{}] – the highest key which may be returned, {} is similar to numerical infinite
  • endkey=["java",0] – the lowest key to return
  • descending=true – order direction
  • limit=10 – max number of results to return

Calling the URL above will return posts containing the word ‘groovy’ ordered by the number of occurrences:

[javascript]
{“total_rows”:3527,”offset”:2253,”rows”:[
{"id":"301","key":["groovy",11],”value”:{“title”:”Grails – Soap”}},
{“id”:”432″,”key”:["groovy",9],”value”:{“title”:”Running your griffon application in fullscreen mode”}},
{“id”:”362″,”key”:["groovy",8],”value”:{“title”:”Using propertyMissing to enhance Date (in Groovy)”}},
{“id”:”380″,”key”:["groovy",7],”value”:{“title”:”How Elvis showed me a neat way of using operators in Ruby”}},
{“id”:”232″,”key”:["groovy",7],”value”:{“title”:”Spring and scripting languages… don’t go together?”}},
{“id”:”278″,”key”:["groovy",6],”value”:{“title”:”Grails – associations”}},
{“id”:”361″,”key”:["groovy",5],”value”:{“title”:”Ranges with dates (in Groovy)”}}
]}
[/javascript]

I modified my WordPress templates to use this view now and it seems to yield better results.

Note
One thing I noticed while writing the mapping function is that altering Javascripts’ array prototype (i.e. I wanted to add my contains and count method to it) seems to result in unpredictable problems. Still investigating.

update
I probable made a mistake with the prototype extensions, refactored it back and works now, updated the code above.

This entry was posted in couchdb, fulltext analysis, javascript. Bookmark the permalink.

One Response to Simple fulltext analysis in couchdb

  1. Where did you put those helper functions? I’d use couchapp, put them in something like helpers/fulltextsearch.js and use a macro to import the code into your map function.

    Or you could maybe just use couchdb-lucene: http://rnewson.github.com/couchdb-lucene/

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>