In my previous post I presented a simple map function to query WordPress articles I imported in CouchDB. The map function looked at the categories / terms manually assigned to the articles. I decided to take this a step further and analyze the actual text in the posts to extract keywords.
I created a very simple parser which:
- Strips out HTML
- Removes (english) stopwords
- Counts the number of occurences of the word to provide a hint for ‘scoring’ results
The mapping code looks like this:
[javascript]
Array.prototype.contains = function(obj) {
var i = this.length;
while (i–) {
if (this[i] === obj) {
return true;
}
}
return false;
}
Array.prototype.count = function(obj) {
var count = 0;
var i = this.length;
while (i–) {
if (this[i] === obj) {
count++;
}
}
return count;
}
function stripHTML(w){
return w.replace(/(<([^>]+)>)|nbsp/ig,”");
}
function stripNonWords(w){
return w.replace(/[^a-zA-Z]+/ig,” “);
}
stopwords = ['a','about','above','across','after','afterwards','again','against','all','almost','alone','along','already','also','although','always','am','among','amongst','amoungst','amount','an','and','another','any','anyhow','anyone','anything','anyway','anywhere','are','around','as','at','back','be','became','because','become','becomes','becoming','been','before','beforehand','behind','being','below','beside','besides','between','beyond','bill','both','bottom','but','by','call','can','cannot','cant','co','computer','con','could','couldnt','cry','de','describe','detail','do','done','down','due','during','each','eg','eight','either','eleven','else','elsewhere','empty','enough','etc','even','ever','every','everyone','everything','everywhere','except','few','fifteen','fify','fill','find','fire','first','five','for','former','formerly','forty','found','four','from','front','full','further','get','give','go','had','has','hasnt','have','he','hence','her','here','hereafter','hereby','herein','hereupon','hers','herself','him','himself','his','how','however','hundred','i','ie','if','in','inc','indeed','interest','into','is','it','its','itself','keep','last','latter','latterly','least','less','ltd','made','many','may','me','meanwhile','might','mill','mine','more','moreover','most','mostly','move','much','must','my','myself','name','namely','neither','never','nevertheless','next','nine','no','nobody','none','noone','nor','not','nothing','now','nowhere','of','off','often','on','once','one','only','onto','or','other','others','otherwise','our','ours','ourselves','out','over','own','part','per','perhaps','please','put','rather','re','same','see','seem','seemed','seeming','seems','serious','several','she','should','show','side','since','sincere','six','sixty','so','some','somehow','someone','something','sometime','sometimes','somewhere','still','such','system','take','ten','than','that','the','their','them','themselves','then','thence','there','thereafter','thereby','therefore','therein','thereupon','these','they','thick','thin','third','this','those','though','three','through','throughout','thru','thus','to','together','too','top','toward','towards','twelve','twenty','two','un','under','until','up','upon','us','very','via','was','we','well','were','what','whatever','when','whence','whenever','where','whereafter','whereas','whereby','wherein','whereupon','wherever','whether','which','while','whither','who','whoever','whole','whom','whose','why','will','with','within','without','would','yet','you','your','yours','yourself','yourselves'];
map = function(doc) {
var body = stripNonWords(stripHTML(doc.body)).toLowerCase();
var terms = [];
var words = body.split(/\s+/);
var i = words.length;
while (i–) {
var word = words[i];
if(word.length > 2 && !stopwords.contains(word)) {
if(!terms.contains(word)){
terms.push(word);
var weight = words.count(word);
if(weight > 1) {
emit([word, weight], {title: doc.title});
}
}
}
}
}
[/javascript]
The resulting view can be used similar to the previous one I described:
- startkey=["java",{}] – the highest key which may be returned, {} is similar to numerical infinite
- endkey=["java",0] – the lowest key to return
- descending=true – order direction
- limit=10 – max number of results to return
Calling the URL above will return posts containing the word ‘groovy’ ordered by the number of occurrences:
[javascript]
{“total_rows”:3527,”offset”:2253,”rows”:[
{"id":"301","key":["groovy",11],”value”:{“title”:”Grails – Soap”}},
{“id”:”432″,”key”:["groovy",9],”value”:{“title”:”Running your griffon application in fullscreen mode”}},
{“id”:”362″,”key”:["groovy",8],”value”:{“title”:”Using propertyMissing to enhance Date (in Groovy)”}},
{“id”:”380″,”key”:["groovy",7],”value”:{“title”:”How Elvis showed me a neat way of using operators in Ruby”}},
{“id”:”232″,”key”:["groovy",7],”value”:{“title”:”Spring and scripting languages… don’t go together?”}},
{“id”:”278″,”key”:["groovy",6],”value”:{“title”:”Grails – associations”}},
{“id”:”361″,”key”:["groovy",5],”value”:{“title”:”Ranges with dates (in Groovy)”}}
]}
[/javascript]
I modified my WordPress templates to use this view now and it seems to yield better results.
Note
One thing I noticed while writing the mapping function is that altering Javascripts’ array prototype (i.e. I wanted to add my contains and count method to it) seems to result in unpredictable problems. Still investigating.
update
I probable made a mistake with the prototype extensions, refactored it back and works now, updated the code above.
Where did you put those helper functions? I’d use couchapp, put them in something like helpers/fulltextsearch.js and use a macro to import the code into your map function.
Or you could maybe just use couchdb-lucene: http://rnewson.github.com/couchdb-lucene/