In my previous post I presented a simple map function to query WordPress articles I imported in CouchDB. The map function looked at the categories / terms manually assigned to the articles. I decided to take this a step further and analyze the actual text in the posts to extract keywords.
I created a very simple parser which:
- Strips out HTML
- Removes (english) stopwords
- Counts the number of occurences of the word to provide a hint for ‘scoring’ results
The mapping code looks like this:
[javascript]
Array.prototype.contains = function(obj) {
var i = this.length;
while (i–) {
if (this[i] === obj) {
return true;
}
}
return false;
}
Array.prototype.count = function(obj) {
var count = 0;
var i = this.length;
while (i–) {
if (this[i] === obj) {
count++;
}
}
return count;
}
function stripHTML(w){
return w.replace(/(<([^>]+)>)|nbsp/ig,”");
}
function stripNonWords(w){
return w.replace(/[^a-zA-Z]+/ig,” “);
}
stopwords = ['a','about','above','across','after','afterwards','again','against','all','almost','alone','along','already','also','although','always','am','among','amongst','amoungst','amount','an','and','another','any','anyhow','anyone','anything','anyway','anywhere','are','around','as','at','back','be','became','because','become','becomes','becoming','been','before','beforehand','behind','being','below','beside','besides','between','beyond','bill','both','bottom','but','by','call','can','cannot','cant','co','computer','con','could','couldnt','cry','de','describe','detail','do','done','down','due','during','each','eg','eight','either','eleven','else','elsewhere','empty','enough','etc','even','ever','every','everyone','everything','everywhere','except','few','fifteen','fify','fill','find','fire','first','five','for','former','formerly','forty','found','four','from','front','full','further','get','give','go','had','has','hasnt','have','he','hence','her','here','hereafter','hereby','herein','hereupon','hers','herself','him','himself','his','how','however','hundred','i','ie','if','in','inc','indeed','interest','into','is','it','its','itself','keep','last','latter','latterly','least','less','ltd','made','many','may','me','meanwhile','might','mill','mine','more','moreover','most','mostly','move','much','must','my','myself','name','namely','neither','never','nevertheless','next','nine','no','nobody','none','noone','nor','not','nothing','now','nowhere','of','off','often','on','once','one','only','onto','or','other','others','otherwise','our','ours','ourselves','out','over','own','part','per','perhaps','please','put','rather','re','same','see','seem','seemed','seeming','seems','serious','several','she','should','show','side','since','sincere','six','sixty','so','some','somehow','someone','something','sometime','sometimes','somewhere','still','such','system','take','ten','than','that','the','their','them','themselves','then','thence','there','thereafter','thereby','therefore','therein','thereupon','these','they','thick','thin','third','this','those','though','three','through','throughout','thru','thus','to','together','too','top','toward','towards','twelve','twenty','two','un','under','until','up','upon','us','very','via','was','we','well','were','what','whatever','when','whence','whenever','where','whereafter','whereas','whereby','wherein','whereupon','wherever','whether','which','while','whither','who','whoever','whole','whom','whose','why','will','with','within','without','would','yet','you','your','yours','yourself','yourselves'];
map = function(doc) {
var body = stripNonWords(stripHTML(doc.body)).toLowerCase();
var terms = [];
var words = body.split(/\s+/);
var i = words.length;
while (i–) {
var word = words[i];
if(word.length > 2 && !stopwords.contains(word)) {
if(!terms.contains(word)){
terms.push(word);
var weight = words.count(word);
if(weight > 1) {
emit([word, weight], {title: doc.title});
}
}
}
}
}
[/javascript]
The resulting view can be used similar to the previous one I described:
http://log4p.com:5984/articles/_design/split/_view/withoutStopWords?startkey=["groovy",{}]&endkey=["groovy",0]&descending=true
- startkey=["java",{}] – the highest key which may be returned, {} is similar to numerical infinite
- endkey=["java",0] – the lowest key to return
- descending=true – order direction
- limit=10 – max number of results to return
Calling the URL above will return posts containing the word ‘groovy’ ordered by the number of occurrences:
[javascript]
{“total_rows”:3527,”offset”:2253,”rows”:[
{"id":"301","key":["groovy",11],”value”:{“title”:”Grails – Soap”}},
{“id”:”432″,”key”:["groovy",9],”value”:{“title”:”Running your griffon application in fullscreen mode”}},
{“id”:”362″,”key”:["groovy",8],”value”:{“title”:”Using propertyMissing to enhance Date (in Groovy)”}},
{“id”:”380″,”key”:["groovy",7],”value”:{“title”:”How Elvis showed me a neat way of using operators in Ruby”}},
{“id”:”232″,”key”:["groovy",7],”value”:{“title”:”Spring and scripting languages… don’t go together?”}},
{“id”:”278″,”key”:["groovy",6],”value”:{“title”:”Grails – associations”}},
{“id”:”361″,”key”:["groovy",5],”value”:{“title”:”Ranges with dates (in Groovy)”}}
]}
[/javascript]
I modified my WordPress templates to use this view now and it seems to yield better results.
Note
One thing I noticed while writing the mapping function is that altering Javascripts’ array prototype (i.e. I wanted to add my contains and count method to it) seems to result in unpredictable problems. Still investigating.
update
I probable made a mistake with the prototype extensions, refactored it back and works now, updated the code above.