log4p

Peter Maas’s Weblog

Archive for May, 2009

Simple fulltext analysis in couchdb

In my previous post I presented a simple map function to query Wordpress articles I imported in CouchDB. The map function looked at the categories / terms manually assigned to the articles. I decided to take this a step further and analyze the actual text in the posts to extract keywords.

I created a very simple parser which:

  • Strips out HTML
  • Removes (english) stopwords
  • Counts the number of occurences of the word to provide a hint for 'scoring' results

The mapping code looks like this:

  1. Array.prototype.contains = function(obj) {
  2.   var i = this.length;
  3.   while (i--) {
  4.     if (this[i] === obj) {
  5.       return true;
  6.     }
  7.   }
  8.   return false;
  9. }
  10.  
  11.  
  12. Array.prototype.count = function(obj) {
  13.   var count = 0;
  14.   var i = this.length;
  15.   while (i--) {
  16.     if (this[i] === obj) {
  17.       count++;
  18.     }
  19.   }
  20.  
  21.   return count;
  22. }
  23.  
  24. function stripHTML(w){
  25.   return w.replace(/(<([^>]+)>)|nbsp/ig,"");
  26. }
  27.  
  28. function stripNonWords(w){
  29.   return w.replace(/[^a-zA-Z]+/ig," ");
  30. }
  31.  
  32. stopwords = ['a','about','above','across','after','afterwards','again','against','all','almost','alone','along','already','also','although','always','am','among','amongst','amoungst','amount','an','and','another','any','anyhow','anyone','anything','anyway','anywhere','are','around','as','at','back','be','became','because','become','becomes','becoming','been','before','beforehand','behind','being','below','beside','besides','between','beyond','bill','both','bottom','but','by','call','can','cannot','cant','co','computer','con','could','couldnt','cry','de','describe','detail','do','done','down','due','during','each','eg','eight','either','eleven','else','elsewhere','empty','enough','etc','even','ever','every','everyone','everything','everywhere','except','few','fifteen','fify','fill','find','fire','first','five','for','former','formerly','forty','found','four','from','front','full','further','get','give','go','had','has','hasnt','have','he','hence','her','here','hereafter','hereby','herein','hereupon','hers','herself','him','himself','his','how','however','hundred','i','ie','if','in','inc','indeed','interest','into','is','it','its','itself','keep','last','latter','latterly','least','less','ltd','made','many','may','me','meanwhile','might','mill','mine','more','moreover','most','mostly','move','much','must','my','myself','name','namely','neither','never','nevertheless','next','nine','no','nobody','none','noone','nor','not','nothing','now','nowhere','of','off','often','on','once','one','only','onto','or','other','others','otherwise','our','ours','ourselves','out','over','own','part','per','perhaps','please','put','rather','re','same','see','seem','seemed','seeming','seems','serious','several','she','should','show','side','since','sincere','six','sixty','so','some','somehow','someone','something','sometime','sometimes','somewhere','still','such','system','take','ten','than','that','the','their','them','themselves','then','thence','there','thereafter','thereby','therefore','therein','thereupon','these','they','thick','thin','third','this','those','though','three','through','throughout','thru','thus','to','together','too','top','toward','towards','twelve','twenty','two','un','under','until','up','upon','us','very','via','was','we','well','were','what','whatever','when','whence','whenever','where','whereafter','whereas','whereby','wherein','whereupon','wherever','whether','which','while','whither','who','whoever','whole','whom','whose','why','will','with','within','without','would','yet','you','your','yours','yourself','yourselves'];
  33.  
  34. map = function(doc) { 
  35.   var body = stripNonWords(stripHTML(doc.body)).toLowerCase();
  36.   var terms = [];
  37.   var words = body.split(/\s+/);
  38.  
  39.   var i = words.length;
  40.   while (i--) {
  41.     var word = words[i];
  42.     if(word.length> 2 && !stopwords.contains(word)) {
  43.       if(!terms.contains(word)){
  44.         terms.push(word);
  45.         var weight = words.count(word);
  46.         if(weight> 1) {
  47.           emit([word, weight], {title: doc.title});
  48.         }
  49.       }
  50.     }
  51.   }
  52. }

The resulting view can be used similar to the previous one I described:

http://log4p.com:5984/articles/_design/split/_view/withoutStopWords?startkey=["groovy",{}]&endkey=["groovy",0]&descending=true

  • startkey=["java",{}] - the highest key which may be returned, {} is similar to numerical infinite
  • endkey=["java",0] - the lowest key to return
  • descending=true - order direction
  • limit=10 - max number of results to return

Calling the URL above will return posts containing the word 'groovy' ordered by the number of occurrences:

  1. {"total_rows":3527,"offset":2253,"rows":[
  2. {"id":"301","key":["groovy",11],"value":{"title":"Grails - Soap"}},
  3. {"id":"432","key":["groovy",9],"value":{"title":"Running your griffon application in fullscreen mode"}},
  4. {"id":"362","key":["groovy",8],"value":{"title":"Using propertyMissing to enhance Date (in Groovy)"}},
  5. {"id":"380","key":["groovy",7],"value":{"title":"How Elvis showed me a neat way of using operators in Ruby"}},
  6. {"id":"232","key":["groovy",7],"value":{"title":"Spring and scripting languages... don't go together?"}},
  7. {"id":"278","key":["groovy",6],"value":{"title":"Grails - associations"}},
  8. {"id":"361","key":["groovy",5],"value":{"title":"Ranges with dates (in Groovy)"}}
  9. ]}

I modified my Wordpress templates to use this view now and it seems to yield better results.

Note
One thing I noticed while writing the mapping function is that altering Javascripts' array prototype (i.e. I wanted to add my contains and count method to it) seems to result in unpredictable problems. Still investigating.

update
I probable made a mistake with the prototype extensions, refactored it back and works now, updated the code above.

1 comment

Wordpress, Couchdb and Ruby

I did a small test to see how complex it would be to put a Wordpress database into CouchDB. This might not seem very useful, and when only receiving data from Wordpress it isn't. In the future other applications would also publish content to the same database.

To get my posts into CouchDB I wrote the following Ruby script (disclaimer: this is a quick and dirty hack, don't use it in a production environment):

  1. require 'rubygems'
  2. require 'mysql'
  3. require 'json'
  4. require 'couchdb.rb'
  5.  
  6. database = Mysql.real_connect("localhost", ===database user===, ===database pass===, ===database name===)
  7.  
  8. # utility function for storing articles
  9. def store_article(couchdb_server, id, article)
  10.   begin
  11.     existing = couchdb_server.get("/articles/#{id}")
  12.     if existing.code == '200'
  13.       article["_rev"] = JSON.parse(existing.body)["_rev"]
  14.     end
  15.   rescue
  16.     # ignore for now...
  17.   end
  18.  
  19.   couchdb_server.put("/articles/#{id}", article.to_json)
  20. end
  21.  
  22.  
  23. puts "connected to #{database}"
  24.  
  25. # query will return cartesian product, num_category*blogposts
  26. # grouping will be done afterwards. The query will only return published blogposts.
  27. res = database.query("select
  28.               p.id as id,
  29.               p.post_title,
  30.               p.post_content,
  31.               t.name
  32.           from
  33.             wp_posts p
  34.             join wp_term_relationships tr on tr.object_id = p.id
  35.             join wp_term_taxonomy wtt on wtt.term_taxonomy_id = tr.term_taxonomy_id
  36.             join wp_terms t on t.term_id = wtt.term_id
  37.           where
  38.             post_type = 'post'
  39.             and post_status = 'publish'
  40.         ")
  41.  
  42.  
  43. # Convert the results to the internal datastructure
  44. data = Hash.new()
  45. while row = res.fetch_row do
  46.   post_id = row[0].to_i
  47.   post = data[post_id] ? data[post_id] : {:terms => []}
  48.    
  49.   post[:title] = row[1]
  50.   post[:body] = row[2]
  51.   post[:terms] <<row[3]
  52.  
  53.   data[post_id] = post
  54. end
  55. puts "#{res.num_rows} posts queried, posting to couchdb"
  56. res.free
  57.  
  58. # setup the couchdb class and post all articles
  59. couchdb_server = Couch::Server.new("log4p.com", "5984")
  60. data.each do |k,v|
  61.   store_article(couchdb_server, k,v)
  62. end

As you can see the bulk of the code is in the data retrieval SQL and conversion. The Couch module was taken from the couchdb wiki. And provides some really basic wrappers for the CouchDB REST interface.

After executing the script above all blogposts stored in CouchDB in JSON format:

  1. {
  2.    "_id": "454",
  3.    "_rev": "1-888454205",
  4.    "terms": [
  5.        "gadgets",
  6.        "android",
  7.        "g1"
  8.    ],
  9.    "body": ".....",
  10.    "title": "ADP1"
  11. }

One thing I wanted to do was creating a simple API to retrieve articles based on their category. To do this I created this simple view in couchdb:

  1. function(doc) {
  2.   for each(term in doc.terms){
  3.     emit([term, parseInt(doc._id)], {title: doc.title});
  4.   }
  5. }

Which emits the post and its' terms, which makes it possible to query like this:

http://log4p.com:5984/articles/_design/list/_view/category?startkey=["java",{}]&endkey=["java",0]&descending=true&limit=10

Auch, that's a lot of parameters! Here's what they do:

  • startkey=["java",{}] - the highest key which may be returned, {} is similar to numerical infinite ;)
  • endkey=["java",0] - the lowest key to return
  • descending=true - order direction
  • limit=10 - max number of results to return

which should return posts like this:

  1. {"total_rows":403,"offset":151,"rows":[
  2. {"id":"600","key":["java",600],"value":{"title":"Binding mmbase nodes to strongly typed object graphs"}},
  3. {"id":"596","key":["java",596],"value":{"title":"Oracle buys Sun..."}},
  4. {"id":"577","key":["java",577],"value":{"title":"Composited objects with shared id's in  Hibernate"}},
  5. {"id":"555","key":["java",555],"value":{"title":"CouchDB meetup in Amsterdam"}},
  6. {"id":"467","key":["java",467],"value":{"title":"Ioke @ Amsterdam.rb"}},
  7. {"id":"428","key":["java",428],"value":{"title":"I want closures \"bolted on to Java\""}},
  8. {"id":"424","key":["java",424],"value":{"title":"Review: \"Clean Code: A handbook of agile software craftmanship\""}},
  9. {"id":"397","key":["java",397],"value":{"title":"JavaOne 2008 - Summary & Reflection"}},
  10. {"id":"381","key":["java",381],"value":{"title":"Closures and the return of the return"}},
  11. {"id":"359","key":["java",359],"value":{"title":"CPD with maven2 and PMD"}}
  12. ]}

Just to test the API I wrote the following code (see it in action underneath the 'full' post view) and added it to the single_post view of my blog:

  1. <?php
  2. require_once("class_couchdb.php");
  3. $couchdb = new CouchDB('articles', '79.170.94.41', 5984);
  4. ?>
  5.  
  6. <?php foreach(get_the_category() as $category) {
  7.   try {
  8.       $result = $couchdb->send('_design/list/_view/category?limit=10&startkey=["' . $category->name . '",{}]&endkey=["' . $category->name . '",0]&descending=true');
  9.       // here we get the decoded json from the response
  10.       $all_docs = $result->getBody(true);
  11.  
  12.       foreach($all_docs->rows as $r => $row) { ?>
  13.         <li><a href="http://log4p.com?p=<?=$row->id?>"><?= $row->value->title ?></a></li>
  14.       <? }
  15.   } catch(CouchDBException $e) {
  16.  
  17.   }
  18. }?>

As I said before, this is an experiment and I wouldn't use this specific setup in this format; there is no validation of data. Deleting posts is not implemented etc.

2 comments

Binding mmbase nodes to strongly typed object graphs

In past years I've spend quite some time converting MMBase node graphs to strongly typed object graphs. One of the reasons for doing this is to define 'meta' models on top of the cloud. 'What is a newsitem?' (i.e. which rules need to be applied to get all the needed data from MMBase).

Due to recent developments within the VPRO I decided to have another go at it. And I came up with a working prototype of something which I think might be useful to others; or where others might be able to provide valuable feedback!

The small framework I created is annotation based; one specifies the bindings to MMBase using annotations:

  1. // --------- NewsItem.java
  2. @Entity(builder = "news", root = true)
  3. public class NewsItem {
  4.   private Long number;
  5.   private String title;
  6.   private String subtitle;
  7.   private String credits;
  8.  
  9.   @Field(nodeField = "intro")
  10.   private String description;
  11.   private String body;
  12.  
  13.   @Embedded(builder = "mmevents", field = "start", convertor = EpochDateConvertor.class)
  14.   private Date created;
  15.  
  16.   @PosRel(orderDirection = Direction.DESC, queryDirection = QueryDirection.BOTH)
  17.   private List<Image> image;
  18.  
  19.   @Rel(orderDirection = Direction.DESC, orderField = "value", queryDirection = QueryDirection.DESTINATION)
  20.   private List<Tag> tag;

  1. // --------- Image.java
  2. @Entity(builder = "images")
  3. public class Image {
  4.   private Long number;
  5.   private String title;

  1. // --------- Tag.java
  2. @Entity(builder = "tags")
  3. public class Tag {
  4.   private Long number;
  5.   private String value;

The implementation is still in concept phase, but as you can see it is already possible to define mappings for:

  • associations (works for typed collections only)
  • fields (populated by default, @Field annotation use to override properties)
  • Embedded values from one-to-one associations which are treated as embedded objects.

Note: at the moment I'm only considering read operations.

Entity definitions are automatically retrieved at startup (using Springs' ClassPathBeanDefinitionScanner) of a simple MMBase module, after which binding can be done as follows:

  1. NewsItem item = (NewsItem) populator.unmarshallNode(newsItemNode, "news");

There is still a lot of ground to cover, but the basics work, and the populator class is still less then 200 lines of code! No public sourcecode yet, but I'd be more than happy to contribute/make it availlable in the near future if others are interested.

looking forwards to ideas, criticism etc.

2 comments