log4p

Peter Maas’s Weblog

Unconventional architectures that might scale (I)

I'm experimenting with some unconventional architectural ideas (at least from a Java developers view) while designing a new architecture for Dutch broadcaster VPRO. In this series of blogposts I'll try to explain some of the concepts; please feel free to point out issues/caveats I failed to notice!

Some aims of the architecture

  • Reliable
  • Performant & Scalable (VPRO is running high traffic sites like 3voor12 and cinema.nl)
  • Language agnostic frontend (and preferably backend)
  • Web Oriented
  • Maintainable
  • Offers option to create new concepts and deliver to yet unknown platform with existing content

Conceptually I would like to solve this by creating 'services' (no, not talking about 'classical' SOA here) around specific domains. For example everything about writing articles, like for instance workflow, would be in the 'articles' service. A service has its' own master repository of the data (relational database, file system etc.) it operates on.

On top of that a REST/HTTP based layer would be created to provide the necessary API on which websites, widgets or mobile applications can be build. Ideally the service layers would 'push' (using JMS or something like Kestrel) content to this layer in an open format like XML, JSON or YAML. Schematically it might look like this:

arch2_concept

Techniques used to implement the views on top of the REST layer can differ between projects and implementation, but we would like to get rid of to heavyweight Java Servlet containers (and yes, I call Tomcat heavyweight) we use in this layer at the moment.

Proof-of-Concept implementation based on CouchDB

One of the possible solutions would be to use a fast document database as queriable cache. Document databases are not as mature as relational databases, but their popularity is rising quickly. As opposed to Relational Databases, document-based databases do not store data in tables with uniform sized fields for each record. Instead, each record is stored as a document that has certain characteristics. Any number of fields of any length can be added to a document. Fields can also contain multiple pieces of data. Amazon has it's SimpleDB, Ruby folks seem to like MongoDB and rumors about the rising of Voldemort are spreading throughout the web. I like CouchDB, mainly due to the fact that it's build on a solid functional language (Erlang).

CouchDB offers a JSON based document model. Documents are manipulated using REST. CouchDB also offers support for replication between other instances. Custom views of the document in the database (also accessed via HTTP) are written in Javascript (where JSON is a native data structure).

The proof-of-concept (PoC) setup I created to experiment this scenario looks like this:

couch_poc1

I added a simple service to Groovy which facilitates writing/updating entities to CouchDB via the REST interface (It would probably be better to use Hibernates' interceptors to do this, doens't change the PoC though):

  1. import groovyx.net.http.RESTClient
  2. import net.sf.json.groovy.JsonSlurper
  3. import net.sf.json.JSONObject
  4.  
  5. class CouchDBAccessorService {
  6.  
  7.   boolean transactional = true
  8.   def restClient = new RESTClient("http://localhost:5984")
  9.  
  10.   def sendToDatabase(obj) {
  11.     def docId = "${obj.class.simpleName}:${obj.id}".toString()
  12.  
  13.     def json = new JSONObject()
  14.     json.put('_id',docId)
  15.     json.put('content', obj)
  16.     if(obj.extVersion){ // we have written to couchdb before, use revision number
  17.       json.put('_rev', obj.extVersion)
  18.     }
  19.      
  20.     def resp = restClient.put(path: "/news/${docId}", body: json.toString(0), requestContentType: "application/json")
  21.     storeNewRevisionNumber(obj, resp)
  22.   }
  23.  
  24.   def storeNewRevisionNumber(obj, resp) {
  25.     obj.extVersion = new JsonSlurper().parse(resp.data).get("rev")
  26.     obj.save()   
  27.   }
  28. }

In CouchDB I created a really simple view to list articles. The map function of the view looks like this (no reduce function at the moment):

  1. function(doc) {
  2.   emit(null, doc.content.title);
  3. }

This view will return JSON like this:

  1. {"total_rows":3,"offset":0,"rows":[
  2. {"id":"Article:1","key":null,"value":"Title 1"},
  3. {"id":"Article:2","key":null,"value":"Title 2"},
  4. {"id":"Article:3","key":null,"value":"Title 3"}
  5. ]}

CouchDB views can be paged, sorted and filtered via specific parameters. Mind you, this is not a showcase for what CouchDB views can do... if you're interested in that feel free to browse the map/reduce snippets.

A PHP script to render the above view might look like this:

  1. <html>
  2. <head>
  3.     <script type="text/JavaScript" src="jquery-1.3.2.js"></script>
  4.     <script type="text/JavaScript" src="news.js"></script>   
  5. </head>
  6. <body>
  7. <?php
  8. require_once("class_couchdb.php");
  9. // we get a new CouchDB object that will use the 'pastebin' db
  10. $couchdb = new CouchDB('news');
  11. try {
  12.     $result = $couchdb->send('/_design/list/_view/minimal');
  13. } catch(CouchDBException $e) {
  14.     die($e->errorMessage()."\n");
  15. }
  16. // here we get the decoded json from the response
  17. $all_docs = $result->getBody(true);
  18.  
  19. ?>
  20. <h1>News List</h1>
  21. <ul>
  22. <?foreach($all_docs->rows as $r => $row) { ?>
  23.     <li><a href="article?id=<?= $row->id ?>"><?= $row->value ?></a></li>
  24. <? } ?>
  25. </ul>
  26. </body>
  27. </html>

The CouchDB class was taken from the CouchDB wiki. The script above displays a list of articles. We use a little bit of JQuery to fetch the articles' synopsis via ajax and display it:

  1. $(function(){
  2.     $("ul> li> a").each(function(){
  3.         $(this).bind("click", function(){
  4.           var button = $(this);
  5.             (function(){
  6.             var id = button.attr("href").match(/(\d+)$/)[1];   
  7.             if(button.parent().find('> div').length == 0){
  8.               button.after('<div id="desc'+ id +'"></div>');
  9.             }         
  10.             $.ajax({
  11.               url: "/couchdb/news/Article:"+ id,
  12.               dataType:"json",
  13.               success: function(data){
  14.                 $("div.active").slideUp(function(){
  15.                   $(this).removeClass("active");
  16.                 });
  17.                
  18.                 $("#desc"+ id).css("display", "none")
  19.                   .html(data.content.synopsis).slideDown(120).addClass("active");
  20.               }
  21.             });
  22.            
  23.           })();
  24.           return false;
  25.         });
  26.     });
  27. });

This piece of javascript leverages JQueries' selectors to find all hyperlinkes and extracts the articles' identifier from the target link. Default behavior of the anchor is replaced by retrieving data from CouchDB via Ajax and displaying it just underneath the link.

The setup mentioned above uses a mod_proxy configuration to access CouchDB. The CouchDB administrative interfaces are not proxied.

One step back

Ouch... stop... that's a lot of code! Well.. actually for what we did it really isn't. And look at what we get back from it:

  • Seperation of concerns
  • Really loosely coupled systems
  • A very performant and scalable system (more on that later)
  • A language agnostic stack
  • A single API used by all views (in this case PHP and JS)
  • The maintainability of CouchDB is a bit tricky, the REST connections are easy to maintain though. They're plain HTTP. Traditional ACLs could be used, caching proxies, routing (i.e. all posts to a single master or loadbalanced GET requests)

Performance & Scaling

I benchmarked the above on my laptop using the default OSX apache with mod_php enable, no tweaking, no caching. I managed to get about 65 requests/sec out of my laptop. Running apache bench in 128 thread concurrent mode. In this case PHP/Apache is the bottleneck, if I run the benchmark on top of the REST interface directly I'm getting about 500 reques/sec.

The above number where obtained from a CouchDB instance with almost no data in it. Let's fix that. The following Ruby script retrieves 15000 movie documents from the internal rest API of cinema.nl and stores them in CouchDB:

  1. require 'rubygems'
  2. require 'open-uri'
  3. require 'json'
  4. require 'couchdb.rb'
  5.  
  6. server = Couch::Server.new("145.58.169.174", "5984")
  7.  
  8. all_movies = JSON.load(open("http://www.cinema.nl/api/1/rest/movie.json"))
  9. all_movies['idList']['ids']['id'][0...15000].each do | id |
  10.   movie_data = JSON.load(open("http://www.cinema.nl/api/1/rest/movie/#{id}.json"))
  11.   server.put("/movies/movie:#{id}", movie_data.to_json)
  12. end

I wrote a more elaborate view:

  1. function(doc) {
  2.   if(doc.movie.genres.genre == 'Thriller'){
  3.     emit(doc.movie.appreciation, {title: doc.movie.title, appreciation: doc.movie.appreciation});
  4.   }
  5. }

Which creates a view which can be used to query 'popular' movies:

"GET /couchdb/movies/_design/by_genre/_view/thriller?limit=10&descending=true" returns:

  1. {"total_rows":247,"offset":0,"rows":[
  2. {"id":"movie:342339","key":10,"value":{"title":"Cadaveri eccellenti","appreciation":10}},
  3. {"id":"movie:331195","key":9,"value":{"title":"Blind Terror","appreciation":9}},
  4. {"id":"movie:269882","key":9,"value":{"title":"Apprentice to Murder","appreciation":9}},
  5. {"id":"movie:350847","key":8,"value":{"title":"Delusion","appreciation":8}},
  6. {"id":"movie:350143","key":8,"value":{"title":"D.O.A.","appreciation":8}},
  7. {"id":"movie:349981","key":8,"value":{"title":"Deadly Strangers","appreciation":8}},
  8. {"id":"movie:349973","key":8,"value":{"title":"Dark City","appreciation":8}},
  9. {"id":"movie:349835","key":8,"value":{"title":"Defence of the Realm","appreciation":8}},
  10. {"id":"movie:349808","key":8,"value":{"title":"A Deadly Puzzle","appreciation":8}},
  11. {"id":"movie:349479","key":8,"value":{"title":"Diagnosis: Murder","appreciation":8}}
  12. ]}

I benchmarked this view with 5000, 10000 and 15000 documents and got about 360 request/second out of it in all cases. Not bad for something which can be easily be cached.

Horizontal scaling using replication
Where relational databases are expensive to replicate due to the complexity of keeping track of changes document oriented databases are a bit better suited to do this. After an initial sync CouchDB only replicates changes. It would be fairly easy to give each node it's own CouchDB instance and replicate the necessary changes from a central master database; thus creating (almost) autonomous nodes:

couch_arch_scale3

This would result in a very scalable architecture. Configuring and monitoring could however proof to be a menace.

Caveats

Obviously this architecture has a couple of (potential) downsides:

  • The document format is not validated by the database
  • Synchronization of backend services with CouchDB might prove tricky
  • Maintaining the CouchDB setup in general
  • The depicted architecture is mostly focussed on 'read' operations, writing (POST) from frontends will probable involve some additional plumbing

And now...

I'm really interested to see what other people think of this concept. Any form of constructive criticism would really be appreciated!

8 comments

8 Comments so far

  1. Chad March 24th, 2009 3:20 am

    Nice work - The architecture approach seems valid. Like you say, read optimized, but with some additions could scale and perform well. I've been looking for thoughts on grails fronting a couchdb repository for some of these same reasons. And your groovy service and this comment was of interest to me.

    >> I added a simple service to Groovy which facilitates writing/updating entities to CouchDB via the REST interface (It would probably be better to use Hibernates' interceptors to do this, doens't change the PoC though)

    Do you mean it would be ideal to work through the hibernate .save, .delete operations with some type of couchdb directive on the grails domain objects that are to be saved to the couchdb?

    thanks,
    chad.

  2. peter March 24th, 2009 7:09 am

    Well 'ideal' is a big word, but yes I think it would be worthwhile to work through those operations. Lucene based frameworks like Compass keep their Lucene indexes up-to-date in that fashion. But you will probable need some sort of mapping strategy to decouple changes to your model from changes to the document model in CouchDB.

  3. Nils Breunese March 24th, 2009 9:51 am

    I hear Varnish is a great HTTP accelerator to put on top of CouchDB: http://varnish.projects.linpro.no/

    Also of interest might be CouchDB’s show and list functions, which convert documents and views into non-JSON formats, if you’d like that: http://wiki.apache.org/couchdb/Formatting_with_Show_and_List Looks pretty awesome.

  4. Giorgio Sironi March 26th, 2009 11:32 pm

    Maybe ignoring the writing is oversimplifying, I was wondering 'How he will do the master part?' while reading...

  5. peter March 26th, 2009 11:46 pm

    @Giogio I'm not sure what you are actually asking... which master part, and what reading?

  6. Remco March 28th, 2009 1:00 pm

    The Book Of JOSH ??

  7. [...] databases weer helemaal hot. Een aantal prototypes voor de architectuur die ik momenteel ontwerp zijn ook gebaseerd op het gebruik van een dergelijke database. Een goed moment om er eens een [...]

  8. Wordpress & Couchdb and Ruby « log4p May 28th, 2009 1:15 pm

    [...] This might not seem very useful, and when only receiving data from Wordpress it isn't. In the future other applications would also publish content to the same [...]

Leave a reply