Recently we had to look for a solution to the ever growing problem of searching huge amounts of data.
Requirements
Initially the dataset will contain about 1.000.000 records entangled in complex ways using a lot of ternary relations.
To solve the problem of searching I had used Lucene before, but was never really satisfied with the robustness of a ‘raw’ Lucene installation. When used in complex situations (think heavy load and clustered environments) it is immensely complex to get configured correctly. There is also the matter of keeping the indexes synchronized to the data they are based upon.
While searching for a solution we where bound by the following requirements:
- Integration with Hibernate
- Support for running in a clustered environment
- Good performance
- Open Source
- Preferable avoid the use of specific value objects
- Provide ‘fuzzy’ search functionality
Selection
After testing some frameworks (hibernate’s lucene integration, Spring’s Lucene integration), and database solutions (tsearch2). The Lucene frameworks seamed to lack maturity and should be considered experimental. TSearch2 proved to be a bit to slow for the type of searches we wanted to execute. This is when we found out about Compass, a Lucene based framework which is part of the OpenSymphony project.
Compass is a first class open source Java Search Engine Framework, licensed under the Licensed under the Apache License (V2), which enables developers to decoratively add search capabilities to an application. A really strong aspect of Compass is the way it integrates with all the leading ORM and Spring Also, compass does not try to hide Lucene’s features – all of Lucene’s functionality is available through Compass.
Compass consists of three modules: Compass Core, Compass GPS and Compass Spring integration. Compass Core is the most fundamental part of Compass. It holds Lucene extensions for transactional indexing, search engine abstraction, ORM like API and transaction management integration. The Compass GPS module contains the functionality needed to integrate with all supported data sources. And, as the name probably gave away; the Spring integration modules contains everything needed to configure and integrate Compass with Spring.
The GPS for Hibernate 3 uses the Hibernate event system to provide realtime mirroring of changes to the data in the underlying index. When configured from Spring Compass and Hibernate can use the same transactionmanager (SpringSyncTransaction) to reduce discrepancies between the index and the actual data to this absolute minimum.
Examples
Preparing your objects to be indexed can be done in various ways, since we already used Hibernate annotations we decided to go for the annotations provided by Compass. By default Compass provides a lot of useful annotations, which can be found in the org.compass.annotations package.
The following piece of code (taken from this nice article on infoq) illustrates how the annotations are used:
[java]
@Searchable
public class Author {
@SearchableId
private Long id;
@SearchableComponent
private String Name;
@SearchableReference
private List books;
@SearchableProperty(format = “yyyy-MM-dd”)
private Date birthdate;
}
// …
@Searchable
public class Name {
@SearchableProperty
private String firstName;
@SearchableProperty
private String lastName;
}
[/java]
When used in conjunction with Spring searching entities can be done using the CompassDaoSupport/CompassTemplate:
[java]
public class ExampleDao extends CompassDaoSupport {
public Author findFirstMatchingAuthor(final String query) {
Author author = (Author) getCompassTemplate().execute(new CompassCallback() {
public Object doInCompass(CompassSession session) {
CompassHits hits = session.find(query);
Author a = (Author) hits.data(0);
return a;
}
});
return author;
}
}
[/java]
There is much more to compass then the short examples here are able to demonstrate; most of it can be found in the online documentation or the provided example applications in the download.
Conclusion
although the framework has some quirks (Documentation can be a bit scarce), we are very pleased with the functionality provided by the Compass framework; it really succeeds in helping a developer to use Lucene in a sensible way.
–update–
We had a lot of unexpected errors with compass lately, which where cause by a nasty bug concerning the caching of blobs within compass. My colleague managed to track this down and fix it… but this fix is not yet available in compass itself.
Sounds pretty cool!
Especially the annotation stuff, and doInCompass. If it all works, it will speed up integration a lot.
But if it can manage your distributed data updates in a good way it will be a real winner!
I will check this out the next ime I need a Lucene-based framework.