HAvroBase: a searchable, evolvable entity store on top of HBase and Solr

Building out a social consumer internet product that could change quickly and evolve over time puts special requirements on the underlying data store. You need to be prepared for scale but not investing too much too early, your business may need to pivot in different directions so data models can’t be set in stone and you need to be able to search that data to enable many of the features users expect from an online social product. I have the additional requirement that I wanted all the entities in the system, regardless of where they are stored, to be accessed the same way and be described by the same data description language for consistency and maintainability. I’ve created what I think to be a novel solution to these requirements in the form of HAvroBase.

The first choice you have to make against these requirements is which data definition language are you going to use? Instead of depending on the native format of the storage system I’ve decided to use Avro. Similar to Protocol Buffers and Thrift, Avro lets you define your entities using schemas and store them efficiently in a binary format. Additionally, as long as you have the original schema available you can load old data into a new schema. This will let you evolve your stored rows lazily and not have to completely update your storage system when a new field is added or a field is removed.

Whereas the data definition choice is basically commodity at this point and your choice can be somewhat arbitrary, the choice of storage technology will likely be something that has more trade-offs to consider. After looking at the features and communities of a variety of projects including HBase, Cassandra, Redis, Riak, MySQL, Membase, Terracotta, etc. I finally chose HBase for a few reasons that may not be that important to you. First I settled on a BigTable type choice based on the data model. That left HBase and Cassandra as contenders. There also a few things I think are advantages:

  • multiple tables
  • better ordered key support
  • Zookeeper and HDFS were already in my solution
  • Hive and Hadoop support against the native data format
  • consistency
  • compare and set
  • atomic increment
  • versioning

Cassandra definitely has its own advantages that didn’t out-weigh the other considerations including

  • performance
  • simple deployment
  • always writable

You might make different tradeoffs or even use both solutions for different problems. I’m also biased somewhat as I am sharing an office with Cloudera and get top notch support at a moments notice.

Whether I chose HBase or Cassandra the implementation would have been much the same and with the HAvroBase framework you can have multiple independent storage systems. The framework does somewhat shield you from this decision, at least in your code.

When it comes to text search you really don’t get better than Lucene in open source and the features that Solr builds on top of Lucene make it even better. I don’t think there is reasonable argument for using something besides Solr at this point. Especially with their support for sharding and replication that comes with Solr Cloud. It also has the nice benefit that it supports multiple tables, like HBase, so I can efficiently separate entities and allow them to scale independently without managing an additional system.

The last choice that I made was runtime configuration. There are a few solutions including Spring, Guice and rolling my own and I finally landed on the side of Guice. I just like typed, programmatic configuration better than XML files and since that is the focus of Guice, I think it is the best solution for that.

The first thing that you need to do to use the system is to define an entity in Avro:

You can use the Avro compiler (or the Maven plugin) to convert that definition to classes (HAvroBase requires this). Then you do the typical things in code to setup your HBase connection but within a Guice module and set some of the configuration parameters for HAB:

That configuration can then be used to instantiate an instance of the HAB class that implements the connection to HBase like this:

You’ll notice that configuration is somewhat split between things that the base system needs to know and things that the HAB needs to know. For example, what table the schema is store in isn’t relevant to the base system as an implementer is free to store schemas however they like. But the assumption is that every system will have the concept of a table and a column family in order to separate entities in the underlying storage system. The Solr support is pretty elegant I think. Rather than specifying in more than one place which fields should be indexed, instead at runtime it queries the Solr core’s schema.xml that defines them and then when you put an entity in the system it automatically indexes those fields. Here is an example of using some of the APIs:

This example creates a new User, puts it into the HAvroBase, looks it up by row and then searches for it by field that is indexed and finally deletes it. In any fully implemented AvroBase system this should work identically. You’ll notice in the memcached implementation that only put/get/delete are implemented and it doesn’t support scan or search at all. I’ve considered breaking out that functionality into separate interfaces but haven’t done that yet.

If you are unhappy with the way the default Avro compiler generates code you can use my templated Avro compiler that lets you change the generated code. I have done that for my project to make them a little more developer friendly.

Things left to do:

  • implement guarantees around indexing
  • optimize schema keys (use an abbrev table instead of a hex sha-256 key)
  • indexing nested fields and arrays in Solr

I’m sure there are more things left undone but those are some of the more obvious issues.

This entry was posted in Java, Startup, Technology. Bookmark the permalink.
  • Edwink

    Very interesting. Did you consider at any point in the process to use MongoDB?

  • http://otis.myopenid.com/ Otis Gospodnetic

    Wow, this looks interesting!
    Am I understanding this correctly – this is NOT like Lucandra or HBasense that decided to store the whole Lucene index in Cassandra and HBase instead of the default filesystem (where they live in various index files). This is much more like what one would get with, say, Hibernate Search, except here data lives in HBase instead of some RDBMS. Correct?

  • http://www.javarants.com spullara

    I looked at the HBasene project as part of the analysis before choosing Solr. It is too immature at this point and doesn't support enough of the power of Lucene. That said, I would love to leverage either HBase or HDFS rather than the raw file system.

  • http://www.javarants.com spullara

    I looked at it. I found it to be lacking in a few areas:

    - schemaless rather than formal evolvable schemas
    - querying & indexing isn't search and I needed search (stemming, synonyms, relevancy ranking, etc) as you find in Lucene

    The other features of MongoDB are pretty attractive.

  • ewhauser

    Good stuff. You could probably do a whole blog post about this subject itself, but what approaches are you considering for implementing guaranteed indexes?

  • outerthought

    Looks really interesting! How did you bridge HBase with SOLR? For Lily, we've build a library to build HBase-backed message queues for that.

  • http://jots.mypopescu.com Alex Popescu

    Sam,

    There is something I think I'm missing. If your objects are avro-encoded (and so becoming opaque binaries) why HBase and not simply a KV?

  • http://otis.myopenid.com/ Otis Gospodnetic

    Doesn't MongoDB have some hooks to index everything that goes in it into an external Solr instance? Or maybe in external Lucene indices?

  • http://otis.myopenid.com/ Otis Gospodnetic

    Hm, but I was trying to point out/ask whether HAvroBase is actually using a different approach than HBasene all together.

    I believe HBasene is essentially saying “instead of storing Lucene indices in the raw FS, we'll model the indices in various HBase column families, so the actual index will live in a HBase database”.

    vs. how I understand HAvroBase from your post, which is:

    “I want to store data in a DB (HBase), but I also want it searchable, so let me create a component that both puts data in HBase and also indexes it in regular, external Lucene indices (i.e. they still end up living in the raw/local FS)”.

    Maybe I misunderstood how HAvroBase uses Lucene? Does HAvroBase also model Lucene indices inside HBase and when one uses HAvroBase there are no Lucene indices in the local FS?

    Thanks.

  • http://www.javarants.com spullara

    You could use any key value store that has all the properties listed above. I chose HBase but you could choose MySQL or Memcache as long as it has the right properties. For example, I want to be able to scan by key, so that leaves out Memcache. I want to be able to scale across many machines easily, that leaves out MySQL. I want atomic compare and set, so that leaves out the current version of Cassandra.

  • http://www.javarants.com spullara

    There are a couple strategies that I am thinking about. HBase is adding coprocessors and that might give me the right hooks. Simplest would be writing to HBase twice for each update, once to insert the update and again when the index is complete with a dirty bit and periodically scanning for inconsistencies. Kind of a hack but for my use case could be ok.

  • http://www.javarants.com spullara

    Right now doing this in the client. Sounds like you are a little further down the “implement guarantees around indexing” track than I am right now.

  • http://www.javarants.com spullara

    If MongoDB has those hooks, which I did not investigate, it seems like it could be a reasonable solution. Here is a reference to a few other libraries that do something similar for other NoSQL stores + Solr:

    http://nosql.mypopescu.com/post/383437318/integ

  • http://twitter.com/sbtourist Sergio Bossa

    Also, Terrastore (http://code.google.com/p/terrastore/) provides all requested features: key ranges, atomic updates, evolvable schemas through custom serializers/deserializers by the client side, lucene integration via ElasticSearch … you may want to take a look ;)

    Cheers,

    Sergio B.

  • http://www.javarants.com spullara

    Terrastore looks promising but the choice of Elastic Store over Solr doesn't seem well thought out. The advanced search and indexing capabilities of Solr far outweigh its deployment difficulties.

  • http://twitter.com/sbtourist Sergio Bossa

    Thanks for your kind words Sam.
    Solr merits over ElasticSearch (or vice-versa) apart, coding a Solr integration for Terrastore is a matter of hours, thanks to Terrastore event system.

    Anyways, good luck with your system ;)

    Sergio B.

  • http://www.javarants.com spullara

    This is using Solr completely independently of HBase and no indexes (except primary) are stored in HBase.