Building out a social consumer internet product that could change quickly and evolve over time puts special requirements on the underlying data store. You need to be prepared for scale but not investing too much too early, your business may need to pivot in different directions so data models can’t be set in stone and you need to be able to search that data to enable many of the features users expect from an online social product. I have the additional requirement that I wanted all the entities in the system, regardless of where they are stored, to be accessed the same way and be described by the same data description language for consistency and maintainability. I’ve created what I think to be a novel solution to these requirements in the form of HAvroBase.
The first choice you have to make against these requirements is which data definition language are you going to use? Instead of depending on the native format of the storage system I’ve decided to use Avro. Similar to Protocol Buffers and Thrift, Avro lets you define your entities using schemas and store them efficiently in a binary format. Additionally, as long as you have the original schema available you can load old data into a new schema. This will let you evolve your stored rows lazily and not have to completely update your storage system when a new field is added or a field is removed.
Whereas the data definition choice is basically commodity at this point and your choice can be somewhat arbitrary, the choice of storage technology will likely be something that has more trade-offs to consider. After looking at the features and communities of a variety of projects including HBase, Cassandra, Redis, Riak, MySQL, Membase, Terracotta, etc. I finally chose HBase for a few reasons that may not be that important to you. First I settled on a BigTable type choice based on the data model. That left HBase and Cassandra as contenders. There also a few things I think are advantages:
- multiple tables
- better ordered key support
- Zookeeper and HDFS were already in my solution
- Hive and Hadoop support against the native data format
- compare and set
- atomic increment
Cassandra definitely has its own advantages that didn’t out-weigh the other considerations including
- simple deployment
- always writable
You might make different tradeoffs or even use both solutions for different problems. I’m also biased somewhat as I am sharing an office with Cloudera and get top notch support at a moments notice.
Whether I chose HBase or Cassandra the implementation would have been much the same and with the HAvroBase framework you can have multiple independent storage systems. The framework does somewhat shield you from this decision, at least in your code.
When it comes to text search you really don’t get better than Lucene in open source and the features that Solr builds on top of Lucene make it even better. I don’t think there is reasonable argument for using something besides Solr at this point. Especially with their support for sharding and replication that comes with Solr Cloud. It also has the nice benefit that it supports multiple tables, like HBase, so I can efficiently separate entities and allow them to scale independently without managing an additional system.
The last choice that I made was runtime configuration. There are a few solutions including Spring, Guice and rolling my own and I finally landed on the side of Guice. I just like typed, programmatic configuration better than XML files and since that is the focus of Guice, I think it is the best solution for that.
The first thing that you need to do to use the system is to define an entity in Avro:
You can use the Avro compiler (or the Maven plugin) to convert that definition to classes (HAvroBase requires this). Then you do the typical things in code to setup your HBase connection but within a Guice module and set some of the configuration parameters for HAB:
That configuration can then be used to instantiate an instance of the HAB class that implements the connection to HBase like this:
You’ll notice that configuration is somewhat split between things that the base system needs to know and things that the HAB needs to know. For example, what table the schema is store in isn’t relevant to the base system as an implementer is free to store schemas however they like. But the assumption is that every system will have the concept of a table and a column family in order to separate entities in the underlying storage system. The Solr support is pretty elegant I think. Rather than specifying in more than one place which fields should be indexed, instead at runtime it queries the Solr core’s schema.xml that defines them and then when you put an entity in the system it automatically indexes those fields. Here is an example of using some of the APIs:
This example creates a new User, puts it into the HAvroBase, looks it up by row and then searches for it by field that is indexed and finally deletes it. In any fully implemented AvroBase system this should work identically. You’ll notice in the memcached implementation that only put/get/delete are implemented and it doesn’t support scan or search at all. I’ve considered breaking out that functionality into separate interfaces but haven’t done that yet.
If you are unhappy with the way the default Avro compiler generates code you can use my templated Avro compiler that lets you change the generated code. I have done that for my project to make them a little more developer friendly.
Things left to do:
- implement guarantees around indexing
- optimize schema keys (use an abbrev table instead of a hex sha-256 key)
- indexing nested fields and arrays in Solr
I’m sure there are more things left undone but those are some of the more obvious issues.