Recipe for a Distributed Realtime Tweet Search System
Recipe for a Distributed Realtime Tweet Search System:
Ingredients:
Method:
- Place Voldemort, Kafka, and Sensei on a couple of servers.
-
Arrange them with taste:
-
Spray a large quantity of tweets on the system
Preparation time:
24 hours
Notes:
For more servings, add the appropriate number of servers.
Result:
Reviews
- One design choice was letting the process that writes to Voldemort also be a Kafka consumer. Although this would be cleaner, we would risk a data-race where search may return hit array before they are yet added to Voldemort. By making sure it is first added to Voldemort, we can rely on it being an authoritative storage for our tweets.
- You may have already realized Kafka is acting as a proxy for twitter stream, and we could have also streamed tweets directly into the search systems, bypassing the Kafka layer. What we would be missing is the ability to play back tweet events from a specific check-point. One really nice feature about Kafka is that you can keep a consumption point to have data replayed. This makes reindexing for cases such as data corruption and schema changes, etc., possible. Furthermore, to scale search, we would have a growing number of search nodes consume from the same Kafka stream. Kafka is written in a way where adding consumers does not affect through-put of the system really helps in scaling the entire system.
- Another important design decision was on using Voldemort for storage. One solution would be instead store tweets in the search index, e.g. Lucene stored fields. The benefits with this approach would be stronger consistency between search and store, and also the stored data would follow the retention policy of that’s defined by the search system. However, other than the fact that Lucene stored field is no-where near as optimal comparing to a Voldemort cluster (an implementation issue), there are more convincing reasons:
- We can first see the consistency benefit for having search and store be together is negligible. Actually, if we follow our assumption of tweets being append-only and we always write to Voldemort first, we really wouldn’t have consistency issues. Yet, having data storage reside on the same search system would disproportionally introduce contention for IO bandwidth and OS cache, as data volume increases, search performance can be negatively impacted.
- The point about retention is rather valid. As search index guarantees older tweets to be expired, Voldemort store would continue to grow. Our decision ultimately came down to two points: 1) Voldemort’s growth factor is very different, e.g. adding new records into the system is much cheaper, so it is feasible to have a much longer data retention policy. 2) Having have cluster of tweet storage allows us to integrate with other systems if desired for analytics, display etc.
Original title and link: Recipe for a Distributed Realtime Tweet Search System (NoSQL databases © myNoSQL)