A Data Engineering Design Pattern and Trial of HBase and Redis (continued)

In my last blog post, I looked at a comparison of HBase vs Redis for addressing my needs for fast lookups. The motivation was to see if I needed the full power of having a distributed system in an HBase (Hadoop) cluster vs a single-node, in-memory lookup service Redis.

My initial conclusion was that even though my HBase footprint was 4 GB, I could not fit Redis into memory of a system with 4 GB. However, I found a better way, leveraging the hash data structure provided by Redis to fit the data into a footprint small enough for a single normal machine. Furthermore, it actually uses less than half of the footprint of HBase! Let’s dive into the schema differences:

The original schema key structure contained a namespace for logical table separation, number of bedrooms of listing, geo-graphic location, and date, with each separated by the ‘|’ character. Here is an example key:

CI|3|ca-san_francisco|2010-07-31

where ‘CI’ stands for the city namespace (other namespaces include states, counties, and zipcodes). The geographic labels of cities and counties include a state, with the city/county name separated from the state with a ‘-‘. All geographic labels are in lower case with any space replaced with a ‘_’. The value of the key is a textual representation of a dictionary with the number of listings and average listing price for that week:

"{'a':374531,'n':13936}"

A problem with this schema is that there is a lot of redundancy in keys. There are a lot of keys that are mostly the same (i.e., each week of data for San Francisco for k bedrooms has a key), and there is some overhead per key. The result of this schema was that it used a ton of memory and made Redis non responsive. (I will note that my system still held together quite well and was responsive.) Another problem note that with this schema, I do not know what dates I have data for without searching for the keys, and Redis does not recommend running the keys command in production because of the O(n) computation .

The new schema is similar to the schema I used for HBase in that I grouped together all of the records for a specific number of bedrooms and geographic location. The new schema, which should be easy to understand based on the previous schema, is as follows:

CI|3|ca-san_francisco

The value of this key is a hash that is keyed by the listing weeks for which Trulia had data. From the Redis CLI:

> hkeys 'CI|3|ca-san_francisco'
1) "2010-07-31"
2) "2011-01-15"
3) "2011-07-16"
...

The value within a hash is the same as in the previous schema, a textual representation of a dictionary:

> hget 'CI|3|ca-san_francisco' '2010-07-31'
"{'a':1187840,'n':490}"

I ran through all of the input statements as before and I checked the info command from the Redis CLI. There is a lot of useful output, but just focusing on the used memory footprint, it lists only 1.7 GB! That is a huge reduction from 8 GB last time, with my memory full and more data to read in. And my performance numbers are about the same as they were with HBase: a 10-row lookup is about 100 ms and extracting the useful data out of the key is about another 100 ms. Recall, that with schema version 1 (before I ran out of memory) this only took about 20 ms for each.

So, great it works. I’ve managed to fit my data into memory of regular-sized hardware, but let’s talk about tradeoffs. How does this scale? A way I like to think about this abstract question is by looking at different ways the system could fail by examining how it would work if different dimensions were 100x larger. The reason this is important is several fold. Imagine your new prototype is wildly successful, attracting lots of users. You want to be able to handle this much demand. Optimizing your code/infrastructure for larger than 100x may be a waste of time and you may have missed your mark in terms of time to market (or time to boss’s desk). Okay, so hypothetical performance estimation:

– If I had 100x the amount of data (weekly listings in my case), this would not fit into memory and I would have not only bad functionality, it wouldn’t work. I might need a distributed system like Hadoop/HBase, Cassandra, or Redis Cluster.

– If I had 100x more hits to a particular set of keys (or hotspotting), I could have issues. It’s not easy to see at what hit rate hotspots would cause a slowdown, but it is important to note that by default Cassandra hashes nearby keys to different nodes to distribute load. If I have more time, I’ll develop a more sophisticated performance rig to evaluate hotspots.

– If I had a lot of updates to entries, I’d have some issues with the distributed systems because there would a thrashing among nodes during compaction. And related: Redis can become fragmented with adds/updates/deletes, so care may be required to rebuild the hashes from scratch to ensure more contiguous reads.

Anyway, these are some thoughts on my Redis schema redesign that enabled me to store a couple of gigs of records in memory for a serving layer for my use case. Hope you enjoyed.