A Data Engineering Design Pattern and Trial of HBase and Redis

Key-value stores can be a useful piece of your software stack, but choosing the right one for your use cases can be hard. Moreover, as open-source offerings evolve, it’s nice to be able to do quick performance analysis on how a new offering can improve your software stack. Today, I’ll be sharing my experience with two different key-value store offerings with different strengths and weaknesses: HBase and Redis.

The use case I’ll focus on is covered in my earlier blog posting about my Insight Data Engineering project. In a nutshell, I created a simple REST API that exposes real estate listing data I gathered from Trulia’s API. One can query a geo-graphic location (state, county, city, or zipcode) for a date range, and they are given average listing price for that location and 10 nearby locations.

If you are new to Redis, it is an in-memory key-value store that can store several different data structures as values. Redis has a cluster-based approach. It’s unclear if it is stable because it has been in alpha or unstable for over 2 years, but I would be anxious to give it a try particularly based on what follows. If you are new to HBase, it is a NoSQL database that runs on top of HDFS, which means it can scale with a Hadoop installation. Additionally, by default HBase uses a least recently used block caching for storing blocks of data in memory at the region servers (typically co-located with HDFS data nodes)

Why am I comparing these two quite different technologies? For my particular use case, I didn’t have that much data in HBase. Maybe I could just store it all in memory using something simpler standing up a Hadoop cluster. How can I use a good software design pattern so that I can accommodate switch underlying technologies?

To avoid being tied to a particular technology, it is a good practice to abstract different functionalities in your codebase. This is way easier said than done. Let me share with you how my project, Theft-Market, evolved. First, I had separate functions for fetching data from Trulia, parsing the XML response, and writing to a datastore. But these were all in the same class and their functionalities intertwined (i.e., the parsing output was a dictionary that was matching my HBase schema). Then, I moved the parsing functions into Cython for a 2x improvement there. Finally, I re-contracted the parser to datastore writer interface to accept a more schema-agnostic dictionary for putting into a datastore. Also, I defined the same functions in my HBase Manager and Redis Manager classes. This last changed enabled me to change what self.kv_store_manager object was in the Trulia data fetcher for writing data and in the Flask web server for reads. The next minor step is to move this to configuration! To go a step further with this concept of abstraction, it could make sense to make abstractions across internal REST APIs, so you can separate functionalities from a particular language and separate computing hardware.

The performance comparison I’ll present is more qualitative and anecdotal than precise and scientific, but exemplifies the stark underlying differences in Redis and HBase. First the setup, I’m using a 4-node cluster of 3 x m1-medium and 1 x m1-large with cloudera manager version 5.1.0, and I’m using the Thrift API with HappyBase for accessing HBase with Python. The standard Redis standalone install: apt-get install redis-server and pip install redis for the client client. I’m using the m1-large with 8 GB of memory for running the standalone Redis store. See the HBase/Redis schema part of the Theft-Market Schema page for the key and value structure.

Webserver Pipeline

First, the performance of HBase for getting 10 rows that were nearby each other (based on good row-key design) took around 100 ms. I noticed that when I was running puts to HBase (while getting additional data), the response time would triple but never stall for seconds. One issue that I continue to find is that the HBase Thrift server, which is what HappyBase connects to, is fragile and needs strong type checking prior to input. I also increased the Java heap size, which seemed to help a bit. Occasionally, it needs restarting anyways, so this is definitely something to be aware of. I may try the native Java API next time.

Like I mentioned earlier, the total data footprint listed was not huge in 4 GB (when doing hdfs dfs -du -h /hbase), so I thought I could put all this into memory with an in-memory datastore solution! I used an m1-large (without Hadoop installed) for testing Redis. When I put in a small fraction of my dataset around 1/100 – 1/1000 of my dataset, I was getting really fast access times, around 20 ms! I noticed that things seem to be really responsive with more and more of the dataset in Redis, except when I reached 80% of my 8 GB memory used. First, I was surprised that I was up to 6 GB of memory usage with 40% more data yet to be input. I read that Redis can be quite fragmented causing up to 10x more memory usage. Next time I try this out, I’ll check the memory fragmentation ratio available from the INFO command from the redis-cli. I’m not sure why I would have so much fragmentation when I am not overwriting any values associated with a key. Second, I was more surprised that my lookups were really slow, on the order of 5 seconds for that same 10 row lookup as before. The performance was poor before and after I stopped adding more to Redis. [EDIT: see a follow-up post on how to use an advanced Redis feature to meet my use case]

Overall, I was disappointed that I could not use Redis for my use case [EDIT: see the follow-up post], but I was excited that I had a better understanding of when to use it and a stronger argument for Hadoop on non-gigantic datasets. As an aside, I do feel Hadoop is useful with 1 TB or less of data more often than one would think particularly if there is the potential for more data to become part of a future use case. Lastly, I was also really excited about using the simple practice of abstraction in the process of this investigation. It made my software more extensible for future tools, like when I try out things like Redis cluster!

2 thoughts on “A Data Engineering Design Pattern and Trial of HBase and Redis

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s