My Insight Data Engineering Project

The goal is to analyze historical real-estate listing data using data engineering tools, so I built a tool called Theft Market. I leverage Trulia’s API to gather their historic data. See Trulia’s developer page for an overview of their API.

Overview

To bootstrap the data pipeline, Theft Market repeatedly calls Trulia’s API to get the list of states, cities, and zipcodes in the US. It parses the XML responses and puts this information into its Meta Store, a MySQL database. The TruliaInfoFetcher calls getStates, getCitiesInStates, and getZipCodesInState to populate theMeta Store; see information library page for more about these calls.

With information about different geographic areas, Theft Market repeatedly calls Trulia’s API to get real estate data about each of the areas, which takes approximately 50,000 API calls. The TruliaDataFetcher functions used are getStateStats, getCityStats, and getZipCodeStats; see Trulia’s stats library page for more about these calls. The results of these calls are then split in the pipeline.

Pipeline Details

The TruliaDataFetcher uses HappyBase to put data directly into HBase (via the Thrift API) when it finishes parsing a stats response. Also, the stats are sent to HDFS using FluentD to the WebHDFS port on the HDFS NameNode. FluentD appends each record to a file of records in HDFS, and files are partitioned hourly as currently configured. Each line of these record file includes a JSON object for each record allowing flexibility in what was parsed out of the XML.

Subsequently, these large files in HDFS are processed by Hadoop Streaming with these Python map-reduce jobs. This translates the JSON objects to structured, tab-separated files that are used as Hive external tables. To create the Hive tables, use the create table script; see external table creation query for an example to create a city table. One could also write Pig, Cascading, etc. scripts based on the file structure from nice structure of the Hadoop Streaming processing mentioned above. Following that, there are a handful of other ad hoc queries in the hive directory. This concludes progress on the batch processing, deep-dive analytics layer of Theft Market.

Webserver Pipeline

The user is exposed to a simple (read-only) REST API for getting statistics about particular geographic areas. The REST call is handed to a combination of Apache (server), WSGI, and Flask. The Flask has an instance to an object that handles calls to the MySQL manager and an instance to HBase manager. Apache runs multiple Flask threads, with each thread having its own MySQL and HBase manager. The Flask web server routes calls to functions in the RestCallHandler . The RestCallHandler coordinates a combination of MySQL and HBase queries to rapidly answer the REST call (see the diagram above).

MySQL and HBase timings

The timing of these operations are quite fast. The GPS lat and lon distance query takes about a millisecond, where as the HBase access and the manipulation of a row’s columns takes 100-200 ms depending on how many weeks are in the date range. The columns values are keyed by the date of the week listing (e.g. 2014-06-30) textual json storing the average listing and number of listing that comprise the average (e.g. value=”{‘a’:’1000000′,’n’=’13’}”). The average and number is necessary for recomputing the average over the desired date range.

The schema could be enhanced to avoid parsing a python dictionary for each key, but this schema leads to acceptable performance with the data I have; I noticed that Trulia’s historical data goes back to at most 2008 (so that is 52 days/year * 6 years = 312 maximum columns per row). I also don’t take advantage of using different column families for other attributes I would not often access with listing averages.

The API format is straighforward, the caller passes a dictionary (described below) to a base url corresponding to the query interest; here are the following urls supported:

ipaddress/data/city/volume
ipaddress/data/city/average
ipaddress/data/zipcode/volume
ipaddress/data/zipcode/average

where volume is the aggregation of listings over the time period and average is the average listing price over the time period. The dictionary passes the remainder of the search parameters. All dictionaries contains the following key-value pairs:

– start_date (YYYY-MM-DD)
– end_date (YYYY-MM-DD)
– num_bedrooms (positive integer)

For city queries provide:

– state_code (XX)
– city (city name) (some browsers may need a ‘%20’ inserted for spaces in the city name)

For zipcode queries provide:

– zipcode (XXXXX)

A full example of a city average listings call:

ipaddress/data/city/average?q={"state_code":"MA","city":"Boston","num_bedrooms":3,"start_date":"2012-01-01","end_date":"2014-01-01"}

A full example of a zipcode volume listings call:

ipaddress/data/zipcode/volume?q={"zipcode":"02458","num_bedrooms":3,"start_date":"2012-01-01","end_date":"2014-01-01"}

Overall, this was a really fun project with a functioning prototype with 2-3 weeks of work. Some issues I had was using the Thrift API to put data into HBase. I would have used the Java API if I had known how flaky it is. Also, having port 80 open on Amazon Web Services machines is asking for trouble. I saw dozens of different types of access from the Apache server access log from researcher’s web crawlers to a commonly known w00tw00t attack. I’d need learn how to configure Apache to not crash on seemingly simple gets (here I thought Apache was a more hardened server platform than Flask).

Right now the project is off-line because AWS costs money. The 4-node (1 x1large, 3 x1mediums) were paid for by Insight. I’d like to see if I could get the Rest API portion of this project on a single node with decent performance soon. Without a doubt I will be distracted by working hard getting started at my next job (to be determined).

Advertisement

7 thoughts on “My Insight Data Engineering Project

  1. Hello Ryan, This is Sathyan here. I am planning to join Insight Data Engineering program in the month of Jan 2015. Just want to check with you on how was the journey with Insight program?. I got admission in the month of Sep 2014, unfortunately, i was not able to attend the program. So planning to join in Jan 2015.

  2. hayden says:

    Hi Ryan! This is awesome! I am really interested in this program after reading your post. Amazing! Is it hard to get into it? Any must-have / requirement for the program?

    • Hi hayden,

      Sorry for the delayed response! I’m glad you enjoyed the blog post. I had a ton of fun writing it. The program is difficult to get into, but I would encourage anyone interested in the program to give it a shot. Must-haves (in my opinion) are technical ability in programming/linux and the ability to communicate ideas effectively with a positive attitude. Lastly, you should be passionate about continually learning about engineering systems. I hope this helps!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s