Simple Python Developer Practices

I dedicate this post to Insight Data Engineering.  It’s a great program that brought me to California one year ago this week.

I’ve missed several months of blogging because of a new job, buying a condo, and finishing the move across the country. In this simple blog, I’ll share how I like to set up a Python project. There will be no holy wars on editors or my choice to focus on Python. Rather, this blog is the missing blog I was looking for when I was learning the best practices with software engineering in Python. Specifically, this blog walks through how I would set up a Python project as of mid-2015 with tools including virtualenv, tox, make, pytest, mock, and a basic code structure.

Virtualenv

Virtual environments are essential for developing in Python when non-native Python libraries are needed, like a MySQL client.  Run the following command at the top-level of your repo to create a directory called virtualenv_run with the Python 2.7 interpreter inside.

virtualenv virtualenv_run --python=python2.7

To enter the new virtualenv, execute source virtualenv_run/bin/activate.  I execute that command so often, I have a bash alias src for this and name all my virtualenvs the same. While inside the virtualenv, you can now install packages with pip and they will be installed in the virtualenv_run directory and not in a directory owned by root (so stop using sudo pip install some_package).

Next, three files need to be added at the top level of the repo: setup.py, requirements.txt, and requirements-dev.txt.  These files specify metadata about your repo/project and what dependencies you need. The format of setup.py is shown below. Other arguments including description, author, author_email, and url can also be included.

from setuptools import find_packages
from setuptools import setup

setup(
    name="MyProject",
    version="0.1",
    packages=find_packages(exclude=['tests']),
    setup_requires=['setuptools'],
    install_requires=[
        "MySQL-python"
    ],
)

In this example, the package MySQL-python is listed as a dependency with install_requires kwarg without a pinned down version. This will install the latest version of the package that is available. It is a common practice to include dependencies in setup.py (over requirements.txt) as much as possible if your project is a library. If your project is a standalone application, service, or something with a main, it is more common to include dependencies with pinned version numbers in requirements.txt. See this blog for a more detailed explanation.

The requirements-dev.txt (shown below) is designed to include the tools needed for testing.  I also include a link to install the requirements.txt file at the top. Later, I’ll show uses for each package in my adopted workflow.

-r requirements.txt
coverage
ipdb
ipython
flake8
mock
pytest

Below is an example requirements.txt with the dependency MySQL-python package pinned down. If you have installed packages via pip to your project, you can do pip freeze to get the version numbers of installed packages. It may easier to pip install a package and then get the version number with pip freeze for making sure an updated dependency doesn’t break your project when new versions are available. Lastly, the -e . installs the package at the current directory (your project) into the virtualenv.

-e .
MySQL-python==1.2.5

To build a virtualenv with these files simply execute pip install -r requirements-dev.txt.

Tox

The last component before starting development is to set up tox, a virtualenv-based test automation tool.  Tox will run all of your unit tests, check for code style issues, code test coverage, and even kick off acceptance/integration tests.  I’ll cover the first three uses in this blog.  Below is an example tox.ini file.

[tox]
envlist = py27

[testenv]
deps = -rrequirements-dev.txt
commands =
 coverage run --source=<YOUR_PROJECT>/,tests/ -m pytest --strict {posargs:tests}
 flake8 .

[flake8]
ignore = E125
max-line-length = 120
max-complexity = 10
exclude = .git,.tox,virtualenv_run

With this tox configuration file, the command tox will know how to run your suite of tests. It automatically build a clean virtualenv in a directory called .tox. It runs all unit tests in the tests directory. It reports the code coverage with the coverage. Always shoot for 100% code coverage with your tests. Certain lines like running main can be excluded with a .coveragerc file. If main is excluded from coverage, I’d strongly suggest main consisting of only an object creation and a single call to a method like run.

Tox also checks your code for conformance to proper Python coding style with flake8. By default flake8 is very strict with how whitespace is managed. Feel free to add to the list of ignores with the errors that flake8 provides. Before writing code, create a simple makefile with make-venv creating the virtualenv for development and make-test for executing tox.

Directory Structure

Finally, let’s write some code! I’ll share my opinions about high-level structure of your repo. First, create a tests directory for all tests. It is not by chance or accident that I mention tests first; the test code is as important as the production code! Second, create a directory for all of your production code to live. This name can even be the same as the repo name if no other name seems more appealing.

The remaining parts of the directory structure requires a brief discussion about software engineering. Think about separating the functionalities needed for your use cases. You’ll probably have some core objects and some object containers along with logic for manipulations of those objects or containers. You’ll also have some end delivery system for your use case like handling web requests, document creation, or database inserts.

The most important thing is to separate the core objects/logic from the delivery system because it makes switching delivery systems easier, and it makes managing the code easier. At the very least create a directory called components as a sub-directory of your production code directory, and the other sub-directory can be named after the delivery system (i.e., webserver). Also, the directory structure of the tests directory should match the adopted directory structure of the production code, so your tests are easier to find.

Pytest

The last thing, I’d like to cover is how to use a test framework in Python. I’ll share some examples using pytest and mock as well as quick pointers leveraging breakpoints in iPython.  Test files can be named test_module_to_be_tested.py in the directory that mirrors the module directory in the production code.  I opt for classes for my tests instead of individual functions in a module.  Test classes are named starting with “Test” and test methods are prepended with “test_”.

Inside each test class there are typically parameters that can be hardcoded for setting up or interacting with the thing being tested.  I put these constants in pytest fixtures, which are objects that are constructed once for each test method that includes them as an argument.  Below is an example use of fixtures with the assumption that this is in a test class and you are testing the Person class.  Lastly, assume that name is a property of Person.

@pytest.fixture
def some_name(self):
    return "Bob"

@pytest.fixture
def some_person(self, some_name):
    return Person(name=some_name)

def test_person_has_right_name(self, some_person, some_name):
    assert some_person.name == some_name

I strive for putting any hardcoded strings or integers at the top of each test class.  This makes refactoring the test code or production code easier.  In the above example, if we wanted to change the name property to be a namedtuple or name object for additional features, we would only need to change the some_name fixture!

The next essential testing concept is mocking.  Use mocking for simulating external functionalities like writes to a database, posts to a client for a web response, or simply calls to a logger.  Mocks can also be used to isolate functionalities of a class as will be shown below.  Also, it’s a good idea to have mocks inside of contexts using a combination of with or contextlib.nested and maybe a @pytest.yield_fixture if the mock can be re-used.

Here is an example where raw result processing will be tested (but not raw result generation/gathering), so we will mock out the retrieval of raw results to isolate what is being tested. Assume that DataProcessor is the class being tested and get_sum_of_results() calls get_raw_results() enroute to summing the raw results.

@pytest.fixture
def data_processor(self):
    return DataProcessor()

@pytest.fixture
def fake_results(self):
    return [0, 2, 4, 6, 10]

@pytest.yield_fixture
def raw_results_patch(
    self, 
    data_processor, 
    fake_results
):
    with mock.patch.object(
        data_processor, 
        'get_raw_results', 
        return_value=fake_results
    ) as mock_get_raw_results:
        yield mock_get_raw_results

def test_data_processor_sums_correctly(
    self, 
    data_processor, 
    raw_results_patch, 
    fake_results
):
    summed_results = data_processor.get_sum_of_results()
    assert summed_results == sum(fake_results)

This is a nice way to mock out the raw results to test summing independently of result retrieval, and if the raw_results_patch yield fixture is not included as an argument to another test, the test will not mock out the get_raw_results call. See mock examples to see more examples on mocking, and pytest documentation for more on pytest.

Sometimes, I like to step into the code at a breakpoint using iPython to explore some language semantics, mocking, or debugging why my test isn’t passing. To do so using the setup endorsed in this blog, I use the ipdb package and invoke the test run without using tox. Specifically, for the spot in the code I want investigate, I paste in import ipdb;ipdb.set_trace(). Then, I run the test with this invocation python -m pytest -s tests/components/test_a_component.py.

There is a lot of stuff in this post.  I hope you enjoyed it and found it helpful! Feel free to share feedback, and thanks for reading!

Data Engineering: From Basic Prototype to Production

Designing a scalable production-worthy system is a huge challenge that requires a careful balance of developing extensible code that is fast enough for production and time-management of hitting milestones while avoiding pre-mature optimization and dirty hacks. Here the ways I like to think about developing a scalable system:

  1. Write prototype in a program.  It’s important to get something working end-to-end for a sanity check.
  2. Refactor prototype code into classes, parameterize where necessary, and write unit tests for each class (use test-driven development!).  Refactoring should involve abstracting away things like DB connections and queries, and making things configurable
  3. Re-test end-to-end functionality, and create integration tests.  All of this test writing gives us confidence later when making any changes.

At this point, you have some decent looking code and a nice test suite, but it is not obvious how to scale from here.  You might spot some places where parallelize execution of your code could help. However, this can be tricky business with extending code designed not to have parallel execution, and testing/debugging can be difficult because of the exponential combinations of state that need to be right for the program to work. I prefer using what I’ll call a workflow framework.

Let’s define a workflow framework as a tool or set of tools that separate execution into multiple tasks computing simultaneously with finer grained control and visibility on each task. As an example, think about submitting a large number of small tasks to a job server, like gearman or publishing jobs to a queue using Redis.

This greatly helps software development because each piece can be developed and tested individually with ease. Failures on production can be isolated to a task and restarted automatically. Also, this isolation enables using different tools and languages to be used for isolated tasks; imagine submitting tasks from user input from a web-layer in Python and have workers processing requests using a lower-level language like C for performance reasons.

Lastly, you will find that if you isolated your classes well and have a good test suite, it will not be too difficult to move logic between different frameworks. Much of the code can be moved to worker classes that listens to a job server (Gearman) or a Redis queue.

This blog provides a high-level methodology of going from rough prototype hitting a real use-case to a maintainable and scalable system on production.

A Data Engineering Design Pattern and Trial of HBase and Redis (continued)

In my last blog post, I looked at a comparison of HBase vs Redis for addressing my needs for fast lookups. The motivation was to see if I needed the full power of having a distributed system in an HBase (Hadoop) cluster vs a single-node, in-memory lookup service Redis.

My initial conclusion was that even though my HBase footprint was 4 GB, I could not fit Redis into memory of a system with 4 GB. However, I found a better way, leveraging the hash data structure provided by Redis to fit the data into a footprint small enough for a single normal machine. Furthermore, it actually uses less than half of the footprint of HBase! Let’s dive into the schema differences:

The original schema key structure contained a namespace for logical table separation, number of bedrooms of listing, geo-graphic location, and date, with each separated by the ‘|’ character. Here is an example key:

CI|3|ca-san_francisco|2010-07-31

where ‘CI’ stands for the city namespace (other namespaces include states, counties, and zipcodes). The geographic labels of cities and counties include a state, with the city/county name separated from the state with a ‘-‘. All geographic labels are in lower case with any space replaced with a ‘_’. The value of the key is a textual representation of a dictionary with the number of listings and average listing price for that week:

"{'a':374531,'n':13936}"

A problem with this schema is that there is a lot of redundancy in keys. There are a lot of keys that are mostly the same (i.e., each week of data for San Francisco for k bedrooms has a key), and there is some overhead per key. The result of this schema was that it used a ton of memory and made Redis non responsive. (I will note that my system still held together quite well and was responsive.) Another problem note that with this schema, I do not know what dates I have data for without searching for the keys, and Redis does not recommend running the keys command in production because of the O(n) computation .

The new schema is similar to the schema I used for HBase in that I grouped together all of the records for a specific number of bedrooms and geographic location. The new schema, which should be easy to understand based on the previous schema, is as follows:

CI|3|ca-san_francisco

The value of this key is a hash that is keyed by the listing weeks for which Trulia had data. From the Redis CLI:

> hkeys 'CI|3|ca-san_francisco'
1) "2010-07-31"
2) "2011-01-15"
3) "2011-07-16"
...

The value within a hash is the same as in the previous schema, a textual representation of a dictionary:

> hget 'CI|3|ca-san_francisco' '2010-07-31'
"{'a':1187840,'n':490}"

I ran through all of the input statements as before and I checked the info command from the Redis CLI. There is a lot of useful output, but just focusing on the used memory footprint, it lists only 1.7 GB! That is a huge reduction from 8 GB last time, with my memory full and more data to read in. And my performance numbers are about the same as they were with HBase: a 10-row lookup is about 100 ms and extracting the useful data out of the key is about another 100 ms. Recall, that with schema version 1 (before I ran out of memory) this only took about 20 ms for each.

So, great it works. I’ve managed to fit my data into memory of regular-sized hardware, but let’s talk about tradeoffs. How does this scale? A way I like to think about this abstract question is by looking at different ways the system could fail by examining how it would work if different dimensions were 100x larger. The reason this is important is several fold. Imagine your new prototype is wildly successful, attracting lots of users. You want to be able to handle this much demand. Optimizing your code/infrastructure for larger than 100x may be a waste of time and you may have missed your mark in terms of time to market (or time to boss’s desk). Okay, so hypothetical performance estimation:

– If I had 100x the amount of data (weekly listings in my case), this would not fit into memory and I would have not only bad functionality, it wouldn’t work. I might need a distributed system like Hadoop/HBase, Cassandra, or Redis Cluster.

– If I had 100x more hits to a particular set of keys (or hotspotting), I could have issues. It’s not easy to see at what hit rate hotspots would cause a slowdown, but it is important to note that by default Cassandra hashes nearby keys to different nodes to distribute load. If I have more time, I’ll develop a more sophisticated performance rig to evaluate hotspots.

– If I had a lot of updates to entries, I’d have some issues with the distributed systems because there would a thrashing among nodes during compaction. And related: Redis can become fragmented with adds/updates/deletes, so care may be required to rebuild the hashes from scratch to ensure more contiguous reads.

Anyway, these are some thoughts on my Redis schema redesign that enabled me to store a couple of gigs of records in memory for a serving layer for my use case. Hope you enjoyed.

A Data Engineering Design Pattern and Trial of HBase and Redis

Key-value stores can be a useful piece of your software stack, but choosing the right one for your use cases can be hard. Moreover, as open-source offerings evolve, it’s nice to be able to do quick performance analysis on how a new offering can improve your software stack. Today, I’ll be sharing my experience with two different key-value store offerings with different strengths and weaknesses: HBase and Redis.

The use case I’ll focus on is covered in my earlier blog posting about my Insight Data Engineering project. In a nutshell, I created a simple REST API that exposes real estate listing data I gathered from Trulia’s API. One can query a geo-graphic location (state, county, city, or zipcode) for a date range, and they are given average listing price for that location and 10 nearby locations.

If you are new to Redis, it is an in-memory key-value store that can store several different data structures as values. Redis has a cluster-based approach. It’s unclear if it is stable because it has been in alpha or unstable for over 2 years, but I would be anxious to give it a try particularly based on what follows. If you are new to HBase, it is a NoSQL database that runs on top of HDFS, which means it can scale with a Hadoop installation. Additionally, by default HBase uses a least recently used block caching for storing blocks of data in memory at the region servers (typically co-located with HDFS data nodes)

Why am I comparing these two quite different technologies? For my particular use case, I didn’t have that much data in HBase. Maybe I could just store it all in memory using something simpler standing up a Hadoop cluster. How can I use a good software design pattern so that I can accommodate switch underlying technologies?

To avoid being tied to a particular technology, it is a good practice to abstract different functionalities in your codebase. This is way easier said than done. Let me share with you how my project, Theft-Market, evolved. First, I had separate functions for fetching data from Trulia, parsing the XML response, and writing to a datastore. But these were all in the same class and their functionalities intertwined (i.e., the parsing output was a dictionary that was matching my HBase schema). Then, I moved the parsing functions into Cython for a 2x improvement there. Finally, I re-contracted the parser to datastore writer interface to accept a more schema-agnostic dictionary for putting into a datastore. Also, I defined the same functions in my HBase Manager and Redis Manager classes. This last changed enabled me to change what self.kv_store_manager object was in the Trulia data fetcher for writing data and in the Flask web server for reads. The next minor step is to move this to configuration! To go a step further with this concept of abstraction, it could make sense to make abstractions across internal REST APIs, so you can separate functionalities from a particular language and separate computing hardware.

The performance comparison I’ll present is more qualitative and anecdotal than precise and scientific, but exemplifies the stark underlying differences in Redis and HBase. First the setup, I’m using a 4-node cluster of 3 x m1-medium and 1 x m1-large with cloudera manager version 5.1.0, and I’m using the Thrift API with HappyBase for accessing HBase with Python. The standard Redis standalone install: apt-get install redis-server and pip install redis for the client client. I’m using the m1-large with 8 GB of memory for running the standalone Redis store. See the HBase/Redis schema part of the Theft-Market Schema page for the key and value structure.

Webserver Pipeline

First, the performance of HBase for getting 10 rows that were nearby each other (based on good row-key design) took around 100 ms. I noticed that when I was running puts to HBase (while getting additional data), the response time would triple but never stall for seconds. One issue that I continue to find is that the HBase Thrift server, which is what HappyBase connects to, is fragile and needs strong type checking prior to input. I also increased the Java heap size, which seemed to help a bit. Occasionally, it needs restarting anyways, so this is definitely something to be aware of. I may try the native Java API next time.

Like I mentioned earlier, the total data footprint listed was not huge in 4 GB (when doing hdfs dfs -du -h /hbase), so I thought I could put all this into memory with an in-memory datastore solution! I used an m1-large (without Hadoop installed) for testing Redis. When I put in a small fraction of my dataset around 1/100 – 1/1000 of my dataset, I was getting really fast access times, around 20 ms! I noticed that things seem to be really responsive with more and more of the dataset in Redis, except when I reached 80% of my 8 GB memory used. First, I was surprised that I was up to 6 GB of memory usage with 40% more data yet to be input. I read that Redis can be quite fragmented causing up to 10x more memory usage. Next time I try this out, I’ll check the memory fragmentation ratio available from the INFO command from the redis-cli. I’m not sure why I would have so much fragmentation when I am not overwriting any values associated with a key. Second, I was more surprised that my lookups were really slow, on the order of 5 seconds for that same 10 row lookup as before. The performance was poor before and after I stopped adding more to Redis. [EDIT: see a follow-up post on how to use an advanced Redis feature to meet my use case]

Overall, I was disappointed that I could not use Redis for my use case [EDIT: see the follow-up post], but I was excited that I had a better understanding of when to use it and a stronger argument for Hadoop on non-gigantic datasets. As an aside, I do feel Hadoop is useful with 1 TB or less of data more often than one would think particularly if there is the potential for more data to become part of a future use case. Lastly, I was also really excited about using the simple practice of abstraction in the process of this investigation. It made my software more extensible for future tools, like when I try out things like Redis cluster!

Cython Performance Case Study

I’ve been learning different programming languages since the early 2000s. It can be tempting to clutch onto a language and use it no matter what. There are advantages to mastering a single language on a personal level and company level. However, the disadvantages include:

(1) not using new languages and their inherent strengths,

(2) reliant on language support for new tools,

(3) your build and testing infrastructure are stuck in a language (I really like Makefiles because they work in all languages!),

(4) using multiple languages may guide your implementation to have clearly defined modules, and

(5) not learning new languages can limit us as programmers (see Nathan Marz’s blog that suggests learning new languages makes us better programmers).

I started programming with C and C++, but higher-level languages like Python are attractive in terms of code-density and readability, which in my experience greatly help maintainability. Many Java and C++ software engineers dismiss Python as a production-grade language because it is much slower and it is not strongly typed (Often times they dismiss each other as well :-D, but that’s another topic.).

I wanted to do a short case study on sorting unique numbers in linear time using bit vectors in each of these languages. I wanted to see how much slower Python would be compared to the strongly typed languages for bit manipulation, and as the title eludes, compare all these to Cython.

The precise question, sort a file of unique numbers and write out the numbers to a file in sorted order. I plan to use a bit vector of integer type, with the idea of setting a bit at the memory address corresponding to the number’s value. Once I have all the numbers represented with one bit, I can loop through each bit of the bit vector and print out the numbers in order.

For me it was natural to think about implementing this algorithm in C. Here is a link to the full 32-bit based implementation. The important snippets below:

// bit vector
uint32_t *bvp = (uint32_t*)malloc(bv_size*sizeof(uint32_t));

// clear bit vector
for (idx = 0; idx < bv_size; idx++) {
  bvp[idx] = 0;
}

...

// ---- reads file and sets bits in bit vector ----
while (fscanf(ifp, "%d", ph_num) != EOF) {

  idx = *ph_num/32;    // calc which int in bv to set
  bit = *ph_num % 32;    // calc which bit in int to set

  comp_int = 1;
  comp_int<<=bit;
  *(bvp + idx) = (*(bvp + idx) | comp_int); // set bit
}

// ---- outputs sorted numbers in file ----
for (idx = 0; idx < bv_size; idx++) {

  comp_int = 1;
  if (*(bvp + idx) > 0) { // small optimization
    for (bit = 0; bit < 32; bit++) {
      if (comp_int & *(bvp + idx))
        fprintf(ofp,"%d\n",idx*32+bit);
      comp_int<<=1;
    }
  } 
}

Then, I wrote this in Python. I had never cared deeply about an integer’s bit length before in Python, so my first attempt was to use NumPy’s number types. Here is a link to the full Python implementation. The important snippets below:

bit_vector = numpy.zeros(bv_size, dtype=numpy.uint32)

for idx in xrange(bv_size):
  bit_vector[idx] = 0
    
  # reading from file
with open('unsorted_phonebook.txt','r') as ifh:
  for line in ifh:
    ph_num = int(line)
    bit = ph_num % int_size
    idx = ph_num / int_size
    comp_int = numpy.uint32(1)
    comp_int <<= bit
    bit_vector[idx] |= comp_int

with open('sorted_phonebook_python.txt','w') as ofh:
  for idx in xrange(bv_size):
    comp_int = numpy.uint32(1)
    if bit_vector[idx] > 0: # small optimization
      for bit in xrange(int_size):
        if comp_int & bit_vector[idx]:
          ofh.write(str(idx*32 + bit) + '\n')
          comp_int <<= 1

The error handling of opening the file is included here showing the nice code density of Python but it’s slow of course (how much slower we’ll see below). Here is a link to the full Cython implementation:

def sort_phonebook():
    
    DEF bv_size = 321500 # 10M / 32 supports 0 - 9999999 (or ph:999-9999)
    DEF int_size = 32

    cdef int bit
    cdef int idx
    cdef int ph_num
    cdef unsigned int comp_int
    cdef unsigned int bit_vector_i

    cdef unsigned int bit_vector[bv_size]

    for idx in xrange(bv_size):
        bit_vector[idx] = 0
    
    # reading from file
    with open('unsorted_phonebook.txt','r') as ifh:
         for line in ifh:
             ph_num = int(line)
             bit = ph_num % int_size
             idx = ph_num / int_size
             comp_int = 1
             comp_int <<= bit
             bit_vector[idx] |= comp_int

    with open('sorted_phonebook_cython.txt','w') as ofh:
         for idx in xrange(bv_size):
             comp_int = 1
             if bit_vector[idx] > 0: # small optimization
                 for bit in xrange(int_size):
                     if comp_int & bit_vector[idx]:
                         ofh.write(str(idx*int_size + bit) + '\n')
                     comp_int <<= 1    

Notice how the Cython code is not very different than the Python code! Cython will need to compile this pyx file for Python to call this function. The setup file is here:

from distutils.core import setup
from Cython.Build import cythonize
setup(
    ext_modules = cythonize("sort_phonebook_cython.pyx")
)

And then execute the setup with the following command:

python setup_cython_phonebook_sort.py build_ext --inplace

Now we can call this function in another Python program with:

import sort_phonebook_cython
sort_phonebook_cython.sort_phonebook()

For more comprehensive benchmarking, I wrote this algorithm in Java. For brevity, see the java implementation here .

On to the performance results. I sorted 6- and 7-digit numbers with these programs.
The results for 6-digits:


$ make
python create_phonebook_file.py
gcc sort_phonebook.c -o sort_phonebook_in_c
javac SortPhonebook.java
python setup_cython_phonebook_sort.py build_ext --inplace
running build_ext
python run_all_tests.py
0.608454942703 seconds to sort in c
1.06622695923 seconds to sort in cython
12.8049960136 seconds to sort in python
1.31410098076 seconds to sort in java
diff sorted_phonebook_c.txt sorted_phonebook_cython.txt
diff sorted_phonebook_c.txt sorted_phonebook_python.txt
diff sorted_phonebook_c.txt sorted_phonebook_java.txt

The results for 7-digits (I excluded the creation of the file since this is really slow in Python with my current implementation):

$ make run_tests
python run_all_tests.py
4.33113193512 seconds to sort in c
10.2070810795 seconds to sort in cython
130.906479836 seconds to sort in python
10.8574941158 seconds to sort in java
diff sorted_phonebook_c.txt sorted_phonebook_cython.txt
diff sorted_phonebook_c.txt sorted_phonebook_python.txt
diff sorted_phonebook_c.txt sorted_phonebook_java.txt

A couple of points:

(1) Confirmed (anecdotally) that the computational complexity growth of these function is mostly linear (10x) going from 6-digit to 7-digit numbers (a 10x growth in number of numbers, 1M to 10M)

(2) The Python implementation was 20x slower than C for sorting 1M numbers and 30x slower than C for sorting 10M numbers

(3) Cython edged Java for sorting 1M numbers, but both implementation were 2-2.5x slower than C.

(4) There are 25 source line of code (SLOC) Python, 35 SLCO for Cython, 51 SLOC Java, 56 SLOC C. There could be argument for performance in terms of this count in terms of developer time.

There you have it. Try Cython if you have a computationally complex function in Python (or high big-O constants as in this evaluation) that you’d like to avoid re-writing in C or Java. But the ultimate performance king is still C.

My Insight Data Engineering Project

The goal is to analyze historical real-estate listing data using data engineering tools, so I built a tool called Theft Market. I leverage Trulia’s API to gather their historic data. See Trulia’s developer page for an overview of their API.

Overview

To bootstrap the data pipeline, Theft Market repeatedly calls Trulia’s API to get the list of states, cities, and zipcodes in the US. It parses the XML responses and puts this information into its Meta Store, a MySQL database. The TruliaInfoFetcher calls getStates, getCitiesInStates, and getZipCodesInState to populate theMeta Store; see information library page for more about these calls.

With information about different geographic areas, Theft Market repeatedly calls Trulia’s API to get real estate data about each of the areas, which takes approximately 50,000 API calls. The TruliaDataFetcher functions used are getStateStats, getCityStats, and getZipCodeStats; see Trulia’s stats library page for more about these calls. The results of these calls are then split in the pipeline.

Pipeline Details

The TruliaDataFetcher uses HappyBase to put data directly into HBase (via the Thrift API) when it finishes parsing a stats response. Also, the stats are sent to HDFS using FluentD to the WebHDFS port on the HDFS NameNode. FluentD appends each record to a file of records in HDFS, and files are partitioned hourly as currently configured. Each line of these record file includes a JSON object for each record allowing flexibility in what was parsed out of the XML.

Subsequently, these large files in HDFS are processed by Hadoop Streaming with these Python map-reduce jobs. This translates the JSON objects to structured, tab-separated files that are used as Hive external tables. To create the Hive tables, use the create table script; see external table creation query for an example to create a city table. One could also write Pig, Cascading, etc. scripts based on the file structure from nice structure of the Hadoop Streaming processing mentioned above. Following that, there are a handful of other ad hoc queries in the hive directory. This concludes progress on the batch processing, deep-dive analytics layer of Theft Market.

Webserver Pipeline

The user is exposed to a simple (read-only) REST API for getting statistics about particular geographic areas. The REST call is handed to a combination of Apache (server), WSGI, and Flask. The Flask has an instance to an object that handles calls to the MySQL manager and an instance to HBase manager. Apache runs multiple Flask threads, with each thread having its own MySQL and HBase manager. The Flask web server routes calls to functions in the RestCallHandler . The RestCallHandler coordinates a combination of MySQL and HBase queries to rapidly answer the REST call (see the diagram above).

MySQL and HBase timings

The timing of these operations are quite fast. The GPS lat and lon distance query takes about a millisecond, where as the HBase access and the manipulation of a row’s columns takes 100-200 ms depending on how many weeks are in the date range. The columns values are keyed by the date of the week listing (e.g. 2014-06-30) textual json storing the average listing and number of listing that comprise the average (e.g. value=”{‘a’:’1000000′,’n’=’13’}”). The average and number is necessary for recomputing the average over the desired date range.

The schema could be enhanced to avoid parsing a python dictionary for each key, but this schema leads to acceptable performance with the data I have; I noticed that Trulia’s historical data goes back to at most 2008 (so that is 52 days/year * 6 years = 312 maximum columns per row). I also don’t take advantage of using different column families for other attributes I would not often access with listing averages.

The API format is straighforward, the caller passes a dictionary (described below) to a base url corresponding to the query interest; here are the following urls supported:

ipaddress/data/city/volume
ipaddress/data/city/average
ipaddress/data/zipcode/volume
ipaddress/data/zipcode/average

where volume is the aggregation of listings over the time period and average is the average listing price over the time period. The dictionary passes the remainder of the search parameters. All dictionaries contains the following key-value pairs:

– start_date (YYYY-MM-DD)
– end_date (YYYY-MM-DD)
– num_bedrooms (positive integer)

For city queries provide:

– state_code (XX)
– city (city name) (some browsers may need a ‘%20’ inserted for spaces in the city name)

For zipcode queries provide:

– zipcode (XXXXX)

A full example of a city average listings call:

ipaddress/data/city/average?q={"state_code":"MA","city":"Boston","num_bedrooms":3,"start_date":"2012-01-01","end_date":"2014-01-01"}

A full example of a zipcode volume listings call:

ipaddress/data/zipcode/volume?q={"zipcode":"02458","num_bedrooms":3,"start_date":"2012-01-01","end_date":"2014-01-01"}

Overall, this was a really fun project with a functioning prototype with 2-3 weeks of work. Some issues I had was using the Thrift API to put data into HBase. I would have used the Java API if I had known how flaky it is. Also, having port 80 open on Amazon Web Services machines is asking for trouble. I saw dozens of different types of access from the Apache server access log from researcher’s web crawlers to a commonly known w00tw00t attack. I’d need learn how to configure Apache to not crash on seemingly simple gets (here I thought Apache was a more hardened server platform than Flask).

Right now the project is off-line because AWS costs money. The 4-node (1 x1large, 3 x1mediums) were paid for by Insight. I’d like to see if I could get the Rest API portion of this project on a single node with decent performance soon. Without a doubt I will be distracted by working hard getting started at my next job (to be determined).

Image

Insight Data Engineering First Two Weeks

funny_big_data_t_shirt-rc6c4d4671ee643298525386e9ae05ff3_804gs_512

(I’m definitely going to get this shirt from Zazzle.com)

The first two weeks in the inaugural Insight Data Engineering Fellows Program have been really fun.  We have met with people with experience at Facebook, Jawbone, LinkedIn, Databrix, Datastax, Netflix, Twitter, Yammer, Intuit, Apple, and others that I’ve momentarily forgotten (and we’ll meet with many more as the data engineering program starts do more company visits).  At a high level, they all have shared with us their stories on how they found stories hidden in their data.  I was blown away by how many ways data is used to help solve real problems (other than finding cat videos on YouTube).

I’ll share a few interesting use cases.  I’ll leave out the social networking graph analytics angle on big data as that is an obvious (and still very powerful) use case.

Analytical company roadmapping:  Are the products you or your company focusing on providing the highest ROI?  What would a PDF of all your users versus some usage dimension look like?  One company showed us how such a plot saved the company by redirecting the ship to work in areas more related to what the large population of their users were doing.

Large scale A/B testing:  How do you know if what you are building will work better or be used more?  Multiple if not all companies mentioned the power of deploying A/B tests for performance analysis and new UI testing to answer these questions. (See for how people-you-may-know came about at LinkedIn, which was an inadvertent A/B test http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ )

Logical engineering bugs: One company noticed increases in adoption was lagging behind in a foreign country, and upon drilling down using big data tools, it was discovered a critical page in signing on was in the wrong language.  They would not have know exactly where to look for the problem without clear organization of the data to point exactly to the logical bug.

On the engineering tools side, we had a two week crash course in the Hadoop stack, Cassandra, and Spark.  Although I was familiar with some of the tools, I learned many of the finer points to how these systems work and how to get them working together.  Here are some more humorous points that either I or other engineerings fellows made.

Hadoop: Why did my job take 30 seconds when I had 10 rows of text in my only table?  (another commented) Oh yeah, well wait until you try a join!

Spark: I thought this was a spark shared cluster where everyone could run jobs simultaneously?  Someone is capturing 1% of all twitter feeds and is hogging 63 gigs of memory?

Cassandra: So your telling me that eventually, my data will be consistent? hmpf.

We’ve been thrown into the proverbial deep end with all these data engineering tools.  Over the next few weeks, we each solve a data engineering problem with components including batch jobs, streaming, and (external) query serving.  I’ll blog more about that as my project in organizing real-estate data progresses.

Java, JUnit, and Maven Jumpstart

It’s great to keep up with the most sophisticated build tools and not get stuck doing a lot of manual configuration. I was stuck using make, which can be fine for simple things, but hopefully after reading this post, you’ll use Maven especially for the simple things.

I’ll skip the details of the install and configuration since it was quite a breeze. For OS X, I downloaded a maven bin.tar.gz archive and for my Debian-based systems, I installed it through apt-get.

The most important configuration file in Maven is the pom.xml, where “pom” stands for product object model. The pom.xml file bootstraps your project directory structure. I’ll share my pom.xml in pieces and start there.

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.your-name</groupId>
    <artifactId>project-name</artifactId>
    <version>0.0.1-SNAPSHOT</version>

    <packaging>jar</packaging>

    <properties>
        <maven.compiler.source>1.7</maven.compiler.source>
        <maven.compiler.target>1.7</maven.compiler.target>
    </properties>

The opening lines include typical boilerplate for Maven projects. The three tags: groupId, artifactId, and version uniquely identify the code to group, project, and time respectively and are all required. More on this trio later. Line 11 specifies packing to be a jar (java archive) file. The properties tags specify that the Java version to use is version 1.7 (aka JDK 7), and this must be installed on the system and on the PATH system variable. Onward:

    <build>
        <resources>
            <resource>
                <directory>src/main/java</directory>
                <includes>
                    <include>**/*.java</include>
                </includes>
            </resource>
        </resources>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.17</version>
            </plugin>
        </plugins>
    </build>

The directory tag is where Maven will find the base of project’s source code. The includes and single wild-carded include on line 21 specify to include all files ending in .java in sub-directories of the previously specified base directory. This is particularly nice because, you’ll be able to avoid incremental changes when adding code (i.e., your Makefile copy and paste days are over!).

The plugins specified here are to use a basic set of plugins. I’ve chosen maven-surefire-plugin for JUnit tests. Notice how the trio groupId, artifactId, and version appear together again. This is a good place to highlight Maven’s organizational strengths. No matter what happens to the maven-surefire-plugin project, version 2.17 works for our project for now, and Maven gives us this fine-grained control over versions of software our project uses.

    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.8.1</version>
            <scope>test</scope> 
        </dependency>
    </dependencies>
</project>

Last, we need to include the JUnit dependency. By the way, if you thought you didn’t need unit tests for your very simple program, you are most likely wrong.

Let’s move onward briefly to source code and testing. Instead of going into details about a particular sorting algorithm, let’s pretend we wrote one and just use the java.util.Arrays sorting algorithm instead. Here is MySort.java:

package com.your-name.project-name.algorithms;
import java.util.Arrays;
public class MySort {
    public static void arraySort(int a[]) {
        Arrays.sort(a);
    }
}

This is in the directory src/main/java/com/your-name/project-name/algorithms. The MySortTest.java test class can look like:

package com.your-name.project-name.algorithms;
import static org.junit.Assert.assertTrue;
import org.junit.Test;
public class MySortTest{
    @Test
    public void testMySort() {
        int arr[] = {4,3,2,5,7};
        MySort.arraySort(arr);
        for (int i = 0; i < arr.length-1; i++) {
            assertTrue("Elements not in sorted order", arr[i] <= arr[i+1]);
        }
    }
}

This class is in src/test/java/com/your-name/project-name/algorithms (note the directory includes test as opposed to main for the MySort.java source file). Note the @Test annotation tells JUnit that this is function is a test to run.

Then run Maven to do the install and test: mvn clean install (within the same directory as the pom.xml.
This places the main source code and test source code into the target directory. You’ll see that the classes from the main directory are in the classes directory, and the test classes will be in the test-classes directory. The results from the tests will be logged in the surefire-reports directory.

The great thing about this set up is that you can create more directories and classes. For example, you can create a datastructures directory and HashTable.java class at src/main/java/com/your-name/project-name/datastructures with test classes at src/test/java/com/your-name/project-name/algorithms, and Maven will automatically find and run the test! The Maven command that compiles and runs the tests is mvn test (no need to do a clean install everytime).

References:

Personal reference to Joe Lust for helping me get started (see lustforge.com and newly minted joelust.com)

http://maven.apache.org/pom.html

http://maven.apache.org/surefire/maven-surefire-plugin/examples/junit.html

Appendix:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.your-name</groupId>
    <artifactId>project-name</artifactId>
    <version>0.0.1-SNAPSHOT</version>

    <packaging>jar</packaging>

    <properties>
        <maven.compiler.source>1.7</maven.compiler.source>
        <maven.compiler.target>1.7</maven.compiler.target>
    </properties>
    <build>
        <resources>
            <resource>
                <directory>src/main/java</directory>
                <includes>
                    <include>**/*.java</include>
                </includes>
            </resource>
        </resources>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.17</version>
            </plugin>
        </plugins>
    </build>
    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.8.1</version>
            <scope>test</scope> 
        </dependency>
    </dependencies>
</project>

Python Decorators and Exception Handling

On a recent project, I was using Python Flask (behind Apache/WSGI) to answer REST calls and accessed data in a relational database.  Both MySQL and PostgreSQL datastores were supported.  Following the coding DRY (Don’t Repeat Yourself) principle, I tried to re-use database access code as much as possible, but often I had to create yet another method with a tweak in the query logic.  This lead to the traditional try-catch-finally blocks appearing in many access methods to ensure things sailed smoothly on any exceptions from malformed REST calls any other unforeseen inconsistencies.

This seems tolerable at first, but when improving how things are logged or how the locking mechanism works, I quickly felt the pain.  I finally found the time to learn how to use decorators can help and decided to share it!

Here is an example of the old way:

def simple_select_all_query(conn, table_str):
    retval = None
    cursor = conn.cursor()
    try:
        cursor.execute("BEGIN")
        cursor.execute("select * from " + table_str)
        retval = cursor.fetchall()
        cursor.execute("COMMIT")
    except Exception, e:
        cursor.execute("ROLLBACK")
        print "Exception in simple_select_all_query():", str(e)
    finally:
         cursor.close()
    return retval

and is called via

result = simple_select_all_query(conn, table_str)

Making additional database access functions is a pain following this approach.  Try a decorator that will be defined further below.   Remove all the try-catch-finally mess,  and change the function’s first argument to be a cursor instead of a connection.

@database_query_wrapper_as_func
def simple_select_all_query(cursor, table_str):
    cursor.execute("select * from " + table_str)
    return cursor.fetchall()

The database query wrapper decorator can be a function or a class.  I’ll show you both. As a function it can look like:

def database_query_wrapper_as_func(func):
    def new_func(conn, *args, **kwargs):
        retval = None
        cursor = conn.cursor()
        try:
            cursor.execute("BEGIN")
            retval = func(cursor, *args, **kwargs)
            cursor.execute("COMMIT")
        except Exception, e:
            cursor.execute("ROLLBACK")
            print "Exception in simple_select_all_query():", str(e)
        finally:
            cursor.close()
        return retval

    return new_func

The new function created here contains all the exception handling instead.  It accepts a database connection, creates a cursor, and calls the wrapped function at line 7.  For anyone new to the *args or **kwargs, it is Python’s way of passing the remaining argument(s) to the new function.

It is important to emphasize that the function is called the same way as above but the function definition has the first argument as a cursor object instead of a connection object.

As promised, here is the wrapper as a class that has the same functionality:

class database_query_wrapper_as_class(object):
    def __init__(self, func):
        self.func = func
    def __get__(self, obj, type=None):
        return self.__class__(self.func.__get__(obj, type))
    def __call__(self, conn, *args, **kw):
        cursor = conn.cursor()
        try:
            cursor.execute("BEGIN")
            retval = self.func(cursor, *args, **kw)
            cursor.execute("COMMIT")
        except Exception, e :
            cursor.execute("ROLLBACK")
            retval = None
        finally:
            cursor.close()
        return retval

This has the advantage of being a little more organized, but it’s a bit of style choice in my opinion. The function being decorated with this class should have the @database_query_wrapper_as_class above the definition.

If your program is object-oriented the decorators above can wrap class methods without any modification. First, the call to a wrapped method:

    self.create_table(self.conn, table_str, schema_str)

and the full method is a single line!

    @database_query_wrapper_as_class
    def create_table(self, cursor, table_str, schema_str):
        cursor.execute("create table " + table_str + schema_str)

It’s a little strange to pass a class object to another class method, but it’s required with this decorator approach. I suppose you could have the wrapper as part of the class, but that topic may be for the next blog as this is more than I had planned to write.

Like I mentioned earlier, I wanted to improve the logging from regular print statements to using Python’s logging package. Here is snippet of code that could be placed into above except block. This would be a pain to copy and paste in to all those functions without decorators!

    ...
    except Exception, e:
        logging.warning('Exception in %s' % self.func)
        template = "An exception of type {0} occured. Arguments:\n{1!r}"
        message = template.format(type(e).__name__, e.args)
        logging.exception(message)
    ...

Lastly, I expanded the application of database query method wrapping to more generally handle exceptions in other functions with this “general purpose” wrapper to handle any exceptions. This just passes any number of arguments through to your wrapped function.

class general_function_handler(object):
    def __init__(self, func):
        self.func = func
    def __get__(self, obj, type=None):
        return self.__class__(self.func.__get__(obj, type))
    def __call__(self, *args, **kwargs):
        try:
            retval = self.func(*args, **kwargs)
        except Exception, e :
            ... # logging
            # TODO filter which exceptions are fatal and non-fatal for your purpose:
            # Fatal: AttributeError, KeyError, NameError, ProgrammingError, etc.
            sys.exit(1) # Exit on all exceptions for now
        return retval

There are many variants and opinions on how you can do this, but here is a simple place to start.

References (apologies if I left any out or misinterpreted your post!):

http://www.kylev.com/2009/05/22/python-decorators-and-database-idioms/

http://stackoverflow.com/questions/9823936/python-how-do-i-know-what-type-of-exception-occured

http://blog.dscpl.com.au/2014/01/how-you-implemented-your-python.html

Link

ryan = new DataEngineeringBlogger()

I wonder if I should’ve used a factory method instead :-/

This is my first posting on my new life out West.  For the last 6 years, I have been living in Boston working as a Network Scientist / Software Engineer / BBNer at Raytheon BBN Technologies (BBN) doing cutting edge research for DARPA or studying in Blacksburg, VA finishing up my PhD at Virginia Tech.  The opportunity: Insight Data Engineering and join the world’s leading tech innovators in the San Francisco Bay area.  A side bonus is the escape from next winter’s series of polar vortices (yes, plurality is necessary if you’ve live in Boston for the half of the year known as winter).

The opportunity to learn from the leaders of data engineering was too great pass up, and I’ll have several weeks to develop my own “big data” project under my own direction with whatever technologies I want to demonstrate that I can write some solid, maintainable, and usable code.  Awesome indeed.  Leaving BBN and Boston was a tough decision, but this is going to be a fun experience.

More to come as I’ll try to blog my experience and the technologies I encounter…