Cython Performance Case Study

I’ve been learning different programming languages since the early 2000s. It can be tempting to clutch onto a language and use it no matter what. There are advantages to mastering a single language on a personal level and company level. However, the disadvantages include:

(1) not using new languages and their inherent strengths,

(2) reliant on language support for new tools,

(3) your build and testing infrastructure are stuck in a language (I really like Makefiles because they work in all languages!),

(4) using multiple languages may guide your implementation to have clearly defined modules, and

(5) not learning new languages can limit us as programmers (see Nathan Marz’s blog that suggests learning new languages makes us better programmers).

I started programming with C and C++, but higher-level languages like Python are attractive in terms of code-density and readability, which in my experience greatly help maintainability. Many Java and C++ software engineers dismiss Python as a production-grade language because it is much slower and it is not strongly typed (Often times they dismiss each other as well :-D, but that’s another topic.).

I wanted to do a short case study on sorting unique numbers in linear time using bit vectors in each of these languages. I wanted to see how much slower Python would be compared to the strongly typed languages for bit manipulation, and as the title eludes, compare all these to Cython.

The precise question, sort a file of unique numbers and write out the numbers to a file in sorted order. I plan to use a bit vector of integer type, with the idea of setting a bit at the memory address corresponding to the number’s value. Once I have all the numbers represented with one bit, I can loop through each bit of the bit vector and print out the numbers in order.

For me it was natural to think about implementing this algorithm in C. Here is a link to the full 32-bit based implementation. The important snippets below:

// bit vector
uint32_t *bvp = (uint32_t*)malloc(bv_size*sizeof(uint32_t));

// clear bit vector
for (idx = 0; idx < bv_size; idx++) {
  bvp[idx] = 0;
}

...

// ---- reads file and sets bits in bit vector ----
while (fscanf(ifp, "%d", ph_num) != EOF) {

  idx = *ph_num/32;    // calc which int in bv to set
  bit = *ph_num % 32;    // calc which bit in int to set

  comp_int = 1;
  comp_int<<=bit;
  *(bvp + idx) = (*(bvp + idx) | comp_int); // set bit
}

// ---- outputs sorted numbers in file ----
for (idx = 0; idx < bv_size; idx++) {

  comp_int = 1;
  if (*(bvp + idx) > 0) { // small optimization
    for (bit = 0; bit < 32; bit++) {
      if (comp_int & *(bvp + idx))
        fprintf(ofp,"%d\n",idx*32+bit);
      comp_int<<=1;
    }
  } 
}

Then, I wrote this in Python. I had never cared deeply about an integer’s bit length before in Python, so my first attempt was to use NumPy’s number types. Here is a link to the full Python implementation. The important snippets below:

bit_vector = numpy.zeros(bv_size, dtype=numpy.uint32)

for idx in xrange(bv_size):
  bit_vector[idx] = 0
    
  # reading from file
with open('unsorted_phonebook.txt','r') as ifh:
  for line in ifh:
    ph_num = int(line)
    bit = ph_num % int_size
    idx = ph_num / int_size
    comp_int = numpy.uint32(1)
    comp_int <<= bit
    bit_vector[idx] |= comp_int

with open('sorted_phonebook_python.txt','w') as ofh:
  for idx in xrange(bv_size):
    comp_int = numpy.uint32(1)
    if bit_vector[idx] > 0: # small optimization
      for bit in xrange(int_size):
        if comp_int & bit_vector[idx]:
          ofh.write(str(idx*32 + bit) + '\n')
          comp_int <<= 1

The error handling of opening the file is included here showing the nice code density of Python but it’s slow of course (how much slower we’ll see below). Here is a link to the full Cython implementation:

def sort_phonebook():
    
    DEF bv_size = 321500 # 10M / 32 supports 0 - 9999999 (or ph:999-9999)
    DEF int_size = 32

    cdef int bit
    cdef int idx
    cdef int ph_num
    cdef unsigned int comp_int
    cdef unsigned int bit_vector_i

    cdef unsigned int bit_vector[bv_size]

    for idx in xrange(bv_size):
        bit_vector[idx] = 0
    
    # reading from file
    with open('unsorted_phonebook.txt','r') as ifh:
         for line in ifh:
             ph_num = int(line)
             bit = ph_num % int_size
             idx = ph_num / int_size
             comp_int = 1
             comp_int <<= bit
             bit_vector[idx] |= comp_int

    with open('sorted_phonebook_cython.txt','w') as ofh:
         for idx in xrange(bv_size):
             comp_int = 1
             if bit_vector[idx] > 0: # small optimization
                 for bit in xrange(int_size):
                     if comp_int & bit_vector[idx]:
                         ofh.write(str(idx*int_size + bit) + '\n')
                     comp_int <<= 1    

Notice how the Cython code is not very different than the Python code! Cython will need to compile this pyx file for Python to call this function. The setup file is here:

from distutils.core import setup
from Cython.Build import cythonize
setup(
    ext_modules = cythonize("sort_phonebook_cython.pyx")
)

And then execute the setup with the following command:

python setup_cython_phonebook_sort.py build_ext --inplace

Now we can call this function in another Python program with:

import sort_phonebook_cython
sort_phonebook_cython.sort_phonebook()

For more comprehensive benchmarking, I wrote this algorithm in Java. For brevity, see the java implementation here .

On to the performance results. I sorted 6- and 7-digit numbers with these programs.
The results for 6-digits:


$ make
python create_phonebook_file.py
gcc sort_phonebook.c -o sort_phonebook_in_c
javac SortPhonebook.java
python setup_cython_phonebook_sort.py build_ext --inplace
running build_ext
python run_all_tests.py
0.608454942703 seconds to sort in c
1.06622695923 seconds to sort in cython
12.8049960136 seconds to sort in python
1.31410098076 seconds to sort in java
diff sorted_phonebook_c.txt sorted_phonebook_cython.txt
diff sorted_phonebook_c.txt sorted_phonebook_python.txt
diff sorted_phonebook_c.txt sorted_phonebook_java.txt

The results for 7-digits (I excluded the creation of the file since this is really slow in Python with my current implementation):

$ make run_tests
python run_all_tests.py
4.33113193512 seconds to sort in c
10.2070810795 seconds to sort in cython
130.906479836 seconds to sort in python
10.8574941158 seconds to sort in java
diff sorted_phonebook_c.txt sorted_phonebook_cython.txt
diff sorted_phonebook_c.txt sorted_phonebook_python.txt
diff sorted_phonebook_c.txt sorted_phonebook_java.txt

A couple of points:

(1) Confirmed (anecdotally) that the computational complexity growth of these function is mostly linear (10x) going from 6-digit to 7-digit numbers (a 10x growth in number of numbers, 1M to 10M)

(2) The Python implementation was 20x slower than C for sorting 1M numbers and 30x slower than C for sorting 10M numbers

(3) Cython edged Java for sorting 1M numbers, but both implementation were 2-2.5x slower than C.

(4) There are 25 source line of code (SLOC) Python, 35 SLCO for Cython, 51 SLOC Java, 56 SLOC C. There could be argument for performance in terms of this count in terms of developer time.

There you have it. Try Cython if you have a computationally complex function in Python (or high big-O constants as in this evaluation) that you’d like to avoid re-writing in C or Java. But the ultimate performance king is still C.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s