Data Engineering: From Basic Prototype to Production

Designing a scalable production-worthy system is a huge challenge that requires a careful balance of developing extensible code that is fast enough for production and time-management of hitting milestones while avoiding pre-mature optimization and dirty hacks. Here the ways I like to think about developing a scalable system:

  1. Write prototype in a program.  It’s important to get something working end-to-end for a sanity check.
  2. Refactor prototype code into classes, parameterize where necessary, and write unit tests for each class (use test-driven development!).  Refactoring should involve abstracting away things like DB connections and queries, and making things configurable
  3. Re-test end-to-end functionality, and create integration tests.  All of this test writing gives us confidence later when making any changes.

At this point, you have some decent looking code and a nice test suite, but it is not obvious how to scale from here.  You might spot some places where parallelize execution of your code could help. However, this can be tricky business with extending code designed not to have parallel execution, and testing/debugging can be difficult because of the exponential combinations of state that need to be right for the program to work. I prefer using what I’ll call a workflow framework.

Let’s define a workflow framework as a tool or set of tools that separate execution into multiple tasks computing simultaneously with finer grained control and visibility on each task. As an example, think about submitting a large number of small tasks to a job server, like gearman or publishing jobs to a queue using Redis.

This greatly helps software development because each piece can be developed and tested individually with ease. Failures on production can be isolated to a task and restarted automatically. Also, this isolation enables using different tools and languages to be used for isolated tasks; imagine submitting tasks from user input from a web-layer in Python and have workers processing requests using a lower-level language like C for performance reasons.

Lastly, you will find that if you isolated your classes well and have a good test suite, it will not be too difficult to move logic between different frameworks. Much of the code can be moved to worker classes that listens to a job server (Gearman) or a Redis queue.

This blog provides a high-level methodology of going from rough prototype hitting a real use-case to a maintainable and scalable system on production.