Insight Data Engineering First Two Weeks


(I’m definitely going to get this shirt from

The first two weeks in the inaugural Insight Data Engineering Fellows Program have been really fun.  We have met with people with experience at Facebook, Jawbone, LinkedIn, Databrix, Datastax, Netflix, Twitter, Yammer, Intuit, Apple, and others that I’ve momentarily forgotten (and we’ll meet with many more as the data engineering program starts do more company visits).  At a high level, they all have shared with us their stories on how they found stories hidden in their data.  I was blown away by how many ways data is used to help solve real problems (other than finding cat videos on YouTube).

I’ll share a few interesting use cases.  I’ll leave out the social networking graph analytics angle on big data as that is an obvious (and still very powerful) use case.

Analytical company roadmapping:  Are the products you or your company focusing on providing the highest ROI?  What would a PDF of all your users versus some usage dimension look like?  One company showed us how such a plot saved the company by redirecting the ship to work in areas more related to what the large population of their users were doing.

Large scale A/B testing:  How do you know if what you are building will work better or be used more?  Multiple if not all companies mentioned the power of deploying A/B tests for performance analysis and new UI testing to answer these questions. (See for how people-you-may-know came about at LinkedIn, which was an inadvertent A/B test )

Logical engineering bugs: One company noticed increases in adoption was lagging behind in a foreign country, and upon drilling down using big data tools, it was discovered a critical page in signing on was in the wrong language.  They would not have know exactly where to look for the problem without clear organization of the data to point exactly to the logical bug.

On the engineering tools side, we had a two week crash course in the Hadoop stack, Cassandra, and Spark.  Although I was familiar with some of the tools, I learned many of the finer points to how these systems work and how to get them working together.  Here are some more humorous points that either I or other engineerings fellows made.

Hadoop: Why did my job take 30 seconds when I had 10 rows of text in my only table?  (another commented) Oh yeah, well wait until you try a join!

Spark: I thought this was a spark shared cluster where everyone could run jobs simultaneously?  Someone is capturing 1% of all twitter feeds and is hogging 63 gigs of memory?

Cassandra: So your telling me that eventually, my data will be consistent? hmpf.

We’ve been thrown into the proverbial deep end with all these data engineering tools.  Over the next few weeks, we each solve a data engineering problem with components including batch jobs, streaming, and (external) query serving.  I’ll blog more about that as my project in organizing real-estate data progresses.