Image

Insight Data Engineering First Two Weeks

funny_big_data_t_shirt-rc6c4d4671ee643298525386e9ae05ff3_804gs_512

(I’m definitely going to get this shirt from Zazzle.com)

The first two weeks in the inaugural Insight Data Engineering Fellows Program have been really fun.  We have met with people with experience at Facebook, Jawbone, LinkedIn, Databrix, Datastax, Netflix, Twitter, Yammer, Intuit, Apple, and others that I’ve momentarily forgotten (and we’ll meet with many more as the data engineering program starts do more company visits).  At a high level, they all have shared with us their stories on how they found stories hidden in their data.  I was blown away by how many ways data is used to help solve real problems (other than finding cat videos on YouTube).

I’ll share a few interesting use cases.  I’ll leave out the social networking graph analytics angle on big data as that is an obvious (and still very powerful) use case.

Analytical company roadmapping:  Are the products you or your company focusing on providing the highest ROI?  What would a PDF of all your users versus some usage dimension look like?  One company showed us how such a plot saved the company by redirecting the ship to work in areas more related to what the large population of their users were doing.

Large scale A/B testing:  How do you know if what you are building will work better or be used more?  Multiple if not all companies mentioned the power of deploying A/B tests for performance analysis and new UI testing to answer these questions. (See for how people-you-may-know came about at LinkedIn, which was an inadvertent A/B test http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ )

Logical engineering bugs: One company noticed increases in adoption was lagging behind in a foreign country, and upon drilling down using big data tools, it was discovered a critical page in signing on was in the wrong language.  They would not have know exactly where to look for the problem without clear organization of the data to point exactly to the logical bug.

On the engineering tools side, we had a two week crash course in the Hadoop stack, Cassandra, and Spark.  Although I was familiar with some of the tools, I learned many of the finer points to how these systems work and how to get them working together.  Here are some more humorous points that either I or other engineerings fellows made.

Hadoop: Why did my job take 30 seconds when I had 10 rows of text in my only table?  (another commented) Oh yeah, well wait until you try a join!

Spark: I thought this was a spark shared cluster where everyone could run jobs simultaneously?  Someone is capturing 1% of all twitter feeds and is hogging 63 gigs of memory?

Cassandra: So your telling me that eventually, my data will be consistent? hmpf.

We’ve been thrown into the proverbial deep end with all these data engineering tools.  Over the next few weeks, we each solve a data engineering problem with components including batch jobs, streaming, and (external) query serving.  I’ll blog more about that as my project in organizing real-estate data progresses.

Java, JUnit, and Maven Jumpstart

It’s great to keep up with the most sophisticated build tools and not get stuck doing a lot of manual configuration. I was stuck using make, which can be fine for simple things, but hopefully after reading this post, you’ll use Maven especially for the simple things.

I’ll skip the details of the install and configuration since it was quite a breeze. For OS X, I downloaded a maven bin.tar.gz archive and for my Debian-based systems, I installed it through apt-get.

The most important configuration file in Maven is the pom.xml, where “pom” stands for product object model. The pom.xml file bootstraps your project directory structure. I’ll share my pom.xml in pieces and start there.

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.your-name</groupId>
    <artifactId>project-name</artifactId>
    <version>0.0.1-SNAPSHOT</version>

    <packaging>jar</packaging>

    <properties>
        <maven.compiler.source>1.7</maven.compiler.source>
        <maven.compiler.target>1.7</maven.compiler.target>
    </properties>

The opening lines include typical boilerplate for Maven projects. The three tags: groupId, artifactId, and version uniquely identify the code to group, project, and time respectively and are all required. More on this trio later. Line 11 specifies packing to be a jar (java archive) file. The properties tags specify that the Java version to use is version 1.7 (aka JDK 7), and this must be installed on the system and on the PATH system variable. Onward:

    <build>
        <resources>
            <resource>
                <directory>src/main/java</directory>
                <includes>
                    <include>**/*.java</include>
                </includes>
            </resource>
        </resources>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.17</version>
            </plugin>
        </plugins>
    </build>

The directory tag is where Maven will find the base of project’s source code. The includes and single wild-carded include on line 21 specify to include all files ending in .java in sub-directories of the previously specified base directory. This is particularly nice because, you’ll be able to avoid incremental changes when adding code (i.e., your Makefile copy and paste days are over!).

The plugins specified here are to use a basic set of plugins. I’ve chosen maven-surefire-plugin for JUnit tests. Notice how the trio groupId, artifactId, and version appear together again. This is a good place to highlight Maven’s organizational strengths. No matter what happens to the maven-surefire-plugin project, version 2.17 works for our project for now, and Maven gives us this fine-grained control over versions of software our project uses.

    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.8.1</version>
            <scope>test</scope> 
        </dependency>
    </dependencies>
</project>

Last, we need to include the JUnit dependency. By the way, if you thought you didn’t need unit tests for your very simple program, you are most likely wrong.

Let’s move onward briefly to source code and testing. Instead of going into details about a particular sorting algorithm, let’s pretend we wrote one and just use the java.util.Arrays sorting algorithm instead. Here is MySort.java:

package com.your-name.project-name.algorithms;
import java.util.Arrays;
public class MySort {
    public static void arraySort(int a[]) {
        Arrays.sort(a);
    }
}

This is in the directory src/main/java/com/your-name/project-name/algorithms. The MySortTest.java test class can look like:

package com.your-name.project-name.algorithms;
import static org.junit.Assert.assertTrue;
import org.junit.Test;
public class MySortTest{
    @Test
    public void testMySort() {
        int arr[] = {4,3,2,5,7};
        MySort.arraySort(arr);
        for (int i = 0; i < arr.length-1; i++) {
            assertTrue("Elements not in sorted order", arr[i] <= arr[i+1]);
        }
    }
}

This class is in src/test/java/com/your-name/project-name/algorithms (note the directory includes test as opposed to main for the MySort.java source file). Note the @Test annotation tells JUnit that this is function is a test to run.

Then run Maven to do the install and test: mvn clean install (within the same directory as the pom.xml.
This places the main source code and test source code into the target directory. You’ll see that the classes from the main directory are in the classes directory, and the test classes will be in the test-classes directory. The results from the tests will be logged in the surefire-reports directory.

The great thing about this set up is that you can create more directories and classes. For example, you can create a datastructures directory and HashTable.java class at src/main/java/com/your-name/project-name/datastructures with test classes at src/test/java/com/your-name/project-name/algorithms, and Maven will automatically find and run the test! The Maven command that compiles and runs the tests is mvn test (no need to do a clean install everytime).

References:

Personal reference to Joe Lust for helping me get started (see lustforge.com and newly minted joelust.com)

http://maven.apache.org/pom.html

http://maven.apache.org/surefire/maven-surefire-plugin/examples/junit.html

Appendix:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.your-name</groupId>
    <artifactId>project-name</artifactId>
    <version>0.0.1-SNAPSHOT</version>

    <packaging>jar</packaging>

    <properties>
        <maven.compiler.source>1.7</maven.compiler.source>
        <maven.compiler.target>1.7</maven.compiler.target>
    </properties>
    <build>
        <resources>
            <resource>
                <directory>src/main/java</directory>
                <includes>
                    <include>**/*.java</include>
                </includes>
            </resource>
        </resources>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.17</version>
            </plugin>
        </plugins>
    </build>
    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.8.1</version>
            <scope>test</scope> 
        </dependency>
    </dependencies>
</project>