Tuesday, December 24, 2013

Hello Hive!

New year....new job!

So the next chapter is in ad-tech and big data. Before a few weeks ago I had only heard of hadoop and hive from a couple articles I had read. Wasn't really sure what MapReduce even meant other than

"MapReduce is a programming model for processing large data sets with a paralleldistributed algorithm on a cluster" - wikipedia

.....

umm right.....

Fast forward to now.  After some more reading and understanding the above statement is much clearer.  Hadoop basically cuts a "job" up and a master node tells a bunch of worker nodes what part of the data to work on and how to work on them.  This is all abstracted out from the programmer however as once Hadoop is configured the programmer just runs his/her job and hadoop handles the distributing.

So what's hive?   Hive is making hadoop SQLesque.  It's an abstraction of hadoop to let programmer brains reuse SQL skills instead of having to write hadoop map reduce classes all day long.

Anyway on to an example!

The source is available @ https://github.com/asharif/hello_hive

You need to install hadoop and hive and have both available in the PATH.

There are only two files here.  hello_hive.q and hello_hive.py.

hello_hive.q

CREATE TABLE words (word STRING ) ROW FORMAT DELIMITED LINES TERMINATED BY '\n' STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH '/Users/asharif/development/hello_hive/input' OVERWRITE INTO TABLE words;

add FILE hello_hive.py;

INSERT OVERWRITE TABLE words
SELECT
TRANSFORM (word) USING 'python hello_hive.py'
AS (word)
FROM words;

SELECT * FROM words;

The above hive script creates an imaginary table using the file in the given path (don't forget to change that path!).  It uses the '\n' as the row delimiter so each new line is a new row in the table.

Then the fun part.  A simple python script is added and the python script is used to overwrite the contents of the table.

hello_hive.py


import sys

for line in sys.stdin:
        print "hive row: " + line

As you can see all the python script does is prepend the text "hive row: " to each row.

Now if you run this

hive -f hello_hive.q


you will get the following output:


hive row: i

hive row: am

hive row: the

hive row: best

hive row: i

hive row: am

Time taken: 0.098 seconds, Fetched: 12 row(s)