Hadoop, trying it.

Full Project: https://github.com/tsolakp/hadoop-sample

There has been a lot of talk about Hadoop lately but with very little explanation of what it actually is. Unfortunately some of tutorial have not been simple enough to give me a big picture of what this technology is all about.
After some reading http://developer.yahoo.com/hadoop/tutorial/ comprehensive tutorial on Yahoo and trying I want to share my observations.

What it is and what it is not.

First of all Hadoop is simply a batch application. It is designed to process large amount of data (for example in forms of large data files) in parallel over distributed system where each machine is called a node.

It is not a database like the related Apache projects called Hive or HBase.

It consist of two major parts. One is distributed batch app that manages job on each node and second is the HDFS (file system) which allows Hadoop to transparently distribute large files across nodes so that each job on the node could access and process the file or its parts.

One probably obvious but crucial aspect of Hadoop is that it makes sure the processing code (i.e Mapper’s or Reducer’s Java classes) are run closer to the data which means that unlike Spring Batch, for example, it consists of multiple processes running on each node and controller by master “namenode”.
This makes it difficult to debug, but fortunately you can run Hadoop jobs just like plain Java “main” application for testing before deploying to “pseudo” (all nodes are run on single machine) or “real” (nodes are actually separate machines) environments.

HDFS

HDFS is a special file system but unlike regular Windows or Linux file system you cannot simple access it using your plain file explorer. Think of it as an abstraction on top of your regular file systems that transparently allows to access the whole file locally even though it can physically be distributed across multiple machines.
Its also much faster at sequential reading than traditional file systems but slower at random access though this is not an issue since the whole MapReduce architecture is about sequential reads.

FileInputFormat, RecordReaders, FileInputFormat and etc

These are just for handling reading, writing and splitting of HDFS files during different stages of MapReduce jobs.

MapReduce

The Yahoo tutorial does excellent job of explaining in details how MapReduce works. Here is a nice diagram from that tutorial showing the full MapReduce flow:

mapreduce-process

I’ll just show modified WordCount code with comments to give high level explanation of how it works and how to test it yourself.

Maven Project

Here is GitHub link to the full Maven project to pull and play with: https://github.com/tsolakp/hadoop-sample

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>