Full Project: https://github.com/tsolakp/hadoop-sample
There has been a lot of talk about Hadoop lately but with very little explanation of what it actually is. Unfortunately some of tutorial have not been simple enough to give me a big picture of what this technology is all about.
After some reading http://developer.yahoo.com/hadoop/tutorial/ comprehensive tutorial on Yahoo and trying I want to share my observations.
What it is and what it is not.
First of all Hadoop is simply a batch application. It is designed to process large amount of data (for example in forms of large data files) in parallel over distributed system where each machine is called a node.
It is not a database like the related Apache projects called Hive or HBase.
It consist of two major parts. One is distributed batch app that manages job on each node and second is the HDFS (file system) which allows Hadoop to transparently distribute large files across nodes so that each job on the node could access and process the file or its parts.
One probably obvious but crucial aspect of Hadoop is that it makes sure the processing code (i.e Mapper’s or Reducer’s Java classes) are run closer to the data which means that unlike Spring Batch, for example, it consists of multiple processes running on each node and controller by master “namenode”.
This makes it difficult to debug, but fortunately you can run Hadoop jobs just like plain Java “main” application for testing before deploying to “pseudo” (all nodes are run on single machine) or “real” (nodes are actually separate machines) environments.
