INTRODUCTION TO HADOOP
Hadoop is now-a-days one of the most popular open source implementation of Map-Reduce which is a powerful tool designed for deep analysis and transformation of large data. Hadoop helps you to explore complex data, using custom analysis to your information and questions. Hadoop, system that allows unstructured data to be distributed across hundreds & thousands of different machines forming shared clusters. Hadoop owns an individual file system which replicates data on multiple nodes ensuring that if one node that holds data goes down, there are at least 2 other nodes from which data can be retrieved. This protects the data and ensures data availability from node failure, something which is critical when there are many nodes in a cluster (RAID at a server level).
HOW HADOOP WORKS?
Hadoop reduces communication which can be performed by the processes, as each individual record processing is done by a task in isolation from one another. While this sounds like a limitation at first, but it makes the whole framework much more reliable. Hadoop will not just run any program and distribute it across every cluster. Programs must be written in such a way that it conforms to a particular programming model, named "Map-Reduce."
In Map-Reduce, records are processed in isolation with the help of tasks called Mappers. The output from retrieved from Mappers is then brought together into a second set of tasks, Reducers, where results from different mappers are merged together.
Pieces of data can be provided with key names which inform Hadoop about how to send related bits of information to common destination node. Hadoop internally manages all of the data transfer and cluster topology issues.
By restricting the communication between nodes, Hadoop helps in making the distributed system much more reliable. Individual node failures can be repaired by restarting tasks on another machines. User-level tasks do not communicate explicitly with one another, neither messages need to be exchanged by user programs, nor do nodes need roll back to pre-arrange the checkpoints to partially start the computation. The other nodes continue to operate as though nothing went wrong, leaving the challenging aspects of partially restarting the program to the underlying Hadoop layer.