Big Data is a term that represents data sets whose size is beyond the ability of commonly used software tools to manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set.
It is the term for a collection of data sets, so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
Bigdata is a term which defines three characteristics.
Already we have RDBMS to store and process structured data. But of late we have been getting data in form of videos, images and text. This data is called as unstructured data and semistructured data. It is difficult to efficiently store and process these data using RDBMS.
So definitely we have to find an alternative way to store and to process this type of unstructured and semistructured data.
What is HADOOP:
HADOOP is one of the technologies to efficiently store and to process large set of data. This HADOOP is entirely different from Traditional distributed file system. It can overcome all the problems exits in the traditional distributed systems.
HADOOP is an opensource framework written in java for storing data in a distributed file system and processing the data in parallel manner across cluster of commodity nodes.
Mainly there are two components of HADOOP:
1. HDFS: HADOOP Distributed File System-
HDFS is a distributed file system designed for storing very large files with streaming data access patterns, running on cluster of commodity hardware.
· HDFS is a logical file system across all the nodes local file system; it provides special capabilities to handle the storage of bigdata efficiently.
· We are giving files to HDFS, and it will divide the file into no of blocks of manageable size (either 64MB or 128MB based on configuration).
These blocks will be replicated three times (default can be configurable) and stored in local file system as a separate file.
· Blocks are replicated across multiple machines, known as DataNodes. DataNode is a slave machine in hadoop cluster running the data node daemon (a process continuously running).
· A master node(high end configurations like dual power supply,dual n/w cards etc..) called the NameNode (a node which is running name node daemon) keeps track of which blocks make up a file, and where those blocks are located, known as the metadata.
· The NameNode keeps track of the file metadata—which files are in the system and how each file is broken down into blocks. The DataNodes provide backup store of the blocks and
constantly report to the NameNode to keep the metadata current.
MapReduce is a massive parallel processing technique for processing data which is distributed on a commodity cluster. MapReduce is a method for distributing a task across multiple nodes.
Each node is processing data, stored on local to those nodes whenever possible.
Hadoop can run MapReduce programs written in various languages.Generally, we are writing the mapreduce programs in java because whole hadoop has developed in java.The other languages which support hadoop like c++, ruby on rails, python etc are called as hadoop streaming.
MapReduce consists of two phases.
Features of MapReduce:
- Automatic parallelization and distribution.
- Fault tolerance , Status and monitoring the tools
- It is clear abstraction for the programmers.
- MapReduce programs are usually written in java but it can be written in any other scripting language using Hadoop Streaming API.
- MapReduce abstracts all the “housekeeping” away from the developer.Developer can concentrate simply on writing the Map and Reduce functions.
- MapReduce default port number is 8021 and web ui is 50030