In early 2000, Google designed and implemented a programming model called MapReduce for processing and generating large data sets that can be executed in parallel across a large cluster of machines. Google later open sourced this programming model for anyone to implement and use.
MapReduce has had a huge impact on the computing world by making it possible to process terabytes of data in parallel across clusters of community machines. MapReduce led to the popularity of Apache Hadoop, which is an open source implementation of MapReduce, and a host of other Big Data technologies.
Hadoop is an implementation of MapReduce programming model, that enables distributed and parallel processing of large data sets across clusters of computers using simple programming models.
Hadoop is designed to scale up from single server to clusters of thousands of machines, utilizing the local computation and storage of each server. This enables Hadoop to process massive amounts of data in times that cannot be done with other distributed computing processes.
Hadoop is designed to detect node failures in the clusters and handle the failure at the application layer. Hence Hadoop provides a highly-available service on top of the cluster of computers each of which may be prone to failures.
Apache Hadoop contains the following core modules.
1. Hadoop Common: Hadoop Common contains the common utilities and libraries that support the other Hadoop modules.
2. Hadoop Distributed File System (HDFS): Hadoop HDFS is a distributed file system that provides high-throughput and fault-tolerant access to application data.
3. Hadoop YARN: Hadoop YARN is a framework for job scheduling and cluster resource management.
4. Hadoop MapReduce: Hadoop MapReduce is the implementation of the MapReduce programming model that enables parallel processing of large data sets.
Following are some Apache projects related to Hadoop separated by their category.
Database:
1. HBase - HBase is a distributed and scalable database that supports structered data for very large tables.
2. Hive - Hive is a data warehouse platform that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL
3. Cassandra - Cassandra is a scalable multi-master NoSQL database with no single points of failure.
Data-Flow:
Pig - Pig is a high level data-flow language and execution framework for parallel computation.
Tez - A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases.
Data-Collection:
Compute
Spark - A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
Mahout - A Scalable machine learning and data mining library.
Administration:
Ambari - A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters.
Zookeeper - A high-performance coordination service for distributed applications.
Hadoop is an open source project developed by the Apache community. In addition to core Hadoop, Apache has many other projects that are related to Hadoop and the Big Data ecosystem.
Third party vendors package togeather - the core Hadoop Apache framework. related projects, and add proppriety utilities and capabilites on top and package into a unified project.
Cloudera, HortonWorks and MapR are the three popular commercial distributions of Hadoop.