HDFS, which stands for 'Hadoop Distributed File System', is a distributed file system that comes with the Hadoop platform. HDFS is designed for storing large files on clusters of commodity hardware, with high throughput streaming access to the file system data.
Large Data Sets - HDFS is designed to store very large data sets across hundreds or thousands of commodity machines. The size of HDFS data sets can be hundreds of megabytes, gigabytes or terabytes in size.
Streaming Data Access - HDFS is designed for write-once read-many-times pattern. HDFS is designed for applications that need high throughput of data access rather than low latency of data access.
Fault Tolerant - An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. HDFS is architectured for detection of faults and quick, automatic recovery from node failures.
Simple data coherency model - HDFS is designed for applications that need a write-once read-many-times access model for files. A file once created, written, and closed need not be changed except for appends and truncates. Appending the content to the end of the files is supported but cannot be updated at arbitrary point. This assumption simplifies data coherency issues and enables high throughput data access.
HDFS follows a master-slave architecture. Following are the main components of a HDFS installation.
NameNode - An HDFS cluster consists of a single NameNode, which is a master server that manages the file system namespace and regulates access to files by clients.
Datanode - An HDFS cluster consists of many DataNodes, usually one per node in the cluster, which manages the storage attached to that particular node.
Data Blocks - Files in HDFS are split into one or more blocks and these blocks are stored and replicated across a set of DataNodes.
HDFS follows a master-slave architecture. NameNode is a piece of software that acts as the master and is designed to run on any commodity server. Following are the key responsibilities of the NameNode.
- The NameNode manages the HDFS filesystem namespace. It manages the filesystem tree and the metadata for all the files and directories in that tree. NomeNode stores this information on the disc in two files - namespace image and edit log.
- The NameNode executes file system namespace operations like opening files, closing files, and renaming files and directories.
The NameNode maintains the mapping between the blocks of a data file and the NameNodes on which the block exists.
HDFS splits large files into smaller chunks called data blocks, stores them as independent units, and replicates the blocks across multiple nodes in a cluster.
HDFS block sizes are usually 128 MB by default.
HDFS is designed to be fault-tolerant. Large HDFS data files are split into smaller chunks called blocks and each block is stored in multiple DataNodes across the cluster. The block size and the replication factor can be configured per file.
HDFS is rack aware for multi-clustered evnironments, and takes this into consideration when replicating blocks for fault-tolerance. HDFS ensures that the blocks are replicated on DataNodes that are on different racks, so if a rack goes down the data is still available from the DataNode on the other rack.
NameNode manages and store the metadata of the HDFS filesystem namespace. If the NameNode goes down or if the data gets corrupted then the HDFS filesystem cannot be used.
Hadoop provides two mechanisms to make the NameNode resilient to failures.
1. Backup files - The first mechanism is to take a backup of the files that store the persistent state of the filesystem, so that if the files storing the persistent state is corrupted then the backup data can be used. Hadoop can be configured so that the NameNode writes its persistent state to multiple systems.
2. Secondary NameNode - The second option is to have two NameNodes in the same cluster, one primary and the other secondary, setup as active and passive nodes. The secondary node copies the state from the primary node on a regular basis, and promotes itself as the active node when the primary node has a failure.
You can copy files from HDFS file system to local filesystem via command-line interface by using the sub-command -copyToLocal
% hadoop fs -copyToLocal /temp/test1.txt /docs/test1.txt
The Hadoop API provides the class org.apache.hadoop.fs.FileSystem to access the HDFS filesystem. You call the open() method on the FileSystem class and pass the HDFS filesystem path as a parameter.
String uri = 'hdfs://localhost/interviewgrid/hadoop_questions.txt'
FileSystem fs = FileSystem.get(URI.create(uri), conf);
fs.open(new Path(uir));
The Hadoop API provides the class org.apache.hadoop.fs.FileSystem to access the HDFS filesystem. You call the create() method on the FileSystem class and pass the HDFS filesystem path as a parameter. The create() method returns an output stream to which you can write the data to.
String uri = 'hdfs://localhost/interviewgrid/hadoop_questions.txt'
FileSystem fs = FileSystem.get(URI.create(uri), conf);
fs.create(new Path(uri));