Hadoop Distributed File System (HDFS) Platform

From GM-RKB
(Redirected from Apache HDFS System)
Jump to navigation Jump to search

A Hadoop Distributed File System (HDFS) Platform is a distributed file system platform within the Hadoop framework (to create HDFS instances) .



References

2020

  • (Wikipedia, 2020) ⇒ https://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_distributed_file_system Retrieved:2020-8-23.
    • The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file system written in Java for the Hadoop framework. Some consider it to instead be a data store due to its lack of POSIX compliance, but it does provide shell commands and Java application programming interface (API) methods that are similar to other file systems. A Hadoop is divided into HDFS and MapReduce. HDFS is used for storing the data and MapReduce is used for processing data.
    • HDFS has five services as follows:
      1. Name Node
      2. Secondary Name Node
      3. Job tracker
      4. Data Node
      5. Task Tracker
    • Top three are Master Services/Daemons/Nodes and bottom two are Slave Services. Master Services can communicate with each other and in the same way Slave services can communicate with each other. Name Node is a master node and Data node is its corresponding Slave node and can talk with each other.

2015

2012

2011

  • http://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_Distributed_File_System
    • … Each node in a Hadoop instance typically has a single datanode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate between each other. The HDFS stores large files (an ideal file size is a multiple of 64 MB), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. The HDFS was designed to handle very large files. The HDFS does not provide High Availability.

2007