What is HDFS?

HDFS stands for Hadoop Distributed File System. It is a sub-project of Hadoop. HDFS lets you connect nodes contained within clusters over which data files are distributed, overall being fault-tolerant.

You can then access and store the data files as one seamless file system. HDFS has many goals, among which:

  • Fault tolerance + automatic recovery
  • Data access via MapReduce streaming
  • Simple and robust coherency model
  • Portability across heterogeneous operating systems + hardware
  • Scalability to reliably store and process large amounts of data

HDFS is written in Java, so any machine supporting Java can run HDFS.

How can you access data stored in HDFS?

  • Using the Native Java API
  • Using a C-wrapper
  • Using a web-browser interface

What is HDFS made of?

HDFS is an interconnected cluster of nodes where files and directories reside. HDFS is an interconnected cluster of nodes where files and directories reside. HDFS splits input data into blocks of 64 or 128MB and stores them on computers called DataNodes.

An HDFS cluster consists of :

  • A single node, known as a NameNode, that manages the file system namespace and regulates client access to files. It manages file system namespace operations like opening, closing, and renaming files and directories. A name node also maps data blocks to data nodes, which handle read and write requests from HDFS clients.
  • DataNodes, that create, delete and replicate data blocks according to instructions from the governing name node.

Data nodes continuously loop, asking the name node for instructions. The file system is similar to most other existing file systems; you can create, rename, relocate, and remove files, and put them in directories. HDFS also supports third-party file systems such as CloudStore and Amazon Simple Storage Service (S3).

Process

When a client creates a file in HDFS :

  • it first caches the data into a temporary local file.
  • It then redirects subsequent writes to the temporary file.
  • When the temporary file accumulates enough data to fill an HDFS block, the client reports this to the name node, which converts the file to a permanent data node.
  • The client then closes the temporary file and flushes any remaining data to the newly created data node.
  • The name node then commits the data node to disk.

Data Storage Reliability

Remember that HDFS needs to be reliable even when failures occur within :

  • name nodes,
  • data nodes,
  • or network partitions.

How do we control if a node is failing? We use a heartbeat process, a small message sent by the node to the name node. If the message is received, then the node is up. Once we don’t receive the message anymore, we know that the node is down. HDFS guarantees Transaction and Data Integrity processes. It uses checksum validation on the contents of HDFS files by storing computed checksums in separate, hidden files in the same namespace as the actual data. When a client retrieves file data, it can verify that the data received matches the checksum stored in the associated file.

Conclusion: I hope this high-level overview was clear and helpful. I’d be happy to answer any question you might have in the comments section.