, we have seen that Hadoop generally consists of two major component: HDFS and MapReduce. In this blog post we are going to talk about the storage side of Hadoop, i.e, HDFS. By now, you must be knowing that Hadoop is basically a Distributed System with its own flavour and architecture. As hadoop deals with large datasets (remember we are talking about big data so here the datasets can go beyond 100s of TB), the storage capacity of a single physical machine will not be enough. To deal with these large amount of data, it becomes necessary to partition it across a number of separate machines.
Filesystems that manage the storage across a network of machines are called distributed filesystems. Hadoop comes with a distributed filesystem of its own which is called as HDFS or Hadoop Distributed File System. Now lets see the official definition of HDFS which we will discuss in details later:
"Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers."
We will now explain the keywords used in the definition to further understand HDFS:
Filesystems that manage the storage across a network of machines are called distributed filesystems. Hadoop comes with a distributed filesystem of its own which is called as HDFS or Hadoop Distributed File System. Now lets see the official definition of HDFS which we will discuss in details later:
"Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers."
We will now explain the keywords used in the definition to further understand HDFS:
- By the name Hadoop Distributed File System we understand that Hadoop is basically a distributed file system which treats all the storage in the cluster as a unified storage space.
- Java-based file system:- The system uses Java as a platform as Java is the language for the advance programming in networks. Java distributed file system resolves all the problems occurring in systems being implemented in the past and that also with guarantee of integrity, confidentiality, access and availability and the most important one is the safe and secure access since the data is being distributed among the multiple clients.
- Scalable:- HDFS has a scale out architecture, i.e, we just need to add servers into it to increase the capacity of the cluster without actually disturbing the running HDFS cluster.
- Reliable data storage:- HDFS was built in such a way that the data once stored in HDFS has the maximum fault tolerant mechanism in case of any failure. HDFS was built with an assumption that Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.
- Commodity servers:- Hadoop doesn't require expensive, highly reliable hardware. It is designed to run on clusters of commodity hardware for which the chance of node failures is high, atleast for large clusters. Having said that, it doesn't mean that in actual production environment we can use our normal PC as a node to the cluster. Instead of that, commodity hardware means normal typical servers which are easily available and is made by different vendors.
- Streaming data access:- Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access.
- Simple Coherency Model:- HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access.
- Flexible Access:- HDFS has multiple and open frameworks for serialization and file system mounts.
- Load Balancing:- HDFS places data intelligently for maximum efficiency and utilization so that data is equally distributed throughout the cluster.
- Tunable Replication:- HDFS stores multiple copies of each file which provide data protection and computational performance.
- Security:- Hadoop provides POSIX-based file permissions for users and groups with optional LDAP integration. This provides highly secured access to data in HDFS.
No comments:
Post a Comment