Discussion on how Hadoop splits data serially or parallelly
Before proceeding let’s first understand the basics…
Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. It uses Distributed File System to resolve Big Data Problems.
According to popular articles, Hadoop uses the concept of parallelism to upload the split data while fulfilling Velocity problem. It is said that Hadoop is good at processing large amount of data in parallel. The idea is to breakdown the large input into smaller chunks and each can be processed separately on different machines.
But, when we (me and my team) performed it practically, we found that Hadoop uses the concept of SERIALISM instead of PARALLELISM to upload split data to solve velocity problem.
So, let’s see the proof of the above statement.
✅ So, firstly, we set up a Hadoop Cluster with one NameNode (master), 4 DataNodes (slaves) and 1 Client.
✅We need to note down Public IP and Private IP of our systems to identify from where the traffic is coming.
- DATANODE 1
- DATANODE 2
- DATANODE 3
✅Now, we uploaded a small file from client with replication factor 4.
Note : Replication Factor is basically the number of times Hadoop framework replicate each and every Data Block. Block is replicated to provide Fault Tolerance. The default replication factor is 3 which can be configured as per the requirement; it can be changed to 2 (less than 3) or can be increased (more than 3).
✅Before uploading the file we ran tcpdump command with port 50010 (used for HDFS)to check the traffic.
As soon as we uploaded the file, we noticed that traffic was coming on Datanodes one by one (serially) and not all at once (parallelly).
- Firstly, split data goes from client (184.108.40.206) to Datanode 2 (172.31.37.144)
- Then, it goes from Datanode 2 (220.127.116.11) to Datanode 1 (172.31.33.221)
- Then, it goes from Datanode 1 (18.104.22.168) to Datanode 3 (172.31.13.12)
- Finally, it goes from Datanode 3 (22.214.171.124) to Datanode 4 (172.31.45.186)
So, in this way, we proved that Hadoop uses concept of Serialism for uploading split data. Moreover, we also discovered that Client sends data to one datanode only and then data goes from one datanode to another. This means that a kind of chain is formed.