Hadoop Explained

Jonathan Schein
3 min readNov 6, 2020

--

What is Hadoop?

Hadoop is there so we can take advantage of all the opportunities that big data has to offer us. It assists in facing all the challenges that occur when working with data and simplifies difficult issues that would otherwise pose as a blocking point between solving a problem and not solving it. Big data has been around for a while now, but the recent software of Hadoop helps us and big companies aid in certain projects and contribute to the vast number of conclusions that have since been made since the software began.

Hadoop is “ an open source java based programming framework that supports the processing of large datasets in a distributed computing environment.” It also takes advantage of a distributed computation framework called MapReduce.

Many people often have a hard time understanding the use cases of Hadoop because they are already familiar with the framework of RDBMS or Relational Database Management System. This is a system that stores data in a dataset that consists of rows and columns.

Hadoop is made up of 4 open source libraries that each do something unique and add to the competitive advantage that Hadoop has.

  • Hadoop Common: provides common utilities that can be used across multiple models
  • Hadoop MapReduce: Job processing and scheduling across the cluster
  • HDFS or Hadoop Distributed File System: File system management across the cluster
  • Hadoop YARN: Improved version of MapReduce

So, why do people use Hadoop?

When you compare Hadoop to other systems, this software is much more cost efficient. Additionally, there is a lot of community support surrounding Hadoop which is constantly evolving over time.

And what is it used for? Companies that receive large amounts of data in different forms and want to combine all of it. For example, companies that receive data in the forms of clickstream, social media, transaction or really any other format would benefit from using Hadoop. Additionally, enterprise projects that require clusters of servers where specialized data management and programming skills are limited. Furthermore, Hadoop should only be used when the amount of data exceeds terabytes or petabytes, otherwise other softwares such as Microsoft Excel or Postgres would be more useful.

Now, how does Hadoop actually work?

Hadoop is an ecosystem of libraries and each library performs its own functions. HDFS writes data to the servers and then reads it over and over again. Compared to other file systems which have multiple read and write operations, Hadoop is much faster, thus giving it an advantage when using a high volume and variety of data.

Job Tracker is the master node and gives commands to all of its slave nodes, called Task Trackers. Whenever data is required, a request is sent to NameNode which is the master node of HDFS and manages all the slave nodes called DataNode. And then the DataNodes proceed and process the request.

MapReduce and Yarn are used for scheduling, processing and executing a system of jobs.

Conclusion

Hadoop can do many things and serve many purposes, but it is the job of the data scientist to execute and take advantage of all the necessary components of the software.

Sources

  1. https://www.youtube.com/watch?v=oT7kczq5A-0
  2. https://www.dezyre.com/article/hadoop-explained-how-does-hadoop-work-and-how-to-use-it-/237

--

--

Jonathan Schein
Jonathan Schein

Written by Jonathan Schein

Data Scientist, Brandeis University Alum and Flatiron School Alum

No responses yet