Apache Spark and Big Data

11 min readDec 7, 2020

Introduction

In this blog post I will introduce the idea of big data and discuss the tools that data scientists use daily to manage this issue. Big data involves working with data that encompasses the three V’s, Volume, Variety and Velocity. This makes it very challenging for data scientists to run their routine data analysis using just a simple small data set. In this post, I will discuss how to overcome this challenge through parallel and distributed computation. I will discuss Apache Spark, an open-source distributed cluster-computing framework. Finally, I will discuss PySpark, the Python Apache Spark API.

Before I delve into these ideas, I will explain a few concepts and key terms that will clarify some confusion. Parallel computing refers to the idea that complex data-focused tasks can be distributed and executed through a cluster of interconnected computers rather than just one. MapReduce converts large datasets into sets of key:value pairs.

Big Data: Introduction

Big data has become more and more popular over the past few years. Big businesses aim to use big data to make data-driven business decisions. Besides for the size, there is no difference between “normal” data and big data. This affects the analytics tools that are used to look closely at this data and make meaningful decisions.

To data scientists, big data refers to datasets that are so large that you can not use traditional database management systems and standard analytical tools. You would need advanced storage systems to work with these big data projects within a reasonable time.

Big datasets can have up to petabytes of data. Now, we have softwares that can analyze and store this data to ultimately extract meaningful information. According to Doug Laney, Big Data has its proper meaning when it encompasses the three V’s, Volume, Velocity and Variety. Volume refers to the size of the data, velocity refers to the rate at which the data is changing and variety refers to the different formats, types and ways to analyze the data.

Analytics

While data has been growing over the past few years, so have the tools that are used to analyze the data. It is not just enough to have the data, rather equally important is analyzing the data and making meaningful decisions.

Visualization and Analytics: Tableau, Microsoft, Looker, R, Python
Computation: Spike, Hive, Azure, Digital Ocean, Hadoop
Storage: Amazon S3, Apache Hbase
Distribution and Data Warehouse: Oracle, Amazon EMR, Cassandra, CloudEra

Parallel and Distributed Computing with MapReduce

MapReduce allows its users to scale across many servers for big data analysis projects. The term MapReduce is made up of two terms, map and reduce. Map takes one set of data and transforms it into another set of data that is broken down into key:value pairs. Reduce takes the output from Map as input and combines those key:value tuples into even smaller sets of tuples. In summary, MapReduce is used to change big data into regular sized data by using parallel and distributed computing.

Distributed processing is a group of computers in a network working together to accomplish a certain task. The computers in the network do not share hard drive or processing memory, rather they communicate through messages over the network. The individual computers are referred to as nodes. The two ways to organize these computers in the distributed systems are the client-server system and the peer-to-peer system.

The Client-Server system has nodes that make requests to a central server and then the central server can choose whether or not they want to accept that request. Peer-to-peer allows nodes to communicate with each other directly without requiring authorization from a central server.

Parallel processing is based off of the well-known term divide and conquer. When there is a very large and arduous task the best way to handle it is through dividing up the tasks and completing them at the same time as your colleagues. Individual processors are getting faster, but they are still unable to handle the volume of data that goes along with big datasets. This is why companies use multiple processors to handle large amounts of data.

Summary of parallel computing

A large problem is divided into many small problems
The small problems follow an algorithm
The program is executed simultaneously on many processors
All the answers are combined to form one final answer

MapReduce

This is when MapReduce comes into play. We use it to make sure all the nodes communicate with each other. It is a software that is used to make sure Big Data is distributed and has a parallel processing environment over several nodes connected to each other in a cluster. In just a few words MapReduce can be explained as “split-apply-combine”.

Example

Situation: Someone is trying to calculate the total number of animals in all zoos across the country. They receive a file that contains all this information but it is not organized.

Map and Split: the dataset needs to be transformed to key:value pairs. Each computing cluster is assigned a number of map tasks which is then distributed amongst its nodes. In this example there are 5 nodes. The first step is to split the data from one file into the five files or however many nodes are being used. Then the map function will allow you to create a series of key:value pairs for the animal:number of animals per zoo. Some intermediate key:value pairs are created throughout the process.
Shuffling: the mapped objects are shuffled to make it easier to reduce the key:value pairs. The number of new fragments will be the same as the number of reduced tasks.
Reducing: every shuffled segment will have a reduced task applied to it. The final output is written into a file system. The underlying file system is HDFS or Hadoop Distributed FIle System. One worthwhile point to address is that MapReduce is used best with Big Data and not for small projects because that will actually slow it down. The job tracker is a “master” node that informs the other nodes which tasks to complete. The task tracker is the “worker” node that completes the tasks. Several MapReduce jobs can be used to complete a job. Meaning, that when one MapReduce job is finished the output will be input for the second MapReduce job.

Downloading Apache Spark Through a Docker Container

All the tools that are downloaded are open source.

Docker: We will run PySpark on a single machine in a virtualized environment using Docker. This is a container technology that allows packaging and distribution of software to make it simpler to set up environments, configuring logging and options etc. Spark is known to be difficult to install which is why it’s easier to go through Docker.

Mac: https://download.docker.com/mac/stable/Docker.dmg

Windows: https://hub.docker.com/editions/community/docker-ce-desktop-windows

Kitematic makes it easy to install containers in Docker running on your computer and it lets you control your app containers from a GUI. This is easier than configuring virtual environments. It is automatically included with the newer versions of Docker. Once Docker is installed, you need to perform the following tasks to create a PySpark enabled notebook.

Click on the docker toolbar on mac and select Kitematic

Sign up on the Docker Hub: Although this step is optional, it is highly recommended so you can share your Docker containers and run them on different machines. This step is accessed via the “My Repos” section in the Kitematic GUI.
Search for the “pyspark-notebook” repos and click on the image provided by jupyter.
Once the image is downloaded, run it. This will start an ipython-kernel. To run the jupyter notebook, click on the right half of the kitematic where it says “web preview”
Add the token id which is accessed when you go back to the kitematic and check the left bottom of the terminal-like screen. There a string will say token?= ___. Then the jupyter notebook will open and you can run the program in spark.

Testing:

In a new jupyter notebook run the script below to test it out.

import pyspark

sc = pyspark.SparkContext(‘local[*]’)

rdd = sc.parallelize(range(1000))

rdd.takeSample(False, 5)

If the output is [941, 60, 987, 542, 718] then it works!

Spark Continued

Cluster Resource Manager — A feature of Spark which splits the physical resources of a cluster of machines between multiple Spark applications.

Standalone Cluster Manager — this operates in the standalone mode and allows Spark to manage its own clusters.

SparkContext() — One needs this in order to use Spark and its API. This is a client of Spark’s execution environment, it is the master of the Spark application and it sets up internal services.

When you are running Spark, you can start a new Spark application by creating a new SparkContext. The applications driver program launches parallel operations on executor JVMs that run either in a cluster or on the same machine. PySparkShell is the driver program when you run it locally.

Code

Start a Spark application by importing PySpark, creating a spark context as sc and printing out the type of sc.

import pyspark

sc = pyspark.SparkContext(‘local[*]’)

Type of spark context

type(sc)

The dir() function gets you a list of all the attributes and methods accessible through the sc object

dir(sc)

The help() function also gets you a list

help(sc)

SparkContexts documentation page

https://spark.apache.org/docs/0.6.0/api/core/spark/SparkContext.html

Number of cores being used

sc.defaultParallelism

Current version of spark

Sc.version

Name of the application currently being used in the Spark environment

sc.appName

Shut down SparkContext

sc.stop()

Once you shut it down you can no longer access the spark functionality before starting a new SparkContext

RDD — Resilient Distributed Datasets

RDDs are the fundamental data structures of Spark. It is the Spark representation of a set of data that is spread across multiple machines with APIs. RDDs can come from any data source including text files, database, JSON file etc.

What are RDDs?

Resilient — they are resilient because they have a built-in fault tolerance. This means that if one of their nodes goes offline, RDDs can restore the data. This is as opposed to a standard computer that will lose the information if the computer dies while performing the operation.
Distributed — the data is contained on multiple nodes of a cluster-computing operation. This allows for parallelism.
Dataset — the dataset is partitioned across multiple nodes

RDDs are the building blocks which higher level Spark operations are based upon. If you are performing an action on Spark then it probably uses RDDs. The first thing to know about RDD is that it’s immutable. This means that once an RDD is created, it cannot be modified. The second thing to know is that it’s lazily evaluated. This means that RDDs will not be evaluated until an action is triggered. This allows users to organize their Spark actions into smaller actions. It also saves unnecessary computation and memory load. The third thing to know is that it is in-memory. The operations are performed in-memory rather than in the database. This allows Spark to perform fast operations with big data.

First you create a base RDD and then perform operations to that base RDD. Each transformation of an RDD creates a new RDD. Then you can apply actions to the RDD.

Transformations — they create a new dataset from an existing one by passing the dataset through a function which leads to a new dataset as the output. A transformation is an RDD that returns a new RDD.
Actions — they return final results of RDD computations. An action returns a value to a Spark driver.

Code

To initialize an RDD first import pyspark and then create a SparkContext and set it to the variable sc. Use ‘local[*]’ as the master

import pyspark

sc = pyspark.SparkContext(‘local[*]’)

.parallelize() method will create an RDD that will distribute across multiple cores. Here is one with 10 partitions and data is the collection.

rdd = sc.parallelize(nums, numSlices=10)

print(type(rdd))

See how many partitions are being used

rdd.getNumPartitions()

Total count of items in the RDD

rdd.count()

Returns the first item in the RDD

rdd.first()

Returns the first n items in the RDD

rdd.take(n)

Returns the top n items in the RDD

rdd.top(n)

Returns everything from your RDD. If you are dealing with big data then this can make your computer crash

rdd.collect()

Machine Learning with Spark

pyspark.mllib package - PySpark 3.0.1 documentation

Classification model trained using Multinomial/Binary Logistic Regression. Parameters weights - Weights computed for…

spark.apache.org

pyspark.ml package - PySpark 3.0.1 documentation

Binarize a column of continuous features given a threshold. Since 3.0.0, can map multiple columns at once by setting…

spark.apache.org

There are two different libraries for machine learning, mllib and ml and they are very similar to each other. Mllib is built upon RDDs and ml is built on higher level Spark DataFrames which have similar methods and attributes to pandas. Ml will be used more than mllib in the future. There are also not as many features in these libraries than there are in pandas and scikit-learn.

Previously, SparkContext has been the main way to connect with a Spark application. Now, I will introduce SparkSession which is the sql component of Spark. This is the bridge between Python and the Spark Application. It is built on the Spark SQL API and is a higher level than RDDs.

Code

Necessary libraries

from pyspark import SparkContext

from pyspark.sql import SparkSession

Creating a SparkSession

spark = SparkSession.builder.master(‘local’).getOrCreate

Reading in a pyspark dataframe

spark_df = spark.read.csv(‘./forestfires.csv’, header=’true’, inferSchema=’true’)

Most other functions are the same as a regular data frame such as head(), columns etc.

PySpark says that they used scikit-learn as an inspiration for their implementation of a machine learning library. These are the three main concepts in the ML library.

Transformer — algorithm that transforms one PySpark df into another df.
Estimator — algorithm that can be fit onto a PySpark df that can then be used as a Transformer
Pipeline — similar to a sklearn pipeline that chains together different actions

Summary

Big Data refers to datasets that fit the three Vs. Its data is so big that it cannot be handled through traditional database management systems.
The size of the data can be up to petabytes in size
MapReduce is used to split the data into smaller sets and is distributed over several machines.
You must install Docker and Kinematic on your environment.
Confirm again that everything is working.
Create a SparkContext() when working with PySpark.
Creating RDDs is absolutely necessary.
Machine Learning with big data can be used with Spark using the ml library

Apache Spark and Big Data

Big Data: Introduction

Parallel and Distributed Computing with MapReduce

MapReduce

Downloading Apache Spark Through a Docker Container

Spark Continued

Code

RDD — Resilient Distributed Datasets

What are RDDs?

Code

Machine Learning with Spark

pyspark.mllib package - PySpark 3.0.1 documentation

Classification model trained using Multinomial/Binary Logistic Regression. Parameters weights - Weights computed for…

pyspark.ml package - PySpark 3.0.1 documentation

Binarize a column of continuous features given a threshold. Since 3.0.0, can map multiple columns at once by setting…

Code

Summary

Resources

Written by Jonathan Schein

No responses yet