Spark with Python

1,978 291 4MB

English Pages [154] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Spark with Python

Citation preview

Spark With Python

Athul Dev

Preface The creation of this book has been for me, Athul Dev, an exciting personal journey in discovering how Apache Spark can be used today for Big Data Analytics and Machine Learning. Since in today’s era what we have in abundance is the data itself, and in the coming future, with no doubt it will only expand and this expansion will be on the exponential scale. So, when we have such abundant data from various sources, we can get important insights that could have a major impact on our society. On working with data, say Healthcare data, if we are able to find even a small insight, it would mean a ray of hope to someone’s life. This is what makes it so special and something which has a purpose. Ok, so now in order to find insights or learn something from the data we need some tools, now that’s when Apache Spark comes to our rescue. There are a lot of tools that would facilitate our requirements but since Spark can work with large amounts of data with ease, we will go about learning that. This book mainly focuses on the hands-on implementation of Spark using Python from scratch. We will learn and understand the fundamental concepts of Spark and Machine Learning, we will also go through the steps of setting up Spark on our machines in detail and will also brush up on our python scripting skills along the way.In order to understand the concepts better, most of the sections have example code snippets with its output which shows the actual results from the respective code snippets for easy comparison and code mapping. I sincerely hope you enjoy learning and discovering the exciting possibilities of Spark and have as much fun with it as I did writing this book. In order to understand and bring clarity to our code snippets, I have adopted certain colorization conventions. Components of the Python programming language like keywords and numeric are colored green, variables are colored black, strings colored orange and comments are colored light blue. #Example of Code Color Representation text = "Welcome to the World of Possibilities" print(text) Additionally, in order to easily access the source code and the data of the code snippets used in the book, a colored icon is provided with the corresponding link which on accessing or on click redirects you to those files itself, which you can later download and use, this would surely help you perform quicker implementations and would give you an idea of how the data actually looks like for relating it with the code.

For convenience I have placed all the source code files and its respective data from the code snippet examples featured in this book in GitHub. You can access these materials from here.

www.github.com/athul-dev/spark-with-python

About this Book Nowadays the internet is an integral part of our life, right from the waking moment we indulge in the world of the internet like creating a Facebook post or watch a YouTube video or so, and in this process we tend to create data. And think of it as the entire human population participating in this process of creating data every day, every minute and every second, now that would be a lot of data. Ok, now storage is an issue but the bigger issue is managing this data, it would be difficult and confusing to handle this data and to get some insights from this data to improvise the user experience and facilitate the society by providing them with the precise information which they require. But the question is how do we handle this data or how do we get the insights from this data? Before answering that let us virtually visit a hospital and there we see patients waiting in long queues and paying lump some money to avail various medical services, with the amount of medical historical data that is available to us, how can we handle this and get some insights from this data which would, in turn, help the patients in need of these services get it faster and avail it for cheap. We can achieve this by making the diagnostics easier for doctors or making the medical equipments function better or so and all this can be done by handling the respective medical data and finding some insights. In this similar fashion we can go about finding insights for various problems in society and addressing problems in various industries like aviation, transportation, and automobile and so. Now we understand the importance of data and the need to handle and process it. Hence, in order to handle and process it we need some tools which would help us perform various operations on data and one such powerful tool which can help us in this process is Apache Spark. Therefore, in this book we will learn about Apache Spark, how to handle the data with Apache Spark using Spark’s DataFrames, and also learn how to obtain insights and make predictions using Machine Learning with Spark. This book is designed in such a manner where it starts from the scratch by understanding the fundamentals, then going through the Step-by-Step installation process of Spark, brushing up our Python Skills for Spark, working with data in Spark and finally entering into the Machine Learning section with Spark. This book can be easily followed by anyone with or without any programming background, but on the completion of this book, I am sure my readers will be confident to write programs using the python language and would also be in a position to write Machine Learning scripts using python and spark. Since, each and every concept or topic is demonstrated using code snippets and its outputs, it would be really easy to follow and execute the same.

Contents

1

Introducing the Spark to our Life 1.1 1.2 1.3

1.4

2

Setting up our Tools 2.1

2.2

3

What is “Big Data”? Local v/s Distributed Systems Hadoop Overview 1.3.1 Hadoop Distributed File System – HDFS – Distributed Storage 1.3.2 MapReduce with Hadoop Spark 1.4.1 Spark v/s MapReduce 1.4.2 Spark RDDs 1.4.3 Spark DataFrames

Spark on Ubuntu VirtualBox – Local Setup 2.1.1 Setting up Ubuntu on VirtualBox 2.1.2 Installing Python, Jupyter Notebook and Spark onto Ubuntu Spark on Databricks Notebook System 2.2.1 Setting up our Databricks Account 2.2.2 Setting up Spark and uploading some Data to our Cluster in Databricks

7 7 8 9 10 10 11 12 12 13

14 14 14 16 19 19 20

Python Essentials for Spark

22

3.1

22 22 25 28 30 30 33 34 35 37 37 38 39 41

3.2

3.3

Python Basics 3.1.1 Numbers, Strings and Casting in Python 3.1.2 Lists and Dictionary in Python 3.1.3 Tuples and Sets in Python Controlling the flow of execution in Python 3.2.1 Operators in Python 3.2.2 if, if else and elif 3.2.3 Looping using While 3.2.4 Looping with For Loops Functions, Map, Filter & Reduce and List Comprehensions in Python 3.3.1 User defined Functions 3.3.2 Lambda Functions 3.3.3 map(), filter() and reduce() 3.3.4 List Comprehensions

4

Spark-ling Journey Begins 4.1 4.2

4.3 4.4

4.5

4.6

5

Machine Learning with Spark 5.1 5.2 5.3

6

Basics of Spark DataFrames Working with Rows and Columns in Spark Dataframe 4.2.1 Working with Columns in Spark 4.2.2 Working with Rows in Spark Using SQL in Spark Dataframes Operations and Functions with Spark Dataframes 4.4.1 Filter Function 4.4.2 GroupBy Function 4.4.3 Aggregate Function 4.4.4 orderBy Function 4.4.5 Using Standard Functions in our Operations Dealing with Missing Data in Spark Dataframes 4.5.1 Dropping the rows or data points which contains null values. 4.5.2 Filling the null values with another value 4.5.3 Filling Calculated values in place of missing or null values 4.5.4 Replacing specific value(s) in the DataFrame by another value Working with Date & Time in Spark Dataframe

Supervised Learning Unsupervised Learning Spark MLlib

Supervised Learning with Spark 6.1

6.2

Linear Regression with Spark 6.1.1 Linear Regression with Single Variable 6.1.2 Mathematics behind Linear Regression 6.1.3 Sneak peek on the Gradient Decent Algorithm 6.1.4 Gradient Decent for our Linear Regression 6.1.5 Evaluation Metrics for Linear Regression 6.1.6 Linear Regression with Multiple Variables Logistic Regression with Spark 6.2.1 Hypothesis Representation of Logistic Regression 6.2.2 Decision Boundary 6.2.3 Cost Function 6.2.4 Simplified Cost Function for Logistic Regression 6.2.5 Gradient Decent for Logistic Regression 6.2.6 Evaluating Logistic Regression 6.2.7 ROC Curve based Evaluation for Binary Classification 6.2.8 Pipelines

42 42 44 44 46 48 49 50 52 53 53 56 57 57 58 59 60 62

66 69 70 71

73 73 73 74 77 79 80 85 91 92 94 95 97 98 98 100 104

7

Tree Methods with Spark 7.1

7.2 7.3

8

Un-Supervised Learning with Spark 8.1

9

Decision Trees 7.1.1 Splitting Process 7.1.2 Entropy and Information Gain Random Forests Gradient Boosted Trees

K-Means Clustering Algorithm 8.1.1 Simplified Mathematical Representation of the K-means clustering algorithm 8.1.2 Determining the Number of clusters to be used for our K-means algorithm. 8.1.3 Elbow Method to determine the correct number of clusters 8.1.4 Silhouette Score to determine the correct number of clusters 8.1.4 Scaling of Data for K-Means Algorithm

Natural Language Processing for Spark 9.1

9.2

Basic Understanding of Regular Expression for our Natural Language Processing (NLP) 9.1.1 Metacharacters and Special Sequences 9.1.2 Special Sequences 9.1.3 Sets Spark Tools for Natural Language Processing 9.2.1 Tokenization 9.2.2 Stop Words Removal 9.2.3 TF-IDF (Term Frequency – Inverse Document Frequency) 9.2.4 Performing TF-IDF with CountVectorizer transformer

106 106 107 108 117 120

124 125 127 128 129 129 130

136 138 139 140 141 142 142 144 145 147

CHAPTER 1 Introducing the Spark to our Life I agree that Introductions are quite boring but when we observe the rate at which the volume, variety, and velocity of data is increasing it puts us in a difficult situation when handling this data and we finally end up saying “this is boring”. In order to avoid this let us introduce the Spark in our lives to helps us handle and analyze this data with ease. With no much suspense let us jump right away into the field of “Data Analytics” and “Machine Learning” and Oh yeah if the data is too big to fit in our hard-disk then let’s just call it as “Big Data Analytics”. And as the title says, we will be learning the most powerful, the most exuberant, the most popular, the friendliest, and the one who is faster than the flash, it’s none other than our Apache Spark which is very well suited to handle the Big Data that we have got. So before we get our hands dirty by working with the Apace Spark, let us understand some basic concepts which are very important to be understood in order to show off and apply our skills to solve real-life problems and also there shouldn’t be a scenario where we write efficient spark scripts to tackle some big data problems and when someone asks us what exactly is Big Data and how is Spark better than Hadoop, we start talking about comets and star, rather we should be in a position to convey it clearly without any hassle. Therefore in order to get some clear cut understanding we shall just sprint through a few important topics and have a bird’s view understanding of the fundamental concepts.

1.1 What is “Big Data”? Big Data, Bigg Data, Big Dataa it is just a term that echoes in and around the workspace and which buzzes always inside the minds of Data Engineers. As a matter of fact this is just a cute like phrase and it means – any quantity of that data which cannot fit on a local computer for any sorts of processing with respect to the RAM. Say for example we have 1TB of harddisk and 16GB of RAM and the dataset we have is around 1-15GB, this is not a Big Data problem as the data can entirely fit within the limits of RAM for faster processing, and if we consider this as scenario a Big Data problem then this itself will create a Big problem, whereas if we have 1TB of data in place of 1-15 GB, in that case, this is a big data problem with respect to the volume of the dataset that we have. And this is when our champ Apache Spark comes into action. Generally the Big Data term is associated with 3 V’s, with respect to the data one deals with, they are “Volume – Variety – Velocity”. In simple terms Volume is the amount of data that you have in terms of GB(s), TB(s), PB(s) and so, and the Variety is with respect to what kind of data we are dealing with, whether is it a structured database like the kind we have in excel or

7

more like images, tweets or social networks data which are categorized as unstructured data. And the third V – the velocity is based on the frequency or momentum of the data, whether if it is already available on the disk as a batch or if it is a real-time streaming data. Anyways, when we have this large beautiful data which cannot be fit into our local system for processing, we shouldn’t just ignore this and stop working on the problem rather we can use the distributed systems concept to tackle this data volume issue. Let’s quickly understand what distributed systems really mean.

1.2 Local v/s Distributed Systems A local system is probably the one that we are used to or is currently using. It is just a single isolated computer that shares the same RAM and hard drive. Local systems are always constrained or restricted to the number of cores available to that local system and the processing happens with respect to those available cores. The important catch here is that local processes use the computational resources of a single machine. A distributed system can be thought of as a master-slave concept where we have one main computer which is the master node, and we have the data and the processes distributed onto the other computers or the slave nodes where these slave nodes are typically weaker machines compared to the master node. In this kind of setup we can leverage the processing power and the storage of the slave nodes making it more efficient than a single powerful local system setup. And our Spark is a specialist in setting up and handling this kind of setup.

Figure 1.1: Local System versus Distributed System Setup

8

Let us see why Spark considers this kind of a setup: A distributed process has access to the computational resources across a number of machines connected through a network. Also after a certain point, we might have so much of data and would want to add some more machines to the network to handle it and in this kind of a setup it is easier to scale out to many lower CPU machines, rather than trying to scale up one a single machine with a high CPU in the case of the local system setup. The Distributed system also takes care of the fault tolerance, where one system fails the whole network can still go on but in the case of a local setup if there occurs any error then the entire system collapses. One might wonder how is this even possible since every node will have some data, so if one node goes down it will not compute or process the data that is stored in that failed node and we would get wrong results, but to our rescue, there is this concept of data replication, the data of every node will be replicated to two other nodes and on the failure of a node it performs the computation or processing of the failed node’s data. So let us see a typical format of a distributed architecture that uses a framework like Hadoop.

1.3 Hadoop - Overview Hadoop is the big player which is out there when it comes to Big Data, Hadoop is a framework that facilitates the whole process of storing, distributing, processing, and analyzing large data. It is a way to distribute very large files or the big data across multiple machines or commodity hardware forming a cluster like setup. A few of the most integral components of the Hadoop framework is discussed below. Hadoop uses the Hadoop Distributed Files System, (the HDFS) which allows us to work with these large sets of data. HDFS is the primary data storage system used by Hadoop applications. HDFS also does the duplication of blocks of data for fault tolerance, remember the example of one machine going down but even though we getting our results correct, yeah exactly! That technique. Hadoop also uses the MapReduce which is a programming model suitable for the processing of large(big) data, these programs are parallel in nature, thus they are very useful for performing large-scale data analysis using multiple machines in the cluster. So in a gist, the HDFS is a way through which we get the big large dataset distributed across multiple machines, and then we have the idea of MapReduce which allows computations across the distributed data set.

9

1.3.1 Hadoop Distributed File System – HDFS – Distributed Storage HDFS likes the master-slave concept very much and no surprise, it uses the masterslave concept, wherein the master node is called as the Name Node which has its own CPU and RAM and it basically controls the process of either distributing the storage and calculations to these slave nodes which are called as Data Nodes, and these poor guys who do most of the hard work have their own CPU(s) and RAM(s).

Figure 1.2: HDFS Setup Overview

So, what HDFS does is that it uses blocks or parts of data with a size of 128 megabytes by default, and each of these blocks is replicated 3 times and these blocks are distributed in such a way as to enable the fault tolerance mechanism. Hence, a machine can go down but our data stays strong. And also these smaller blocks of data enable more parallelization during processing and multiple copies of a block prevent loss of data even on failure of a node.

1.3.2 MapReduce with Hadoop MapReduce is a way of splitting computational tasks to a distributed set of files such as the Hadoop distributed file system and it comprises a job tracker and multiple task trackers. Spark also incorporates the idea of trackers. So with respect to MapReduce, the job tracker sends the code to run on the task trackers and then the task trackers allocate the CPU and memory for the tasks and monitor the tasks in these worker nodes. We can visualize it as the job tracker being the manager, the task trackers being the team leaders and the worker nodes are the engineers who actually work on the tasks. 10

Figure 1.3: MapReduce Setup Overview

In short, what we covered can be thought of as two distinct steps where the first is to distribute large datasets and then using MapReduce to distribute a computational task to a distributed dataset. So let us directly dive into the interesting part and the latest technology in this space known as SPARK. Spark improves on the concept of performing distribution. We shall understand Spark in-depth and also see how we can leverage the power of spark over MapReduce. We will learn about Spark Resilient Distributed Datasets (RDDs) which is a data structure of Spark and understand how to use Spark Data Frames in detail. All these Spark concepts enable us to quickly and easily handle Big Data and Machine learning problems.

1.4 Spark Presenting the most awaited topic – Spark, Spark is an open-source project on Apache, which means that we have access to all the source code, and how is that for a treat! It was initially released in February 2013 and has exploded in popularity due to its ease of use and its speed. One can think of Spark as a flexible alternative to MapReduce. If we find MapReduce difficult or we encounter any challenges in writing the MapReduce programs, then there is no need to look around for anything other than Spark, Spark is the best alternative for MapReduce. We shouldn’t think of it in terms of Hadoop versus Spark but instead of MapReduce versus Spark. Spark can also use data stored in a variety of formats like – Cassandra, AWS S3, HDFS, and more.

11

1.4.1 Spark v/s MapReduce MapReduce and Hadoop generally go hand in hand, hence it requires files to be stored in HDFS format, whereas Spark does not have that constraint condition, it can work on a wide variety of data formats including HDFS. The most important and common thing that we read when we search about Spark on the internet is that Spark can perform operations up to 100 times faster, “100 times faster” than MapReduce. And this is one of the main selling point of Spark. And let us see how Spark achieves this speed. That is 100 times faster is a huge leap ahead in terms of performance. So, when we see the data storing functionality of MapReduce, it writes most of the data to the disk after each map and reduce operation, which is very expensive, whereas spark, on the other hand, keeps most of the data in its memory after each transformation and it also has the ability to spill over to disk if the memory is filled.

1.4.2 Spark RDDs Now let us focus on Spark in particular. At the core of the Spark is the idea of an RDD – Resilient Distributed Dataset. The RDD has 4 major features, and they are   

Distributed Collection of Data Fault-tolerant Parallel operation Ability to use many data sources

RDDS are immutable and lazily evaluated, which means the execution will not start until an action is triggered. And this has various advantages like it increases manageability and reducing complexities by allowing users to freely organize their Spark program into smaller operations thus reducing the number of passes on data by grouping the operations. It also saves computational processes and increases speed by saving the calculation overhead, since the value need not be calculated until it is used or addressed. Only the necessary values are computed. It saves the trip between driver and cluster, thus speeds up the process. There are basically two types of Spark Operations –  

Transformations Actions

Transformations are general instructions or a recipe to follow and Actions actually perform what the recipe says and return the tasty dish back as its result. And this behavior carries over to the syntax when coding. Most of the time we write a method call, but will not see anything as a result until we call the action.

12

With respect to the context of Big Data it makes sense because, with these large datasets, we don’t want to calculate all the transformations until we are sure that we need to perform them. In our journey of learning about Spark and its syntax we will see Spark’s RDD versus DataFrame syntax. As with the release of Spark 2.0, Spark is moving towards a DataFrame based syntax. And the point to keep in mind is that the way files are being distributed physically can still be thought of as RDDs, it is just the typed out syntax that has changed.

1.4.3 Spark DataFrames A Dataset is a distributed collection of data and a DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a pandas DataFrame in Python, but with richer optimizations under the hood. A Spark DataFrame is considered as a distributed collection of data that is organized under named columns and provides operations to filter, group, process, and aggregate the available data. Spark DataFrames are also the standard way of using Spark’s Machine Learning capabilities which we would be dealing with in the upcoming chapters. DataFrames are designed to be multifunctional. And we need DataFrames for facilitating:    

Multiple data sources Multiple Programming languages Processing Structured and Semi-Structured Data Slicing and Dicing the data

OK, that’s enough with the overview and theory, now let’s us jump right away into setting up and sharpening our tools in order to build and craft our spark competency.

13

CHAPTER 2 Setting up our Tools In this chapter let us explore the various ways to set up the working environment to facilitate Spark. We will mainly focus on two options or ways to get it all working. You can go through both the options and choose the one which suits you better and works the best for you. Both of the installation options will work on any of the Operating Systems, and that’s is because we will either use a VirtualBox to set up a Linux based system locally, or we will link online to a Linux based system. And the reason we are focusing on Linux here is that realistically speaking Spark, in general, will not be running on a single machine, basically, that’s the whole point of Spark, since the data will be so large, it will no longer fit on a single machine and we will need it to run on a cluster or on a service like the AWS or Google cloud. And these cluster services will pretty much always be a Linux based system. The two methods that we are going to explore are – 



Setting up Ubuntu, Python, and Spark on the VirtualBox In this method we will set up a VirtualBox on our local computer irrespective of any OS and install Ubuntu, Python, and Spark local to this virtual machine. Setting up the Databricks Notebook System for Spark In this method we will set up a freely hosted Databricks Notebook which is similar to the Jupyter Notebook and run our Spark online.

2.1 Spark on Ubuntu VirtualBox – Local Setup In this section of setting up Spark, we shall first see the steps to download and set up the VirtualBox software which lets us run any another OS in our case Ubuntu on our currently installed OS say Windows or Mac. If our system is already running Ubuntu then we can skip this VirtualBox setup step and jump directly to the Python and Spark setup for Ubuntu section. Installation Video Link – https://youtu.be/N8-3kCscjrU 2.1.1 Setting up Ubuntu on VirtualBox Firstly we need to download the VirtualBox software and Ubuntu OS from the internet. VirtualBox software can be download from the download section on the www.virtualbox.org website. And after downloading the executable we can install it with the default settings via the installer. Similarly for getting the Ubuntu OS, we can just go to the download section on www.ubuntu.com website and get the Ubuntu Desktop version. 14

Alright, once we have the VirtualBox installed and the Ubuntu OS image file downloaded, we can then proceed to our next step and that is of installing Ubuntu onto this VirtualBox. Steps to install Ubuntu OS onto VirtualBox: 1. Firstly we have to open the VirtualBox software which was installed. It will then open the VirtualBox Manager Window. 2. We have to then click on the ‘New’ option from the VirtualBox Manager Window. 3. A new window appears asking the Name and the type of the OS which has to be installed. Here we can give a name to identify the OS. The Type is to be selected as Linux and Version have to be Ubuntu. 4. After filling the details for Name and Type, we then click next. On clicking next a window appears wherein we have to allocate the Memory Size or the RAM to the OS/Virtual Machine. The RAM can be allocated depending on our Machine’s installed RAM availability. It is recommended that we provide half of the CPU’s RAM to this Virtual Machine for better performance. 5. After allocating the RAM and clicking next, we will be prompted to set up a Hard disk for our new OS. We can do this by selecting the “Create a virtual hard disk now” option, which would create the hard disk consuming around 8.00 GB of our installed storage. And then on clicking next, in the next window we select VDI (VirtualBox Disk Image) as the hard disk file type to successfully set up our hard disk for the new OS. 6. And on clicking next after the Hard disk file type selection window, we have to choose the type of storage allocation on our physical hard disk, we have two options - dynamic allocated hard disk file or fixed size hard disk file. It is recommended to select the fixed size option because the dynamic allocated option tends to slow down the inputoutput speed to some extent. 7. Finally, after clicking next, post-selection of the hard disk file type, we have to allocate the amount of storage we want the new OS to have. It is ideal to provide around 2030 GB of storage for storing some files and accounting for the OS itself. 8. We then click on the create button and this will take a few minutes to set up the Virtual Machine. 9. After setting up the Virtual Machine we have to install the Ubuntu OS to this Virtual Machine. For this we can click on Start option or double click on the Virtual Machine which we have configured in the above steps from the VirtualBox Manager Window. 10. On starting the Virtual Machine for the first time, a select start-up disk window appears, it will ask us to provide a virtual optical disk file or an image file. So this is when the Ubuntu OS ISO file which we had downloaded comes into the picture. We can select the ISO file from the drop-down if it is present, else we can browse for this Ubuntu ISO file and select it. Then we can click on the Start button and this will install Ubuntu onto our virtual machine. Few points which would help us avoid any confusion while installing Ubuntu onto the Virtual Machine: 

When the installation asks us to select from Try Ubuntu or Install Ubuntu, we can select the install Ubuntu. 15

 

We can select the Download updates while installing Ubuntu to start using it right away after the install. When it asks whether we want to Erase disk and install Ubuntu, we can select this option. A point to note that it will not delete any of our files that are present on our computer rather it will only erase or delete files within the scope of our Virtual Machine.

Finally our Ubuntu OS is now ready for action. Next we shall furnish our OS to be able to run spark on it. For this we shall install Python, Jupyter Notebook, and Spark itself onto our Ubuntu.

2.1.2 Installing Python, Jupyter Notebook and Spark onto Ubuntu Let us start with Python first, generally, Ubuntu comes with python installed inbuilt. In order to check if it is installed we can open the Ubuntu terminal and type in the command python3 in the terminal and if it is already installed we get the details like its version displayed on the terminal. For our learning we use the python3 version. So once we get this we can be sure that python is installed and we are ready to go. Next, let us install the Jupyter Notebook to work with ease. For this we need to follow the below steps: 1. Install the pip for python 3 which is used to install various tools or services for python. It can be done using the command sudo apt install python3-pip 2. Next, we can install jupyter by giving the commandpip3 install jupyter 3. Once the download and install are complete we need to add it to the path to easily start the jupyter notebook. export PATH=$PATH:~/.local/bin 4. Once the path is exported we can check if has been installed correctly by giving the commandjupyter notebook So far we are all good on our Jupyter Notebook and Python installation. And all we have to do next is to download Spark and connect it with Python. So let us go ahead and do that. The following steps can be performed to connect Spark with Python and get it all running together. 1. We need to first get Java installed on our machine. For this we can give the following command in the terminal. sudo apt-get update sudo apt-get install default-jre We can then check if java was installed correctly by giving the command-

16

2.

3.

4.

5.

6.

7.

java -version Once Java is successfully installed we go about installing scala. And we can do this by giving the following commandsudo apt-get install scala We can then check if scala is installed correctly by giving the commandscala -version The next thing that we need to do is install a library called py4j which connects Java and Scala with python. We can give the following command to install py4jpip3 install py4j Finally we are ready to install our Spark and Hadoop. For this we can go to the spark website which is https://spark.apache.org/ and click on the download section. In this downloads page we can choose a stable spark release version and the package type as Pre-built for Apache Hadoop 2.7 and download the same to our computer. Once the spark and Hadoop package is downloaded to our computer we can move it to our home folder to access it with ease. Next we shall go about unzipping the archive or tgz file. We can do this by providing the following commandsudo tar -zxvf downloaded_spark_filename Example- sudo tar -zxvf spark-2.4.5-bin-hadoop2.7.tgz This will unzip the spark files and provide us with the folder that we need to work with. Next we have to connect Python with Spark. So as to do this we shall provide the following commandsexport SPARK_HOME=’home/ubuntu/spark-2.4.5-bin-hadoop2.7’ Note: The spark file name can change depending on the version that we have downloaded. Make sure you provide the correct filename. The filename which we provide here is the extracted folder name from the previous step number 5. export PATH=$SPARK_HOME:$PATH Note: This is basically going to tell the path where to find Spark Home and set it to the Path. export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH Finally in order to connect and work with the jupyter notebook for spark. We have to provide a few more commands as shownexport PYSPARK_DRIVER_PYTHON=”jupyter” export PYSPARK_DRIVER_PYTHON_OPTS=”notebook” export PYSPARK_PYTHON=python3

And Voila! We are all done with our installation process.

Note: Generally the spark folder will be locked and we may face some permission issues so in order to fix any possible permission issues when using the jupyter notebook for spark. For this, we shall change the access rights for these folders and we can do this by giving the following commands1. sudo chmod 777 spark-2.4.5-bin-hadoop2.7

17

Syntax – sudo chmod 777 file_name_of_permissionChangeItem . Make sure the file name is as per the spark version that we have installed. 2. cd spark-2.4.5-bin-hadoop2.7/ sudo chmod 777 python cd python/ sudo chmod 777 pyspark Now let us do the final success check of our installation by just navigating to the python directory under our spark installation folder and opening the jupyter notebook and then importing the pyspark there. For this what we can do is1. Open the Terminal, and navigate to the python directory under the spark folder. cd spark-2.4.5-bin-hadoop2.7/python 2. Then open the jupyter notebook by giving the commandjupyter notebook 3. Once the jupyter notebook is open, inside the jupyter Home click on New and create a new Python 3 notebook. 4. Then inside the cell, type in the command given below and execute it. import pyspark Once the pyspark is imported successfully, we are done with the installation. That’s it we have done it. We have successfully installed Spark and we are ready to go about working with Spark. Note: Any time in the future after the installation if we want to work with pyspark and jupyter we can just follow the 4 simple steps shown above. Currently, we are able to import pyspark by navigating to the python folder under the spark installation folder, and then launching the terminal from this location and opening the jupyter notebook from here. But this is a tedious way to do it, so let us make this simple and be able to import pyspark for anywhere irrespective of the directory we are in. In order to do that we have to perform a few steps which are as follows1. In order to find our spark easily, we have to install a library called findspark which can be done using the below commandpip3 install findspark 2. Next we use the findspark to connect to our installed spark. For this we have to get the full path of our spark install. We can do this using the following commandscd spark-2.4.5-bin-hadoop2.7/ To get the full path we can use the below commandpwd After executing this command we can get the full path, which will look something like ‘home/your_username/ spark-2.4.5-bin-hadoop2.7’. We have to copy this full path. 3. Next, let us navigate back to the home directory by giving the commandcd

18

4. Next, we can open up our python by giving the commandpython3 5. Import the findspark library and connect to pyspark by giving the full path which we have copied in step 2Import findspark findspark.init(‘paste_copied_full_path_here’) 6. And it is done, we can now check this by just importing the pyspark by giving the commandImport pyspark The same steps can be followed if we are using the jupyter notebook which is started from the home directory or any other directory location. And yes, that is it. Now we have dealt with all the issues and done with the complete installation of SPARK. Next, let us see another way to install Spark, this time not locally rather online, and accessing it online.

2.2 Spark on Databricks Notebook System Databricks provides clusters that run on top of Amazon Web Services. It is a web-based platform for working with Spark and adds the convenience of having a Notebook System already setup which has the ability to quickly add files either from our local storage or Amazon S3 and so. The Databricks version which we are going to use for our learning is going to be the Databricks Community Version. This community version comes with a 6GB cluster size which is sufficient for our learning and later on can be scaled up depending on our needs. The Databricks setup is the quickest and easiest way to set up and start working with Spark. Also it can be accessed anywhere anytime as it a web-based platform. Now let us see how to set up a Databricks account, and also see how to set up our pyspark and how to upload our Data to access it in our cluster. 2.2.1 Setting up our Databricks Account 1. Firstly we have to sign up to Databricks for accessing the community version and this can be done by going to the website https://databricks.com/try-databricks and providing a few basic information like our name, e-mail id and so which would later be used to login to the Databricks platform. 2. Once we click on the signup button we would receive a confirmation mail for using Databricks to the e-mail id provided at signup. In this conformation mail there would be a hyperlink that would confirm our e-mail or identity and takes us to the Databricks platform.

19

3. And that’s it, we are all set to use Databricks. Also the next time we want to log in, we can just go to the Databricks login page and enter our credentials and resume our work. 2.2.2 Setting up Spark and uploading some Data to our Cluster in Databricks 1. Once we are logged in to Databricks and is at the home page, we will be able to view some featured Notebooks and some documentations which we can go through. But for now to set up and work with our spark, we need to first create a cluster. 2. In order to create a cluster, we need to go to the new section and click on the cluster option. A new webpage is loaded wherein we have to fill in some details like the cluster name that we want to identify our cluster and then the Apache Spark version which we want to use. The rest can be kept as it is and then we can click the create cluster button. And our cluster will be created. 3. Once we create our cluster it will take about a minute or so to get it into the running state. And once it gets the status running, we can go to the Databricks home, and from the ‘New’ section click on the Notebook option. This will prompt us to give a name for our new Notebook and to select the language which we prefer, and for the language we can select python as we are going to work with python. It will also ask us for the cluster which we want to use, here we can select the cluster which we have created in the previous step. Then finally click on the create button. 4. On clicking create it will create a notebook for us and we will be redirected to our notebook. To check if we have done the setup correctly and to see if everything is working fine and is ready to go, we try to import our pyspark here, for this, we can just type in the command – ‘import pyspark’ in the notebook cell and execute it by pressing Shift+Enter or clicking the play button corresponding to that cell. 5. Once our pyspark is imported successfully, we good to go and our setup is a success. 6. In order to work with spark we need some data in our Databricks platform, therefore let us quickly try to import some data and use it in our notebook, for this we can click on the tables option from the menu bar and click on create the table. The create table will prompt us to select the data source from where we need to fetch or upload the data, the options would be like File, S3 and so. Since we want to load data from our local machine, we select the File option and browse for the data file in our local and click upload. 7. Once the upload is complete we have an option to preview the data/table, on clicking the preview table, we have to provide a table name for it and select the type of the file and then click on the create table button. 8. Since our table is created let us now access this data in our Notebook. For this we can open our created notebook and in its cell we can provide the following script and execute it to create a DataFrame with the created table data. our_dataframe=sqlContext.sql(“SELECT * FROM created_tablename”) And we can see the data by then typing the script and executing itour_dataframe.show()

20

And yes, we are all through with our installation methods. We can choose either of them and both will work fine in our Spark journey. The first one is a bit lengthy but we will have everything within our local and the data or cluster size is not limited and this method is very useful as we learn to set up our working environment from scratch. Whereas the Databricks method is super easy and quick to set up, we need not worry on installing various libraries and so as it is already present, it can also be accessed anywhere as it provides a web interface, but since we use the community version the data size or cluster size is limited. Now, as we have the working environment ready. In the next chapter let us refine our python knowledge by just going through the python essentials which will help us boost and fasten our Spark learning.

21

CHAPTER 3 Python Essentials for Spark As we will be learning Spark using Python, we shall quickly refresh our Python knowledge and learn some python essentials for Spark. This chapter is more like a crash course for Python. This chapter will be useful to boost our pace of learning spark and we will be to grasp the upcoming topics very easily. For this chapter on learning Python we can use the Jupyter Notebook or just any other IDE which is already installed on our computer. But since we have already installed Jupyter Notebook on our Ubuntu Virtual Machine in the previous chapter we can use that.

3.1 Python Basics In this section we shall quickly go through some basics concepts of python, which would come in handy when going through and dealing with various scripts along our learning journey.

3.1.1 Numbers, Strings and Casting in Python In python, for numeric values there are mainly 3 data types, they are ‘int’, ‘float’, and ‘complex’. Where the ‘int’ datatype is used to store integer values, the ‘float’ is used to store decimal values and ‘complex’ is used to store complex values. We mainly use the ‘int’ and ‘float’ datatypes in our scripts. All the numeric operations like addition, multiplication and so remain the same. It also performs implicit conversions, that is when trying to add an int with a float it converts the values to float implicitly. When writing complex scripts and dealing with multiple objects or variables, at some point we will need to know what the type of that variable is, in order to perform certain actions. For this we can make use of the type function. 

type() – this method gives us the type of the variable, whether if it is an integer, float and so.

Let us quickly check out a script wherein we generate a random integer number, check its datatype and perform and addition operation with a decimal number and check its type as well.

22

#Code Snippet 1 import random #importing the library random to generate random number x = random.randrange(1,10) #generating random number in between 1 to 10 print(x) #printing the generated value print(type(x)) #printing the type of the generated value y = x + 2.5 #adding the generated value and decimal value print(y) #printing the sum value print(type(y)) #printing the type of the sum value Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%201.ipynb

5

7.5

Now, let us quickly go through the string datatype, its methods, and uses. Strings in python are surrounded by either single quotation marks, or double quotation marks. Strings in python are arrays and its characters can be accessed using the square bracket and by passing the index of the element to be accessed. Since the strings are array-based, we can easily perform slicing of the strings by just providing the range in the form of start index and the end index, separated by a colon, to return a substring or a part of the string. Few important methods that are performed on the strings are  

len() – It returns us the length of the particular string. split() – It splits the string into substrings if it finds instances of the separator or the delimiter which is passed as a parameter to the split method. replace() – It replaces a string with another string wherein the string to be replaced and by what it should replace is given as a parameter to the method.

Now let us see an example script which uses these methods so that we can understand how we can use this. #Code Snippet 2 text = "Apache Spark is Awesome" sliced_text = text[7:12] print(sliced_text) length_text=len(text) print(length_text) replaced_text = text.replace("Awesome","Fast and Easy") print(replaced_text) split_text = text.split("is") print(split_text) string_present = "Spark" in text print(string_present) Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%202.ipynb

23

Spark 23 Apache Spark is Fast and Easy ['Apache Spark ', ' Awesome'] True The output of the split methods gives us a list which we will learn in the next section, but the values of the list can be accessed by the indexes and assigned to a variable. Also the “in” operator is very useful to check if a substring is present or not within a string and gives us either true or false depending on the substring presence. Finally in the context of Strings we will go through an important function or method called the format method. The format method takes the given arguments, formats them, and places them in the string where the placeholders {} are specified. The below code snippet will help us to understand string formatting much better. #Code Snippet 3 text = "Apache {} is {}" formatted_text = text.format("Spark","Super") print(formatted_text) new_text = "{y} is {x}" new_formatted_text = new_text.format(x="Fun", y="Learning") print(new_formatted_text) Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%203.ipynb

Apache Spark is Super Learning is Fun We can notice that in the first use we just used the format without any specific order, and in the second use we assigned it to variable and then provided the placeholder with the order of the variable. We can use numbers as well starting from index 0 inside the placeholders when the format parameters are variables. Example – text = “Apache {0} is {1}”.format(x,y) It can be noted that we can explicitly convert one data type to another by enclosing the variable or the value within the data type itself. An example code snippet is shown below#Code Snippet 4 float_to_integer = int(2.8) #converting float value 2.8 to integer print(type(float_to_integer)) #checking the updated type string_to_float = float("4.2") #converting string value “4.2” to float print(type(string_to_float)) #checking the updated type integer_to_string = str(5) #converting integer value 5 to string print(type(integer_to_string)) #checking the updated type Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%204.ipynb

24



Now we have a good idea of how to deal with numbers, strings, and variables and also how to use various methods and so on. Next we shall go about learning Lists in python. 3.1.2 Lists and Dictionary in Python Lists in python are more or less like an array in other programming languages, it is a collection of various objects which are ordered and unchangeable. Lists are created or written with square brackets. Example: myfirstlist = [“Spark”,”is number”,1] Also the items in the list can be accessed by referring to its index numbers which start from 0. And it also supports negative indexing so as to easily access the end items of the list. Example: print(myfirstlist[0]) #Output is Spark print(myfirstlist[-1]) #Output is 1 The values of the items within the list can be easily modified using the assignment operation. It can be done by referring to the index number and providing the new value. List provides various methods which are very useful. Few of the methods are mentioned below    

len() – It determines how many items the list contains. append() – It is used to add an item to the end of the list. insert() – It is used to add an item at a specified index. remove() – It is used to remove a specified item pop() – It is used to remove item by specifying the item index

In order to understand the use of the above methods and to familiarize ourselves with lists, we can go through the below code snippet. #Code Snippet 5 ourFirstList = [1,5.5,"Apache",["Spark",7]] print(ourFirstList) ourFirstList[1] = ourFirstList[1]+2 print(ourFirstList) ourFirstList.append("Appended To Our List") print(ourFirstList) ourFirstList.insert(1,"I like to skip the queue") print(ourFirstList) print(len(ourFirstList)) #printing the length of the current list ourFirstList.remove(1) #this removes the item 1 and not the item in index 1 25

print(ourFirstList) print(len(ourFirstList)) #printing the length of the current list ourFirstList.pop(1) #this removes the item at index 1 which is 7.5 print(ourFirstList) print(ourFirstList[2][0]) #accessing the list value which is within our list for x in ourFirstList: print(x) #traversing through our list Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%205.ipynb

The output of the above code snippet is[1, 5.5, 'Apache', ['Spark', 7]] [1, 7.5, 'Apache', ['Spark', 7]] [1, 7.5, 'Apache', ['Spark', 7], 'Appended To Our List'] [1, 'I like to skip the queue', 7.5, 'Apache', ['Spark', 7], 'Appended To Our List'] 6 ['I like to skip the queue', 7.5, 'Apache', ['Spark', 7], 'Appended To Our List'] 5 ['I like to skip the queue', 'Apache', ['Spark', 7], 'Appended To Our List'] Spark I like to skip the queue Apache ['Spark', 7] Appended To Our List

This will surely give us a good understanding and a strong grip on the lists in python. Next we shall see another quite famous, widely used, and very powerful datatype, that is none other than the Python Dictionary. A dictionary is a collection of key and value pairs that are unordered, changeable, and indexed. They are generally represented using curly brackets. Syntax – dictionary_name = { key:value } Example: myFirstDictionary = { “Google”:”Chrome”, “Mozilla”:”Firefox”} We can easily access the elements of a dictionary by referring to its key name, inside square brackets, or by using the get() method which gives us the same result. Example: value_from_reference = myFirstDictionary[“Google”] value_from_get = myFirstDictionary.get(“Google”) Both will fetch us the value, Chrome in this case. We can also change or modify any value by referring to its key name as shown below.

26

Example: myFirstDictionary[“Mozilla”] = “Firefox Beta” Adding elements to the dictionary can be easily performed using a new index key and assigning a value to it. Example: myFirstDictionary[“Microsoft”] = “Explorer” Our updated dictionary would be – { “Google”:”Chrome”, “Mozilla”:”Firefox”, “Microsoft”:”Explorer”} Similar to our lists we can use the pop() method for the dictionary to remove the elements from the dictionary, it removes the item with the specified key name, wherein we pass the key as the parameter to the pop() method. Example - myFirstDictionary.pop(“Mozilla”) Few frequently used methods with the dictionary are   

get() – it returns the value of the specified key. pop() – it removes the element with the specified key. values() – it returns the list of all the values in the dictionary items() – it returns a list containing a tuple for each key value pair

We can go through the below code snippet which highlights some methods and features of the python dictionary. #Code Snippet 6 myFirstDictionary = {"Google":"Chrome", "Mozilla":"Firefox"} print(myFirstDictionary) value_from_reference = myFirstDictionary["Google"] print(value_from_reference) value_from_get = myFirstDictionary.get("Google") print(value_from_get) myFirstDictionary["Mozilla"] = "Firefox Beta" print(myFirstDictionary) myFirstDictionary["Microsoft"] = "Explorer" print(myFirstDictionary) myFirstDictionary.pop("Mozilla") print(myFirstDictionary) for x in myFirstDictionary.values(): #iterating dictionary using its values print(x) for x, y in myFirstDictionary.items(): #iterating dictionary using keys & values print(x, y) Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%206.ipynb

{'Google': 'Chrome', 'Mozilla': 'Firefox'} Chrome Chrome

27

{'Google': 'Chrome', 'Mozilla': 'Firefox Beta'} {'Google': 'Chrome', 'Mozilla': 'Firefox Beta', 'Microsoft': 'Explorer'} {'Google': 'Chrome', 'Microsoft': 'Explorer'} Chrome Explorer Google Chrome Microsoft Explorer

3.1.3 Tuples and Sets in Python A tuple is a collection of ordered and unchangeable elements or items. Tuples are represented and written in round braces or parenthesis where the elements are separated by a comma. Tuple elements can also be accessed in a similar way as we access lists. That is we can tuple items by referring to the index number, inside square brackets. Example : myFirstTuple = ( “Apache” , “Spark”, “Hadoop”, “ML”) print(myFirstTuple[1]) Which results in the output, Spark. Since Tuples are unchangeable or immutable we will not be able to directly add or remove the elements or items from it. So in order to add or remove elements from a tuple, we need to explicitly convert it into a list, make the required changes, and then convert it back to a tuple. One such example of this sort of conversion is shown in the code snippet below. Also there are two inbuilt methods that we can use on tuples.   

index() – it searches the tuple for the specified value and returns us the position of the value in the tuple where it was found. count() – it returns the number of times a specified value occurs in a tuple. len() – it returns the length of the tuple or number of items in tuple

We shall go through the below code snippet to understand how to work with the tuples better. #Code Snippet 7 myFirstTuple = ("Apache","Spark","Hadoop","ML") print(myFirstTuple) print(myFirstTuple[1]) #Adding a value to the tuple using list temporary_list = list(myFirstTuple) temporary_list.append("Spark") modified_tuple = tuple(temporary_list) print(modified_tuple) print(len(modified_tuple)) #returns the number of elements in the tuple

28

print(modified_tuple.index("ML")) #returns the index of ML in the tuple print(modified_tuple.count("Spark")) #returns the number of times the element Spark is present in the tuple Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%207.ipynb

('Apache', 'Spark', 'Hadoop', 'ML') Spark ('Apache', 'Spark', 'Hadoop', 'ML', 'Spark') 5 3 2 With this knowledge we will be able to understand and easily work with tuples. Next we shall quickly see how to work with Sets in python.

Sets in python is a collection of elements which are unordered and unindexed. They are represented and written in curly or flower brackets. Example: myFirstset = {1, 3, 5, 7} We will not be able to access the elements of the set directly via its index as it is unordered and thus has no definite index. But we will be able to check if an element is present or not in a set by traversing the set using a ‘for’ loop and checking if the element is in the set using the “in” operator. This is shown in the below code snippet. Also, we will not be able to change or modify any items or elements of a set, once created, rather values can be added to an existing set. To add items or elements to a set, we can use the add() method or the update() method, the add() method allows us to add only one element to the set and the update() method allows us to add multiple items to the set. Few important methods used in the context of sets are   

add() – update() – len() – discard() –

The below code snippet demonstrates the uses of these methods on sets#Python Snippet 8 myFirstSet = {1,3,5,7} print(myFirstSet) myFirstSet.add(9) #adding a single item 29

print(myFirstSet) myFirstSet.update([11,13,15]) #adding a multiple items print(myFirstSet) myFirstSet.discard(15) #discarding a specific item print(myFirstSet) print(len(myFirstSet)) #returning the length of the set Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%208.ipynb

{1, 3, 5, 7} {1, 3, 5, 7, 9} {1, 3, 5, 7, 9, 11, 13, 15} {1, 3, 5, 7, 9, 11, 13} 7 This gives us a better understanding and basic usage of sets.

3.2 Controlling the flow of execution in Python In this section we will understand how to redirect the flow of control within the python script using the various inbuilt flow control statements in python such as the “if else” statement, “while” and “for”. Based on some condition(s) the flow is altered in various ways. We will also go through the various operators which are available in python so as to perform complex operations on values and variables. 3.2.1 Operators in Python Python offers us a variety of operators which are very useful to perform operations on variables and handy when writing complex logic. Few of the important Operators in Python are

Arithmetic Operators – Arithmetic operators are basically used to perform some mathematical operations on the numeric values of variables. It would calculate the value of the variables based on the operator which is used. Few of the Arithmetic Operators are –  + Addition  Subtraction  * Multiplication  / Division  % Modulus (Used to get the reminder)  ** Exponential (Used to get raised to the power value)  // Floor Division (Used to get the result rounded to the nearest whole number)

30



Assignment Operators Assignment Operators are used to assign values to the variables – The major assignment operator is the “=” operator, it is used to assign values to a variable. An Example of assigning the value 7 to x, we write it as x=7 Assignment Operator “=” can be used with other operators like the Arithmetic operator and the Bitwise Operator. That is, if any assignment operation is performed on the same variable which involves arithmetic operator or bitwise operator in the logic, we can use a short-hand version of the assignment operator to perform the same. Example if we want to perform x = x + 3, we can simply write it as x += 3, which is equivalent to writing, x = x + 3. Few Examples of the same are shown below, where x=7 –  *= x=x*5(result = 35) x*=5(result = 35)  %= x=x%5(result = 2) x%=5(result = 2)  /= x=x/5(result = 1.4) x/=5(result = 1.4)  //= x=x//5(result = 1) x//=5(result = 1)  **= x=x**5(result = 16807) x**=5(result = 16807)  &= x=x&5(result = 5) x&=5(result = 5)  |= x=x|5(result = 7) x|=5(result = 7)  ^= x=x^5(result = 2) x^=5(result = 2)



Comparison or Relational Operators Comparison Operator as the name suggests it is used to compare two values and which results in a True or False condition after comparing or checking the values based on the type of the comparison operator which we have used. The various types of comparison operators are == - Equal (x==y) Checks if both x and y are equal  != - Not Equal (x!=y) Checks if both x and y are not equal  > - Greater than (x>y) Checks if x is greater than y  < - Less than (x= - Greater than or equal to (x>=y) Checks if x is greater or equal to y.  2) as well as (7==7) both are True.  or – It returns True if at least one of the statement or expression is true. For example, (10>100) or (2>1), this results in True, because at least one statement that is (2>1) is True even tough (10>100) results in False.  notIt returns the opposite of the computed result from the expressions. That is “not” is used to directly reverse the result. Example, not((5>2) and (7==7)), as we have seen earlier the expression (5>2) and (7==7) results in True but on passing this with the not operator, the final output results as False, as the “not” operator reversed the result to False. 

Bitwise Operator Bitwise Operators are used to compare or perform operations on the values in their binary format. For example, x= 7 and y =5, then on performing the Bitwise operator the value of x is taken as 111, and the value of y is taken as 101, which are nothing but their binary representations. Few important Bitwise operators are & (Bitwise AND) It computes the bit as 1 if both the bits are 1. Example, where x = 7 and y =5, z=x&y. Here z results to 5, and the logic behind this is, it takes 7 as 111 and 5 as 101 and performs the bitwise and operation and the resulting bit is 101 which is 5.  | (Bitwise OR) – It computes the bit as 1 if at least one of the bits is equal 1. Example, where x = 7 and y =5, z=x|y. Here z results to 7, and the logic behind this is, it takes 7 as 111 and 5 as 101 and performs the bitwise or operation and the resulting bit is 111 which is 7.  ^ (Bitwise XOR) – It computes to 1 if only one of the two bits is 1. Example, where x = 7 and y =5, z=x^y. Here z results to 2, and the logic behind this is, it takes 7 as 111 and 5 as 101 and performs the bitwise XOR operation and the resulting bit is 010 which is 2.



Membership Operator Membership operator checks for the membership in a sequence, such as strings, lists, or tuples.

32

It basically results True if it finds the specified member in the sequence else returns False. Types of Membership Operators  in – It returns True if it finds the specified variable within the sequence.  not in – It returns True if it does not find the specified variable within the sequence. For example, consider a list of prime_numbers=[1,3,5,7] and variable x = 7, we can use the membership operator “in” to check if 7 is present in the list or not. We can use the “in” operator as shown belowz = x in prime_number This results in the value of z being True as the value x is present in the list of prime_numbers.

3.2.2 if, if else and elif The if statements are generally used to check for a condition based on the comparison and logical operators and the execution flow would switch depending on the result from these True or False conditions. Generally the if statements would contain some logic or instructions to execute on the resulting True or False of a condition, so basically it determines or takes a decision whether to execute some set of instructions or not depending on the condition itself. There are few varieties of the if statements available and they are







if (simple if)It is used to execute a set of instructions on a single condition check, so if the condition is satisfied, it executes the instructions, or else it will go past the “if” statement block and execute the next statement. If else – It is used to execute a set of instruction on a single condition check, if the condition results to be True it executes a set of statements which are written in the if block, and if the condition turns out to be false it executes the statements which are written in the else block. It is basically like if not this then that type of behavior. elif – It is used to execute a set of instructions on multiple condition checks, multiple conditions can be specified in the “elif” and the flow goes on until the condition is satisfied, once the condition is satisfied it executes the set of instructions within that elif block and control returns from the entire elif ladder. If no condition is satisfied then the default else block is executed. Nested ifWe can have one or more if statements within another if statement, this type of a statement is called a nested if statement.

33

#Code Snippet 9 x = 50 y = 70 if x < y: #Simple if print("Simple if Executed") if x > y: #if else print("if block in if-else Not Executed") else: print("else block in if-else Executed") if x > y: #elif ladder print("if block in elif Not Executed") elif x > 20 and y==70: print("first elif Executed") elif x == 50 and y >50: print("second elif is Not reached") else: print("if no elif condition true then Executed") if y > x: #nested if print("I am in the first if") if x!= 70 and (y+10)==80: print("I am in the second if") if (x+y) == 120: print("Am I nested?, Oh Yes I am!") else: print("Nested if-else") else: print("Where am I?") else: print("I am not here") Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%209.ipynb

Simple if Executed else block in if-else Executed first elif Executed I am in the first if I am in the second if Am I nested?, Oh Yes I am!

3.2.3 Looping using While While loops are used to execute a set of statements as long as the condition specified is True. When we need to execute the same set of statements again and again and provided we have a definite stopping condition, then we can, with no doubt go with the while loop. We can also alter the flow of the while loop using methods like the break, continue, and else. Break statements are used to abruptly stop the iteration and move the control out of the loop and execute the next line of statements.

34

Continue statements are used to skip the current iteration but the control is still active within the loop until the conditions are met. Else statement is used execute a set of statements when the conditions are no more true. #Code Snippet 10 print("--Use of Break in While-") value = 1 while value < 5: print(value) if value == 2: break value += 1 print("--Use of Continue and Else in While-") i = 0 while i < 6: i += 1 if i in [1,2,5,6]: continue print(i) else: print("While loop successfully executed") Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2010.ipynb

--Use of Break in While1 2 --Use of Continue in While3 4 While loop successfully executed

3.2.4 Looping with For Loops For loops are generally used for iterating over a sequence like a list, dictionary, strings and so. Using the “for” loop we can write statements or instructions which would execute for each and every element within that sequence. Supposing we don’t have a sequence ready, in that case we can quickly generate a sequence and perform some “for” loop operation using the range() function. The range() function returns a sequence of numbers, by default starting from 0, and ending at the specified number (stop value). It also increments the value by 1. A note to keep in mind is that the specified ending number (stop value) will not be included as it is starting from 0. The range() function can also be made to start from a specified number(start) and end at a specified number(stop) and we are also able to increment the elements in the sequence by a number on giving the number as the third parameter. 35

Syntax – range(start,stop,increment_value) Example: range(3) range(2,7,2)

results in – [0,1,2] results in – [2,4,6]

As we have seen the use of break, continue and else statement in the while loop section, we can use them in the context of the “for” loop as well. Another statement which comes handy is the pass statement. The pass statement actually does nothing but it can be used as a bridge for smooth transitions and also would help us to increase the readability of our script. One major use of the pass statement is in the context of “for” loops, the “for” loop’s body cannot be left empty and if so it will throw an error, so here on using the pass statement we can avoid such errors. As a note, the pass statement can also be used with the “if” and “while” iterations as well to address the above issue.

#Code Snippet 11 out=[] for me in range(5,15,2): #using the range function out.append(me) print("---") for values in out: #nesting the for loop pass #Will do nothing else: print(out) #Will execute always after the for loop Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2011.ipynb

--[5] --[5, 7] --[5, 7, 9] --[5, 7, 9, 11] --[5, 7, 9, 11, 13]

Now we are familiar with controlling and deciding the flow of execution using various constructs and methods. Next we shall see how to write user-defined functions using python so that we can reuse our code, increase code readability, and bring modularity into our programs. It would also be easy to debug our code and boost our code writing skills and efficiency. Along the next section we will also see how to use Lambda functions, Iterators, and some List comprehension.

36

3.3 Functions, Map, Filter & Reduce and List Comprehensions in Python In this section we will see the most important and very commonly used user-defined functions and also go through how to use Map, Filter, Reduce Operations, and List Comprehensions in our scripts. 3.3.1 User-defined Functions A function is like a block of code that performs a certain operation or action. We can call it whenever it is required and can be called any number of times depending on our needs. This makes it reusable and we need not write the same set of instructions again and again which makes the code lengthy, difficult to debug, and decreases modularity in our programs. We can also pass parameters, arguments, or data to the functions. By passing parameters and making the functions return values and so, we can make the functions more dynamic and increase its efficiency. We can create functions using the “def” keyword. That is the “def” keyword followed by the function name, its parameters, and the function body. Example : def myFirstFunction(): print(“First Function”) And the functions which we have written can be called very easily by just writing the function name like myFirstFunction() in the script. We can also pass arguments to our functions, arguments are specified after the function name, inside the parentheses. We can add as many arguments as we want by just separating them with commas. And by default a function must be called with the correct number of arguments, that is if we have specified 2 parameters in the function definition, we will have to pass 2 arguments when we call the function. Supposing we don’t know what is the number of arguments that would be passed into our function, in that case we can either use * or ** before the parameter name in the function definition. By passing * or ** to parameters, the function would receive a tuple of arguments or a dictionary of arguments respectively and can access the items accordingly. Also there is a possibility to use a default parameter, that is when we call the function without an argument, rather than giving an error it will take the default parameter value which we had provided in the function definition. The below code snippet would help us to understand the working of functions much better. #Code Snippet 12 #Simple function with default parameter def myFirstFunction(name="Master"): '''Doc Strings can be passing using the triple quotes''' print("Welcome "+name) myFirstFunction("Thanos") myFirstFunction()

37

#Functions with Unknown number of Keywords using * and ** def unknownParameters_1(*args): #args in the form of list print("First Car {} and Last Car {}".format(args[0], args[-1])) #negative indexing to fetch last element of list unknownParameters_1("Pajero","Cayenne","Aventador") def unknownParameters_2(**args): #args in the form of dictionary print("Manufacturer Name is {} and Car Name is {}".format(args["manufacturer"],args["car"])) unknownParameters_2(manufacturer="Porsche",car="Cayenne",seati ng_capacity=5) Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2012.ipynb

Welcome Thanos Welcome Master First Car Pajero and Last Car Aventador Manufacturer Name is Porsche and Car Name is Cayenne

3.3.2 Lambda Functions A Lambda Function is like a normal function but without a name, it is more like an anonymous function. It can take any number of arguments but can only have a single expression. Syntax of Lambda Function – lambda arguments: expression Lambda Functions are very efficient and its selling point is that it can be easily used inside of another function. So it is basically like having an anonymous function within a function. We can go through the below code snippet and understand the use of Lambda functions much better. #Code Snippet 13 #Simple Lambda Function multiply = lambda a, b : a * b print(multiply(5, 7)) #Lambda Function within a normal function def myAdditionFunctionWithLambda(add): return lambda val: val+add square_result=myAdditionFunctionWithLambda(2) print(square_result(5)) Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2013.ipynb

35 7

38

3.3.3 map(), filter() and reduce() map() function – Map function is used to apply a sequence to a function. That is it takes a function as one of its arguments and takes the sequence such as a list as its second argument and applies the function defined to each and every element of the sequence. We can also make use of the lambda functions to perform map() functions. Syntax - resulting_map = map(function , sequence) The map() function can also be applied to more than one list or tuple at the same time. But in order to do that all the lists or tuples must have the same number of elements. The below code snippet shows the use of map() function#Code Snippet 14 #converting Celsius to Fahrenheit Celsius = [40, 33.2, 35.3, 41.8, 30.1] Fahrenheit = map(lambda x: (float(9)/5)*x + 32, Celsius) print(list(Fahrenheit)) #mapping multiple lists list_1 = [5,10,15,20] list_2 = [1,3,5,7] list_3 = [2,4,6,8] print(list(map(lambda x,y:x*y,list_1,list_2))) print(list(map(lambda x,y,z:(x+z)*y,list_1,list_2,list_3))) Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2014.ipynb

[104.0, 91.76, 95.53999999999999, 107.24, 86.18] [5, 30, 75, 140] [7, 42, 105, 196]

filter() function – Filter function as the name suggests is used to filter out all elements from the list, for which the function returns True. Syntax – filter(function,list) The function, filter(F,L) needs a function F as its first argument. The function F returns a Boolean value, which is either True or False. This function will be applied to each and every element of the list L and only if the function F returns True, the element of the list will be included in the result list. We can go through the below code to understand the use of filter() – #Python Snippet 15 numbers = [0,0,1,2,3,5,6,9,11,23,34,45,50,77] #obtaining the odd numbers from the numbers list odd_numbers = filter(lambda x: x % 2, numbers) print(list(odd_numbers)) #obtaining the even numbers from the numbers list 39

even_numbers = filter(lambda x: x % 2 == 0, numbers) print(list(even_numbers)) Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2015.ipynb

[1, 3, 5, 9, 11, 23, 45, 77] [0, 0, 2, 6, 34, 50]

reduce() function – The reduce() function is used to continuously apply a function logic to the elements of the list. Syntax – reduce(function,list) The reduce() function, initially takes the first two elements of the list, performs the function logic on these two elements, then the result of these two elements will be taken and the function logic is performed again with the result of the first two elements and the third element of the list and this goes on until the last element of the list is addressed. Supposing we have a List, L=[L1,L2,L3, …… ,Ln], the reduce(F,L) works in this mannero

o

o

At first the first two elements of list L will be applied to function F, that is F(L1,L2). At this stage the list on which the reduce() works can be thought of like this: [ F(L1, L2), L3, ... , Ln ] In the next step F will be applied on the previous result and the third element of the list, that is F(F(L1, L2),L3) Now, the list can be thought of like this: [ F(F(L1, L2),L3), ... , Ln ] Similarly it continues this process until all the elemtns in the list a have been addressed and it returns the final value of the function ‘F’ as the result of reduce()

We can go through the code below and understand the working and use of the reduce() function#Code Snippet 16 from functools import reduce #importing the reduce library counting=reduce(lambda x, y: x+y, range(1,50)) print("Sum of numbers 1 to 50 is {}".format(counting)) multiplying=reduce(lambda x,y: x*y, [1,5,10]) print("Multiplied list result is {}".format(multiplying)) largest_in_list = lambda a,b: a if (a > b) else b largest=reduce(largest_in_list, [100,101,155,122,210]) print("Largest element of list is {}".format(largest)) Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2016.ipynb

Sum of numbers 1 to 50 is 1225 Multiplied list result is 50 Largest element of list is 210

40

3.3.4 List Comprehensions List Comprehensions help us to create lists with ease. It consists of brackets containing an expression followed by a “for” clause, then zero or more “for or if” clauses. The expressions can be anything, meaning you can put in any kind of objects in lists. The result will be a new list resulting from evaluating the expression in the context of the “for and if” clauses that follow it. Syntax – [expression for element in list if condition] The above syntax can be realized without the list comprehensions asfor element in list: if conditional: expression As we can see, the above form is quite lengthy and we will have to define a list and then append the resulting elements to the lists explicitly. Whereas by using the List comprehensions we can directly assign the one-liner, list comprehension code to a variable and it implicitly becomes the resulting list, as the list comprehensions return a list by default. Now let us see how to use list comprehensions to our scripts by going through the below code snippet. #Code Snippet 17 #Get the squares of elements in a list #using the typical method squares = [] for sq in range(5): squares.append(sq**2) print(squares) #using the List comprehensions print("Squares using List Comprehensions - {}".format([sq**2 for sq in range(5)])) Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2017.ipynb

[0, 1, 4, 9, 16] Squares using List Comprehensions - [0, 1, 4, 9, 16] And finally we are done with our Python essentials learning. We will now be able to easily understand and write quite complex scripts using python. With this knowledge of python, it is sufficient to even understand and write Machine Learning code snippets. We have gone through most of the important and commonly used objects and methods. Next, we shall deep dive into the world of Spark. In the next chapter we shall go through the basics of Spark and perform some hands-on with Spark.

41

CHAPTER 4 Spark-ling Journey Begins At last, we are here, we have done all the prerequisites for the kick-off of our Spark learning Journey. We have seen how to set up our working environment to facilitate Spark and have also brushed up on our Python skills in order to grasp the scripts quickly and understand the spark topics with ease. In this chapter we will familiarize ourselves with various Spark concepts and features and mainly focus on the working of Spark Dataframes. We will work extensively with the Spark Dataframes in order to have an understanding of how to handle, organize, and process the data. And also we shall see the various features and operations that are available for us when using the Spark Dataframes.

4.1 Basics of Spark DataFrames Spark DataFrames are basically used to store some data. It does hold the data in rows and columns. Each column represents a feature or a variable and each row represents an individual data point. Spark Dataframes are able to deal with various sources of data, which means it can input and output data from a wide variety of sources like csv, json and so. We can also perform various transformations on the data and collect the results to visualize, or record, or for some other processing. In order to begin working with Spark Dataframes, we have to create a Spark session. Spark session is like a unified entry point of a spark application, it provides a way to interact with various spark’s functionalities. So let’s see how we can create one.

#Importing the Spark Session from pyspark.sql import SparkSession #Creating the Spark Session spark = SparkSession.builder.appName('myFirstSparkSession').getOrCreate() In this manner we can create a spark session. And use the session variable ‘spark’ within our scripts.

42

Next let us work with some real data in order to do that we need to first read a dataset, for this, we can use the read method from the spark context. We can also select the type of dataset file we need to load, and for this, the read method has various options available like csv, json and so. We can load the dataset file which is present in our computer asdataFrame_name = spark_session_variable.read.csv(‘filename’) Example: We want to load a csv file called employees.csv and our session variable is spark. Then in order to import this data file which is in the same directory as our jupyter notebook or IDE we writedf = spark.read.csv(‘employee.csv’) Additional parameters such as the header and inferSchema can be passed with the read.csv() method. The header parameter takes either True or False as its values and on giving True it would consider the first row of the dataset as its column title or header, if given False it would not consider it as the header rather it would treat it as the rest of the data. Similarly the inferSchema parameter also takes the value either True or False and if given True if would identify and assign the correct datatypes to the columns of the DataFrame based on the dataset’s column values, if not it would generally consider all the DataFrame columns as a string datatype. Example: df = spark.read.csv(‘employee.csv’,header=True,inferSchema=True) If the file to be imported is in some other directory location as of where we started our Jupyter notebook or IDE, then we have to give the complete filename or the absolute path of the file. We can also see our created DataFrame contents and how the data looks like by just using the show() method. We can use the show() after the DataFrame name and get to know the contents of the DataFrame. Example: df.show() Supposing we want to know what the features or columns that are present in our dataset, we can easily get the list of columns by just executing df.columns and we can also use df.printSchema() to get the columns data and its data types where the ‘df’ being the DataFrame which we have created. We can go through the code snippet below to understand the same. The output of the code snippet will make us understand and relate to the above mentioned methods. #Code Snippet 18 from pyspark.sql import SparkSession #importing SparkSession spark = SparkSession.builder.appName("MyFirstSparkSession").getOrCreat e() #creating a Spark Session

43

df = spark.read.csv("employee.csv",header=True) #reading the data without inferSchema parameter df.show() #displaying the dataframe data df.printSchema() #printing the schema of the datatype without inferSchema df = spark.read.csv("employee.csv",header=True,inferSchema=True) #reading the data with inferSchema parameter df.printSchema() #printing the schema of the datatype with inferSchema df.columns #getting the list of columns Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2018.ipynb Data – www.github.com/athul-dev/spark-with-python/blob/master/employee.csv +-----------+-------------+---+----------+-----+ |employee_id|employee_name|age| location|hours| +-----------+-------------+---+----------+-----+ | G001| Pichai| 47|California| 14| | M002| Bill| 64|Washington| 10| | A003| Jeff| 56|Washington| 11| | A004| Cook| 59|California| 12| +-----------+-------------+---+----------+-----+ root |-|-|-|-|--

employee_id: string (nullable = true) employee_name: string (nullable = true) age: string (nullable = true) location: string (nullable = true) hours: string (nullable = true)

root |-|-|-|-|--

employee_id: string (nullable = true) employee_name: string (nullable = true) age: integer (nullable = true) location: string (nullable = true) hours: integer (nullable = true)

['employee_id', 'employee_name', 'age', 'location', 'hours']

4.2 Working with Rows and Columns in Spark Dataframe In this section we will familiarize ourselves with the rows and columns in the Spark Dataframe. We will try to get the rows and columns data and perform some operations on them. Let us consider the DataFrame that we used earlier which contains the employee data. 4.2.1 Working with Columns in Spark Getting the Columns and its contents. 44

 Using the DataFrame to get the columns directly. In order to get the columns via the DataFrame, we can directly pass the column name inside of the DataFrame, that is df[‘column_name’], but on accessing the column like this, we get the column object itself. Example: df[‘age’]  Using the select option in the DataFrame Instead of directly passing the column name as a list to the DataFrame, we can use the select method of the DataFrame by which we get the required column as a DataFrame by itself. We also get a lot of flexibility using the ‘select’ as it returns a DataFrame. Syntax : df.select(‘column_name’) Example: df.select(‘age’).show() We can also select multiple columns by passing the columns needed in the form of a list to the select function as its parameter. Syntax – df.select([column1,column2,column3]) Example – df.select([name,age,location]).show() Adding new columns. We can add new columns using the method withColumn(). This function returns a new DataFrame by adding a new column or replacing the existing column if it has the same name as the new column which we specify. Syntax – df.withColumn(‘new_column_name’,column_itself) So in order to create a new column with the withColumn() we pass in the first parameter as the new column name and the second parameter as the column itself. Example: df_new = df.withColumn(‘overtime_time’, df[‘hours’]) Also we will be able to rename our columns using the withColumnRenamed() method. Supposing in our previous example we need to change the column name hours to something like working_hours, we can do that as shown below. Example: df_renamed = df.withColumnRenamed(‘hours’,’working_hours’) This can be achieved by just passing the old column name and the new preferred column name as the parameters to the withColumnRenamed method. Dropping a Column We can easily drop a column from our DataFrame by using the drop() function. Syntax: df_new = df.drop(“column_name”)

45

4.2.2 Working with Rows in Spark Getting the Rows and its contents. We can get the rows data from the DataFrame using various methods. One such quick way is using the head() function. We can use the head() function to get the topmost records or rows as a list. The head() function takes one parameter and this parameter determines how many records from the top should be selected or fetched. Syntax – df.head(number_of_records_required) Example – df.head(3) In this case the first three or the top three records are selected and returned in the form of a list.

Getting to know the statistical details of the dataset If we have some numerical feature in our dataset we can quickly get a statistical report or a summary out of it, and we can easily get this by using the describe method, that is df.describe().show() #Code Snippet 19 from pyspark.sql import SparkSession #importing SparkSession spark = SparkSession.builder.appName("Rows&Columns").getOrCreate() #creating a Spark Session df = spark.read.csv("employee.csv",header=True,inferSchema=True) #reading csv data df.show() #displaying the data df["age"] #accessing the column object age via dataframe print("Single Column Data") df.select("age").show() #accessing single column using select, returns a dataframe print("Multiple Column Data") df.select(["employee_name","age"]).show() #accessing multiple column using select, returns a dataframe df_new = df.withColumn("productive_hours",df["hours"]-3) #Adding a new column from existing one with some operation

46

df_renamed = df_new.withColumnRenamed("hours","working_hours") #Renamed hours to working_hours df_final = df_renamed.drop("location") #dropping the column location print("Data with the new column productive_hours, hours Column Renamed to working_hours and dropped the location column") df_final.show() print("First three Data points\n") print(df_final.head(3)) #Fetching the list of the first 3 values in the dataframe print("\nSummary of Age, Working hours and Productive hours") summary_data = df_final.select(["age","working_hours","productive_hours"]) summary_data.describe().show() Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2019.ipynb Data – www.github.com/athul-dev/spark-with-python/blob/master/employee.csv +-----------+-------------+---+----------+-----+ |employee_id|employee_name|age| location|hours| +-----------+-------------+---+----------+-----+ | G001| Pichai| 47|California| 14| | M002| Bill| 64|Washington| 10| | A003| Jeff| 56|Washington| 11| | A004| Cook| 59|California| 12| +-----------+-------------+---+----------+-----+ Single Column Data +---+ |age| +---+ | 47| | 64| | 56| | 59| +---+ Multiple Column Data +-------------+---+ |employee_name|age| +-------------+---+ | Pichai| 47| | Bill| 64| | Jeff| 56| | Cook| 59| +-------------+---+ Data with the new column productive_hours, hours Column Renamed to working_hours and dropped the location column +-----------+-------------+---+-------------+----------------+ |employee_id|employee_name|age|working_hours|productive_hours|

47

+-----------+-------------+---+-------------+----------------+ | G001| Pichai| 47| 14| 11| | M002| Bill| 64| 10| 7| | A003| Jeff| 56| 11| 8| | A004| Cook| 59| 12| 9| +-----------+-------------+---+-------------+----------------+ First three Data points [Row(employee_id='G001', employee_name='Pichai', age=47, working_hours=14, productive_hours=11), Row(employee_id='M002', employee_name='Bill', age=64, working_hours=10, productive_hours=7), Row(employee_id='A003', employee_name='Jeff', age=56, working_hours=11, productive_hours=8)] Summary of Age, Working hours and Productive hours +-------+----------------+-----------------+-----------------+ |summary| age| working_hours| productive_hours| +-------+----------------+-----------------+-----------------+ | count| 4| 4| 4| | mean| 56.5| 11.75| 8.75| | stddev|7.14142842854285|1.707825127659933|1.707825127659933| | min| 47| 10| 7| | max| 64| 14| 11| +-------+----------------+-----------------+-----------------+

4.3 Using SQL in Spark Dataframes Spark Dataframes give us the option to use the SQL queries to directly work and interact with the DataFrame. In case we are familiar with SQL, we can use SQL queries to easily play around with data. In order to do this we have to first register our DataFrame as a SQL temporary view. And we can do that by using the method createOrReplaceTempView() and pass in the new preferred table name as the parameter to this method as shown below. Example: df.createOrReplaceTempView(‘associates’) On executing this statement we have successfully registered the DataFrame as a SQL temporary view and now we are allowed to use direct SQL queries on the newly registered View. We can do that by using the sql() method in the created spark session. Syntax: variable = spark_session_variable.sql(‘SQL QUERY‘) Example: sql_result = spark.sql(‘SELECT * FROM associates’) In this way we can use complex SQL queries to get data from the dataset. So the best thing about this method is that, if we are already familiar with SQL queries we can simply use the spark’s SQL feature to get the required data rather than using the DataFrame operations itself.

48

#Code Snippet 20 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("SparkSQL").getOrCreate() df = spark.read.csv("employee.csv",header=True,inferSchema=True) df.createOrReplaceTempView("associates") #Creating a view called associates sql_result_1 = spark.sql("SELECT * FROM associates") #Applying SQL query on associates print("Showing the results of the select query") sql_result_1.show() sql_result_2 = spark.sql("SELECT * FROM associates WHERE age BETWEEN 45 AND 60 AND location='California'") print("Showing the results after using Where clause in select query") sql_result_2.show() Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2020.ipynb Data – www.github.com/athul-dev/spark-with-python/blob/master/employee.csv Showing the results of the select query +-----------+-------------+---+----------+-----+ |employee_id|employee_name|age| location|hours| +-----------+-------------+---+----------+-----+ | G001| Pichai| 47|California| 14| | M002| Bill| 64|Washington| 10| | A003| Jeff| 56|Washington| 11| | A004| Cook| 59|California| 12| +-----------+-------------+---+----------+-----+ Showing the results after using Where clause in select query +-----------+-------------+---+----------+-----+ |employee_id|employee_name|age| location|hours| +-----------+-------------+---+----------+-----+ | G001| Pichai| 47|California| 14| | A004| Cook| 59|California| 12| +-----------+-------------+---+----------+-----+

4.4 Operations and Functions with Spark Dataframes Once we have the data with us, we may not use the entire dataset as a whole for certain requirements rather we would need to filter out some data from the entire dataset, so that it is in line with our requirements. Therefore in order to do this we can use various operations on our spark DataFrame which we are going to discuss in detail in this section.

49

4.4.1 Filter Function Filter as the name suggests lets us filter out some data from this dataset based on some conditions. To understand the working of the filter function, in the context of Dataframe Operations, let us consider a dataset of items bought. This dataset contains the Date column which tells us when the items were purchased, the name of the item, the item price, quantity, total amount, and the tax charged for that item. In order to apply filters on our data we can simply use the filter() method on our DataFrame. We can use the filter() method in various ways. 

filter() method using direct sql type syntax We can directly key in the sql like conditions within the filter method as a parameter and it does the filtering of data for us. Example: Supposing we want to get all the items whose total amount is greater than 1000, we can directly pass in this condition, in the formatdf.filter(“total_amount > 1000”).show()



filter() method using DataFrame objects We can perform a similar operation as seen in the above example for filter() by using the DataFrame objects instead of the SQL like syntax. That is we can fetch the columns using the DataFrame and by using the comparison operators available in python we can get the same result. df.filter(df[“total_amount”] > 1000).show()



filter() method based on multiple conditions We can perform the filter operation based on multiple conditions as well. In order to do that, we have to separate the multiple conditions using an ampersand (&) operator, wherein each condition is enclosed within a set of parentheses. Example: Supposing we want to see all the items whose item price is greater than 250 and tax is lesser than 25, we can incorporate these types of multiple conditions within the filter method as shown. df.filter((df[item_price]>250) & (df[tax_amount]1000)&(df["tax_amount"]>500)).show () #filtering based on multiple column conditions result_data = df.filter((df["total_amount"]==1924.74)).collect() #collecting the record based on a condition resulting_date = result_data[0]["date"] #fetching data from the collected result print("The collected data point's date value is "+resulting_date) result_data[0].asDict() #converting the result to a dictionary Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2021.ipynb Data – www.github.com/athul-dev/spark-with-python/blob/master/items_bought.csv

51

+----------+---------+----------+--------+----------+------------+ | date|item_name|item_price|quantity|tax_amount|total_amount| +----------+---------+----------+--------+----------+------------+ |11-10-2018| Beer| 110.5| 2| 53.04| 163.54| |14-02-2018| Whisky| 1250.0| 1| 300.0| 1550.0| |23-03-2020| Whisky| 1300.5| 2| 624.24| 1924.74| |05-10-2018| Rum| 550.0| 2| 264.0| 814.0| +----------+---------+----------+--------+----------+------------+ only showing top 4 rows Result after filtering total_amount greater than 1500 +----------+---------+----------+--------+----------+------------+ | date|item_name|item_price|quantity|tax_amount|total_amount| +----------+---------+----------+--------+----------+------------+ |14-02-2018| Whisky| 1250.0| 1| 300.0| 1550.0| |23-03-2020| Whisky| 1300.5| 2| 624.24| 1924.74| +----------+---------+----------+--------+----------+------------+ Result after filtering based on item_price and tax_amount +----------+---------+----------+--------+----------+------------+ | date|item_name|item_price|quantity|tax_amount|total_amount| +----------+---------+----------+--------+----------+------------+ |23-03-2020| Whisky| 1300.5| 2| 624.24| 1924.74| +----------+---------+----------+--------+----------+------------+ The collected data point's date value is 23-03-2020 {'date': '23-03-2020', 'item_name': 'Whisky', 'item_price': 1300.5, 'quantity': 2, 'tax_amount': 624.24, 'total_amount': 1924.74}

4.4.2 groupBy Function GroupBy allows us to group rows together based on some common value. Generally if there is a column that is repeating we can apply groupBy on that column and we can get each and every repeating value as a distinct group and we can apply our logic or requirements based on these certain groups. Supposing we have the revenue or sales data that contains columns like the company name, products, and sales figures, in this scenario we can group by the repeating company name column to get the overview of that specific company’s products and sales figures. From the above sales or revenue data, let us use groupBy to the company name column and use the sum function to get the total sales or revenue of each individual company. For this, we can write the following statementdf.groupBy(“company_name”).sum().show()

52

There are various other methods that we can use with the groupBy function and they are      

count() – It returns the count of rows for each group. mean() – It returns the mean of values for each group. max() – It returns the maximum of values for each group. min() – It returns the minimum of values for each group. sum() – It returns the total or sum of the values for each group. avg() – It returns the average of the values for each group. agg() – Can be used to calculate more than one aggregate at a time, we will see more about the agg() function in the next topic of the aggregate function and also learn how to use it with groupBy function.

4.4.3 Aggregate Function Aggregate Function as the name suggests is used to perform some aggregation operations like sum, average and so, on a set of values of the columns and outputs a consolidated or single-valued result. Aggregate Function can be used to aggregate between all rows of the DataFrame and as well as can be used with the groupBy function to aggregate within the rows of each group. One important point to note is that the aggregate function takes in a dictionary as its parameter. That is we can give the key and the value of the dictionary as the column name and the operation name (like sum, mean and so) respectively. Now let us use the aggregate function on the entire DataFrame and later on see how it can be fine-tuned to work with the groupBy function. Supposing we want the total revenue or sales of all the companies in the dataset and also want to know the maximum revenue or sales among all the companies, we can get it by using the agg() function on the DataFrame as shown belowdf.agg({“revenue_sales”:”sum”}).show()  Total Revenue df.agg({“revenue_sales”:”max”}).show()  Maximum Revenue Now let us see how to use the agg() function with the groupBy() function. Supposing we want the total revenue or sales of each company present in our DataFrame. In order to achieve this we can first groupBy the companies with the company_name column and then apply the aggregate function on the same. df.groupBy(“company_name”).agg({“revenue_sales”:”sum”}).show()

53

4.4.4 orderBy Function orderBy Functions allow us to order and sort data in our DataFrame. It is very useful when we need to quickly analyze and visualize our data. It can be used on the columns of the DataFrame and it acts as an ascending or descending order filter on the entire column(s). Supposing we want to order our “revenue_sales” column data in ascending order, we can easily achieve this by using the orderBy() function on the DataFrame and passing the column name to be sorted as the parameter to the function.  Ascending order sorting

Example: df.orderBy(“revenue_sales”).show()

We can also sort columns based on descending order of column values by passing the column object itself and using an inbuilt desc() function on this column object to the orderBy() function.  Descending order sorting

df.orderBy(df[revenue_sales].desc()).show() #Code Snippet 22

from pyspark.sql import SparkSession spark = SparkSession.builder.appName('SparkGroupBy&Agg').getOrCreate() df = spark.read.csv("company_product_revenue.csv",header=True,infer Schema=True) df.show(4) print("Total revenue_sales per company") df.groupBy("company_name").sum().show() print("Total revenue_sales for the entire data") df.agg({"revenue_sales":"sum"}).show() print("Max revenue_sales value per company") df.groupBy("company_name").agg({"revenue_sales":"max"}).show() print("Ordering the data based on the revenue_sales in ascending order") df.orderBy("revenue_sales").show(5) print("Ordering the data based on the revenue_sales in descending order") df.orderBy(df["revenue_sales"].desc()).show(5) 54

Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2022.ipynb Data – www.github.com/athul-dev/spark-with-python/blob/master/company_product_revenue.csv +-------------+------------+-------------+ | company_name|product_name|revenue_sales| +-------------+------------+-------------+ | Audi| A4| 450| |Mercedes Benz| G Class| 1200| | BMW| X1| 425| | Mahindra| XUV 500| 850| +-------------+------------+-------------+ only showing top 4 rows Total revenue_sales per company +-------------+------------------+ | company_name|sum(revenue_sales)| +-------------+------------------+ | Kia| 1140| | Audi| 2275| | Mahindra| 1640| | BMW| 1975| |Mercedes Benz| 2570| +-------------+------------------+ Total revenue_sales for the entire data +------------------+ |sum(revenue_sales)| +------------------+ | 9600| +------------------+ Max revenue_sales value per company +-------------+------------------+ | company_name|max(revenue_sales)| +-------------+------------------+ | Kia| 690| | Audi| 725| | Mahindra| 850| | BMW| 850| |Mercedes Benz| 1200| +-------------+------------------+ Ordering the data based on the revenue_sales in ascending order +-------------+------------+-------------+ | company_name|product_name|revenue_sales| +-------------+------------+-------------+ | BMW| X1| 425| | Audi| A4| 450| | Kia| Carnival| 450| |Mercedes Benz| C Class| 470| | Audi| Q7| 500| +-------------+------------+-------------+ only showing top 5 rows Ordering the data based on the revenue_sales in descending order +-------------+------------+-------------+ | company_name|product_name|revenue_sales| +-------------+------------+-------------+ |Mercedes Benz| G Class| 1200| |Mercedes Benz| GLS| 900| | Mahindra| XUV 500| 850|

55

| BMW| X5| 850| | Mahindra| XUV 300| 790| +-------------+------------+-------------+ only showing top 5 rows

4.4.5 Using Standard Functions in our Operations There are various standard functions which are very useful in our logic building and scripting, it will help us to easily manipulate and play with data. There are various standard functions and complex functions like the correlation, standard deviation and so which we can easily use. In order to do this, we have to import the standard functions which we want to use from the pyspark.sql.functions and it can be done by giving the commandfrom pyspark.sql.funtions import function_name Example: from pyspark.sql.functions import stddev,avg, format_number Supposing we want to get the average of the revenue or sales column from our previous dataset and set the column name of the result to something which we want. We can use the standard average functions which we have imported and use the alias function to set the preferred column name as shown belowdf.select(avg(“revenue_sales”).alias(“Average”).show() Thereby using the alias for the functions, we can define the header name for the column, by which it makes more sense and can be inferred easily. We can also perform standard deviation and see how to format the results based on our requirements. For this we can use the standard functions ‘stddev’ to calculate the standard deviation and ‘format_number’ function to format the results. Let us see how to put them together in the statements shown below. std_dev_result=df.select(stddev(“revenue_sales”).alias(“std”)) std_dev_result.select(format_number(“std”,3)).show() Here the format_number() function takes two parameters, the first parameter being the column name (in our case the calculated standard deviation result which is present in the “std_dev_result dataframe”) and the second parameter being the number of decimal points to be considered. #Code Snippet 23 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("SparkInbuiltFunctions").getOrCreate()

56

from pyspark.sql.functions import mean,avg,format_number #importing the inbuilt functions df = spark.read.csv("company_product_revenue.csv",header=True,inferSchema =True) df.select(mean("revenue_sales").alias("Mean Revenue Sales")).show() #using mean function result_avg = df.select(avg("revenue_sales").alias("Average Revenue Sales")) #using average function print("Average Revenue Sales value is {0}".format(result_avg.head()[0])) result_avg.select(format_number("Average revenue Sales",2).alias("Formatted Average")).show() #formatting number to 2 decimal values Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2023.ipynb Data – www.github.com/athul-dev/spark-with-python/blob/master/company_product_revenue.csv +------------------+ |Mean Revenue Sales| +------------------+ | 685.7142857142857| +------------------+ Average Revenue Sales value is 685.7142857142857 +-----------------+ |Formatted Average| +-----------------+ | 685.71| +-----------------+

4.5 Dealing with Missing Data in Spark Dataframes In practical and realistic scenarios the data we work with might have some empty values or missing values which are generally referred to as null values. The null values must be handled properly in our data in order to obtain meaningful and accurate results. Especially if we are working on some tasks based on critical and sensitive data like medical data, healthcare equipment data and so. There are various ways to handle this missing data issue. We could either drop the data point which contains missing data or fill it in with some calculated or default value and so. It is recommended to provide a calculated result or a default value to replace the missing values rather than dropping the data points as a part of persevering the data. Now let us go about learning the various ways to how we can handle this missing data in detail. And in order to understand better let us work with some real data which contains few 57

null or missing values. We will be reading a csv file called product_description which contains features or columns like product name, product manufacturer, and product price. 4.5.1 Dropping the rows or data points that contains null values. We can drop the missing values from our DataFrame using the drop() method which is associated with the “na” method in the DataFrame. The drop() function returns a new DataFrame omitting all the rows with null values. Example: df.na.drop().show() This drops all the columns which have null values present in the ‘df’ DataFrame. The drop() functions have 3 important parameters and they are





how – how parameter has two main options ‘any’ and ‘all’. By default how is equal to ‘any’, and it means to drop all those rows, if any of the row values have a null value. The second option which can be associated with how is ‘all’ and it defines, drops all those rows which have all the row values as null. Example: df.na.drop(how=’all’).show() This returns us the DataFrame with those rows having one or more null values in their columns but drops all those rows which have all null values in their columns. thresh – The ‘thresh’ can be assigned with a natural number and based on the value it considers all the rows which have non-null values equal or more than the thresh number which we have specified. Example: df.na.drop(thresh=2).show() It drops only those rows which have the non-null (that is rows with some value other than null) values less than the specified thresh value. And by default the thresh is set to None, therefore it drops all the rows where the rows have at least one null value and irrespective of the number of non-null values. subset – The subset parameter can be used to specific the columns that we want to consider for checking or validating for the null values. Example: df.na.drop(subset=[column_name(s)]).show() This would drop the rows with how condition ‘any’ applied to only those specific columns which are mentioned in the subset parameter. It will still show the null values present in other columns.

4.5.2 Filling the null values with another value Until now we were just dropping the rows with null values, instead of dropping the rows which have missing values, let us see how to fill the missing values and retain the rows or the data points.

58

We can replace the null or missing values in our DataFrame with another value using the fill() method which is associated with the “na” method in the DataFrame. Syntax: dataframe.na.fill(value,subset=None) Generally the fill() function, replaces the null values by matching the data type of the column values and the value given as a parameter to the full() function. Example: df.na.fill(‘Filling String Value’).show() If we apply fill() function in the above manner, it will replace the column values which has a string data type and has null in it with the value ‘Filling String Value’ which is given as the value parameter to the function fill(), similarly if it encounters an integer or double value as the value parameter, it will function in the similar fashion by replacing the null values in those columns which have the integer or double value as its data type. Practically we may not simply enter the value and let spark infer which column has the missing value and fill them, rather we would specify the value which has to be filled and also specify the column(s) wherein we have to fill these values against their missing or null values. We can use the subset parameter which we have seen in the drop() function to select the columns which we want to fill. Supposing we want to fill missing values in a column which is of string type, and our value is also a string, we can do it as shown in the below statementdataframe.na.fill(“New Value”, subset=[column_name]).show() Also note that if we enter a non-string value as our value parameter to a string type column, it will be simply ignored, that is the missing values will not be filled.

4.5.3 Filling Calculated values in place of missing or null values We have seen how to fill random values to the missing or null values in the DataFrame, but most of the time filling random values might not be a very reliable method in terms of data integrity and correctness, rather we can perform some calculations on our existing DataFrame and deduce a value which is meaningful and is in accordance with the correctness of the data. That is for example if we have some null values in the DataFrame, we could perform some operations, custom operations, or standard operations like mean, correlation and so, and this result can be replaced for the null values in the DataFrame. Now let us see how we can use a standard function like mean to calculate the mean of the columns where there is some missing data and assign this calculated mean to the missing values in the columns or features of the DataFrame. In order to do this we have to import our standard ‘mean’ function from pyspark.sql.functions and use this function to calculate the mean value of the columns where there are some null values. After obtaining the result, we can fill this calculated mean value result in the null values present in that column. 59

We can go thought the code snippet 24 to understand how this works.

4.5.4 Replacing specific value(s) in the DataFrame by another value Replace as the name suggests is used to replace values from a DataFrame with another value that we define. It can be used to change or alter one or more values at a time on our DataFrame. We can use the replace() function as shown() belowSyntax: replace(to_replace,value,subset=None) The replace function takes three parameters, the first one being the value which has to be replaced, the second being the value which has to fill as the replacement and the last parameter is the subset which specifies to which column the replacement should occur. Supposing we want to replace the product name, x with y in the product_name column, we can do that by executing the following statement. Example: df.na.replace(“x”,”y”,product_name).show() Also note that the value to be replaced and the value to be filled, can either take the data types ‘int’, ’long’, ‘float’, ‘string’ or ‘list’. #Code Snippet 24 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("SparkMisingData").getOrCreate() df = spark.read.csv("employee_data.csv",header=True,inferSchema=Tru e) df.show() print("Data after dropping the rows having null values") df.na.drop().show() print("Data after dropping the rows having atleast 4 non-null values") df.na.drop(thresh=4).show() print("Data after dropping the rows having null values in hours column") df.na.drop(subset="hours").show() print("Data after filling the rows having null values in hours column")

60

df.na.fill(12,subset="hours").show() from pyspark.sql.functions import mean #importing the mean function mean_value = df.select(mean("hours")).collect()[0][0] #calculating mean value print("Data after filling the rows having null values in hours column with calculated mean value") df.na.fill(mean_value,subset="hours").show() print("Data after replacing a specific rows value in employee_name column") df.na.replace("Pichai","Sundar",subset="employee_name").show() Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2024.ipynb Data – www.github.com/athul-dev/spark-with-python/blob/master/employee_data.csv +-----------+-------------+----+----------+-----+ |employee_id|employee_name| age| location|hours| +-----------+-------------+----+----------+-----+ | G001| Pichai| 47|California| 14| | M002| Bill| 64|Washington| null| | A003| Jeff| 56| null| null| | A004| null|null| null| 12| +-----------+-------------+----+----------+-----+ Data after dropping the rows having null values +-----------+-------------+---+----------+-----+ |employee_id|employee_name|age| location|hours| +-----------+-------------+---+----------+-----+ | G001| Pichai| 47|California| 14| +-----------+-------------+---+----------+-----+ Data after dropping the rows having atleast 4 non-null values +-----------+-------------+---+----------+-----+ |employee_id|employee_name|age| location|hours| +-----------+-------------+---+----------+-----+ | G001| Pichai| 47|California| 14| | M002| Bill| 64|Washington| null| +-----------+-------------+---+----------+-----+ Data after dropping the rows having null values in hours column +-----------+-------------+----+----------+-----+ |employee_id|employee_name| age| location|hours| +-----------+-------------+----+----------+-----+ | G001| Pichai| 47|California| 14| | A004| null|null| null| 12| +-----------+-------------+----+----------+-----+ Data after filling the rows having null values in hours column +-----------+-------------+----+----------+-----+ |employee_id|employee_name| age| location|hours| +-----------+-------------+----+----------+-----+ | G001| Pichai| 47|California| 14| | M002| Bill| 64|Washington| 12|

61

| A003| Jeff| 56| null| 12| | A004| null|null| null| 12| +-----------+-------------+----+----------+-----+ Data after filling the rows having null values in hours column with calculated mean value +-----------+-------------+----+----------+-----+ |employee_id|employee_name| age| location|hours| +-----------+-------------+----+----------+-----+ | G001| Pichai| 47|California| 14| | M002| Bill| 64|Washington| 13| | A003| Jeff| 56| null| 13| | A004| null|null| null| 12| +-----------+-------------+----+----------+-----+ Data after replacing a specific rows value in employee_name column +-----------+-------------+----+----------+-----+ |employee_id|employee_name| age| location|hours| +-----------+-------------+----+----------+-----+ | G001| Sundar| 47|California| 14| | M002| Bill| 64|Washington| null| | A003| Jeff| 56| null| null| | A004| null|null| null| 12| +-----------+-------------+----+----------+-----+

4.6 Working with Date & Time in Spark Dataframe It is very important to have an understanding of how to deal with date and time using spark DataFrame. Most of the data sets might have a date-time based feature or column in its schema. And some of the requirements would also be based on the date-time criterion. In order to deal with such date-time scenarios let us go through some inbuilt functions which are available to us with the pyspark library. We can use the items purchased dataset which we have seen in the filter() function section to understand the date functions as this dataset contains the date column, which tells us when the items were purchased and it also contains the name of the item, the item price, quantity, total amount and the tax charged for that item. Let us now see how to extract data from this date column and perform some use cases with this data. To make this happen we can leverage on the inbuilt functions that the pyspark has to offer us. There are quite a number of date-time functions readily available to us for use, we shall go through the most commonly used functions for our learning. Few of the most commonly used data-time functions

dayofmonth() – It is used to extract the day of the month from a given date.

62

   

month() – It is used to extract the month from a given date. year() – It is used to extract the year from a given date. weekofyear() – It is used to extract the week number of the year from a given date. date_format(date,format) – It is used to convert a date, timestamp, or string to a string value in the format specified by the date format given by the second argument.

We can go through the below code snippet to understand the use of these functions better and we will also do a use-case wherein we try to get the total amount of the items purchased in that particular year. We will also see how to convert the date column having string data type to the date data type in order to perform date-time operations on them.

#Code Snippet 25 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("SparkDateTime").getOrCreate() df = spark.read.csv("items_bought.csv",header=True,inferSchema=True ) df.show(3) print("Schema with date as string datatype") df.printSchema() from pyspark.sql.functions import unix_timestamp, from_unixtime, to_date #importing necessary functions to convert date string to date type updated_df = df.withColumn('formatted_date',to_date(unix_timestamp(df['date '],'dd-MM-yyyy').cast('timestamp'))) print("Schema with date column string datatype converted to date datatype") updated_df.printSchema() print("Data after dropping the date column which was of string type")

63

updated_df=updated_df.drop("date") #dropping the date column with string type updated_df.show(2) from pyspark.sql.functions import weekofyear, dayofmonth,month,year,date_format #extracting data from dates print("Data Extraction from dates") final_df = updated_df.select(updated_df["item_name"], weekofyear(updated_df["formatted_date"]).alias("week_number"), dayofmonth(updated_df["formatted_date"]).alias("day_number"), month(updated_df["formatted_date"]).alias("month"), year(updated_df["formatted_date"]).alias("year")) final_df.show(3) date_string_value = updated_df.select(df["item_name"],date_format(updated_df["form atted_date"],'MM/dd/yyyy')) #converting date type to a different date format string date_string_value.show(2) print("Usecase - Total amount of items purchased in that particular year") final_format=final_df.groupBy("year").sum().select(["year","su m(year)"]) final_format.withColumnRenamed("sum(year)","Total Expenditure").show() Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2025.ipynb Data – www.github.com/athul-dev/spark-with-python/blob/master/items_bought.csv +----------+---------+----------+--------+----------+------------+ | date|item_name|item_price|quantity|tax_amount|total_amount| +----------+---------+----------+--------+----------+------------+ |11-10-2018| Beer| 110.5| 2| 53.04| 163.54| |14-02-2018| Whisky| 1250.0| 1| 300.0| 1550.0| |23-03-2020| Whisky| 1300.5| 2| 624.24| 1924.74| +----------+---------+----------+--------+----------+------------+ only showing top 3 rows Schema with date as string datatype root |-- date: string (nullable = true) |-- item_name: string (nullable = true) |-- item_price: double (nullable = true) |-- quantity: integer (nullable = true) |-- tax_amount: double (nullable = true) |-- total_amount: double (nullable = true) Schema with date column string datatype converted to date datatype root

64

|-|-|-|-|-|-|--

date: string (nullable = true) item_name: string (nullable = true) item_price: double (nullable = true) quantity: integer (nullable = true) tax_amount: double (nullable = true) total_amount: double (nullable = true) formatted_date: date (nullable = true)

Data after dropping the date column which was of string type +---------+----------+--------+----------+------------+--------------+ |item_name|item_price|quantity|tax_amount|total_amount|formatted_date| +---------+----------+--------+----------+------------+--------------+ | Beer| 110.5| 2| 53.04| 163.54| 2018-10-11| | Whisky| 1250.0| 1| 300.0| 1550.0| 2018-02-14| +---------+----------+--------+----------+------------+--------------+ only showing top 2 rows Data Extraction from dates +---------+-----------+----------+-----+----+ |item_name|week_number|day_number|month|year| +---------+-----------+----------+-----+----+ | Beer| 41| 11| 10|2018| | Whisky| 7| 14| 2|2018| | Whisky| 13| 23| 3|2020| +---------+-----------+----------+-----+----+ only showing top 3 rows +---------+---------------------------------------+ |item_name|date_format(formatted_date, MM/dd/yyyy)| +---------+---------------------------------------+ | Beer| 10/11/2018| | Whisky| 02/14/2018| +---------+---------------------------------------+ only showing top 2 rows Usecase - Total amount of items purchased in that particular year +----+-----------------+ |year|Total Expenditure| +----+-----------------+ |2018| 6054| |2019| 4038| |2020| 8080| +----+-----------------+

We are now familiar with the working of Spark Dataframes, we are now able to manipulate and have a sense of the data, perform various operations on them and format the results and so on. This knowledge on working with data and dataframes would come in handy when performing complex calculations and also helps to bring out various insights from the data. Next we shall see how to perform Machine Learning techniques with spark on our data to perform predictions and to gain in-depth insights out of our data.

65

CHAPTER 5 Machine Learning with Spark In this chapter we will go through various interesting topics including the concepts of Machine Learning and how Spark’s MLlib library works for Machine Learning. Firstly let us have a bird’s view understanding of what machine learning is, before getting too deep into the technical aspects of it. We kind of come in contact with Machine Learning or is exposed to it multiple times every single day without even realizing it. Say each time we do a web search on Google or DuckDuckGo or Bing, it works splendid and most of the time we get the information that we wanted with the first search result itself, but how did it work? This is because their machine learning software has figured out how to rank the pages based on our search keyword. And also when google pictures application or our phone’s gallery software recognizes our friends or family members in our pictures, now that’s also machine learning. And without machine learning our e-mail inbox would have been filled with lots and lots of emails with around 5% that we would really want to read and the rest being spam, the spam filter service in our e-mail engine does this useful filtering for us which works on machine learning. It is basically a science of getting computers to learn without being explicitly programmed. The concept of Machine Learning (ML) existed earlier but at that period of time the priority given to the problems which could be addressed by ML concepts was less. Also the required hardware, processing power, and moreover the data was not in place to work on. And later as time progressed the supporting technologies grew exponentially and thus gave rise to ML. So, ML was developed as a new capability for computers, and today it touches many segments of the industry and basic science. And now there are lots and lots of problems that machine learning can deal with. Few significant problems which can be solved by Machine Learning Web search results  Real-time ads on webpages and mobile interfaces  Prediction of equipment failures  Fraud detection  Recommendation Engines  Email spam filtering  Credit scoring  Medical Diagnosis  Text sentiment analysis The traditional definition which was defined by Arthur Samuel was – “Machine Learning is the field of study that gives computers the ability to learn without being programmed or learned”.

66

Arthur Samuel’s claim was back in 1950, where he wrote a checkers playing program, where he had to program to play the checkers games against himself, and by watching what sorts of board positions lead to wins and what sort of board positions lead to losses, the checkers playing program learned over time what the good board positions were and what bad board positions were. And eventually learned to play checkers better than the Arthur Samuel. This is because a computer has the patience to play tens of thousands of games against itself, no human has the patience to play that many games. By doing this, a computer was able to get so much checkers playing experience that it eventually became a better checkers player than Arthur himself. Arthur Samuel’s definition is quite old and one of the recent definitions by Tom Mitchell says – “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” When we relate this with the checkers game from Arthur Samuel’s scenario, the experience E would be the experience of having the program play tens of thousands of games itself. The task T would be the task of playing checkers, and the performance measure P will be the probability that it wins the next game of checkers against some new opponent. The Machine Learning Process Flow-

Figure 5.1: Machine Learning Process Flow

Basic Steps involved in the Machine Learning Process1. Gathering Data or Data Acquisition Machine Learning’s fundamental ingredient is the data itself. In order to perform various ML algorithms we need data to apply it on. The process of data gathering is a project or task-specific. The data set can be collected from various sources such as a

67

2.

3.

4.

5.

6.

file, database, sensors, and many other such sources but the collected data cannot be used directly to perform any Machine Learning processes as there might be a lot of missing data, junk values, and so. Therefore, in order to solve this problem Data preprocessing or Data Cleaning is done. Data pre-processing or Data Cleaning Data pre-processing is the process of cleaning the raw data, the data is collected in the real world and is converted to a clean data set. Data pre-processing is one of the most important steps in the machine learning process. If we give garbage to the model, we will get garbage in return, that is the trained model will provide false or wrong predictions. This step is mainly responsible for building accurate machine learning models. Splitting the Data for Training and Testing In this step we split our cleaned or pre-processed data randomly, into mainly two chunks of data, that is we split our data into something called a training set and a testing set. The training set is used to build our machine learning model and once the model is trained, we use our testing set to test our model. If we don’t split our data and perform training and testing of our model on the same data, we might not get accurate predictions when we provide it with novel or real data, as the scope of learning and predictions will be restricted to this un-split data. Thus, separating the data and using the test set data as the novel data to get accurate feedback from the trained model to further evaluate and correct our model is very essential and highly recommended. Algorithm/Model Selection and Training the Model After we split our data into a training data set, we have to think of an algorithm that is the best suit for our problem. There are various algorithms that are used to train the model and each one of them has a specific purpose. We will learn about a few of the most common algorithms in detail in the coming sections. On deciding an algorithm or technique, we get into the process of training the model and for that we pass or fit our data to these algorithms which then gives us the trained model object. Testing and Evaluating the trained model The trained model from the previous step is used to perform some predictions and evaluations on the test data set. That is in order to evaluate how well our model performed, we compare it with the test data, and then we perform this process iteratively until we figure out the best parameters for our ML model. Model Evaluation is an integral part of the model development process. It helps us find the best model that represents our data and how well the chosen model will work in the future. Model Deployment and generate insights or predictions using real data Finally, on the successful training of our model and evaluating it. The next step is to take this model forward and use it with real data in order to obtain various insights and predictions based on our problem.

Spark’s MLlib library is mainly designed for two major types of Machine Learning Techniques, they are Supervised and Unsupervised Learning Techniques.

68

5.1 Supervised Learning Supervised Learning is probably the most common machine learning problem, let us look at a scenario where supervised learning is used, and so let us take an example of predicting the house prices. We have collected and plotted the data which gave us the square feet of the house and the amount it was sold for, in a particular neighborhood, as shown in Figure 5.2 data below. We can plot the same in order to visualize the data better as shown in Figure 5.3, here on the horizontal axis, the size of different houses in square feet, and on the vertical axis, the price of different houses in thousands of dollars. So, given this data, let's say we have a friend who owns a house in this neighborhood and it is of say 750 square feet, and he is wanting to sell his house, and he wants to know how much he can get for his house.

Figure 5.2: House Price Data

Figure 5.3: House Price Data Plot

How can the machine learning algorithm help him? One thing the learning algorithm could do is try to fit a straight line through this data. Based on that, it looks like maybe his house can be sold for about $40,000, as seen in Figure 5.4. So, this is an example of a Supervised Learning algorithm. The term Supervised Learning refers to the fact that we gave the algorithm a data set in which the, so-called, "right answers" were given. That is we gave it a data set of houses in which for every example in this data set, we told it what was the actual price that the house was sold for, and the task of the algorithm was to just produce more of these right answers so we could determine what would be the best price for which our friend could sell his house. To define a bit more terminology, this is also called as a regression problem. By regression problem, it means we're trying to predict a continuous-valued output. Namely the price in this case.

Figure 5.4: Fitting a Straight Line to our Data

69

Another example would be to predict if a piece of equipment would fail or not, wherein our dataset contains various parameters or features based on the machine and a label that says whether it failed or not. This type of problem is also considered as a supervised problem as it has a label that gives us the “right answer”. This is another kind of a problem dealt in the context of supervised learning known as the classification problem and that is because unlike regression where we were trying to predict a continuous value, here we're trying to predict a discrete-valued output say zero or one, that is whether the equipment would fail or not. Supervised Learning algorithms are trained using labeled examples, such as an input where the desired output is known, as seen in the above examples. The Supervised Learning algorithms receive a set of inputs along with the corresponding correct outputs, and the algorithm learns by comparing its predicted output with the actual output from the data to find errors and it then corrects or modifies the model accordingly. Through various methods like regression and classification, supervised learning uses patterns to predict the values of the label on additional unlabeled data. It is commonly used in applications where historical data predicts likely future events or labels for novel data.

5.2 Unsupervised Learning The second major type of machine learning problem is called Unsupervised Learning. Previously in Supervised Learning, we were explicitly told what the so-called right answer was. Unlike the Supervised learning technique, In Unsupervised Learning, we deal with data that doesn't have any labels or historical label data. The model is not told what the “right answer” is but the goal is to explore the data and find some structure within. So we're given with the data set and we're not told what to do with it or what each data point is. Instead we're just told, here is the dataset. Can you find some structure in the data? Given this data set, an Unsupervised Learning algorithm must figure out some patterns and might decide that the data lives in multiple clusters or not. This is sort of a clustering technique that is common when using Unsupervised Learning. An example of a use case scenario of such a clustering technique is seen in the telecom industry, say a telecom industry wants to provide quality service to its customers and for this they intend to set up towers at locations where its users are facing network issues. As to address this problem the telecom providers took a dataset that contained complaint logs of the users, the complaint log dataset had the user details, the complaint information, and more importantly the location of the user who is facing the network issue. Later a chart can be plotted with the data points as the locations and it can be subjected to a clustering algorithm like a k-means algorithm that would cluster the locations and provide centroids of those locations. And the telecom industry can act on placing their towers on those centroids making the entire location or the cluster, free of any network issues. Another such example is of Google News, what they have done is they look for tens of thousands of news stories and automatically cluster them together. So, the news stories that are all about the same topic get displayed together. It turns out that clustering algorithms and Unsupervised Learning algorithms are used in many other problems as well. 70

Market segmentation. Many companies have huge databases of customer information. So, what they do is they look at this customer dataset and automatically discover market segments and automatically group the customers into different market segments so that they can automatically and more efficiently sell or market their products. Again, this is an Unsupervised Learning because we have all this customer data, but we don't know in advance what are the market segments and for the customers in our data set, we don't know in advance who is in market segment one, who is in market segment two, and so on. But we have to let the algorithm discover all this just from the data.

Figure 5.5: Supervised Learning and Un-Supervised Learning

Now let us see what Spark’s MLlib has to offer us in terms of the algorithms and various machine learning techniques belonging to the Supervised and Unsupervised type of learning.

5.3 Spark MLlib Spark has its own MLlib library dedicated to Machine Learning. The MLlib library primarily utilizes the Spark Dataframe syntax which we have seen in the previous chapter. An important thing to keep in mind when dealing with MLlib is that we will have to format our data in such a way that we end up with one or two columns only, that is if we are performing supervised learning we should have just two columns irrespective of the number of features, which means even though we have a lot of features available we have to condense it down to just two columns, one being the features column and the labels being the second column. Similarly, for the unsupervised learning we should have the data formatted in such a way that we just have a single column of all the features available to us, since there are no labels for unsupervised learning, it deals with this single aggregated feature column. Supervised Learning – features and labels and Unsupervised Learning – features 71

We will see how to aggregate or condense multiple features or columns present in a dataset into a single features column representation which is accepted by Spark in detail in the upcoming chapter. The various algorithms currently supported by Sparks MLlib library are-

Spark MLlib

Supervised Learning

Unsupervised Learning

Regression

Classification

Clustering

Linear Regression

Logistic Regression

K-means

Decision Tree Regressor

Decision Tree Classifier

Bisecting k-means

Random Forest Regressor

Random Forest Classifier

Latent Dirichlet allocation (LDA)

Gradient Boosted Tree Regressor

Gradient Boosted Tree Classifier

Gaussian Mixture Model (GMM)

Survival Regression

Linear Support Vector Machines

Isotonic Regression

Naïve Bayes Figure 5.6: Algorithms supported by Spark’s MLlib

These are some of the algorithms that are available in the context of Spark’s MLlib library, and we will learn most of the commonly used and highly effective algorithms from this structure to understand the working of Machine Learning with Spark. With this introduction to Machine Learning, in the next chapter let us go about learning algorithms in detail and understand how these algorithms work and more importantly how we can go about implementing them using our Spark’s MLlib library.

72

CHAPTER 6 Supervised Learning with Spark In this chapter we will learn about the supervised learning algorithms in detail. We will go through the algorithms for regression and classification. To easily understand the implementation of these algorithms with our Spark’s MLlib, it is good to have a theoretical understanding of these algorithms. The two major supervised learning algorithms which we are going to learn and implement using MLlib are the Linear Regression and the Logistic Regression algorithms. The Linear Regression algorithm is primarily used to solve regression-based problems and the Logistic Regression algorithm irrespective of its name saying regression is used to solve classification based problems and we will discuss why this anomaly and also the overall concept of these algorithms in the coming sections.

6.1 Linear Regression with Spark Regression is an important topic to start building an intuition for Machine Learning. In this section we will mainly deal with linear regression in order to build a solid foundation for regression. And also our focus will be on the practical implementation of Linear Regression which will have a larger impact on our learning curve. 6.1.1 Linear Regression with Single Variable Linear regression predicts a real-valued output, based on an input value. Linear regression is prominently used for finding the linear relationship between variables – a target variable and one or more predicate of supporting variables. We shall at first deal with the linear regression with a single variable scenario where we are trying to predict the house prices given the size of the house and its price. Figure 6.1 shows a snip from the house price dataset where the size of the house is given in square feet and the price in lakhs. On plotting a scatter chart which is shown in Figure 6.2 and fitting a straight line by making sure that the straight-line passes through most of the data points, we can predict the price of the house given any arbitrary house size.

Figure 6.1: House Price Dataset

Figure 6.2: House Price versus House Size plot

73

Supposing we fit a straight line through most of the data points, so based on this plot with the straight line we can say how much a house would cost given the size of the house. For example if we want to determine the price of the house given the house size, let’s say the house size is 2000 sq. ft. by seeing the graph we can say that it would be around 80 Lakhs. So, all we are trying to do when we calculate regression is, fit a line that is as close as possible to every data point. Our goal with linear regression is to minimize the vertical distance between all the data points and our fitted line itself. So in the process of determining the best line, we are attempting to minimize the distance between all the data points and their distance to our line. Also there are a lot of different ways to minimize this line, like the sum of squared errors, the sum of absolute errors, and so on, but all these methods have a general goal of minimizing this distance. And one of the most popular methods is the least-squares method. Let us see how the least-squares method works by considering the points as shown in the below scatter plot Figure 6.3. And we try to fit a line using the least square method.

Figure 6.3: Regression Line with Residuals Representation

Least Squares Method fits the line by minimizing the sum of squares of the residuals. The residuals for the data points is the difference between the data points and the fitted line. The residuals can be seen as the orange line shown in the above Figure 6.3.

6.1.2 Mathematics behind Linear Regression Let us quickly see the mathematics behind the linear regression, so that we will have a thorough understanding of the working of linear regression and as well as Machine learning in general. In order to make our learning the machine learning simple, let us introduce some notations which we will be used throughout to reference various elements. We shall use the following notations: m to denote the number of training examples in our training set. x to denote the input variables or features. y to denote the output variables or the target variable. (x,y) to denote a single training example, which is a row record from the data set.

74

(x(i),y(i)) to denote the ith training example in the dataset. Let us use our sample dataset of house prices to understand these notations better.

Figure 6.4: House Price and House Size Dataset

Here, m will be 4 as we have four training examples, x will refer to the House size as it is our input variable or feature variable, y will refer to the Price as it is our output variable or the target variable which we are trying to predict. And in order to refer to a single training example or record from the training set (x(i),y(i)) is used, for example, to refer to the third-row house size we use (x(3)) which equals to 1200 and fourth-row house price as (y(4)) which equals 45, and so on this manner. Now let us see how the supervised learning algorithm works in accordance with our house price prediction scenario. We feed our housing prices training set to our learning algorithm, and it is the job of the learning algorithm to then output a function, this function is called the hypothesis function and is by convention denoted by h. And the job of the hypothesis function is to take an input, here the size of a house like maybe the size of the new house which we want to buy, so it takes in the value of x and it tries to output the estimated value of y for the corresponding house.

Training Set

Learning Algorithm

Size of the House

Hypothesis Function

Estimated Price

Figure 6.5: Supervised Learning Flow with respect to the House Price Prediction Scenario

When designing a learning algorithm, the next thing we need to decide is how to represent this hypothesis function. The hypothesis function which in this case is to fit a straight line and this can be denoted as:

ℎ𝜃 (𝑥) = 𝜃0 + 𝜃1 𝑥

75

This model is called linear regression or linear regression with one variable, with the variable being x that is predicting all the prices as a function of one variable x. And another name for this model is univariate linear regression, it is just a fancy way of saying a single variable. Now we shall go about realizing and implementing this model, for this we need to know about something called a cost function which we will be discussing next. The Cost function is used to fit the best possible straight line to our data, in order to give us the most accurate predictions. As we can see, there can be any number of straight lines that can be fit to our training data, but we are looking for the best fit line which would pass through most of our data points so that the predictions would be more accurate. Casting the straight line over the data points is solely dependent on the hypothesis functions which we saw earlier.

ℎ𝜃 (𝑥) = 𝜃0 + 𝜃1 𝑥 In this hypothesis function we see there is a θ (theta) term, which is nothing but parameters that are directly related to how accurate the line would be fit. Based on different values of θ various lines can be drawn as shown below. They are nothing but the coefficients of the linear regression line.

When θ0 = 0 and θ1 = 0.5.

Where θ0 = 10 and θ1 = 0.5

So in order to determine the best fit line we will have to determine the values of θ 0 and θ1. The idea is that we choose θ0 and θ1 so that our hθ(x) which is the value we predict for the input of x is close to y, for our training example (x,y). Let us represent this in a mathematical form, as we are dealing with the basics here we shall not go over the complete derivation of this formula rather we will understand what the formula means and how to use it. Here we determine the values of θ0 and θ1 so that one over 2m times the average of the sum of square errors between our predictions on the training set and the actual values of the houses on the training set is minimized. So this is our overall objective function for linear regression. 𝑚

1 𝐽(𝜃0 , 𝜃1 ) = ∑(ℎθ (𝑥 (𝑖) ) − 𝑦 (𝑖) )2 2𝑚 𝑖=1

76

This cost function is also called as the squared error function. There are multiple cost functions that can be used but it is observed that squared error cost functions are generally used for the linear regression problems. We shall learn and understand a few other cost functions in the coming chapters. Just to have an overview of this cost function J(θ0 , θ1), on plotting the graph or chart of J(θ0 , θ1) for different values of θ0 and θ1, we get a graph as shown in the Figure. Here we can observe that, we get a bow-shaped graph and the lowest point in the graph corresponds to the value of θ0 and θ1 where the squared error is minimum.

Figure 6.6: Cost Function (Squared Error Cost Function)

Now what we really want is an efficient algorithm for automatically finding the values of θ0 and θ1, which minimizes the cost function J. So, the algorithm which helps us in this process is called Gradient Descent Algorithm. 6.1.3 Sneak peek on the Gradient Descent Algorithm Gradient Descent algorithm helps us minimize the cost function J by automatically finding those values of θ0 and θ1 for which the cost function J is minimum. Gradient Descent Algorithm is a general algorithm and is not only used for linear regression rather it plays a big role throughout Machine Learning. Let us first see how the Gradient Descent Algorithm works for an arbitrary function J(θ0 , θ1), where we want to minimize J(θ0 , θ1) with respect to θ0 and θ1. The idea or steps the Algorithm follows is as mentioned below:  Start with a random θ0 , θ1  Keep changing θ0 , θ1 to reduce J(θ0 , θ1) until we end up at a minimum The definition of gradient descent can be understood by this mathematical representation shown below.

77

Repeat Until Convergence

{𝜃𝑗 = 𝜃𝑗 − 𝛼

𝜕 𝜕𝜃𝑗

𝐽(𝜃0 , 𝜃1 )

(𝑓𝑜𝑟 𝑗 = 0 𝑎𝑛𝑑 𝑗 = 1)}

Here we are trying to repeatedly update the parameter θj by subtracting α times the partial derivative term from θj. Here the symbol = (equal to) represents an assignment operator and the α in the formula is called the learning rate, and it is a number which tells us, how large is the step that we should take for convergence. Now with this formula, we have to simultaneously update the values of θj after every loop as the updated value of θj is what is used in the next loop. And the simultaneous update should be done in this manner as shown below: 𝜕 Temp_0= 𝜃0 − 𝛼 𝐽(𝜃0 , 𝜃1 ) 𝜕𝜃0 𝜕 Temp_1= 𝜃1 − 𝛼 𝐽(𝜃0 , 𝜃1 ) 𝜕𝜃1

𝜃0 = Temp_0 𝜃1 = Temp_1 In the formula we had seen a partial derivative term, now let us understand the significance of this term. This partial derivate term gives the slope of the line that is tangent to the function. That is for example, let us consider a function with one parameter for better understanding. So, our function is J(θ1) and the function is represented in a graph as shown below.

θ1 Graph 1

θ1 Graph 2

These graphs provide us the base for understanding the significance of the derivate part and the learning rate in the formula. As we mentioned earlier that the derivative part gives us the value of the slope and in graph 1 we can see that on calculating the derivate part of the function J(θ1) we get a positive slope, and on calculating the updated value of θ1, we get a lower value of θ1 compared to the previous value, as the positive value of slope results in the subtraction of its slope multiplied with the learning rate.

78

Whereas in the graph 2 we can see that the value of the slope obtained is negative, and on applying this negative slope to the formula, as the negative of negative is positive, the value will be added and the value of θ1 will be increased which is further close to the minimum. Here the learning rate α is a positive value that is used to determine the extent of leap the algorithm should take for determining the minimum. Supposing the learning rate α is small then the algorithm would take baby steps towards convergence or the minimum which would take a lot of time, but on increasing the value of α to a higher extent, the algorithm would take larger steps and sometimes would not even converge or find the minimum. So, yes. This is the Gradient Descent Algorithm which we can use to minimize any cost function J. Now, let us use the Gradient Descent Algorithm to our linear regression cost function which we had dealt with earlier.

6.1.4 Gradient Descent for our Linear Regression We are now familiar with the Gradient Descent Algorithm, so let us take this gradient descent algorithm and apply it to minimize the cost function of linear regression. Cost Function of Linear Regression: 𝑚

1 𝐽(𝜃0 , 𝜃1 ) = ∑(ℎθ (𝑥 (𝑖) ) − 𝑦 (𝑖) )2 2𝑚 𝑖=1

In order to apply the gradient descent algorithm we need to calculate the partial derivative term. On performing calculus calculations we get the derivate terms as: 𝑚

𝜕 1 𝑗=0∶ 𝛼 𝐽(𝜃0 , 𝜃1 ) = ∑(ℎθ (𝑥(𝑖) ) − 𝑦(𝑖) ) 𝜕𝜃0 𝑚 𝑚

𝑗=1∶ 𝛼

𝑖=1

𝜕 1 𝐽(𝜃0 , 𝜃1 ) = ∑(ℎθ (𝑥(𝑖) ) − 𝑦(𝑖) ) . 𝑥(𝑖) 𝜕𝜃1 𝑚 𝑖=1

Now we have the value for the derivate term, so on plugging this value to the Gradient Descent Algorithm we get it as: Repeat until convergence 𝑚

1 𝜃0 = 𝜃0 − 𝛼 ∑(ℎθ (𝑥(𝑖) ) − 𝑦(𝑖) ) 𝑚 𝑖=1 𝑚

{

𝜃1 = 𝜃1 − 𝛼

1 ∑(ℎθ (𝑥(𝑖) ) − 𝑦(𝑖) ) . 𝑥(𝑖) 𝑚 } 𝑖=1

79

Gradient Descent Algorithm is susceptible to local minima or local optima, which means it can give multiple values of θ0 and θ1 on minimizing the cost function based on where we start or initialize the θ0 and θ1 values. But interestingly the cost function for linear is always going to be a bow-shaped function as shown in Figure 6.7

Figure 6.7: Cost Function - Linear Regression

The function is technically called the convex function. This function does not have any local optima except for the one global optimum. So, on performing the gradient descent on such kind of a function, it will always converge to the global optimum. So, Yes. We have successfully learned our first machine learning model, the linear regression model.

6.1.5 Evaluation Metrics for Linear Regression Once we have trained our model it is always important to check how well the model performs with novel unlabeled data. In order to measure or understand the credibility of our model we can use various evaluation metrics. Let us discuss a few of the common evaluation metrics used for regression problems or for predicting continuous values in general. 

Mean Absolute Error It is basically just the average error. That is the mean of the absolute value of errors. That is the average of, the absolute value of the difference between the actual value and the predicted value.

Mean Absolute Error (MAE) =

1 𝑚

(𝑖) ∑𝑚 − ℎθ (𝑥 (𝑖) ) | 𝑖=1 |(𝑦

80



Mean Squared Error It is basically the mean of squared errors. In Mean Squared Error since we square the error term, we can focus on the large errors compared to the Mean Absolute Error and the outliers are easily detected.

Mean Squared Error (MSE) =



1 𝑚

(𝑖) ∑𝑚 − ℎθ (𝑥 (𝑖) )2 𝑖=1(𝑦

Root Mean Square Error Since on applying Mean Squared Error evaluation, the units of the predicted value will be squared, in order to retain the unit of the predicted value, we take the root of the mean square error to obtain the root mean square error. It is basically the root of the mean of the squared errors. 1

(𝑖) − ℎ (𝑥 (𝑖) )2 Root Mean Square Error = √𝑀𝑆𝐸 = √ ∑𝑚 θ 𝑖=1(𝑦 𝑚



R Squared Values R squared values are basically a statistical measure of our regression model. It is also known as the coefficient of determination. It measures how much variance our model accounts for, in the range 0-1.

Now let us see how we can implement the Linear Regression with a single variable with our pyspark. We will take the example of predicting the house price for our implementation, we will train our model on a single variable or feature that is the size of the house and we shall also see how to format of csv data into a spark accepted format. Implementation StepsStep 1 – Importing the data and initial libraries Step 2 – Data pre-processing and converting the data into spark accepted format Step 3 – Training our Linear Regression model with a single variable Step 4 – Evaluating our trained model Step 5 – Performing Predictions with novel data

#Code Snippet 26 #Step 1 - Importing the data and essential libraries from pyspark.sql import SparkSession

81

spark = SparkSession.builder.appName('SingleVariableLinearReg').getOrC reate() from pyspark.ml.regression import LinearRegression data = spark.read.csv('single_variable_regression.csv',header=True,in ferSchema=True) print("Initial Data") data.show(3) #importing the VectorAssembler to convert the features into spark accepted format from pyspark.ml.linalg import Vectors from pyspark.ml.feature import VectorAssembler #Step 2 - Data pre-processing and converting the data to spark accepted format #converting the feature(s) into spark accepted data format assembler_object = VectorAssembler(inputCols=['house_size'],outputCol='house_size _feature') feature_vector_dataframe = assembler_object.transform(data) print("Data after adding house_size column as a spark accepted feature") feature_vector_dataframe.show(2) feature_vector_dataframe.printSchema() formatted_data = feature_vector_dataframe.select('house_size_feature','price_so ld') print("Consolidated Data with accepted features and labels") formatted_data.show(3) #Step 3 - Training our Linear Regression model with single variable # Splitting the data into 70 and 30 percent train_data, test_data = formatted_data.randomSplit([0.7,0.3]) #Defining our Linear Regression

82

lireg = LinearRegression(featuresCol='house_size_feature',labelCol='pr ice_sold') #Training our model with training data lireg_model = lireg.fit(train_data) #Step 4 - Evaluating of Trained Model #Evaluating our model with testing data test_results = lireg_model.evaluate(test_data) print("Residuals info - distance between data points and fitted regression line") test_results.residuals.show(4) print("Root Mean Square Error {}".format(test_results.rootMeanSquaredError)) print("R square value {}".format(test_results.r2)) #Step 5 - Performing Predictions with novel data #Creating unlabeled data from test data by removing the label in order to get predictions unlabeled_data =

test_data.select('house_size_feature')

predictions = lireg_model.transform(unlabeled_data) print("\nPredictions for Novel Data") predictions.show(4) #Checking our model with new value manually house_size_coeff=lireg_model.coefficients[0] intercept = lireg_model.intercept print("Coeffecient is {}".format(house_size_coeff)) print("Intercept is {}".format(intercept)) new_house_size = 950 #Mimicking the hypothesis function to get a prediction price = (intercept) + (house_size_coeff)*new_house_size print("\nPredicted house price for house size {} is {}".format(new_house_size,price)) Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2026.ipynb Data – www.github.com/athul-dev/spark-with-python/blob/master/single_variable_regression.csv

83

Initial Data +----------+----------+ |house_size|price_sold| +----------+----------+ | 1490| 60| | 2500| 95| | 1200| 55| +----------+----------+ only showing top 3 rows Data after adding house_size column as a spark accepted feature +----------+----------+------------------+ |house_size|price_sold|house_size_feature| +----------+----------+------------------+ | 1490| 60| [1490.0]| | 2500| 95| [2500.0]| +----------+----------+------------------+ only showing top 2 rows root |-- house_size: integer (nullable = true) |-- price_sold: integer (nullable = true) |-- house_size_feature: vector (nullable = true) Consolidated Data with accepted features and labels +------------------+----------+ |house_size_feature|price_sold| +------------------+----------+ | [1490.0]| 60| | [2500.0]| 95| | [1200.0]| 55| +------------------+----------+ only showing top 3 rows Residuals info - distance between data points and fitted regression line +-------------------+ | residuals| +-------------------+ | -0.695754716981213| |-1.8378537735848823| |0.16332547169830036| |-0.5501179245279957| +-------------------+ only showing top 4 rows Root Mean Square Error 0.4910271657220829 R square value 0.9903556929009175 Predictions for Novel Data +------------------+-----------------+ |house_size_feature| prediction| +------------------+-----------------+ | [850.0]|43.69575471698121| | [1300.0]|57.83785377358488| | [2000.0]| 79.8366745283017| | [2500.0]| 95.550117924528| +------------------+-----------------+ only showing top 4 rows Coeffecient is 0.03146443431048867

84

Intercept is 16.559206221560643 Predicted house price for house size 950 is 46.642681929681125

6.1.6 Linear Regression with Multiple Variables As we have already seen an example of linear regression in our introduction, where we have a single variable or feature x, the size of the house and we wanted to predict the price of the house. And for this problem our hypothesis function was:

ℎ𝜃 (𝑥) = 𝜃0 + 𝜃1 𝑥 But let us imagine, we not only had the size of the house as a feature in trying to predict the price of the house, rather we also knew the number of bedrooms, the number of floors, the age of the house in terms of years and the area/location of the house. It seems like this would give us a lot more information with which we can predict the price of the house. To understand this with the help of mathematics, let us introduce some notations so that we can understand what anything and everything refers to. As we have the size of the house, number of bedrooms, number of floors, the age of the house, and the area/location of the house, we have 5 features or variables which support us in our prediction of the price of the house. So these features can be denoted as x, as we have 5 features each feature can be said x1 – the size of the house, x2 – the number of bedrooms, x3 – the number of floors, x4 – the age of the house and x5 – the area of the house. And in this scenario we have four features, so the number of features can be denoted as n and the value to be predicted as y.

Figure 6.8: House Price Dataset in detail

Notations in brief: n = number of features. These are the number of features or the support variables (x ‘s) to predict y. m = number of records. These are the number of records or rows in the dataset. (i) x = data record (entire row data) of ith training example. This is to obtain the record of a particular training example or index, for example if the value is x(3), the data corresponding to the 3rd index is addressed. Xj(i) = specific value of a feature j in ith training example. This is to obtain a particular value of a feature in the dataset, for example if the denotation x4(1), then the value it corresponds to is 10 – row 1 and column 4.

85

So now our new hypothesis with respect to the multiple features can be represented as:

ℎ𝜃 (𝑥) = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑛 𝑥𝑛 Where, the θ0,θ1 … θn are the parameter values and x1,x2…xn are the feature values. This is the form of a hypothesis we use when we have multiple features. And, just to give this another name, this is also called as multivariate linear regression. And the term multivariate that's just maybe a fancy term for saying we have multiple features, or multiple variables or features with which we to try to predict the value y or hθ(x). Now it is time for us to define the cost function for this multivariate linear regression. The cost function formula for the multivariate linear regression is similar to the cost function formula for linear regression except for two parameters in linear regression, it is with n number of parameters θ. In multivariate linear regression. 𝑚

1 𝐽(𝜃0 , 𝜃1 , … , 𝜃𝑛 ) = ∑(ℎθ (𝑥 (𝑖) ) − 𝑦 (𝑖) )2 2𝑚 𝑖=1

Since we now have the cost function ready, let us apply our gradient descent algorithm in order to determine the best values of the parameters from θ0 to θn, which minimizes the cost function of the multivariate linear regression. The gradient descent algorithm for multivariate linear regression is shown below: Repeat Until Convergence

{ 𝜃𝑗

= 𝜃𝑗 − 𝛼

1 𝑚

𝑚 (𝑖)

∑(ℎθ (𝑥(𝑖) ) − 𝑦(𝑖) ) . 𝑥𝑗 } 𝑖=1

(Simultaneously update 𝜃𝑗 for 𝑗 = 0, … , n)

Thus, the parameters of the multiple features will be calculated using this formula until it converges to the global optimum. Now let us see how to implement this multiple linear regression with our spark, it is similar to what we have done for the linear regression with a single variable, but the major difference is that for multiple regression-based models, we pass in multiple columns to our model rather than a single variable. And we will also see how to convert a feature of the string data type to as a spark accepted feature, generally linear regression works on numeric data, but supposing we have a string datatype which would have an impact on our prediction, we cannot just ignore it, rather what

86

we do is convert it into a numeric value which is accepted by spark using a String Indexer. We will see how to do this in detail in the Code Snippet 27. For understanding the importance of the String Indexer let us take the same example of predicting our house price, but we can see our new data set of house prices has got a new feature called area in it. And we know that depending on the area or the locality where the house is located, the price of the house varies. Thus we need to consider this area feature also into our model, but since it is a string of the locality names, we have to convert it into a spark accepted format, and we do this using the String Indexer which represents each locality by a number, which in turn is accepted and passed to our model. #Code Snippet 27 #Step 1 - Importing the data and essential libraries from pyspark.sql import SparkSession spark = SparkSession.builder.appName('MultiVariableLinearReg').getOrCr eate() from pyspark.ml.regression import LinearRegression data = spark.read.csv('multi_variable_regression.csv',header=True,inf erSchema=True) print("Initial Data") data.show(3) #importing the VectorAssembler to convert the features into spark accepted format from pyspark.ml.linalg import Vectors from pyspark.ml.feature import VectorAssembler #Step 2 - Data pre-processing and converting any string data to spark accepted format #importing the StringIndexer to convert the locality feature into spark accepted format from pyspark.ml.feature import StringIndexer #convert the locality feature of string type into spark accepted data format string_index_object = StringIndexer(inputCol='area',outputCol='area_feature') string_indexed_df_object = string_index_object.fit(data) final_data = string_indexed_df_object.transform(data)

87

print("Data after converting the string column locality into spark accepted feature") final_data.show(3) print("Columns present in our Data and a sample row value\n") print(final_data.columns) #Step 3 - Data pre-processing and converting the numeric data to spark accepted format #converting the feature(s) into spark accepted data format #Passing multiple columns as the input columns assembler_object = VectorAssembler(inputCols=['house_size', 'bedrooms', 'floors','house_age', 'area_feature'], outputCol='house_features') feature_vector_dataframe = assembler_object.transform(final_data) print(feature_vector_dataframe.head(1)) feature_vector_dataframe.printSchema() formatted_data = feature_vector_dataframe.select('house_features','price_sold') print("Consolidated Data with accepted features and labels") formatted_data.show(3) #Step 4 - Training our Linear Regression model with multiple variables # Splitting the data into 60 and 40 percent train_data, test_data = formatted_data.randomSplit([0.6,0.4]) #Defining our Linear regression lireg = LinearRegression(featuresCol='house_features',labelCol='price_ sold') #Training our model with training data lireg_model = lireg.fit(train_data) #Step 5 - Evaluating of Trained Model #Evaluating our model with testing data test_results = lireg_model.evaluate(test_data) print("Residuals info - distance between data points and fitted regression line") 88

test_results.residuals.show(4) print("Root Mean Square Error {}".format(test_results.rootMeanSquaredError)) print("R square value {}".format(test_results.r2)) #Step 6 - Performing Predictions with novel data #Creating unlabeled data from test data by removing the label in order to get predictions unlabeled_data =

test_data.select('house_features')

predictions = lireg_model.transform(unlabeled_data) print("\nPredictions for Novel Data") predictions.show(4) #Checking our model with new value manually print("Coeffecients are {}".format(lireg_model.coefficients)) print("\nIntercept is {}".format(lireg_model.intercept)) new_house_size = 1750 new_house_number_of_bedrooms = 3 new_house_number_of_floors = 2 new_house_age = 5 #Mimicking the hypothesis function to get a prediction new_price = ((lireg_model.intercept) + (lireg_model.coefficients[0])*new_house_size + (lireg_model.coefficients[1])*new_house_number_of_bedrooms + (lireg_model.coefficients[2])*new_house_number_of_floors + (lireg_model.coefficients[3])*new_house_age) print("\nPredicted house price for the house of size {}, having {} bedrooms ,{} floors and the age of the house being {} is {}".format(new_house_size,new_house_number_of_bedrooms,new_hou se_number_of_floors,new_house_age,new_price)) Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2027.ipynb Data – www.github.com/athul-dev/spark-with-python/blob/master/multi_variable_regression.csv Initial Data +----------+--------+------+---------+----------+----------+ |house_size|bedrooms|floors|house_age| area|price_sold| +----------+--------+------+---------+----------+----------+ | 1490| 2| 2| 10|Ave Avenue| 60| | 2500| 3| 2| 20|Ave Avenue| 95| | 1200| 2| 1| 5| MG Road| 55|

89

+----------+--------+------+---------+----------+----------+ only showing top 3 rows Data after converting the string column locality into spark accepted feature +----------+--------+------+---------+----------+----------+------------+ |house_size|bedrooms|floors|house_age| area|price_sold|area_feature| +----------+--------+------+---------+----------+----------+------------+ | 1490| 2| 2| 10|Ave Avenue| 60| 0.0| | 2500| 3| 2| 20|Ave Avenue| 95| 0.0| | 1200| 2| 1| 5| MG Road| 55| 1.0| +----------+--------+------+---------+----------+----------+------------+ only showing top 3 rows Columns present in our Data and a sample row value ['house_size', 'bedrooms', 'floors', 'house_age', 'area', 'price_sold', 'area_feature'] [Row(house_size=1490, bedrooms=2, floors=2, house_age=10, area='Ave Avenue', price_sold=60, area_feature=0.0, house_features=DenseVector([1490.0, 2.0, 2.0, 10.0, 0.0]))] root |-- house_size: integer (nullable = true) |-- bedrooms: integer (nullable = true) |-- floors: integer (nullable = true) |-- house_age: integer (nullable = true) |-- area: string (nullable = true) |-- price_sold: integer (nullable = true) |-- area_feature: double (nullable = false) |-- house_features: vector (nullable = true) Consolidated Data with accepted features and labels +--------------------+----------+ | house_features|price_sold| +--------------------+----------+ |[1490.0,2.0,2.0,1...| 60| |[2500.0,3.0,2.0,2...| 95| |[1200.0,2.0,1.0,5...| 55| +--------------------+----------+ only showing top 3 rows Residuals info - distance between data points and fitted regression line +-------------------+ | residuals| +-------------------+ |-2.4620166382101445| |-1.4706555019066911| | -7.487567406537167| | 3.192339375914486| +-------------------+ only showing top 4 rows Root Mean Square Error 1.1993400384070633 R square value 0.9949473806252802 Predictions for Novel Data +--------------------+------------------+ | house_features| prediction| +--------------------+------------------+ |[850.0,1.0,1.0,5....|42.462016638210144| |[900.0,2.0,2.0,15...| 46.47065550190669|

90

|[1350.0,2.0,2.0,5...| 72.48756740653717| |[2000.0,3.0,1.0,2...| 81.80766062408551| +--------------------+------------------+ only showing top 4 rows Coeffecients are [0.024579877560926094,10.731076623455326,1.1354105600235256,-0.14480761315512228,3.1145479300740484] Intercept is 4.819476940173717 Predicted house price for the house of size 1750, having 3 bedrooms ,2 floors and the age of the house being 5 is 77.0326333563377

6.2 Logistic Regression with Spark Now let us discuss some classification problems where the variable y, which we want to predict is a discrete value. Not all labels are continuous, sometimes we would need to predict categories as well, and this is called classification. To start with we shall go about understanding and using an algorithm called the logistic regression algorithm which is one of the most popular and widely used algorithms today. An example of such kind of a problem can be classifying a tumor as malignant or benign based on the tumor size. Here, in this kind of situation, the variable that we are trying to predict is a variable y, which we can think of as taking on two possible values that are either zero (0) or one (1), either malignant or benign with respect to our example. y ϵ { 0, 1} , [ where 0 = Negative Class & 1 = Positive Class ] We shall first look into the classification problems with two classes, the positive and negative classes, then later on as we progress we shall deal with multi-class problems where the variable y would take three to four values, to say zero, one, two, and three. This is called a multiclass classification problem. As we have seen how the training set looks for the regression problems, let us now see how it looks for a classification problem. We shall go about representing the training set for the example tumor – malignant or benign.

Yes (1)

Malignant/Cancerous?

No (0) Tumor Size Figure 6.9: Tumor Training set – Classification Problem

91

If we analyze this training set and try applying the linear regression algorithm which we are now experts in. We may get a straight line which would create a threshold, say at the middle of the data points as shown below in Figure 6.10. Here, considering the threshold classifier output hθ(x) at 0.5, if hθ(x) >= 0.5 we predict 1 malignant and if hθ(x) < 0.5 we predict 0 benign. In Figure 6.10, the linear regression did perform a good job by putting a straight line whose threshold would split the data points and in a correct manner, so the green color left-pointing arrow from the black point line predicts all the points in that region as benign and the blue color right-pointing arrow denotes all the points falling in that region as malignant.

Figure 6.10: Fitting a Linear Regression Line

Figure 6.11: Fitting a Linear Regression Line with outliers

Whereas on adding a data point which has a larger tumor size and is malignant as shown in Figure 6.11, the linear regression algorithm does fit a straight line but here the threshold is shifted towards the right to facilitate the newly encountered data point, when this happens the data points which are malignant are also treated as benign as seen in the Figure 6.11, giving us an inaccurate result. So, applying linear regression to a classification problem often isn't a great idea. In the first example, before we added this extra training example, previously the linear regression was just getting lucky and it got us a hypothesis that worked well for that particular example, but usually applying linear regression to a data set, we might get lucky but often it isn't a good idea. And also we have seen that the hypothesis value hθ(x) of linear regression can be very much greater than 1 and way lower than 0 as well. And it would be strange as we are dealing with a binary classification which has labels zero and one. Thus, we use the logistic regression algorithm, which has a property that the output predictions of the logistic regression are always in between zero and one, and never greater than one or lesser than zero. One might be wondering, how come the term logistic regression, “regression” a classification algorithm? Let’s us not get confused by this as it is just the name it was given way back in time before the terms like supervised learning and all came into the picture, logistic regression is actually a classification algorithm that we apply to the settings where the label y is a discrete value, and it's either zero or one.

92

6.2.1 Hypothesis Representation of Logistic Regression In this section we shall see the function by which we are going to use to represent our hypothesis, when we encounter a classification problem. As we have seen earlier, we want our logistic regression classifier to output values between 0 and 1, that is: 0 ≤ hθ(x) ≤ 1 When we were using linear regression, the hypothesis we obtained was: hθ(x) = θ0 + θ1x Let us modify the hypothesis function of the linear regression and formulate a hypothesis for our logistic regression. For logistic regression we use a function g as shown below:

ℎ𝜃 (𝑥) = 𝑔(𝜃0 + 𝜃1 𝑥) Which can be mathematically written as, ℎ𝜃 (𝑥) = 𝑔(𝜃 𝑇 𝑥) when 𝑥0 = 1 And here we are going to define the function 𝑔 as,

𝑔(𝑧) =

1 1+𝑒 −𝑧

This function g, is called the sigmoid function or the logistic function. Now on applying this sigmoid function, we get the below representation,

ℎ𝜃 (𝑥) =

1 1+𝑒 −(𝜃0 + 𝜃1 x)

or ℎ𝜃 (𝑥) =

1 1+𝑒 −(𝜃

𝑇 x)

In order to understand better, we can see in Figure 6.12 below, how the sigmoid function looks like, and how it is relevant to be used with logistic regression.

Figure 6.12: Function g(t) – Sigmoid Function

On observing the above Figure 6.12, we can see that it starts off near 0, then it rises and passes off 0.5, and then it flattens out again like so near 1. So, this is a graphical representation of the sigmoid function. Here we notice that the function asymptote at 1 as well as 0, that is with respect to the above Figure 6.12, when t approaches to positive infinity, 𝑔(𝑡) approaches to 1 and similarly if t approaches negative infinity, 𝑔(𝑡) approaches to 0. That is because 𝑔(𝑡) outputs values that are between 0 and 1 and we also have the initial condition where the ℎ𝜃 (𝑥) must be between 0 and 1. 93

Now we have our hypothesis representation ready, next, we will understand what a decision boundary is and then go about fitting the parameters θ to our data. So given a training set we need to pick a value for the parameters theta and this hypothesis will then let us make predictions. 6.2.2 Decision Boundary As we have the hypothesis representation for logistic regression and have also seen how the sigmoid function looks like, let us try to understand how and when this hypothesis will make the prediction that y is either 0 or 1. 1

As we have seen earlier, ℎ𝜃 (𝑥) = 𝑔(𝜃 𝑇 𝑥) and 𝑔(𝑡) = 1+𝑒 −𝑡 And suppose we want to predict “y=1” if ℎ𝜃 (𝑥) ≥ 0.5 and “y=0” if ℎ𝜃 (𝑥) < 0.5 In Figure 6.13, we can see that 𝑔(𝑡) ≥ 0.5, when t ≥ 0. That is when t is positive, then the𝑔(𝑡), the sigmoid function is equal to or greater than 0.5. Hence the hypothesis function, ℎ𝜃 (𝑥) = 𝑔(𝜃 𝑇 𝑥) ≥ 0.5, when 𝑇 𝑇 𝜃 𝑥 ≥ 0, as here 𝜃 𝑥 resembles t. So what we have seen is that “y=1” whenever 𝜃 𝑇 𝑥 ≥ 0.

Figure 6.13: Sigmoid Function

Now let us consider the other case when the hypothesis will predict “y=0” when ℎ𝜃 (𝑥) < 0.5. From Figure 6.13, we observe that 𝑔(𝑡) < 0.5, when the values of t are less than 0 or negative. In a similar way, we can say that, ℎ𝜃 (𝑥) = 𝑔(𝜃 𝑇 𝑥) < 0.5, when 𝜃 𝑇 𝑥 < 0. Let's use this to boost our understanding of how the hypothesis of logistic regression makes those predictions. In order to do that let us consider a training set as shown below in Figure 6.14.

Figure 6.14: Sample Training Set to Understand the Decision Boundary

94

And our hypothesis is, hθ(x) = g(θ0+θ1x1+θ2x2), we have not yet seen on how to fit the parameters for this model, we shall do it in the next section, but to understand this example, we shall assume the values of the parameters θ , θ1 and θ2 as -3 , 1 and 1 respectively. So for these values of the parameters let us now see where the hypothesis would end up predicting “y=1 and y=0”. Using the formula we discussed in this section, we can say that the hypothesis would predict “y=1” when θTx ≥ 0, that is: -3 + x1 + x2 ≥ 0. This says that for any example which satisfies this equation, our hypothesis will think that “y=1” is more likely and predicts y=1. And on rearranging this equation we get: x1 + x2 ≥ 3, now it can also be read as our hypothesis will predict “y=1”, whenever the sum of x1 and x2 is equal to or greater than 3. On depicting the same in the Figure 6.14, the equation x1 + x2 = 3, as it resembles the equation of a straight line, it gives a straight line that passes through the points 3 and 3 on both the axis x1 and x2 respectively. From the Figure 6.14, it is clear that any point above the line satisfies the condition x1 + x2 ≥ 3 and any point on this region above the line is predicted as “y=1”, and on the contrary the points below the line, as it does not satisfy the condition x1 + x2 ≥ 3, will be predicted as “y=0”. Therefore, the line that we have which splits or helps in distinguish to predict the values is called a decision boundary, as it draws a boundary between the positive and negative classes. The decision boundary is a property of the hypothesis function and not of the training set, that is independent of the data points or the training set. We shall see how to fit the parameters to the hypothesis wherein we will need to use our training set to determine the values of the parameters. And once we have these values of the parameters, it is then used to define the decision boundary. Now that we know what h(x) can represent, next we will learn how to automatically choose the parameters θ, so that given a training set we can automatically fit the parameters to our data. 6.2.3 Cost Function Now let us see the definition and working of the cost function for Logistic Regression. We know that on applying basic logistic regression on the m training examples – {(x(1),y(1)),(x(2),y(2)),…(x(m),y(m))}, we get the prediction or output as 0 or 1 as it is a classification technique, that is y ∈ {0,1}. And the hypothesis function, hθ(x) =

1 1+𝑒 −(𝜃0 + 𝜃1 x)

=

1 1+𝑒 −(𝜃

𝑇 x)

We have to now determine how to choose the values that would be a perfect fit for the parameter θ. Supposing we try to use the same cost function as of Linear regression but with the hypothesis function hθ(x) being the hθ(x) of the logistic regression, that is – 1

(𝑖) 2

(𝑖) 𝐽(𝜃) = 2𝑚 ∑𝑚 𝑖=1(ℎθ (𝑥 − 𝑦 )

, where hθ(x) =

1 1+𝑒 −(𝜃0 + 𝜃1 x)

=

1 1+𝑒 −(𝜃

𝑇 x)

95

What we would get is a non-convex function which is shown below, and the problem with this type of non-convex function is that it would have multiple local minima rather than just a single global minimum. And if a function has many local minima, it would not converge to the global minima and in this process the cost would not be minimized or calculated accurately. Thus, a convex function is always preferred. non-convex

convex J(θ)

J(θ)

θ

θ Figure 6.15: Convex versus Non-Convex Function

So, what we need to do is, come up with a different cost function that is convex, so that we can apply our Gradient Descent algorithm and be sure that we would end up with the global minimum. CostFunction(h(θ),y) = {

− log(ℎ𝜃 (𝑥)) 𝑤ℎ𝑒𝑛 𝑦 = 1 − log(1 − ℎ𝜃 (𝑥)) 𝑤ℎ𝑒𝑛 𝑦 = 0

That is we can say that the cost or the penalty the algorithm has to pay is − log(ℎ𝜃 (𝑥)) , if it outputs a value and the labeled or actual value y is 1. Similarly, it has to pay a cost of − log(1 − ℎ𝜃 (𝑥)) , if the value of y is 0. In order to have a better understanding of the same, let us plot these functions.

if y = 0

cost

cost

if y = 1

0

hθ(x)

1

0

h θ(x)

1

Figure 6.16: Intuition of Cost Function for Logistic Regression

Scenario 1 – when y=1 and hθ(x) = 1, it means that our prediction is accurate and is in accordance with the actual or label value, and in that case the cost would be equal to 0. And

96

on the contrary, if our prediction is wrong, say hθ(x)=0 but the actual value y=1, then the cost value in penalized to a large extent, and the cost would tend to infinity. Scenario 2 – Similarly when y=0 and hθ(x) = 0, our cost would also be equal to 0 as our prediction is accurate, but when y=0 and hθ(x) = 1, our cost will tend to infinity or a large value which means it is punished very hard for such cases where our predictions are wrong.

6.2.4 Simplified Cost Function for Logistic Regression We have seen the cost function for our logistic regression that isCostFunction(h(θ),y) = {

− log(ℎ𝜃 (𝑥)) 𝑤ℎ𝑒𝑛 𝑦 = 1 − log(1 − ℎ𝜃 (𝑥)) 𝑤ℎ𝑒𝑛 𝑦 = 0

And we also know that for binary classification, that is our output variable y being either 0 or 1. We can simplify our cost function and represent the above-written cost function in a single line. CostFunction(h(θ),y) = −𝑦 log(ℎ𝜃 (𝑥)) − (1 − 𝑦) log(1 − ℎ𝜃 (𝑥)) Let us now see, how this cost function representation is similar to the cost function that we have written initially. We know that y is always equal to either 0 or 1, with this note in mind, we can apply the conditions when y is 0 and y is 1 and plug-in these y values to our new representation of the cost function and observe what we get. Scenario 1 – when y=1, applying the value of y in the CostFunction equation, we get – CostFunction(h(θ),y) = −(1) log(ℎ𝜃 (𝑥)) − (1 − 1) log(1 − ℎ𝜃 (𝑥)) Therefore, CostFunction(h(θ),y) = − log(ℎ𝜃 (𝑥))

when y=1

Scenario 2 – when y=0, applying the value of y in the CostFunction equation, we get – CostFunction(h(θ),y) = −(0) log(ℎ𝜃 (𝑥)) − (1 − 0) log(1 − ℎ𝜃 (𝑥)) Therefore, CostFunction(h(θ),y) = − log(1 − ℎ𝜃 (𝑥))

when y=0

From these scenarios we can observe that the cost function is the same in both cases. So, now let us use our simplified cost function and plug it to our gradient descent algorithm to minimize the cost and get the correct θ values or the correct coefficients that fit our model. On simplification, our final cost function would look likeJ(θ) =

1 𝑚

(𝑖) (𝑖) (𝑖) (𝑖) [ ∑𝑚 𝑖=1 −𝑦 log(ℎ𝜃 (𝑥 )) − (1 − 𝑦 ) log(1 − ℎ𝜃 (𝑥 ))]

Given this cost function, in order to fit the parameters, what we are going to do then is, try to find the parameters θ that minimize J(θ). So if we try to minimize this, it would give us some set of parameters θ. Finally, if we're given a new example with some set of features x, we can then take the θ that we used to fit our training set and output our prediction.

97

6.2.5 Gradient Descent for Logistic Regression In order to minimize the cost function J(θ) we can use the Gradient Decent technique. The Gradient Descent technique is defined as – Repeat { 𝛛

θj = θj - α𝛛𝜽 }

(𝒋)

𝐽(𝜃)

(simultaneously update all θj)

On applying the partial derivate computation on our cost function J(θ), and applying this value to our Gradient descent algorithm we getRepeat { (𝑖)

(𝑖) (𝑖) θj = θj - α ∑𝑚 𝑖=1(ℎθ (𝑥 − 𝑦 ) 𝑥𝑗

} (simultaneously update all θj) We notice that the representation of the gradient descent algorithm looks identical for the logistic regression and as well as for the linear regression. It might look identical but since the definition of the hypothesis function hθ(x) is different for both, the computation or the end result might vary.

6.2.6 Evaluating Logistic Regression We can evaluate logistic regression or in that matter any classification algorithms or techniques using the confusion matrix. Confusion Matrix is basically a measurement check between the predicted condition, or what we predicted the label value would be versus the true conditions or what the actual label values were.

Figure 6.17: Confusion Matrix

Now let us go through and understand the terms used in the confusion matrix with the help of the above example of predicting if a person has cancer or not based on determining if the tumor is malignant or benign. There are mainly four types of outcomes –

98









True Positive (TP) Our model predicted positive and the actual label was also positive. Example – our model predicted the tumor to be malignant and the person has cancer. True Negative (TN) Our model predicted negative and the actual label was also negative. Example – our model predicted the tumor to be benign and the person does not have cancer. False Positive (FP) Our model predicted positive and the actual label was negative. Example – our model predicted the tumor to be malignant but the person does not have cancer. False Negative (FN) Our model predicted negative but the actual label was positive. Example – our model predicted the tumor to be benign but the person has cancer.

Using the values of the True Positive (TP), True Negative (TN), False Positive (FP) and False Negatives (FN), we can derive and calculate various evaluation metrics. Few of the most common evaluation metrics are – 

Accuracy – Accuracy tells us how often our model is correct, the higher the value the better. It is calculated as the sum of true positives and true negatives divided by the total population. That mean how many did we get correct out of the total population.

Accuracy = ∑ 





∑ 𝑇𝑃+ ∑ 𝑇𝑁 𝑇𝑜𝑡𝑎𝑙 𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛

Recall or Sensitivity or True Positive Rate (TPR)– Out of all the positive classes, how much we predicted correctly, that is predicted positive. It should be as high as possible. Recall is calculated as the total positive divided by the condition positive or the sum of true positive and false negative. ∑ 𝑇𝑃 ∑ 𝑇𝑃 Recall or TPR = ∑ =∑ 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑃+∑ 𝐹𝑁 Precision – Out of all the positive classes we have predicted correctly, how many are actually positive. Precision is calculated as the total positive divided by the predicted condition positive or the sum of true positive and false positive. ∑ 𝑇𝑃 ∑ 𝑇𝑃 Precision = ∑ =∑ 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑃+∑ 𝐹𝑃 Specificity – Out of all the negative classes, how much we predicted correctly, that is predicted negative. Sensitivity is calculated as the total negative divided by the condition negative or the sum of false positive and true negative. ∑ 𝑇𝑁 ∑ 𝑇𝑁 Sensitivity = ∑ =∑ 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑇𝑁+∑ 𝐹𝑃

99



False Positive Rate (FPR)– It is calculated as the ratio of total false positive by the condition negative or the sum of false positive and true negative. ∑ 𝐹𝑃 ∑ 𝐹𝑃 FPR = ∑ =∑ 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑇𝑁+∑ 𝐹𝑃

All these evaluation metrics are fundamental ways of comparing our predicted values with the true values. 6.2.7 ROC Curve based Evaluation for Binary Classification ROC – Receiver Operator Curve is an evaluation metric that can be used for Binary Classification. It is basically a visualization of metrics derived from the confusion matrix. The ROC curve is basically a plot of the Sensitivity against 1 minus the Specificity, which is the true positive rate plotted against the false positive rate, as shown in Figure 6.18.

Figure 6.18: ROC Curve

In order to understand how the ROC curve can be evaluated and interpreted, we can go through the plot shown in Figure 6.19.

Figure 6.19: ROC plot

100

We can notice a diagonal line in red color passing through the center of the ROC Space, this line is nothing but a random guess, it is basically a 50% chance of getting it correct. If our evaluation results are above that line, say all the values above that line like the points x,y, or z, it means our model is performing better than a random guess. And on the contrary if we are below that line, like with the values a, b or c, it means our model is performing worse than a random guess. And suppose we have the absolute perfect classification, then our model is going to reach 1. Thus, better is the ROC value, as it is closer to 1. We have now learned what logistic regression is, why we use it, and how we evaluate the results after performing the logistic regression. Next we will go about implementing the logistic regression technique using our Sparks MLlib and we will also learn the concept of evaluators and how to use them. Along the code example, we will also see how to deal with the categorical values in a better way by using String Indexers to convert the categorical values or string values into numerical categorical data and then use One Hot Encoders to splits the column which contains numerical categorical data to many columns depending on the number of categories present in that column. Each column contains “0” or “1” corresponding to which column it has been placed. #Code Snippet 28 #Step 1 - Importing the data and essential libraries from pyspark.sql import SparkSession spark = SparkSession.builder.appName('SparkLogReg').getOrCreate() data = spark.read.csv('brain_tumor_dataset.csv',header=True,inferSche ma=True) print("Initial Data") data.show(3) #Step 2 - Data pre-processing and converting any categorical data to spark accepted format from pyspark.ml.feature import VectorAssembler,VectorIndexer,StringIndexer,OneHotEncoder #Formatting the categorical column - sex #Creating a String Indexer - To convert every string into a unique number sex_string_indexer_direct = StringIndexer(inputCol='sex',outputCol='sexIndexer') indexed_data = sex_string_indexer_direct.fit(data)

101

final_string_indexed_data = indexed_data.transform(data) # Male - 1 and Female 0 or vice versa #Performing OneHotEncoing - convert this value into an array form sex_encoder_direct = OneHotEncoder(inputCol='sexIndexer',outputCol='sexVector') encoded_data = sex_encoder_direct.transform(final_string_indexed_data) # Male - [1,0] and Female - [0,1] or vice versa print("Data after OneHotEncoding") encoded_data.show(4) assembler_direct = VectorAssembler(inputCols=['age','sexVector','tumor_size'],out putCol='features') assembler_data = assembler_direct.transform(encoded_data) final_data_direct = assembler_data.select('features','cancerous') print("Consolidated Data with accepted features and labels") final_data_direct.show(3) #Step 3 - Training our Logistic Regression model from pyspark.ml.classification import LogisticRegression logreg_direct = LogisticRegression(featuresCol='features',labelCol='cancerous' ) train_data_direct,test_data_direct = final_data_direct.randomSplit([0.6,0.4]) logreg_model_direct = logreg_direct.fit(train_data_direct) #Step 4 - Evaluating and performing Predictions on our model #Evaluating our model with testing data #Direct Evaluation using Trivial method predictions_labels = logreg_model_direct.evaluate(test_data_direct) print("Prediction Data") predictions_labels.predictions.select(['features','cancerous', 'prediction']).show(3)

102

#Evaluation using BinaryClassificationEvaluator from pyspark.ml.evaluation import BinaryClassificationEvaluator direct_evaluation = BinaryClassificationEvaluator(rawPredictionCol='prediction',la belCol='cancerous') AUC_direct = direct_evaluation.evaluate(predictions_labels.predictions) print("Area Under the Curve value is {}".format(AUC_direct)) print("\nCoeffecients are {}".format(logreg_model_direct.coefficients)) print("\nIntercept is {}".format(logreg_model_direct.intercept)) Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2028.ipynb Data – www.github.com/athul-dev/spark-with-python/blob/master/brain_tumor_dataset.csv Initial Data +------+---+----+----------+---------+ | name|age| sex|tumor_size|cancerous| +------+---+----+----------+---------+ |Roland| 58|Male| 7.0| 1| | Adolf| 65|Male| 9.0| 1| | Klaus| 50|Male| 3.0| 0| +------+---+----+----------+---------+ only showing top 3 rows Data after OneHotEncoding +------+---+------+----------+---------+----------+-------------+ | name|age| sex|tumor_size|cancerous|sexIndexer| sexVector| +------+---+------+----------+---------+----------+-------------+ |Roland| 58| Male| 7.0| 1| 0.0|(1,[0],[1.0])| | Adolf| 65| Male| 9.0| 1| 0.0|(1,[0],[1.0])| | Klaus| 50| Male| 3.0| 0| 0.0|(1,[0],[1.0])| | Rosh| 26|Female| 2.0| 0| 1.0| (1,[],[])| +------+---+------+----------+---------+----------+-------------+ only showing top 4 rows Consolidated Data with accepted features and labels +--------------+---------+ | features|cancerous| +--------------+---------+ |[58.0,1.0,7.0]| 1| |[65.0,1.0,9.0]| 1| |[50.0,1.0,3.0]| 0| +--------------+---------+ only showing top 3 rows Prediction Data +---------------+---------+----------+ | features|cancerous|prediction|

103

+---------------+---------+----------+ | [27.0,0.0,7.2]| 1| 1.0| |[33.0,1.0,10.5]| 1| 1.0| | [39.0,0.0,9.0]| 1| 1.0| +---------------+---------+----------+ only showing top 3 rows Area Under the Curve value is 1.0 Coeffecients are [0.3045633881179126,-22.41096777736326,13.723184957381948] Intercept is -55.70055467873474

6.2.8 Pipelines Let us see the usage and implementation of pipelines, In order to set stages and build models which can be reused easily. A Spark Pipeline is specified as a sequence of stages, & each stage is either a Transformer or an Estimator. These stages are run in order, and the input Data Frame is transformed as it passes through each stage. #Code Snippet 29 from pyspark.sql import SparkSession spark = SparkSession.builder.appName('SparkLogReg').getOrCreate() data = spark.read.csv('brain_tumor_dataset.csv',header=True,inferSche ma=True) print("Initial Data") data.show(3) from pyspark.ml.feature import VectorAssembler,VectorIndexer,StringIndexer,OneHotEncoder #Stage 1 sex_string_indexer = StringIndexer(inputCol='sex',outputCol='sexIndexer') #Stage 2 sex_encoder = OneHotEncoder(inputCol='sexIndexer',outputCol='sexVector') #Stage 3 assembler = VectorAssembler(inputCols=['age','sexVector','tumor_size'],out putCol='features') from pyspark.ml.classification import LogisticRegression

104

from pyspark.ml import Pipeline #Stage 4 logreg = LogisticRegression(featuresCol='features',labelCol='cancerous' ) #passing the 4 stages directly into a pipeline object pipeline_object = Pipeline(stages=[sex_string_indexer,sex_encoder,assembler,logr eg]) train_data , test_data = data.randomSplit([0.6,0.4]) logreg_model = pipeline_object.fit(train_data) model_results = logreg_model.transform(test_data) print("Prediction Data") model_results.select('cancerous','prediction').show(3) from pyspark.ml.evaluation import BinaryClassificationEvaluator evaluation_object = BinaryClassificationEvaluator(rawPredictionCol='prediction',la belCol='cancerous') AUC = evaluation_object.evaluate(model_results) print("Area Under the Curve value is {}".format(AUC)) Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2029.ipynb Data – www.github.com/athul-dev/spark-with-python/blob/master/brain_tumor_dataset.csv Initial Data +------+---+----+----------+---------+ | name|age| sex|tumor_size|cancerous| +------+---+----+----------+---------+ |Roland| 58|Male| 7.0| 1| | Adolf| 65|Male| 9.0| 1| | Klaus| 50|Male| 3.0| 0| +------+---+----+----------+---------+ only showing top 3 rows Prediction Data +---------+----------+ |cancerous|prediction| +---------+----------+ | 1| 1.0| | 0| 0.0| | 1| 1.0| +---------+----------+ only showing top 3 rows Area Under the Curve value is 1.0

105

CHAPTER 7 Tree Methods with Spark In this chapter we will go through understanding and implementing the tree-based methods for performing regression and classification using Spark. In tree-based methods, we happen to see that there is a split in the flow of prediction, based on various features present in the data. That is generally in tree-based methods, one or more feature will have an influence on another feature to facilitate a combined decision flow-based prediction. In the context of learning tree methods with spark, we will quickly understand the basics of Decision Trees, Random Forests, and Gradient Boosted Trees and primarily focus on the implementation of these tree methods using spark.

7.1 Decision Trees A Decision tree is a flowchart, like a tree structure, where each internal node denotes a test or a condition on an attribute or the feature, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label. Decision trees are constructed via an algorithmic approach that identifies ways to split a data set based on different conditions. To understand Decision Trees and its terminologies better, let us go through an example where we try to predict if our football teammates would show up or not for our Sunday’s game based on the previous year’s experiences of various weather conditions and if we played the game or not. That is we have recorded the weather conditions and as well as if the teammates actually showed up or not on the previous year’s Sundays.

Figure 7.1: Games Played Dataset

Figure 7.2: Decision Tree Representation

106

With this kind of a data wherein the label that is played is dependent on various factors or features, in turn, these features are interdependent on each other and are very well split or differentiated by some set of conditions, for such scenarios we can easily use our Decision Tree technique to fetch better results. In the decision tree that we have generated for the above scenario, we have nodes, edges, roots, and leaves.    

Nodes – Nodes are the features that participate in the splitting process. Example – humidity, wind and so. Edges – Outcome of a split to the next node. Example – Weather split -> Humidity and Weather split -> Wind Root Node – The node that performs the initial split. Example - weather Leaf Node – Terminal nodes that predict the outcome. Example – played -> yes or no

7.1.1 Splitting Process One might wonder on what basis do we perform the split, or how do we decide on what features to split on and based on what order. For example in the above scenario how did we determine our root node to be the weather, that is how did we figure out that splitting on the weather feature first made more sense than splitting on the humidity feature. We can understand this better by going through a simple scenario where we decide on the best split. For this let us consider a simple dataset with three features A, B, and C which consists of two possible label classes X or Y as shown below-

Figure 7.3: Simple Dataset and its Ideal Decision Tree Representation

In the case of this example we can obviously see that we can have proper split if we split on feature B. That is the feature B clearly separates the class labels based on its value. So, it would be a good idea if we split on this B feature first.

107

Supposing we try to perform our first split with other features, this is what we would end up as shown in the figure below-

Figure 7.4: First Split Intuition and Possibilities

It is clear that feature B gives a proper classification on its first split. But what we actually need is to figure out a technique or a method that decides what features to split on and how to split on them. Luckily we have a few mathematical methods like the Entropy and the Information Gain for choosing the best split.

7.1.2 Entropy and Information Gain Entropy controls how a Decision Tree decides to split the data. It actually affects how a Decision Tree draws its boundaries. Entropy is nothing but the measure of disorder, we can also take it as a measure of purity. The Mathematical formula for Entropy is𝑐

𝐸(𝑆) = ∑ −𝑝𝑖 𝑙𝑜𝑔2 𝑝𝑖 𝑖=1

Where ‘𝑝𝑖 ’ is simply the frequentist probability of a label or class ‘i’ in our data. Supposing we have just two classes as in our previous examples, a positive class(X) and a negative class(Y). Therefore ‘i’ here could be either + or (-). So if we had a total of 100 data points in our dataset with 30 belonging to the positive class and 70 belonging to the negative class then ‘p+’ would be 3/10 and ‘p-’ would be 7/10. Now if we try to calculate our entropy for this case we would end up doing this-

Entropy = −

3

3

7

7

∗ 𝑙𝑜𝑔2 ( ) − ∗ 𝑙𝑜𝑔2 ( ) = 0.88 10 10 10 10

The entropy here is approximately 0.88. This is considered a high entropy, a high level of disorder which means a low level of purity. Entropy is measured between 0 and 1.

108

Now we know how to measure the disorder. Next we need a metric to measure the reduction of this disorder in our class label or target variable, given additional information like the features or independent variables about it. This is where the Information Gain comes in. It can be mathematically represented as –

InformationGain(Y,X) = Entropy(Y) – Entropy(Y|X) We simply subtract the entropy of Y given X from the entropy of Y alone to calculate the reduction of uncertainty about Y given an additional piece of information X about Y. This is called Information Gain. The greater the reduction in this uncertainty, the more information is gained about Y from X. In order to maintain simplicity and to understand the calculations better, we can represent Information Gain as shown below -

Information Gain = entropy (parent) – [weighted average]*entropy (children) We split on the feature with the highest information gain, let us take our previous example to calculate and understand how the decision tree splits using information gain.

Figure 7.5: Simple Dataset

Here in this dataset, we have in total 4 records or observations and “A”, “B” and “C” are the features, and “Class Label” is the label. Let us go about and compute the Information Gain for all the features one by one and determine which one has the highest information gain. Firstly let us calculate the entropy of the parent, and for this we consider all the labels as the parent node, which is “Class Label” being the parent node. In order to calculate the entropy of the parent node, we need to find out the fraction of examples that are present in the parent node. There are 2 types (X and Y) of example present in the parent node, and the parent node contains a total of 4 examples. 𝑝𝑋 = no. of X examples in parent node / total number of examples 𝑝𝑋 =

2 4

= 0.5

𝑝𝑌 = no. of Y examples in parent node / total number of examples 𝑝𝑌 =

2 4

= 0.5

109

Then, the entropy of the parent node would compute to –

entropy(parent) = - ∑ 𝑝𝑋 𝑙𝑜𝑔2 𝑝𝑋 +𝑝𝑌 𝑙𝑜𝑔2 𝑝𝑌 = - [0.5*log2(0.5) + 0.5*log2(0.5)] = -[ -0.5 + (-0.5)] = 1 Thus, the entropy of the parent node is 1. Now let us work with the features, firstly let us check whether the parent node could be split by ‘A’ or not. If the Information gain from feature ‘A’ is greater than all other features then the parent node can be split by ‘A’. To find out the Information Gain of the Grade feature, let us virtually split the parent node by the feature ‘A’.

Figure 7.6: Splitting on node A

Now, we need to find out the entropy of both of these child nodes. The Entropy of the right side child node(Y) is 0 because all of the examples in this node belongs to the same class. Let us find out Entropy of the left side node XXY: In this node XXY there are two types of examples present, so we need to find out the fraction of X and Y example separately for this node. 𝑝𝑋 =

2 3

= 0.667

𝑝𝑌 =

1 3

= 0.334

So, the entropy of XXY is-

entropy(XXY)

= - ∑ 𝑝𝑋 𝑙𝑜𝑔2 𝑝𝑋 +𝑝𝑌 𝑙𝑜𝑔2 𝑝𝑌 = - [0.667*log2(0.667) + 0.334*log2(0.334)] = -[ -0.38 + (-0.52)] = 0.9

Now, we need to find out the entropy (children) with weighted average.

110

Formula of entropy(children) with weighted average[Weighted avg]entropy(children) =

𝑛𝑜. 𝑜𝑓 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 𝑖𝑛 𝑙𝑒𝑓𝑡 𝑐ℎ𝑖𝑙𝑑 𝑛𝑜𝑑𝑒 𝑡𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 𝑖𝑛 𝑝𝑎𝑟𝑒𝑛𝑡

∗ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑜𝑓 𝑙𝑒𝑓𝑡 𝑛𝑜𝑑𝑒 +

𝑛𝑜. 𝑜𝑓 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 𝑖𝑛 𝑟𝑖𝑔ℎ𝑡 𝑐ℎ𝑖𝑙𝑑 𝑛𝑜𝑑𝑒 𝑡𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 𝑖𝑛 𝑝𝑎𝑟𝑒𝑛𝑡

∗ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑜𝑓 𝑟𝑖𝑔ℎ𝑡 𝑛𝑜𝑑𝑒

Total number of examples in parent node: 4 Total number of examples in left child node: 3 Total number of examples in child node: 1 3

1

4

4

[Weighted avg]entropy(children) = ∗ 0.9 +

∗ 0 = 𝟎. 𝟔𝟕𝟓

Now applying the values in our Information Gain Formula, we get – Information Gain = entropy (parent) – [weighted average]*entropy (children)

Information Gain(A) = 1 – 0.675 = 0.325 Information gain from feature ‘A’ is 0.325 Decision Tree Algorithm chooses the highest Information gain to split/construct a Decision Tree. So we need to check all the features in order to split the Tree.

Information gain from feature ‘C’-

Figure 7.7: Splitting on node C

The entropy of left and right child nodes is the same because they contain the same classes. entropy(XY) and entropy(YX) in the case of node ‘C’ both equal to 1. So, entropy (children) with weighted average for ‘C’: 2

2

4

4

[Weighted avg]entropy(children) = ∗ 1 +

∗1=𝟏

Hence, the information gain –

Information gain(C) = 1 – 1 = 0

111

Finally let us go about calculating the information gain of our final remaining feature, ‘B’-

Figure 7.8: Splitting on node B

In this case, the entropy of the left and the right side child nodes will be 0 because all of the examples in this node belongs to the same class. Hence, entropy(children) with weighted avg. for ‘B’ is – 2

2

4

4

[Weighted avg]entropy(children) = ∗ 0 +

∗0=𝟎

Therefore, the information gain from the feature ‘B’ is-

Information gain(C) = 1 – 0 = 1 Since we have computed the information gain for all the features, let us see which feature has got the highest information gain,

Information Gain(A) = 0.325 Information Gain(B) = 1 Information Gain(C) = 0 We know that the Decision Tree Algorithm splits on features that have the highest Information gain to construct the Decision Tree. Thus, in our case we can see that the feature ‘B’ has the highest information gain. Hence our first split will be based on this feature ‘B’. And finally, our decision tree would split like shown below –

Figure 7.9: Final Decision Tree

All of these processes and calculations and taken care of by Spark when we use the Decision Trees or any tree methods of that sort. Let us now go about learning how to implement Decision trees using our Spark. We will go through the classification and as well as the regression implementation of Decision Trees.

112

Decision Tree Classifier – In order to perform the decision tree classification, let us consider the breast cancer dataset containing the features like the clump thickness, cell size, and the cell shape, and the target label or class label being benign or malignant where benign and malignant corresponds to the value 2 and 4 respectively in the dataset. We will also see which feature is mainly responsible for determining whether the class label belongs to benign (2) or malignant (4) using the featureImportance method. #Code Snippet 30 #Step 1 - Importing the data and essential libraries from pyspark.sql import SparkSession spark = SparkSession.builder.appName('SparkTrees').getOrCreate() data = spark.read.csv('breastcancer.csv',header=True,inferSchema=True) print("Initial Data") data.show(3) #Step 2 - Data pre-processing and conerting data to spark accepted format from pyspark.ml.feature import VectorAssembler assembler = VectorAssembler(inputCols=['thickness','cell_size','cell_shape '],outputCol='features') assembler_data = assembler.transform(data) final_data = assembler_data.select('features','label') print("Consolidated Data with features and labels") final_data.show(3) #Step 3 - Training our Decision model # Splitting the data into 80 and 20 percent train_data,test_data = final_data.randomSplit([0.8,0.2]) from pyspark.ml.classification import DecisionTreeClassifier dt = DecisionTreeClassifier(labelCol='label',featuresCol='features' ) dt_model = dt.fit(train_data)

113

dt_predictions = dt_model.transform(test_data) #Step 4 - Evaluating our Trained Model from pyspark.ml.evaluation import BinaryClassificationEvaluator,MulticlassClassificationEvaluato r eval_obj = BinaryClassificationEvaluator(labelCol='label') print("Area Under the Curve value is {}".format(eval_obj.evaluate(dt_predictions))) mul_eval_obj = MulticlassClassificationEvaluator(labelCol='label',metricName= 'accuracy') print("\nAccuracy of Decision Tree is {}".format(mul_eval_obj.evaluate(dt_predictions))) print("\nPrediction Data") dt_predictions.show(3) print("Detemining which feature played a major role in Decision Making\n") print(dt_model.featureImportances) Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2030.ipynb Data – www.github.com/athul-dev/spark-with-python/blob/master/breast-cancer.csv Initial Data +---------+---------+----------+-----+ |thickness|cell_size|cell_shape|label| +---------+---------+----------+-----+ | 5| 1| 1| 2| | 5| 4| 4| 2| | 3| 1| 1| 2| +---------+---------+----------+-----+ only showing top 3 rows Consolidated Data with features and labels +-------------+-----+ | features|label| +-------------+-----+ |[5.0,1.0,1.0]| 2| |[5.0,4.0,4.0]| 2| |[3.0,1.0,1.0]| 2| +-------------+-----+ only showing top 3 rows Area Under the Curve value is 1.0 Accuracy of Decision Tree is 0.9647887323943662 Prediction Data

114

+-------------+-----+--------------------+--------------------+----------+ | features|label| rawPrediction| probability|prediction| +-------------+-----+--------------------+--------------------+----------+ |[1.0,1.0,1.0]| 2|[0.0,0.0,334.0,0....|[0.0,0.0,0.985250...| 2.0| |[1.0,1.0,1.0]| 2|[0.0,0.0,334.0,0....|[0.0,0.0,0.985250...| 2.0| |[1.0,1.0,1.0]| 2|[0.0,0.0,334.0,0....|[0.0,0.0,0.985250...| 2.0| +-------------+-----+--------------------+--------------------+----------+ only showing top 3 rows Detemining which feature played a major role in Decision Making (3,[0,1,2],[0.06362893724973079,0.857071363800069,0.07929969895020007])

Here we can see that the ‘featureImportance’ method gives us the result as (3,[0,1,2],[0.06362893724973079,0.857071363800069,0.07929969895020007]), this can be interpreted as the first element, that is 3 being the number of features present, the second element [0,1,2] being the indexes of these features and the final element being none other than the importance value of that feature. It can be read and understood as – Importance – [0.06362893724973079,0.857071363800069,0.07929969895020007] Importance Value

Corresponding Index

Column name from our data / assembled data

0.06362893724973079

0 (1st Column)

thickness

0.857071363800069

1 (2nd Column)

cell size

0.07929969895020007

2 (3rd Column)

cell shape

The feature having highest the importance value plays a major role in decision making and in our case we can see that the cell size feature has the highest importance and is mainly responsible for determining the target values and classifying our model.

Decision Tree Regressor - In order to perform the decision tree regression, let us consider the car prices dataset containing the car’s dimension features like the wheel base, height, length, width, and the target label or class label being the price of the car. Here we try to determine which dimension feature is mainly responsible for the price of the car and also go through the Regression Evaluators to evaluate our model. #Code Snippet 31 #Step 1 - Importing the data and essential libraries from pyspark.sql import SparkSession

115

spark = SparkSession.builder.appName('SparkDTRegression').getOrCreate( ) data = spark.read.csv('car-dimensionprice.csv',header=True,inferSchema=True) print("Initial Data") data.show(3) #Step 2 - Data pre-processing and converting data to spark accepted format data.columns data = data.na.drop() from pyspark.ml.feature import VectorAssembler assembler = VectorAssembler(inputCols=['wheelbase','length','width','height'],outputCol='features') assembler_data = assembler.transform(data) final_data = assembler_data.select('features','price') print("Consolidated Data with features and labels") final_data.show(3) #Step 3 - Training our Decision model # Splitting the data into 80 and 20 percent train_data,test_data=final_data.randomSplit([0.8,0.2]) from pyspark.ml.regression import DecisionTreeRegressor dt = DecisionTreeRegressor(labelCol='price',featuresCol='features') dt_model = dt.fit(train_data) dt_predictions = dt_model.transform(test_data) #Step 4 - Evaluating our Trained Model from pyspark.ml.evaluation import RegressionEvaluator regression_evaluator_r2 = RegressionEvaluator(predictionCol='prediction',labelCol='price ',metricName="r2") R2 = regression_evaluator_r2.evaluate(dt_predictions) print("The R Square value is {}".format(R2))

116

print("\nDetermining which feature played a major role in Decision Making") print(dt_model.featureImportances) Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2031.ipynb Data – www.github.com/athul-dev/spark-with-python/blob/master/car-dimension-price.csv Initial Data +----------+------+-----+------+-----+ |wheel-base|length|width|height|price| +----------+------+-----+------+-----+ | 88.6| 168.8| 64.1| 48.8|13495| | 88.6| 168.8| 64.1| 48.8|16500| | 94.5| 171.2| 65.5| 52.4|16500| +----------+------+-----+------+-----+ only showing top 3 rows Consolidated Data with features and labels +--------------------+-----+ | features|price| +--------------------+-----+ |[88.6,168.8,64.1,...|13495| |[88.6,168.8,64.1,...|16500| |[94.5,171.2,65.5,...|16500| +--------------------+-----+ only showing top 3 rows The R Square value is 0.935772634356986 Determining which feature played a major role in Decision Making (4,[0,1,2,3],[0.2672879528217998,0.17794220087295273,0.5480700152167093,0.0 06699831088538142])

Importance – [0.2672879528217998, 0.17794220087295273, 0.5480700152167093, 0.006699831088538142] Importance Value

Corresponding Index

Column name from our data / assembled data

0.2672879528217998

0 (1st Column)

Wheel-base

0.17794220087295273

1 (2nd Column)

length

0.5480700152167093

2 (3rd Column)

width

0.006699831088538142

3 (4th Column)

height

In this case we can see that the width feature has the highest importance and is playing a major role in determining the price of the car.

117

7.2 Random Forests Random forest, as its name suggests, consists of a large number of individual decision trees that operate as an ensemble, it is basically using many trees, with a random sample of features chosen as the split. A new random sample of features is chosen for every single tree at every single split. It can be used for both classification and regression problems wherein each individual tree in the random forest spit out a class prediction and the class with the most votes become our model’s prediction, if it a classification problem and similarly if it is a regression problem, it takes the average of the predicted values and sets it as its predicted continuous label. One might wonder why is there a need to use a new random sample of features for every single tree at every single split. And the answer to that is to avoid highly correlated trees, supposing there is a very strong feature in the dataset, in that case, most of the trees will use that feature as its first split or top split, and this results in a group or ensemble of trees that are highly correlated. And on averaging these highly correlated trees, the variance is not reduced in a significant manner. Therefore by choosing the sample of features randomly for every single tree at every single split, Random Forests decorrelates the trees, resulting in a reduced variance in our model while on the averaging process. Let us go about implementing a Random Forest Regressor which would predict the price of a car based on its performance and mileage. Our dataset consists of features like the horsepower, peak-rpm, city-mileage, highway-mileage and the target label being the price of the car itself. We will also see which performance feature is mainly responsible for the price of the car. #Code Snippet 32 #Step 1 - Importing the data and essential libraries from pyspark.sql import SparkSession spark = SparkSession.builder.appName('SparkRFRegression').getOrCreate( ) data = spark.read.csv('car-performanceprice.csv',header=True,inferSchema=True) print("Initial Data") data.show(3) #Step 2 - Data pre-processing and converting data to spark accepted format data = data.na.drop() from pyspark.ml.feature import VectorAssembler

118

assembler = VectorAssembler(inputCols=['horsepower','peakrpm','city-mileage','highway-mileage'],outputCol='features') assembler_data = assembler.transform(data) final_data = assembler_data.select('features','price') print("Consolidated Data with features and labels") final_data.show(3) #Step 3 - Training our Decision model # Splliting the data into 80 and 20 percent train_data,test_data=final_data.randomSplit([0.8,0.2]) from pyspark.ml.regression import RandomForestRegressor rf = RandomForestRegressor(labelCol='price',featuresCol='features', numTrees=120) rf_model = rf.fit(train_data) rf_predictions = rf_model.transform(test_data) #Step 4 - Evaluating our Trained Model from pyspark.ml.evaluation import RegressionEvaluator regression_evaluator_r2 = RegressionEvaluator(predictionCol='prediction',labelCol='price ',metricName="r2") R2 = regression_evaluator_r2.evaluate(rf_predictions) print("The R Square value is {}".format(R2)) print("\nDetemining which feature played a major role in Decision Making") print(rf_model.featureImportances) Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2032.ipynb Data – www.github.com/athul-dev/spark-with-python/blob/master/car-performance-price.csv Initial Data +----------+--------+------------+---------------+-----+ |horsepower|peak-rpm|city-mileage|highway-mileage|price| +----------+--------+------------+---------------+-----+ | 111| 5000| 21| 27|13495| | 111| 5000| 21| 27|16500| | 154| 5000| 19| 26|16500| +----------+--------+------------+---------------+-----+ only showing top 3 rows Consolidated Data with features and labels

119

+--------------------+-----+ | features|price| +--------------------+-----+ |[111.0,5000.0,21....|13495| |[111.0,5000.0,21....|16500| |[154.0,5000.0,19....|16500| +--------------------+-----+ only showing top 3 rows The R Square value is 0.9419320689071943 Detemining which feature played a major role in Decision Making (4,[0,1,2,3],[0.3985693584462228,0.06836379609827922,0.24938467044208645,0. 28368217501341153])

Importance – [0.3985693584462228, 0.06836379609827922, 0.24938467044208645, 0.28368217501341153] Importance Value

Corresponding Index

Column name from our data / assembled data

0.3985693584462228

0 (1st Column)

horsepower

0.06836379609827922

1 (2nd Column)

peak-rpm

0.24938467044208645

2 (3rd Column)

city-mileage

0.28368217501341153

3 (4th Column)

highway-mileage

From this feature Importance information, it is obvious that the horsepower feature is mainly responsible for the price of the car as it has the highest feature importance value when compared to the rest of the features.

7.3 Gradient Boosted Trees The term Boosting is a method of converting weak learns into strong learners. In this care we intend to perform the boosting on the trees. Gradient Boosted trees involve three major entities – 



A cost/loss function which has to be optimized Cost Function quantifies the error between predicted values and expected values and presents it in the form of a single real number. It is basically used to determine how far off our predictions are from the actual label or target value. A weak learner to make predictions The Weak learner can be a classifier, predictor or so which performs above average, its accuracy is above chance. No matter what the distribution over the training data is, it will always do better than chance, when it tries to label the data. Doing better than chance means we are always going to have an error rate which is less than 0.5. 120



Decision Trees are used as the weak learner in gradient boosting. We could also constrain various components of the weak learners like such as its nodes, maximum number of layers, splits, or leaf nodes. An additive model to add weak learners to minimize the cost/loss function In our additive model for Gradient Boosted trees, the trees are added one at a time, and the existing trees in the model are retained or unchanged. Later a gradient descent procedure is used to minimize the cost function when adding the trees.

A General Simplistic Procedure for Gradient Boosted Trees 1. Firstly we train a weak model using data samples drawn according to some sort of a weight distribution wherein each sample has a weight associated with it. 2. Next we increase the weights of the samples that are misclassified by our model, basically punishing it on errors, and decrease the weight of the samples that are classified correctly by our model. 3. Finally train the next weak model using the samples drawn according to the updated weight distribution. According to this procedure, the algorithm always trains the model using the data samples that were difficult to learn in the previous cycle, this would result in an ensemble of models that are good at learning different parts of the training data. Basically “boosting” weights of those samples which predicted false values or wrong results. Now let us go about implementing a Gradient Boosted Tree technique and also perform a comparison of the Gradient Boosted Trees, the Random Forest, and the Decision Trees based on their accuracy. We will use the breast cancer dataset which contains features like id_number, clump thickness, cell size, cell shape and so, and the target label belonging to the class – benign (2) or malignant (4). And for this implementation we will see how to import and use the ‘libsvm’ format dataset, the breast cancer dataset we would be using is of the libsvm format and this dataset would be already pre-processed and is a ready to use kind of a dataset. #Code Snippet 33 #Step 1 - Importing the data and all the necessary libraries from pyspark.sql import SparkSession spark = SparkSession.builder.appName('SparkTreeComparisions').getOrCre ate() data = spark.read.format('libsvm').load('libsvm-breastcancer.txt') print("Libsvm format Data - Fully formatted and ready to use data") data.show(3)

121

#Step 2 - Training our Tree models # Splitting the data into 70 and 30 percent train_data,test_data = data.randomSplit([0.7,0.3]) from pyspark.ml.classification import GBTClassifier,DecisionTreeClassifier,RandomForestClassifier gbt = GBTClassifier() #Gradient Boosted Trees rf = RandomForestClassifier(numTrees=150) #Random Forest with 150 Trees dt = DecisionTreeClassifier() #Decision Trees gbt_model = gbt.fit(train_data) rf_model = rf.fit(train_data) dt_model = dt.fit(train_data) gbt_predictions = gbt_model.transform(test_data) rf_predictions = rf_model.transform(test_data) dt_predictions = dt_model.transform(test_data) print("Gradient Boosted Tree Predictions") gbt_predictions.show(3) #Step 3 - Evaluating our Trained Model from pyspark.ml.evaluation import MulticlassClassificationEvaluator mul_eval_obj = MulticlassClassificationEvaluator(metricName='accuracy') print("Accuracy of Decision Tree is {}".format(mul_eval_obj.evaluate(dt_predictions))) print("Feature Importances of Decision Tree {}\n".format(dt_model.featureImportances)) print("Accuracy of Random Forest is {}".format(mul_eval_obj.evaluate(rf_predictions))) print("Feature Importances of Decision Tree {}\n".format(rf_model.featureImportances)) print("Accuracy of GBT is {}".format(mul_eval_obj.evaluate(rf_predictions))) print("Feature Importances of GBT {}\n".format(rf_model.featureImportances))

122

Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2033.ipynb Data – www.github.com/athul-dev/spark-with-python/blob/master/libsvm-breast-cancer Libsvm format Data - Fully formatted and ready to use data +-----+--------------------+ |label| features| +-----+--------------------+ | 2.0|(10,[0,1,2,3,4,5,...| | 2.0|(10,[0,1,2,3,4,5,...| | 2.0|(10,[0,1,2,3,4,5,...| +-----+--------------------+ only showing top 3 rows Gradient Boosted Tree Predictions +-----+--------------------+----------+ |label| features|prediction| +-----+--------------------+----------+ | 2.0|(10,[0,1,2,3,4,5,...| 2.0| | 2.0|(10,[0,1,2,3,4,5,...| 2.0| | 2.0|(10,[0,1,2,3,4,5,...| 2.0| +-----+--------------------+----------+ only showing top 3 rows Accuracy of Decision Tree is 0.9541284403669725 Feature Importances of Decision Tree (10,[0,1,2,3,4,5,6,8],[0.013364966827984295,0.02968777933807465,0.803057997 0505796,0.05718773576076717,0.0013226687818967816,0.019695740588608433,0.02 4668402607831135,0.05101470904425795]) Accuracy of Random Forest is 0.9766355140186916 Feature Importances of Decision Tree (10,[0,1,2,3,4,5,6,7,8,9],[0.01260954816747222,0.03839613477742716,0.277270 9823778001,0.2750452698251106,0.020164117502362154,0.1104786462665907,0.084 20655421869473,0.10672561382201107,0.07220891408617425,0.002894218956357051 3]) Accuracy of GBT is 0.9846153846153847 Feature Importances of GBT (10,[0,1,2,3,4,5,6,7,8,9],[0.01295218934061106,0.06400770689578573,0.292968 63490871017,0.240072249179513,0.020182305762243048,0.04971570054470118,0.13 16500601037912,0.12225750694073848,0.06088058194502535,0.005313064378880777 ])

From the above output we can observe that the Accuracy of the Gradient Boosted Tree is the highest, which is then followed by the Random forest and then the Decision Tree with respect to our breast cancer dataset. And also if we take a look at our feature importances output from all the three Trees, we can see that the 2nd index has the highest value when compared with the rest, and this 2 nd index corresponds to the 3rd column in the dataset and that being the cell size feature. We are now in a position to interpret, work, and implement supervised learning models with ease. We are also able to evaluate and make predictions on our trained models. Hence, we can proudly say that we are pundits in working with Supervised Learning in Spark, but that alone is not enough, therefore in the next chapter we will learn about the Unsupervised Learning techniques in Spark as well.

123

CHAPTER 8 Un-Supervised Learning with Spark Until now we were dealing with the Supervised machine learning models, where our data had some features and corresponding labels or target values, and due to this kind of setup we were able to train our models or learn from these feature cum label data and perform predictions on novel data, that is on providing the set of features to our trained model, it would give us a prediction of what our target label would be for the given set of features. Supposing in practical scenarios we might not have the data that has any labels or target values as such, it would have some set of features and we would be asked to find some patterns or insights from this data. So now, this we when our Unsupervised learning comes alive. In an Unsupervised Learning technique we will often try to find and create patterns or groups from the data rather than trying to predict or classify based on the features and target values, as the data itself might not contain any target values. So, in order to address this problem we apply something called a clustering technique, we can think of it as an attempt to create labels. We input some unlabelled data and our unsupervised algorithm returns us the possible clusters of that data. That is we have our data with only the features and we want to see if there are any patterns in the data that would allow us to then create some groups or clusters.

Clustering Algorithm

Figure 8.1: Overview of Clustering Technique

A note to keep in mind while performing unsupervised learning is that, unlike supervised learning where we were able to calculate the correctness or performance of our model by comparing our model’s predictions with the actual labels, we will not be able to easily determine the correctness of our unsupervised models as there is no provision to compare our predictions with any references due to the absence of the target or actual labels in our data.

124

We generally determine the correctness of our model based on domain knowledge and understanding the problem in depth. Supposing we are dealing with the breast cancer dataset without the label column, with our domain expertise and the data we have in hand, we know that there are only two possible types – benign and malignant and on knowing this we can expect two clusters out of this model, and now on applying the unsupervised learning technique, our model might split the data into two clusters namely benign and malignant, and this is what we expected as well. So, this kind of approach can be considered as one of the correct measures of our model. Let us go about learning the most common and very effective unsupervised learning-based clustering technique known as the K-means Clustering.

8.1 K-Means Clustering Algorithm Kmeans algorithm is an iterative algorithm that tries to partition the dataset into ‘k’ predefined distinct non-overlapping clusters or subgroups, where each data point belongs to only one group. It tries to make the inter-cluster data points as similar as possible while also keeping the clusters as different or as far as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster. Let us quickly understand how the K-means clustering algorithm works with some visualizations. Supposing we have a dataset as shown in Figure 8.2 and we try to cluster them into two. So the first step is to randomly initialize two points, called as the cluster centroids, which are denoted by two crosses in Figure 8.3–

Figure 8.2: Scatter Plot of Dataset

Figure 8.3: Random Cluster Centroids

As we know that the K-Means is an iterative algorithm and performs two major operations. First is the cluster assignment operation and the second is moving the centroid operation.

125

The first of the two steps in the loop of K-means is the cluster assignment. That is it goes through all the data points, in other words, each of these green dots shown and depending on whether if it is closer to the red cluster centroid or the blue cluster centroid, it is going to assign each of these data points to one of the two cluster centroids as shown in Figure 8.4. The second step which is in the loop of the K-means is to move the centroids. That is we are going to take each of those two cluster centroids and are going to move them to the average of similar colored points. In our case we take the average of all the red data points, that is the mean of the location of all the red points and we move the red cluster centroid (red cross) to that location and we do the similar computation with the blue data points and move the blue centroid accordingly. This step is shown in Figure 8.5.

Figure 8.4: Cluster Assignment (1)

Figure 8.5: Moving centroid (1)

Since it is in a loop we again perform the cluster assignment operation, we go through the data points again and depending on whether if it is closer to the red or the blue cluster centroid, we are going to assign them to one of the two cluster centroids either red or blue as shown in Figure 8.6. And then we go about doing the second step that is moving the centroids based on the mean location of the red and blue data points as shown in Figure 8.7.

Figure 8.6: Cluster Assignment (2)

Figure 8.7: Moving centroid (2)

126

And let us perform one more iteration on the same. On doing one more cluster assignment and move centroid operation we get the results as shown in Figure 8.8 and Figure 8.9.

Figure 8.8: Cluster Assignment (3)

Figure 8.9: Moving centroid (3)

So, finally we have successfully clustered our data points into two distinct clusters using the K-means clustering technique. And supposing we keep running additional iterations of Kmeans from this point, the cluster centroids and the colors of the points (cluster assignment) will not change any further. Thus, we can confirm that at this point, K-means has converged and it has done a pretty good job finding our clusters.

8.1.1 Simplified Mathematical Representation of the K-means clustering algorithm The K-means algorithm takes two inputs. One, the parameter K, which is the number of clusters we want to find in our data (we will go about learning how to choose the K value in the coming section). And the second parameter being the unlabeled data itself. K-Means Algorithm Given training data – { x(1), x(2), x(3),…….,x(m) } m – Total number of training examples K – Total number of clusters k – Cluster index value between 1 to K Randomly initialize K cluster centroids µ1, µ2, µ3,….., µ𝐾 Repeat { for i=1 to m (cluster assignment operation) 𝑐 (𝑖) = index (from 1 to K) of cluster centroid closest to 𝑥 (𝑖) for k=1 to K (move centroid operation) µ𝑘 = average (mean) of points assigned to cluster k } 127

In the first step that is the cluster assignment operation, we try to find the closest distance between the cluster centroid and data points and assign that centroid index value to 𝑐 (𝑖) , we can do this by finding the value of k that minimizes the distance between the cluster centroid and the data points. 𝑐 (𝑖) = 𝑚𝑖𝑛𝑘 ||𝑥 (𝑖) − µ𝑘 ||2 That is picking the cluster centroid with the smallest squared distance to our training example. The second step, the move centroid operation. What we do is, for each cluster centroid from 1 to the K (total number of clusters), µ𝑘 is set to the average of the data points assigned to the cluster k. For Example – Supposing we want to calculate the updated centroid for µ2 , in that case, let us say the data points 𝑥 (1) , 𝑥 (2) , 𝑥 (5) , 𝑥 (7) , 𝑥 (15) belongs to µ2 . This also means that, 𝑐 (1) = 2 , 𝑐 (2) = 2, 𝑐 (5) = 2, 𝑐 (7) = 2, 𝑥 (15) = 2 Then, µ2 =

1 5

[𝑥 (1) + 𝑥 (2) + 𝑥 (5) + 𝑥 (7) + 𝑥 (15) ]

Thus, this is responsible for the movement of the cluster centroid to the new location which is the mean of the location of the data points which belonged to that particular centroid.

8.1.2 Determining the Number of clusters to be used for our K-means algorithm Firstly we compute the sum of squared error for the values of k. And the sum of squared error is defined as the sum of the distance between each member of the cluster and its centroid. We can also consider this as our cost function – Let us consider – K = number of clusters 𝑐 (𝑖) = index of the cluster (1,2….,K) to which example 𝑥 (𝑖) is currently assigned µ𝑘 = cluster centroid k µ𝑐 (𝑖) = cluster centroid to which example 𝑥 (𝑖) has been assigned Example: if 𝑥 (𝑖) = 7, 𝑐 (𝑖) =7, µ𝑐 (𝑖) = µ7 1

(𝑖) Cost Function= J(𝑐 (1) , . . 𝑐 (𝑚) , µ1 , … µ𝑚 ) = 𝑚 ∑𝑚 − µ𝑐 (𝑖) ||2 𝑖=1 ||𝑥

We can see that the cost function is in the form of the sum of squared error function which we had seen in the previous chapter. It can be observed that the sum of the distance between each member of the cluster and its centroid is what is being computed as the cost function. Now in order to choose the number of clusters to be used for our K-means algorithm, we can use a common method known as the Elbow method.

128

8.1.3 Elbow Method to determine the correct number of clusters In the elbow method, we plot the K values against the Cost Function or the Sum of squared errors, and we notice that the error decreases as the number of k get larger, this is because as the number of clusters increases, the distance between the data points and centroids decreases due to more availability of clusters to facilitate the data points. The objective of the elbow method is to choose the K value at which the Cost Function or the Sum of squared error decreases abruptly. This produces an “elbow” effect in the plot, which can be seen in the below Figure 8.10, and hence the name.

Figure 8.10: Elbow Method

From the above plot we can see that there is an abrupt drop in the cost function and this drop occurs when the corresponding number of clusters is equal to 3, so we can choose 3 as our number of clusters when using the K-means algorithm.

8.1.4 Silhouette Score to determine the correct number of clusters The Silhouette Score is a better measure to decide the number of clusters to be formulated from the data. It is calculated for every data point and can be mathematically represented asSilhouette Score (Silhouette Coefficient) =

(𝑥−𝑦) max(𝑥,𝑦)

𝑦 -mean distance to the other data points in the same cluster (mean intra cluster distance). 𝑥 – mean distance to the data points of the next closest cluster (mean of the nearest cluster distance)

129

The coefficient varies between -1 and 1. As the value is closer to 1, it implies that the instance is close to its cluster and is a part of the right cluster. Whereas, if the value is closer to -1, it means that the value is assigned to the wrong cluster.

Silhouette Score

For example, if we refer to the plot shown in Figure 8.11, where the Silhouette Score is plotted against the number of clusters K.

Figure 8.11: Silhouette Method plot to determine the ideal number of cluster to be used

We can infer from the above plot that when k equal to 3 our model attains local optima, whereas the number of cluster k should have been chosen as 5, in order to attain global optima and better results. This method is more efficient as it makes the decision regarding the optimal number of clusters more meaningful and clear. We can calculate the Silhouette Score using Spark’s inbuilt Clustering Evaluator. 8.1.5 Scaling of Data for K-Means Algorithm Realistically the dataset contains features that highly vary in magnitudes, units, and range. It is always recommended to perform normalization on the dataset when the scale of a feature is irrelevant or misleading and can be ignored when the scale is meaningful. The algorithms which use Euclidean Distance measure are sensitive to magnitudes. Here feature scaling helps to weigh all the features equally. That is if a feature in the dataset is big in scale compared to others then in those algorithms where Euclidean distance is measured, these largely scaled features become dominating and need to be normalized. Since our K-Means use Euclidean distance to measure the distance between our data points and the cluster centroids in the cluster assignment operation. It is sensible and important to use feature scaling. And we can perform the Feature Scaling very quickly and easily using the 130

inbuilt StandardScaler function present in Spark. Also the Sparks StandardScaler function provides us the provision to perform the scaling by using either the mean or the standard deviation. Let us go about implementing the K-means clustering algorithm with our Spark. In order to understand the importance and possible uses of K-means, we shall take a practical scenario wherein we will be helping a telecom or mobile service provider to set up their network towers efficiently in order to facilitate their customers by providing improved connectivity. Supposing a mobile service provider say Airtel approached us saying they have introduced a new application called the Open Network wherein the company has asked its users to raise a request to initiate a complaint about the towers in their locality if they face any issues with their mobile network. And the company has collected the dataset of users who had raised the complaint. The dataset provided by the company has the location details of the customers facing these network issues. That is, supposing a customer X has a network issue, he/she raises a complaint via the Open Network application and on placing this complaint it is highly possible that the network issue the customer is facing is because of the low network coverage in that area or location. So, the company has recorded the location details that are the latitude and longitude on placing the complaint. And our task is to process this location data and suggest few locations to the company, where on setting up new towers, or moving the existing towers to that location will increase the coverage and facilitate all the users who had placed the complaint.

#Code Snippet 34 #Step 1 - Importing the Data and Required Libraries from pyspark.sql import SparkSession spark = SparkSession.builder.appName('KMeansClustering').getOrCreate() data = spark.read.csv('latitude_longitude.csv',header=True,inferSchem a=True) print("Initial Data") data.show(4) from pyspark.ml.clustering import KMeans from pyspark.ml.feature import VectorAssembler assembler = VectorAssembler(inputCols=['latitude','longitude'],outputCol=' features') final_assembled_data = assembler.transform(data)

131

print("Consolidated Data with features") final_assembled_data.show(4) #Step 3 - Training our K-Means Model #Since our Initial Data is well scaled, we can pass it directly to our K-Means kmeans = KMeans(featuresCol='features',k=3) kmeans_model = kmeans.fit(final_assembled_data) #Step 4 - Displaying the predictions predictions = kmeans_model.transform(final_assembled_data) print("Prediction Data") predictions.show(4) centres = kmeans_model.clusterCenters() #Determining the centroids of the cluster print("The company can setup 3 of their towers at these locations- latitudes and longitudes for optimal network coverage") cluster_list=[] i=1 for centre in centres: cluster_list.append(centre) print("{} - {}".format(i,centre)) i=i+1 print("\nDetermining the number of users that belongs to each clusters") predictions.groupBy('prediction').count().show() #Step 4 -Evaluating our model from pyspark.ml.evaluation import ClusteringEvaluator evaluator_object = ClusteringEvaluator(predictionCol='prediction',featuresCol='fe atures') Silhouette_Score = evaluator_object.evaluate(predictions) print("The Silhouette Score when k=3 is {}".format(Silhouette_Score))

132

print("\nWithin set Sum of Square Error {}\n".format(kmeans_model.computeCost(final_assembled_data))) print(“-”*50) #Additional Info Step - Performing K-Means with Scaled Features # Example of Scaling the Data and performing K-Means from pyspark.ml.feature import StandardScaler scalar_object = StandardScaler(inputCol='features',outputCol='ScaledFeatures') scalar_model = scalar_object.fit(final_assembled_data) final_scaled_data = scalar_model.transform(final_assembled_data) print("\nConsolidated Data with Scaled Features") final_scaled_data.show(4) scaled_kmeans = KMeans(featuresCol='features',k=5) scaled_kmeans_model = scaled_kmeans.fit(final_scaled_data) scaled_predictions = scaled_kmeans_model.transform(final_scaled_data) print("Prediction Data") scaled_predictions.select('latitude','longitude','ScaledFeatur es','prediction').show(4) scaled_centres = scaled_kmeans_model.clusterCenters() print("Scaled Tower Locations {}".format(scaled_centres)) Scaled_Silhouette_Score = evaluator_object.evaluate(scaled_predictions) print("\nThe Silhouette Score when k=5 is {}".format(Scaled_Silhouette_Score)) print("\nWithin set Sum of Square Error {}".format(scaled_kmeans_model.computeCost(final_scaled_data)) ) print("\nDetermining the number of users that belongs to each clusters") scaled_predictions.groupBy('prediction').count().show() Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2034.ipynb Data – www.github.com/athul-dev/spark-with-python/blob/master/latitude_longitude.csv

133

Initial Data +-------------+--------------+ | latitude| longitude| +-------------+--------------+ | 0.0| 0.0| |32.8247811394|-116.870394352| | 45.326414382|-117.807811103| |39.4708861702|-119.659926097| +-------------+--------------+ only showing top 4 rows Consolidated Data with features +-------------+--------------+--------------------+ | latitude| longitude| features| +-------------+--------------+--------------------+ | 0.0| 0.0| (2,[],[])| |32.8247811394|-116.870394352|[32.8247811394,-1...| | 45.326414382|-117.807811103|[45.326414382,-11...| |39.4708861702|-119.659926097|[39.4708861702,-1...| +-------------+--------------+--------------------+ only showing top 4 rows Prediction Data +-------------+--------------+--------------------+----------+ | latitude| longitude| features|prediction| +-------------+--------------+--------------------+----------+ | 0.0| 0.0| (2,[],[])| 1| |32.8247811394|-116.870394352|[32.8247811394,-1...| 2| | 45.326414382|-117.807811103|[45.326414382,-11...| 0| |39.4708861702|-119.659926097|[39.4708861702,-1...| 0| +-------------+--------------+--------------------+----------+ only showing top 4 rows The company can setup 3 of their towers at these locations- latitudes and longitudes for optimal network coverage 1 - [ 39.5739463 -121.24864999] 2 - [0. 0.] 3 - [ 34.52887063 -116.34533272] Determining the number of users that belongs to each clusters +----------+------+ |prediction| count| +----------+------+ | 1| 27683| | 2|233916| | 0|197941| +----------+------+ The Silhouette Score when k=3 is 0.7182721127396141 Within set Sum of Square Error 4020452.566094318 -------------------------------------------------Consolidated Data with Scaled Features +-------------+--------------+--------------------+--------------------+ | latitude| longitude| features| ScaledFeatures| +-------------+--------------+--------------------+--------------------+ | 0.0| 0.0| (2,[],[])| (2,[],[])| |32.8247811394|-116.870394352|[32.8247811394,-1...|[3.52019771373825...| | 45.326414382|-117.807811103|[45.326414382,-11...|[4.86089883133903...|

134

|39.4708861702|-119.659926097|[39.4708861702,-1...|[4.23293982267510...| +-------------+--------------+--------------------+--------------------+ only showing top 4 rows Prediction Data +-------------+--------------+--------------------+----------+ | latitude| longitude| ScaledFeatures|prediction| +-------------+--------------+--------------------+----------+ | 0.0| 0.0| (2,[],[])| 1| |32.8247811394|-116.870394352|[3.52019771373825...| 2| | 45.326414382|-117.807811103|[4.86089883133903...| 4| |39.4708861702|-119.659926097|[4.23293982267510...| 0| +-------------+--------------+--------------------+----------+ only showing top 4 rows Scaled Tower Locations [array([ 37.90174966, -121.11197126]), array([0., 0.]), array([ 34.30144762, -116.23467335]), array([ 40.41046393, 116.42511872]), array([ 44.2344904 , -121.80042939])] The Silhouette Score when k=5 is 0.6839381225555261 Within set Sum of Square Error 2404274.6532758037 Determining the number of users that belongs to each clusters +----------+------+ |prediction| count| +----------+------+ | 1| 27683| | 3| 14087| | 4| 44817| | 2|215852| | 0|157101| +----------+------+

From the output of the code snippet, we can say that it would be ideal for the company to place 3 of their towers at the locations (39.5739463, -121.24864999), (0.0, 0.0) and (34.52887063 -116.34533272) to facilitate and fix the network issue of the users who had raised the complaint. We can also see the number of users who belonged to certain clusters. From the output we can also verify the Silhouette score for choosing the correct number of clusters, that is for the problem and the given dataset the correct number of clusters to be used were 3 as the Silhouette score is higher for k=3 compare to choosing k=5. But this again depends on the domain knowledge and the requirements itself, supposing the company were ready to setup or install only 3 towers, then in that case irrespective of the Silhouette score being high for any other number of k, we are supposed to use the number of clusters as 3 itself. At last, we have learned the various types of Machine learning techniques in Spark and we have also gone through various use cases and predictions on different kinds of data along the way. Next, let us put our Machine Learning knowledge and perform some Natural Language Processing with Spark.

135

CHAPTER 9 Natural Language Processing with Spark Until now we were dealing with datasets which were mostly quantitative numeric measures or data that were structured in some sort. But it would be interesting and exciting if our machine learning models were able to generate some insights from texts, sentences, paragraphs, conversations and so. This process of reading, understanding, and deriving meaning from human languages is called Natural Language Processing. Natural Language Processing is a very large field of machine learning with its own set of algorithms and features, in this chapter we will go through the basics of Natural Language Processing and mainly on the implementation of Natural Language Processing with Spark. Some of the major uses and applications of Natural Language processing include - Speech Recognition, Clustering News Articles, Analysing Feedbacks, Chatbots, Spam Email Detection and so. The Basic Concrete Steps or process for Natural Language Processing is –   

Create a Corpus – that is a collection all the documents or texts Convert the words to a numeric or some a sort of matrix Compare the features of the documents using the numeric data or matrix

A general way to convert these documents or corpus containing words and text to something our machine learning model can understand and since most of these algorithms takes in pure numeric values, we have to covert these words into a matrix full of numbers, we can achieve this through the “TF-IDF” methods. TD-IDF stands for Term Frequency – Inverse Document Frequency. Let us consider a scenario where we have 2 documents – “Early Humans” and “Modern Humans” respectively. In order to convert the words into a numeric representation we can quickly perform a word count on them. In that case, we can represent the same as shown below“Early Humans” – (Early, Modern, Humans) – (1,0,1) “Modern Humans” – (Early, Modern, Humans) – (0,1,1) This approach is known as the Bag of words approach, wherein we first try to find all the possible words present in the corpus of documents, in our case we end up with Early, Modern, and Humans. And then we perform a simple word count for each document against these possible words present in order to obtain a vector representation of the words present in each of the documents. So, in this way we can convert words in a document to a vector of numbers.

136

Now we have the Bag of words, that is nothing but the words present in the documents represented as a vector of word counts. Since these are vectors in an N-Dimensional space we could also compare these vectors with cosine similarity. 𝑋.𝑌

similarity(X,Y) = cos(θ) = ‖𝑋‖‖𝑌‖ We can now compare how similar these documents are based on the actual word counts.

The bag of words has few challenges like the absence of semantic meaning and context, stop words that would add noise to our analysis, and the improper weighing of few words and so. We can improvise on the bag of words by adjusting the word counts based on their frequency in the corpus, that is by rescaling the frequency of words by how often they appear in all the documents, so that the scores for frequent words like “the”, “a”(stop words), that are also frequent across other documents, get penalized. This approach is called “Term Frequency — Inverse Document Frequency”. (TF-IDF) TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a corpus (collection of documents). This is done by multiplying two metrics that is how many times a word appears in a document, and the inverse document frequency of the word across a set of documents. Term Frequency – A term frequency is a count of how many times a word occurs in a document. It basically determines the importance of the term within that document. 𝑡𝑓𝑥,𝑦 = frequency of x in y, that is the number of occurrences of term x in the document y Inverse Document Frequency - The Inverse Document Frequency is the number of times a word occurs in a corpus of documents. It basically determines the importance of the term in the corpus. This means, how common or rare a word is in the entire collection of documents. The closer it is to 0, the more common a word is. This metric can be calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm. Multiplying these two numbers results in the TF-IDF score of a word in a document. The higher the score, the more relevant that word is in that particular document. idf(term) = log of total number of documents divided by number of documents with the term Mathematical representation of TF-IDF for the term x within the document y – 𝑁 𝑤𝑥,𝑦 = 𝑡𝑥,𝑦 ∗ 𝑙𝑜𝑔 ( ) 𝑑𝑓𝑥 where, 𝑡𝑥,𝑦 = frequency of x in y

137

𝑑𝑓𝑥 = number of documents containing x N = total number of documents We can summarize this TF-IDF method as using a word count mechanism and taking into account how often that term shows up in that document and by using the inverse document frequency to basically determine the importance of that term by determining the number of times the term appears in the corpus.

9.1 Basic Understanding of Regular Expression for our Natural Language Processing (NLP) A regular expression is a pattern or a special text string for describing a search pattern or a certain amount of text. Regular expressions are used to perform pattern-matching and "search-and-replace" functions on text. Since we are dealing with Natural Language Processing we might have to perform such operations very often, that’s when our understanding of Regular Expression would come into action. Since we are using python for our implementation, let us quickly understand Regular Expression with the help of python’s RegEx (re) library, that is python has a built-in package called ‘re’, which can be used to work with Regular Expressions. The Regular Expression library (re) for python has few important functions like

search – returns an object if there is a matched term in the string The search function searches the string or the text for a match and returns a match object if there is a match, and we can call and use various methods using the match object itself. If there is more than one match, then only the first occurrence of the match is returned. We will be able to use various meta-characters and special sequences as the search term. Example – text = “Spark is Awesome” match_object = re.search(“\s”,text) -> results a match object that is split on first whitespace x.start() -> returns position of the first split( first whitespace), that is 5 in our case. Few methods of the match object are – o string – returns the string which is passed into the search() function o span() – returns a tuple containing the start and end position of the match o group() – returns the complete part of the string which was matched



split – returns a list where the string has been split at each match The split function as the name suggests splits the text into a list of words on encountering the matched term. Example – text = “Spark is Awesome”

138

re.split(“\s”,text) -> here we are splitting our text on whitespaces, hence we get a list that corresponds to – [“Spark”,”is”,”Awesome”] 

findall – returns a list containing all the matches The findall function as the name implies is used to find all the occurrences of the term within a string or text. Example – text = “Apache Spark” re.findall(“pa”,text) -> results in a list of [“pa”,”pa”] If there is no match, then an empty list is returned.



sub – replaces one or more matches with a string The function ‘sub’ is used to replace the matched term with a text of our choice. Example – text = “ideal filename format” re.sub(“\s”,”_”,text) -> here were trying to replace all whitespace by an underscore, hence the output of this function is a string – ideal_filename_format

9.1.1 Metacharacters and Special Sequences Metacharacters in the regular expression are characters with special meaning. They are generally considered as the building blocks of Regular Expressions. Few of the most common and important metacharacters are shown below, also let us use a sample text= “Apache Spark is for all” Character . (dot)

Description Used to replace one or more characters

[]

Defines a set of characters

^

Selects those texts or words from the sentences and determines, if the search term occurs towards the start of the sentence.

$

Selects those texts or words from the sentences and determines, if the search term occurs towards the end of the sentence.

*

+

Implementation Example re.findall(“Sp..k”,text) returns -> [‘Spark’] #Finding all the characters from #lower case a to f in our text re.findall(“[a-f]”,text) returns -> [‘a’,’c’,’e’,’a’,’f’,’a’] re.findall(“^Spark”,text) returns -> [] re.findall(“^Apache”,text) returns -> [‘Apache’]

re.findall(“all$”,text) returns -> [‘all’] re.findall(“Apache$”,text) returns -> [] Selects those texts or words from the sentences, re.findall(“ac*”,text) if the search term contains 0 or more occurrences returns -> [‘ac’,’a’,’a’] of its last character. Selects those texts or words from the sentences, re.findall(“ac+”,text) if the search term contains 1 or more occurrences returns -> [‘ac’] of its last character.

139

{}

|

Selects those texts or words from the sentences, if the search term contains specified occurrences within the flower braces of its last character. Basically like an OR operator, it selects those texts or words present in a sentence if the search terms defined within the “|” operator is present in that sentence.

re.findall(“a{2}”,text) returns -> [‘all’] re.findall(“Apache|Spark”,text) returns -> [‘Apache’,’Spark’]

9.1.2 Special Sequences A special sequence is a ‘\’ followed by one of the characters in the chart below and has a special meaning and function to it. text = “Apache Spark 2 _is_ the $best” Character \s

Description Returns the list of all the white spaces which are present in the text.

Implementation Example re.findall(“\s”,text) returns -> [‘ ’,‘ ’,‘ ’,‘ ’,‘ ’]

\S

Returns the list of all the characters except the whitespaces which are present in the string.

re.findall(“\S”,text) returns -> ['A', 'p', 'a', 'c', 'h', 'e', 'S', 'p', 'a', 'r', 'k', '2', '_', 'i', 's', '_', 't', 'h', 'e', '$', 'b', 'e', 's', 't']

\d

Return the list of digits from 0-9 which are present in the text. Returns the list of all the characters except the digits from 0-9 present in the text

re.findall(“\d”,text) returns -> [‘2’] re.findall(“\D”,text) returns -> ['A', 'p', 'a', 'c', 'h', 'e', ' ', 'S', 'p', 'a', 'r', 'k', ' ', ' ', '_', 'i', 's', '_', ' ', 't', 'h', 'e', ' ', '$', 'b', 'e', 's', 't'] re.findall(“\w”,text) returns -> ['A', 'p', 'a', 'c', 'h', 'e', 'S', 'p', 'a', 'r', 'k', '2', '_', 'i', 's', '_', 't', 'h', 'e', 'b', 'e', 's', 't']

\D

\w

Returns the list of alphanumeric characters that is the letters (alphabets), digits and ‘_’ (underscore character).

\W

Returns the list of characters without containing any alphanumeric characters like letters, digits and ‘_’ (underscore character). Returns the list of the characters or the search term if that is present at the beginning or end of the string.

\b

re.findall(“\W”,text) returns -> [' ', ' ', ' ', ' ', ' ', '$'] #checking if “park” is present in #the beginning of the word in #the string text re.findall(r”\bpark”,text) returns -> [] #checking if “park” is present in #the end of the word in the #string text re.findall(r”park\b”,text) returns -> [‘park’]

140

\B

Returns the list of the characters or search term if it is not present at the beginning or end of the string.

\A

Returns the list of the characters or the search term if that is present at the beginning of the string. Returns the list of the characters or the search term if that is present at the end of the string.

\Z

#checking if “park” is not present #in the beginning of the word in #the string text re.findall(r”\Bpark”,text) returns -> [‘park’] #checking if “park” is not present #in the end of the word in the string text re.findall(r”park\B”,text) returns -> [] re.findall(“\AApac”,text) returns -> [‘Apac’] re.findall(“st\Z”,text) returns -> [‘st’]

9.1.3 Sets A set is a collection of a group of characters inside a pair of square brackets [] with a special meaning. We can key in text, numbers, or any characters individually or in the form of a range within the square brackets, and depending on the characters we key in, the sets perform a selection of those characters from the text or document and returns us a list. text = “Apache Spark _v24” Set Description Definition [abc] Returns a list of all the characters in the text string for those characters which match with the alphabetic characters specified within the brackets. [a-f] Returns a list of lower case characters in the text string which matches the alphabetic characters between a to f as mentioned in the brackets. [^abcd] Returns a list of all the characters in the text string for those characters which does not matches with the alphabetic characters specified within the brackets.

Implementation Example

[123]

re.findall(“[123]”,text) returns -> [‘2’]

[0-9]

[0-9][0-9]

Returns a list of all the characters in the text string for those characters which match with the numeric characters specified within the brackets. Returns a list of character digits in the text string which matches the numeric characters between 0 to 9 as mentioned in the brackets. Returns a list of two-character digits in the text string which matches the numeric characters between 0 to 9 for each digit as mentioned in the brackets.

re.findall(“[abc]”,text) returns -> [‘a’,’c’,’a’]

re.findall(“[a-f]”,text) returns -> [‘a’,’c’,’e’,’a’] re.findall(“[^abcd]”,text) returns -> ['A', 'p', 'h', 'e', ' ', 'S', 'p', 'r', 'k', ' ', '_', 'v', '2','4']

re.findall(“[0-9]”,text) returns -> [‘2’,’4’] re.findall(“[0-9][0-9]”,text) returns -> [‘24’]

141

[a-zA-Z]

Returns a list of both the lower case and upper case characters in the text string which matches the alphabetic characters between a to z and A to Z as mentioned in the brackets. [$] or [_] Basically takes in any character or any special or character and returns a list, if the character [special mentioned within the bracket is present in the character] text string.

re.findall(“[a-zA-Z]”,text) returns -> ['A', 'p', 'a', 'c', 'h', 'e', 'S', 'p', 'a', 'r', 'k', 'v'] re.findall(“[_]”,text) returns -> [‘_’]

9.2 Spark Tools for Natural Language Processing There are a variety of functions and methods inbuilt with Spark in order to work with text data. It is important for us to learn and understand these tools in order to work efficiently with the NLP models. 9.2.1 Tokenization Tokenization is the process of segmenting text into individual words, in other words it takes up a text or document and then separates each and every word in that text or document in the form of individual pieces called tokens. Example – “Apache Spark is Awesome” ->> Tokenizer ->> [‘Apache’ , ‘Spark’ , ‘is’ , ‘Awesome’] Spark provides us with two types of Tokenizers, one a Simple Tokenizer class which is used to split the words on the basis of white spaces by default. The second is a Regular Expression Tokenizer or a RegexTokenizer which helps us to split the words based on various patters with the help of metacharacters, special sequences, and sets which we have discussed above. Let us go about and understand the implementation and working of both the Simple Tokenizer and the Regex Tokenizer in Spark. For this we will import a simple DataFrame that has some sentences or text and apply our Tokenizer class to the text contents present in the DataFrame.

#Code Snippet 35 from pyspark.sql import SparkSession spark = SparkSession.builder.appName('Tokenizer').getOrCreate() from pyspark.ml.feature import Tokenizer,RegexTokenizer data = spark.read.csv('to_tokenize.csv',header=True,inferSchema=True) print("Initial Data")

142

data.show(truncate=False) #Applying Tokenizer class which splits text on whitespaces simple_tokenizer = Tokenizer(inputCol='text_content',outputCol='tokens_words') simple_tokens = simple_tokenizer.transform(data) print("Tokenizer Output - Splitting text on Whitespaces") simple_tokens.show(truncate=False) #Applying RegexTokenizer class which splits text on user defined patterns # Special sequence \W splits on non-alphanumeric characters (in our case it splits on '-') regex_tokenizer = RegexTokenizer(inputCol='text_content',outputCol='tokens_words ',pattern='\\W') regex_tokens = regex_tokenizer.transform(data) print("RegexTokenizer Output - Splitting text on special sequence \W") regex_tokens.show(truncate=False) Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2035.ipynb Data – www.github.com/athul-dev/spark-with-python/blob/master/to_tokenize.csv Initial Data +---------------------------+ |text_content | +---------------------------+ |Apache Spark is Awesome | |Natural-Language-Processing| +---------------------------+ Tokenizer Output - Splitting text on Whitespaces +---------------------------+-----------------------------+ |text_content |tokens_words | +---------------------------+-----------------------------+ |Apache Spark is Awesome |[apache, spark, is, awesome] | |Natural-Language-Processing|[natural-language-processing]| +---------------------------+-----------------------------+ RegexTokenizer Output - Splitting text on special sequence \W +---------------------------+-------------------------------+ |text_content |tokens_words | +---------------------------+-------------------------------+ |Apache Spark is Awesome |[apache, spark, is, awesome] | |Natural-Language-Processing|[natural, language, processing]| +---------------------------+-------------------------------+

143

9.2.2 Stop Words Removal Stop Words are those filler words and language articles that appear frequently in the corpus and do not carry any meaning to it, hence we need to remove these words. These are the words that can be removed without causing any negative consequences to our NLP model. On removing the stop words, our dataset size would go down and this gives us a performance boost while training our models. Moreover, techniques such as TF-IDF gives more value to rare words than to frequently repeated tokens. Hence, it is a good convention to perform stop words removal prior to training our NLP model. Spark has its inbuilt StopWordsRemover class, it also supports stop words removal from various languages. Let us quickly implement a StopWordsRemover class and understand how to use it. #Code Snippet 36 from pyspark.sql import SparkSession spark = SparkSession.builder.appName('StopWordsRemover').getOrCreate() from pyspark.ml.feature import StopWordsRemover, Tokenizer data = spark.read.csv('stopwords.csv',header=True,inferSchema=True) print("Initial Data") data.show(truncate=False) #Applying Tokenizer prior to StopWordsRemover as StopWords takes tokens as its input simple_tokenizer = Tokenizer(inputCol='text_content',outputCol='tokens_words') simple_tokens = simple_tokenizer.transform(data) print("Tokenizer Output - Splitting text on Whitespaces") simple_tokens.show(truncate=False) #Applying StopWordsRemover class stopWords = StopWordsRemover(inputCol='tokens_words',outputCol='stopWordsR emoved') stopWords_tokens = stopWords.transform(simple_tokens) print("Data after Stop Words Removal") stopWords_tokens.select('tokens_words','stopWordsRemoved').sho w(truncate=False)

144

Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2036.ipynb Data – www.github.com/athul-dev/spark-with-python/blob/master/stopwords.csv Initial Data +---------------------------+ |text_content | +---------------------------+ |Spark is the best | |Oops an error | |The everest is just awesome| +---------------------------+ Tokenizer Output - Splitting text on Whitespaces +---------------------------+---------------------------------+ |text_content |tokens_words | +---------------------------+---------------------------------+ |Spark is the best |[spark, is, the, best] | |Oops an error |[oops, an, error] | |The everest is just awesome|[the, everest, is, just, awesome]| +---------------------------+---------------------------------+ Data after Stop Words Removal +---------------------------------+------------------+ |tokens_words |stopWordsRemoved | +---------------------------------+------------------+ |[spark, is, the, best] |[spark, best] | |[oops, an, error] |[oops, error] | |[the, everest, is, just, awesome]|[everest, awesome]| +---------------------------------+------------------+

9.2.3 TF-IDF (Term Frequency – Inverse Document Frequency) TF-IDF, as we have seen and understood earlier is used to determine the importance of a term to a document in the corpus. Spark has inbuilt classes like the HashingTF, CountVectorizer, and IDF which would help us go about performing the TF-IDF method. Performing TF-IDF with HashingTF transformer HashingTF –HashingTF is a Transformer that takes sets of terms and converts those sets into fixed-length feature vectors. The size of the vector generated through HashingTF has a fixed size of 211, that is the default feature dimension is 262,144. The terms are mapped to indices using a Hash Function. The term frequencies are computed with respect to the mapped indices. IDF - IDF is an Estimator which is fit on a dataset and produces an IDFModel. The IDFModel takes feature vectors (generally created from HashingTF or CountVectorizer) and scales each feature. It intuitively down-weights the features which appear frequently in a corpus. Let us see how to implement TF-IDF using HashingTF and IDF methods in Spark. #Code Snippet 37 from pyspark.sql import SparkSession

145

spark = SparkSession.builder.appName('TFIDF_HashTF').getOrCreate() from pyspark.ml.feature import Tokenizer,HashingTF,IDF data = spark.read.csv('reviews_tfidf.csv',header=True,inferSchema=True) print("Initial Data") data.show(truncate=False) #Applying Tokenizer class which splits text on whitespaces simple_tokenizer = Tokenizer(inputCol='reviews',outputCol='review_tokens') simple_tokens = simple_tokenizer.transform(data) print("Tokenizer Output - Splitting text on Whitespaces") simple_tokens.show(truncate=False) #Applying HashingTF hashingtf_vectors = HashingTF(inputCol='review_tokens',outputCol='hashVec') HashingTF_featurized_data = hashingtf_vectors.transform(simple_tokens) print("HashingTF Data") HashingTF_featurized_data.select('review_tokens','hashVec').sh ow(truncate=40) #Applying IDF on vectors of token count output from HashingTF idf = IDF(inputCol='hashVec',outputCol='features') idf_model = idf.fit(HashingTF_featurized_data) final_data = idf_model.transform(HashingTF_featurized_data) print("Final Spark accepted Data - NLP Formatted Data ready to pass into any Machine Learning Model") final_data.select('label','features').show(truncate=60) Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2037.ipynb Data – www.github.com/athul-dev/spark-with-python/blob/master/reviews_tf-idf.csv Initial Data +---------------------+-----+ |reviews |label| +---------------------+-----+ |It was just wonderful|1 | |not so good |0 |

146

|very very negative |0 | |super super duper |1 | |average quality |0 | +---------------------+-----+ Tokenizer Output - Splitting text on Whitespaces +---------------------+-----+--------------------------+ |reviews |label|review_tokens | +---------------------+-----+--------------------------+ |It was just wonderful|1 |[it, was, just, wonderful]| |not so good |0 |[not, so, good] | |very very negative |0 |[very, very, negative] | |super super duper |1 |[super, super, duper] | |average quality |0 |[average, quality] | +---------------------+-----+--------------------------+ HashingTF Data +--------------------------+----------------------------------------+ | review_tokens| hashVec| +--------------------------+----------------------------------------+ |[it, was, just, wonderful]|(262144,[25570,71225,86175,97171],[1....| | [not, so, good]|(262144,[113432,139098,188424],[1.0,1...| | [very, very, negative]| (262144,[210040,251061],[2.0,1.0])| | [super, super, duper]| (262144,[80042,226659],[1.0,2.0])| | [average, quality]| (262144,[1846,250865],[1.0,1.0])| +--------------------------+----------------------------------------+ Final Spark accepted Data - NLP Formatted Data ready to pass into any Machine Learning Model +-----+------------------------------------------------------------+ |label| features| +-----+------------------------------------------------------------+ | 1|(262144,[25570,71225,86175,97171],[1.0986122886681098,1.0...| | 0|(262144,[113432,139098,188424],[1.0986122886681098,1.0986...| | 0|(262144,[210040,251061],[2.1972245773362196,1.09861228866...| | 1|(262144,[80042,226659],[1.0986122886681098,2.197224577336...| | 0|(262144,[1846,250865],[1.0986122886681098,1.0986122886681...| +-----+------------------------------------------------------------+

9.2.4 Performing TF-IDF with CountVectorizer transformer CountVectorizer CountVectorizer and CountVectorizerModel aim to help convert a collection of text documents to vectors of token counts. CountVectorizer can be used as an Estimator to extract the vocabulary. During the fitting process, CountVectorizer will select the top vocabSize words ordered by term frequency across the corpus. An optional parameter minDF also affects the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. The size of the vector generated through CountVectorizer depends on the training corpus and the document. Let us understand the usage and interpretation of CountVectorizer by implementing the same using our Spark.

147

#Code Snippet 38 from pyspark.sql import SparkSession spark = SparkSession.builder.appName('TFIDF_CountVec').getOrCreate() from pyspark.ml.feature import Tokenizer,CountVectorizer,IDF data = spark.read.csv('reviews_tfidf.csv',header=True,inferSchema=True) print("Initial Data") data.show(truncate=False) #Applying Tokenizer class which splits text on whitespaces simple_tokenizer = Tokenizer(inputCol='reviews',outputCol='review_tokens') simple_tokens = simple_tokenizer.transform(data) print("Tokenizer Output - Splitting text on Whitespaces") simple_tokens.show(truncate=False) #Applying CountVectorizer to convert tokens to vectors of token count count_vectors = CountVectorizer(inputCol='review_tokens',outputCol='countVec') count_vectors_model = count_vectors.fit(simple_tokens) countVector_featurized_data = count_vectors_model.transform(simple_tokens) print("CountVectorizer Data") countVector_featurized_data.select('review_tokens','countVec') .show(truncate=False) #Applying IDF after CountVectorizer idf = IDF(inputCol='countVec',outputCol='features') idf_model = idf.fit(countVector_featurized_data) final_data = idf_model.transform(countVector_featurized_data) print("Final Spark accepted Data - NLP Formatted Data ready to pass into any Machine Learning Model") final_data.select('label','features').show(truncate=60) Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2038.ipynb Data – www.github.com/athul-dev/spark-with-python/blob/master/reviews_tf-idf.csv

148

Initial Data +---------------------+-----+ |reviews |label| +---------------------+-----+ |It was just wonderful|1 | |not so good |0 | |very very negative |0 | |super super duper |1 | |average quality |0 | +---------------------+-----+ Tokenizer Output - Splitting text on Whitespaces +---------------------+-----+--------------------------+ |reviews |label|review_tokens | +---------------------+-----+--------------------------+ |It was just wonderful|1 |[it, was, just, wonderful]| |not so good |0 |[not, so, good] | |very very negative |0 |[very, very, negative] | |super super duper |1 |[super, super, duper] | |average quality |0 |[average, quality] | +---------------------+-----+--------------------------+ CountVectorizer Data +--------------------------+---------------------------------+ |review_tokens |countVec | +--------------------------+---------------------------------+ |[it, was, just, wonderful]|(13,[3,4,8,10],[1.0,1.0,1.0,1.0])| |[not, so, good] |(13,[2,7,9],[1.0,1.0,1.0]) | |[very, very, negative] |(13,[1,6],[2.0,1.0]) | |[super, super, duper] |(13,[0,11],[2.0,1.0]) | |[average, quality] |(13,[5,12],[1.0,1.0]) | +--------------------------+---------------------------------+ Final Spark accepted Data - NLP Formatted Data ready to pass into any Machine Learning Model +-----+------------------------------------------------------------+ |label| features| +-----+------------------------------------------------------------+ | 1|(13,[3,4,8,10],[1.0986122886681098,1.0986122886681098,1.0...| | 0|(13,[2,7,9],[1.0986122886681098,1.0986122886681098,1.0986...| | 0| (13,[1,6],[2.1972245773362196,1.0986122886681098])| | 1| (13,[0,11],[2.1972245773362196,1.0986122886681098])| | 0| (13,[5,12],[1.0986122886681098,1.0986122886681098])| +-----+------------------------------------------------------------+

With all of these available tools for Natural Language Processing in spark, let us see how to implement an NLP model for predicting if a message in a SPAM message or HAM (legit) message. For this implementation we shall make use of a dataset containing two features, one the category that is whether the message is SPAM or HAM and the second being the content of the message itself. We will be reading the content, learning from the content and on passing novel messages, our model will detect if it is a SPAM message or not.

149

General Steps for implementing NLP using spark – 1. Converting the label or target feature, if it is a string, to numeric data using StringIndexer 2. Tokenizing the text or content features using Tokenizer class in Spark 3. Removing the Stop words from the tokenized tokens using the StopWordsRemover class 4. Converting the tokens into a vector of tokens (numeric format) using the CountVectorizer or HashingTF present in Spark 5. Passing these vectors into the Inverse Document Frequency (IDF) class in Spark to perform the TF-IDF method 6. Applying machine learning algorithm to the consolidated data obtained from the previous IDF operation 7. Performing predictions and Evaluating our trained model #Code Snippet 39 #Step 1 - Importing data and necessary libraries from pyspark.sql import SparkSession spark = SparkSession.builder.appName('SpamHamNLP').getOrCreate() data = spark.read.csv('spam_ham_nlp.csv',header=True,inferSchema=True ,sep='\t') print("Initial Data") data.show(5) #Step 2 - Data pre-processing from pyspark.ml.feature import (StringIndexer,Tokenizer, StopWordsRemover,CountVectorizer,IDF,VectorAssembler) #Converting our category into a numeric category_to_numeric = StringIndexer(inputCol='category',outputCol='label') #Tokenizing our content tokenizer = Tokenizer(inputCol='content',outputCol='tokens') #Removing the stop words stopWords_removed = StopWordsRemover(inputCol='tokens',outputCol='stpWrd_tokens') #Converting tokens to vectors of token count count_vectors = CountVectorizer(inputCol='stpWrd_tokens',outputCol='countVec') 150

#Performing IDF idf = IDF(inputCol='countVec',outputCol='tf-idf') #consolidating the features consolidated_data = VectorAssembler(inputCols=['tfidf'],outputCol='features') #Transforming and finalizing our data to spark accepted format from pyspark.ml import Pipeline pipeline_object = Pipeline(stages=[category_to_numeric,tokenizer,stopWords_remov ed,count_vectors,idf,consolidated_data]) pipeline_data_model = pipeline_object.fit(data) final_data = pipeline_data_model.transform(data) final_data.head(1) final_data = final_data.select('features','label') print('Final Data') final_data.show(5) #Step 3 - Applying Machine learning algorithm to our data #Using Logistic Regression Classifier as our ML algorithm from pyspark.ml.classification import LogisticRegression log_reg = LogisticRegression() # Splitting the data into 70 and 30 percent train_data, test_data = final_data.randomSplit([0.7,0.3]) spam_detector = log_reg.fit(train_data) predictions = spam_detector.transform(test_data) print("Predictions") predictions.show(5) #Step 4 - Evaluating our Model from pyspark.ml.evaluation import MulticlassClassificationEvaluator eval_object = MulticlassClassificationEvaluator() accuracy = eval_object.evaluate(predictions) print("The Accuracy is {}".format(accuracy))

151

Code – www.github.com/athul-dev/spark-with-python/blob/master/Code%20Snippet%2039.ipynb Data – www.github.com/athul-dev/spark-with-python/blob/master/spam_ham_nlp.csv Initial Data +--------+--------------------+ |category| content| +--------+--------------------+ | ham|Meet me at Willys...| | ham|Let us go to the ...| | spam|Free entry in 2 a...| | ham|I have sent you t...| | ham|Lets meet at 7pm ...| +--------+--------------------+ only showing top 5 rows Final Data +--------------------+-----+ | features|label| +--------------------+-----+ |(13497,[73,82,940...| 0.0| |(13497,[7,85,127,...| 0.0| |(13497,[2,13,19,2...| 1.0| |(13497,[95,472,75...| 0.0| |(13497,[73,491,34...| 0.0| +--------------------+-----+ only showing top 5 rows Predictions +---------------+-----+---------------+---------------+----------+ | features|label| rawPrediction| probability|prediction| +---------------+-----+---------------+---------------+----------+ |(13497,[0,1,...| 1.0|[-15.4281980...|[1.993511019...| 1.0| |(13497,[0,1,...| 1.0|[-20.1022387...|[1.860838234...| 1.0| |(13497,[0,1,...| 1.0|[-11.9238161...|[6.630550049...| 1.0| |(13497,[0,1,...| 1.0|[-14.8251888...|[3.643360358...| 1.0| |(13497,[0,1,...| 1.0|[-11.6761333...|[8.494073493...| 1.0| |(13497,[0,1,...| 0.0|[15.97658587...|[0.999999884...| 0.0| |(13497,[0,1,...| 0.0|[30.57065447...|[0.999999999...| 0.0| |(13497,[0,1,...| 0.0|[38.59121111...|[1.0,1.73800...| 0.0| |(13497,[0,1,...| 0.0|[35.10498436...|[0.999999999...| 0.0| |(13497,[0,1,...| 1.0|[-19.9670737...|[2.130149238...| 1.0| +---------------+-----+---------------+---------------+----------+ only showing top 10 rows The Accuracy is 0.9715081861143268

Our spam detector model is predicting whether a message is a spam or not with an accuracy of 97% and that’s just wonderful. We are now able to take up real word NLP or any Machine Learning problems for that matter of fact and implement it with ease using Spark. Finally, we have come to an end of our Spark Learning Journey with this Book, but our learning does not stop here, we can still practice and work on various problems that are out there in our society. Supposing we don’t have the data to work with, we can always go to the UCI machine learning repository and download the datasets and use them. Also, we can frequently check the Apache Spark’s documentation and programming guides to get more information on the latest updates, enhancements, and the components in Spark itself.

152

Thank You

You can connect with me via LinkedIn: - www.linkedin.com/in/athul-dev/

Please provide your valuable feedback, it would mean a lot to me. -

www.amazon.com

-

[email protected]