Introduction to Deep Learning and Neural Networks with Python™: A Practical Guide is an intensive step-by-step guide for

*296*
*126*
*44MB*

*English*
*Pages 300
[288]*
*Year 2020*

*Table of contents : Front-Matter_2021_Introduction-to-Deep-Learning-and-Neural-Networks-with-Pyt Front MatterCopyright_2021_Introduction-to-Deep-Learning-and-Neural-Networks-with-Python CopyrightDedication_2021_Introduction-to-Deep-Learning-and-Neural-Networks-with-Pytho DedicationPreface_2021_Introduction-to-Deep-Learning-and-Neural-Networks-with-Python- PrefaceAcknowledgment_2021_Introduction-to-Deep-Learning-and-Neural-Networks-with-P Acknowledgments Ahmed Fawzy Gad Fatima Ezzahra JarmouniChapter-1---Preparing-the-devel_2021_Introduction-to-Deep-Learning-and-Neura Preparing the development environment Downloading and installing Python™ 3 Installing required libraries Preparing Ubuntu® virtual machine for Kivy Preparing Ubuntu® virtual machine for PyPy ConclusionChapter-2---Introduction-to-artific_2021_Introduction-to-Deep-Learning-and-N Introduction to artificial neural networks (ANN) Simplest model Y = X Error calculation Introducing weight Weight as a constant Weight as a variable Optimizing the parameter Introducing bias Bias as a constant Bias as a variable Optimizing the weight and the bias From mathematical to graphical form of a neuron Neuron with multiple inputs Sum of products Activation function ConclusionChapter-3---ANN-with-1-input_2021_Introduction-to-Deep-Learning-and-Neural-N ANN with 1 input and 1 output Network architecture Forward pass Forward pass math calculations Backward pass Chain rule Backward pass math calculations Python™ implementation Necessary functions Preparing inputs and outputs Forward pass Backward pass Training network ConclusionChapter-4---Working-with-any-n_2021_Introduction-to-Deep-Learning-and-Neural Working with any number of inputs ANN with 2 inputs and 1 output Math example Python™ implementation Code changes Training ANN ANN with 10 inputs and 1 output Training ANN ANN with any number of inputs Inputs assignment Weights initialization Calculating the SOP Calculating the SOP to weights derivatives Calculating the weights gradients Updating the weights ConclusionChapter-5---Working-with-hi_2021_Introduction-to-Deep-Learning-and-Neural-Ne Working with hidden layers ANN with 1 hidden layer with 2 neurons Forward pass Forward pass math calculations Backward pass Output layer weights Hidden layer weights Backward pass math calculations Output layer weights gradients Hidden layer weights gradients Updating weights Python™ implementation Forward pass Backward pass Complete code ConclusionChapter-6---Using-any-number-o_2021_Introduction-to-Deep-Learning-and-Neural Using any number of hidden neurons ANN with 1 hidden layer with 5 neurons Forward pass Backward pass Hidden layer gradients Python™ implementation Forward pass Backward pass More iterations Any number of hidden neurons in 1 layer Weights initialization Forward pass Backward pass ANN with 8 hidden neurons ConclusionChapter-7---Working-with-2-h_2021_Introduction-to-Deep-Learning-and-Neural-N Working with 2 hidden layers ANN with 2 hidden layers with 5 and 3 neurons Editing Chapter 6 implementation to work with an additional layer Preparing inputs, outputs, and weights Forward pass Backward pass First hidden layer gradients ANN with 2 hidden layers with 10 and 8 neurons ConclusionChapter-8---ANN-with-3-hid_2021_Introduction-to-Deep-Learning-and-Neural-Net ANN with 3 hidden layers ANN with 3 hidden layers with 5, 3, and 2 neurons Required changes in the forward pass Required changes in the backward pass Editing Chapter 7 implementation to work with 3 hidden layers Preparing inputs, outputs, and weights Forward pass Working with any number of layers Backward pass Python™ implementation ANN with 10 inputs and 3 hidden layers with 8, 5, and 3 neurons ConclusionChapter-9---Working-with-any-num_2021_Introduction-to-Deep-Learning-and-Neur Working with any number of hidden layers What to do for a generic gradient descent implementation? Generic approach for gradients calculation Output layer gradients Hidden layer gradients Calculations summary Python™ implementation backward_pass() method Output layer Hidden layers Example: Training the network Making predictions ConclusionChapter-10---Generic_2021_Introduction-to-Deep-Learning-and-Neural-Networks- Generic ANN Preparing initial weights for any number of outputs Calculating gradients for all output neurons Network with 2 outputs Network with 3 outputs Working with multiple training samples Calculating the size of the inputs and the outputs Iterating through the training samples Calculating the network error Implementing ReLU New implementation for MLP class Example for training network with multiple samples Using bias Initializing the network bias Using bias in the forward pass Updating bias using gradient descent Complete implementation with bias Stochastic and batch gradient descent Example ConclusionChapter-11---Running-neural-net_2021_Introduction-to-Deep-Learning-and-Neura Running neural networks in Android Building the first Kivy app Getting started with KivyMD MDTextField MDCheckbox MDDropdownMenu MDFileManager MDSpinner Training network in a thread Neural network KivyMD app neural.kv main.py Use the app Building the Android app ConclusionIndex_2021_Introduction-to-Deep-Learning-and-Neural-Networks-with-Python- Index A B C D F G H I J K L M N P R S T U V W*

Introduction to Deep Learning and Neural Networks with Python™

Introduction to Deep Learning and Neural Networks with Python™ A Practical Guide

Ahmed Fawzy Gad University of Ottawa, Ottawa, ON, Canada

Fatima Ezzahra Jarmouni Ecole Nationale Superieure d’Informatique et d’Analyse des Systemes (ENSIAS), Rabat, Morocco

Academic Press is an imprint of Elsevier 125 London Wall, London EC2Y 5AS, United Kingdom 525 B Street, Suite 1650, San Diego, CA 92101, United States 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom © 2021 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN 978-0-323-90933-4 For information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals

Publisher: Nikki Levy Acquisitions Editor: Natalie Farra Editorial Project Manager: Sara Pianavilla Production Project Manager: Punithavathy Govindaradjane Cover Designer: Miles Hitchen Typeset by SPi Global, India

Dedication This work is dedicated to Lebanon and all Lebanese people who suffered from the August 2020 Beirut explosion. It was a shock to the world that left hundreds of deaths and thousands of injuries besides the economic impacts. We hope that this beautiful country arises back soon.

Preface Deep learning is the state-of-art technique for solving data-oriented real-world problems. It proves its effectiveness in challenging applications like sentiment analysis, text summarization, text translation, object detection and recognition, image classification, speech synthesis, and many more. To design a deep learning model to solve a problem, it is critical to have knowledge about how things work behind the scenes. It is not just about using the deep learning model as a black box. Both the theory and practical sides are important. Introduction to Deep Learning and Neural Networks with Python™: A Practical Guide is an intensive step-by-step guide for absolute beginners to explore how artificial neural networks (ANNs) work which are the basis of the deep learning models. The book moves from the lowest possible level that starts with the linear model Y = X until building a complete and generic neural network that works regardless of the number of inputs, outputs, samples, hidden layers, and hidden neurons. The book discusses both the math behind each step in training the ANN and then shows how it is implemented in Python. So, some little knowledge about Python (and NumPy) is preferred. Because building an implementation of the ANN from zero to hero is not an easy task, the book works in a hierarchical approach that starts with some specific and simple networks and then moves to generalize the network to work with any architecture. Throughout the journey, many concepts are discussed including train and test data, forward pass, backward pass, hidden layer, hidden neurons, activation function, learning rate, gradient descent, sum of products, and more. Here is a description of the 11 chapters: Chapter 1 prepares the development environment used throughout the book where Python and the required libraries are installed. A virtual machine is created to run Ubuntu for building Android apps out of the Python apps. Chapter 2 discusses the basics of neural networks with the nonparametric linear model Y = X. The weight and bias are introduced in the parametric linear model Y = wX + b. The chapter works through some math examples manually to explain how the training process works from applying the inputs until calculating the error. Out of the linear model, the graphical representation of a single neuron is derived. The chapter also discusses how the sum of products at a neuron is calculated in addition to the importance of activation functions. xi

xii Preface

Chapters 3–10 work closely on how neural networks work in addition to their Python implementation. Each chapter starts where the previous one ends. All of these chapters discuss the following: 1. Network architecture 2. Weights initialization 3. Forward pass 4. Backward pass In Chapter 3, a network with a single input and single out is created without using any hidden layers. The chapter explores how to make a prediction by generating the output from the input, calculate the loss of the network, derive the derivative of the sigmoid function, and calculate the gradient of the weights using the chain of derivatives. Rather than using just a single input, Chapter 4 uses 10 inputs. The chain of derivatives of all weights are prepared to calculate the weights' gradients. The chapter ends by working with any number of inputs. Chapter 5 adds a single hidden layer with two neurons to the network architecture. The math calculations of the hidden neurons in both the forward and backward passes are covered. Chapter 6 starts by extending the number of hidden neurons to five and ends by building a hidden layer that uses any number of neurons. Chapter 7 adds a new hidden layer to the network so that it works on two hidden layers. This chapter discusses two examples with a fixed number of hidden neurons and ends by working with any number of neurons in two hidden layers. It builds a generalized solution to weights initialization. Chapter 8 adds an additional third hidden layer where any number of neurons can be used in the three hidden layers. It also generalizes the Python implementation of the forward pass so that it works with any number of hidden layers. Chapter 9 reached a milestone by editing the latest code to work with any number of hidden layers in both the forward and backward passes. The process of calculating the gradients of all layers is automated without being restricted to any number of hidden layers or neurons. This chapter reformats the project so that a Python module named MLP.py is created to hold the backend operations while giving the user a simple interface to build and use the neural network. Chapter 10 adds additional features to the project so that it works with any number of outputs and multiple training samples. It also supports the rectified linear unit activation function in both the forward and backward passes. The bias is introduced as an additional parameter to the weight. Finally, the gradient descent can work with either the stochastic or batch modes. Chapter 11 uses a framework called Kivy and KivyMD to build a desktop application that builds and trains a neural network using custom data. An Android application is generated from Kivy using the Buildozer tool. The reader is expected to have no knowledge of Kivy. The source code of the book is available at this GitHub project: https://github. com/ahmedfgad/IntroDLPython. Each chapter has its code within a separate folder.

Acknowledgments Thanks, Allah, for inspiring us to write this book. It was a dream that comes into action. We should clearly admit that our success is not only about being skillful in something but the willingness of Allah [And my success is not but through Allah. Upon him I have relied, and to Him I return. (Hud 88)]. We are keeping thanking Allah as David and Solomon previously did in Surat An Naml 15 [We gave (in the past) knowledge to David and Solomon: And they both said: “Praise be to Allah, Who has favoured us above many of his servants who believe!” (An Naml 15)]. Thanks to our parents who did everything possible to help us in our life without expecting a reward. While you surround us at home, we are safe and feel that nothing can beat us. We hope to be grateful to you as said in Surat Luqman 14 [Be grateful to Me and to your parents; to Me is the [final] destination]. It is a wonderful experience to have our book published by Elsevier. Natalie Farra, an acquisition editor, is a friendly person who cares about the subtle details. I never had a question that was not answered accurately in adequate time. Sara Pianavilla, editorial project manager, supported us from submitting the content to preparing the cover in a smooth process. Thanks to Punitha Radjane for the information about the proofing system. Motivated by this Hadith [AbuHurayrah reported that the Prophet Mohamed ( )ﷺsaid, “He who does not thank people does not thank Allah.”], we separately thank the people who motivated and helped us to complete this book.

Ahmed Fawzy Gad From time to time, new people arise in our life to help us proceed and follow the right path. The first person who helped me to prepare this book is my coauthor Fatima Ezzahra Jarmouni. You are an intelligent scientist with lots of inspiration to stir the stagnant water. A few words cannot express my appreciation for your kindness and willingness to help other people just for the sake of help. Thanks, Dr. Rasha Atwa, a staff member at King Abdulaziz University, for planting hope again in my heart. You are a kind student, teacher, mother, and human with a lot of experience. I am sure you are strong enough to help yourself, your children, family, and even other people like me. Whatever good you do will be rewarded by Allah [And whatever good you put forward for yourselves – you will find it with Allah. (Al Muzzammil 20)].

xiii

xiv Acknowledgments

Life has many kind people and Prof. Amiya Nayak, a lecturer at the University of Ottawa, is one of them. You are an initiator who saves no effort in pushing and assisting me to continue my career. Thanks to my childhood friend Ibrahim Elhagali for the continued support and enthusiasm to push me emotionally. Keep up the good work you are doing and I am confident that you will go beyond our expectations. Thanks to Paperspace represented by its CEO Dillon Erb, COO Daniel Kobran, and community manager Rachel Rapp for hosting my works. The same goes for the Fritz AI team lead by CEO Dan Abdinoor, CTO Jameson Toole, and head of Heartbeat community Austin Kodra. Feeling that you appreciate my work is a key driver to write this book. Thanks to my mother, father (may Allah have mercy on him), brother, sisters, and all family members for standing with me. You relieved much load in the last years and I hope I can do so in the future.

Fatima Ezzahra Jarmouni It is a wonderful experience to work on my first book and will remain an achievement in my life. I thank my coauthor Ahmed Gad for giving me the chance to participate in this book. You shared your experience and gave me enough support and advice. I liked working with you and I hope it is not the last time. This book was not possible without the support I received from my friend Batoul Badawi. You are much close to my heart like how our homes are. Every meter in the long-distance to Mona Elmorshedy carries emotions and feelings. Despite the distance, you are so close to my heart and I am looking for the moment to meet you and your daughter Deeda. The time I enjoy with my colleagues Fatima Walid, Kaoutar Abgar, and Fatima Ezzahra Elmortadi made a noticeable difference in adjusting my mood. The crazy moments we had together refresh me to renew my energy to study, work, and live. The existence of my mother, father, brothers, and sisters gives me more confidence and power. You always support me in all aspects of life, and I hope I can return a little bit of your assistance one day.

Chapter 1

Preparing the development environment Chapter outline Downloading and installing Python™ 3 1 Installing required libraries 5 Preparing Ubuntu® virtual machine for Kivy 6

Preparing Ubuntu® virtual machine for PyPy Conclusion

13 14

ABSTRACT This chapter prepares the development environment that is used in the following chapters. The book uses Python 3; thus, the chapter discusses how to download and install it. On top of Python, a number of libraries are necessary to be installed. These libraries include NumPy, Matplotlib, Cython, and Kivy. For building mobile applications out of Kivy, a Linux system must exist. For this purpose, an Ubuntu virtual machine is prepared. The chapter also gets the reader ready to use the recent Python implementation called “PyPy” to see how it could boost the performance of Python scripts.

Downloading and installing Python™ 3 Python is one of the easiest languages to learn as it takes less overhead to make a program that runs. This helps the beginners get started quickly without caring about the details as in other languages, such as C or Java. Python is a cross-platform language by which the same program is written once and used by different operating systems. Python is a general-purpose language that works on different platforms such as desktop, mobile, web, and embedded systems. Thus, different types of applications can be built for different platforms without having to use different languages. Python has two main releases. The first one is Python 2 with its latest version 2.7.18 released on April 2020, which just fixes some bugs in Python 2.7.17 released on October 19, 2019. Starting from January 1, 2020, Python 2 is no longer being developed further as the Python team found serious issues that cannot be solved. Note that Python 2.7.18 was developed in the period from October 2019 to January 2020 but just released after that.

Introduction to Deep Learning and Neural Networks with Python™ https://doi.org/10.1016/B978-0-323-90933-4.00002-4 © 2021 Elsevier Inc. All rights reserved.

1

2 Introduction to deep learning and neural networks with Python™

After the Python 2 sunset date (January 1, 2020), Python 3 is the only supported version of Python. Its development started in 2008. There might still be some Python 2 users but by time, Python 2 will die and only Python 3 will exist. For that reason, this book uses Python 3. The book source code is developed using Python 3.7. The latest Python 3 version is 3.8, which can still be used to run the book’s code without changes. At the time of writing this chapter, the latest release of Python 3 (3.8.5) is published on July 2, 2020. The official Python website is python.org from which Python 3 can be downloaded from the downloads page. To download Python 3.8.5, use this link: python.org/downloads, which forwards to a page like that in Fig. 1.1. To download Python 3.8.5 for Windows, use this link: https://www.python.org/ftp/python/3.8.5/python-3.8.5.exe. Its size is about 25 MB. For Linux, Python 3 comes already installed. To upgrade to a specific version, use this terminal command: sudo apt install python3.8 For Mac, use this command: brew install python3 After Python 3.8 is installed, one way of accessing it is through the terminal/ command prompt. For Windows and Mac, issue the python command as in Fig. 1.2. For Linux, issue the python3 command.

FIG. 1.1 Download Python 3.8 from python.org

FIG. 1.2 Activate Python.

Preparing the development environment Chapter | 1 3

The installed Python from python.org is a native Python distribution, which only supports the built-in libraries. But there is a wide range of libraries that support useful features that are regarded as essential to many Python developers. One of these libraries is Numerical Python (NumPy). Python supports a tool called “pip” to install new libraries. The next command is used to install new libraries where is replaced by the library named (e.g. numpy). Remember to use pip3 for Linux and Mac. pip install For each new library to be installed, a pip command is issued. Some errors may occur while installing some libraries and it takes time to solve them. To ease the support of the Python libraries, there are Python distributions that come with many already installed Python libraries. One of these distributions is Anaconda anaconda.com. It supports dozens of already installed Python libraries. Anaconda supports four editions, which are individual, team, enterprise, and professional. The individual edition is enough for this book. Fig. 1.3 shows that Anaconda supports Windows, Mac, and Linux. Visit anaconda.com/products/ individual and choose the right installer according to your platform. For example, the Python 3.8 download link for 64-bit Windows is repo.anaconda.com/ archive/Anaconda3-2020.07-Windows-x86_64.exe. Compared to the size of the installer of the native Python distribution, the Anaconda installer size is hundreds of megabytes as some libraries are already packaged. By installing Anaconda, not only Python will only be installed but also many libraries will be available for use without having to install them. Anaconda comes with an integrated development environment (IDE) called “Spyder.” It is the best IDE for the MATLAB users moving to Python. Another

FIG. 1.3 Anaconda installers.

4 Introduction to deep learning and neural networks with Python™

popular IDE is PyCharm, which can be downloaded from this page jetbrains. com/pycharm/download. The Jupyter project is also a favorable choice for data scientists as it offers notebooks in which not only code but also text and images can be embedded. The Jupyter Notebook comes already installed in Anaconda. To activate it, issue the next command from the terminal. It activates a server that handles a webbased Jupyter Notebooks. jupyter notebook After the server runs, a new web page opens, as in Fig. 1.4, from which Jupyter Notebooks can be created. From the New button, select the version of Python used in the notebook and it opens in a new browser tab as in Fig. 1.5. Similar to the pip installer in the native Python distribution, Anaconda supports a new installer called “conda.” One of the differences between the two installers is the cloud-based repository from which the installer searches for the libraries. pip searches for the libraries in pypi.org where conda searches in repo.anaconda.com. Note that pip can still be used to install libraries in the Anaconda environment. If conda is used to install a nonconda library (i.e., a library that is not available at the Anaconda package repository), an error occurs similar to that in Fig. 1.6.

FIG. 1.4 Activating Jupyter Notebooks.

FIG. 1.5 Creating a New Jupyter Notebook.

Preparing the development environment Chapter | 1 5

FIG. 1.6 Error installing a nonconda library using conda.

After Python is installed, the next section discusses the libraries that will be used in the book.

Installing required libraries Besides the native Python, some more libraries are required in this book. Here is a list of these libraries: ● ● ● ● ● ● ●

NumPy Matplotlib Cython Virtualenv Kivy KivyMD Buildozer

If Anaconda is used, then NumPy, Matplotlib, Cython, and Virtualenv will be already installed. For the native Python distribution, then all of these libraries must be installed using pip. For the complete list of the packages supported by default in different Anaconda distributions, check this link docs.anaconda.com/ anaconda/packages/pkg-docs. For Python 3.8, the list is accessible from this page docs.anaconda.com/anaconda/packages/py3.8_win-64. NumPy is a library that supports the ndarray data type to process numerical arrays in Python. It is a very popular library in data science. Matplotlib is used to create visualizations. Cython is a superset of Python to support the use of C code within Python, which speeds-up the Python execution. Virtualenv creates virtual environments where Python and its libraries can be installed. Such virtual environments are isolated from the main Python installation. Kivy is a cross-platform library for building native user interfaces. KivyMD offers

6 Introduction to deep learning and neural networks with Python™

a number of widgets that are compatible with Kivy and approximate Google’s Material Design spec. Buildozer is a tool for building the Kivy applications for mobile devices. Kivy applications can be developed in Linux, Mac, and Windows. Kivy can be installed using pip. Use pip3 rather than pip for Linux and Mac. pip install kivy Similarly, install KivyMD: pip install kivymd For deploying the Kivy applications to Android, which is the focus of this book, only the Linux and OSX platforms support Buildozer. For iOS, only OSX can be used. The next section prepares the Kivy environment in Linux for building Android applications.

Preparing Ubuntu® virtual machine for Kivy This section assumes that the user is running a different operating system (OS) than Linux and thus prepares a virtual machine in which Ubuntu is installed. If Linux is available on your machine (either as the main operating system or as a virtual machine (VM)), then skip the section. Otherwise, please keep reading the entire section. In our case, the main OS is Windows and Linux Ubuntu is installed in a virtual machine. The first thing to do is to download a Linux distribution (e.g., Ubuntu) to run on a desktop PC. Visit this page ubuntu.com/download/desktop as in Fig. 1.7 and download Ubuntu. The download link for Ubuntu 20.04, the latest version at the time of writing this chapter, is releases.ubuntu.com/20.04/ubuntu-20.04desktop-amd64.iso. The downloaded file is an ISO image of the OS ready for being installed on virtualization software. Remember the location in which the image is downloaded to use it later.

FIG. 1.7 Download Ubuntu for desktop.

Preparing the development environment Chapter | 1 7

FIG. 1.8 Download VMware Workstation Pro.

Use the virtualization software of your choice. Examples include VMware and VirtualBox. VMware is paid and available for trial for 30 days, but VirtualBox is free. VMware Workstation Pro 15.5.1 is used in the book. Download the software from this page vmware.com/mena/products/workstation-pro/workstationpro-evaluation.html as shown in Fig. 1.8. Install VMware and open it. Fig. 1.9 shows how its home screen looks like. Click on the Create a New Virtual Machine button to create a VM. The same thing is achieved from the New Virtual Machine option in the File menu. A new window appears as in Fig. 1.10. The steps to create the VM are not complicated but it does not hurt to mention it. There is no need to customize the configuration; thus, the Typical option is selected.

FIG. 1.9 VMware Workstation Pro Main Screen.

8 Introduction to deep learning and neural networks with Python™

FIG. 1.10 Creating a New VM.

Click Next to move to the window in Fig. 1.11 where the directory of the downloaded Ubuntu ISO image is specified. Click Next to open a new window as in Fig. 1.12 to create a new user by entering the username and password. Remember them to log in to the OS after being installed. By clicking Next, the window in Fig. 1.13 appears to enter the VM name and its location where all VM files are saved. Later, this location may be copied to create a backup of the VM. Click Next to move to the window in Fig. 1.14 to specify the disk size of the VM. It defaults to 20 GB but you can increase it as much as you can to make a room for the files and libraries downloaded later. If possible, set the size to 40 GB. After clicking Next, a final Window appears summarizing the settings of the VM as in Fig. 1.15. By default, the machine uses 2 GB of RAM, one processor, and one core per processor. These settings are not the best. Click the Customize Hardware button to customize these settings as in Fig. 1.16. According to your physical RAM size, increase the RAM assigned to the VM. In our case, 6 out of 16 GBs are assigned to the VM. From the Processors tab, increase the number of cores assigned to the processor to nearly half.

Preparing the development environment Chapter | 1 9

FIG. 1.11 Specify the Ubuntu ISO Image Directory.

FIG. 1.12 VM username and password.

10 Introduction to deep learning and neural networks with Python™

FIG. 1.13 VM name and location.

FIG. 1.14 VM disk size.

Preparing the development environment Chapter | 1 11

FIG. 1.15 Create the VM.

FIG. 1.16 Change the VM hardware settings.

12 Introduction to deep learning and neural networks with Python™

FIG. 1.17 Power on the VM.

FIG. 1.18 Login to Ubuntu.

After editing the settings, click Close and then Finish to create the VM. The VM will be listed as in Fig. 1.17. Click on the Power on this virtual machine button to start it. After the machine starts as in Fig. 1.18, login to the created user by entering the password set previously. Press Ctrl + Alt + Enter to enter the full-screen mode of the VM. Fig. 1.19 shows the full screen of the OS. Now, everything is ready to start working on developing Android apps in Kivy. In the terminal, install Kivy using pip3. Using this installation, Kivy apps can be developed.

Preparing the development environment Chapter | 1 13

FIG. 1.19 Ubuntu running.

pip3 install kivy For exporting the Kivy app for mobile devices, it is preferred to prepare Kivy and its requirements in a Python virtual environment. This environment is built using the virtualenv library. If not installed, please install it. pip3 install virtualenv For the instructions of installing Kivy and its requirements (e.g., Cython and Buildozer) in a virtual environment, follow this tutorial Python for Android: Start Building Kivy Cross-Platform Applications: ●

●

https://www.linkedin.com/pulse/python-android-start-building-kivy-crossplatform-applications-gad https://towardsdatascience.com/python-for-android-start-building-kivycross-platform-applications-6cf867d44612

Preparing Ubuntu® virtual machine for PyPy Python, the programming language itself, is implemented in different languages. The C implementation of Python is called CPython. For Java, it is called Jython. For a complete list of the various Python implementations, check this page python.org/download/alternatives. PyPy a Python implementation written in the Restricted Python (RPython) language. The motivation behind PyPy is the speed. PyPy only implements CPython 2.7.13 and Python 3.6.9. With just some limitations, any Python code can run using PyPy. Check its main website pypy.org. PyPy is not always faster than CPython, but its main power comes in the long-time repetitive operations inside loops. PyPy is not helpful if most of the

14 Introduction to deep learning and neural networks with Python™

FIG. 1.20 Running PyPy 3 in Ubuntu 18.04.

time is devoted to running C code, not Python code. As a result, PyPy will not speed-up the execution of the NumPy arrays. In this case, Cython can be helpful. To download PyPy, check this page pypy.org/download.html. Note that PyPy may support only some specific versions of the platforms. It is important to check the platform supported by each PyPy distribution. PyPy only supports Ubuntu 18.04. This is the platform used in this book to run PyPy. Download Ubuntu 18.04 from this link releases.ubuntu.com/18.04.4/ ubuntu-18.04.4-desktop-amd64.iso. An easy way to get it up and running is to create a new virtual machine in VMware as discussed previously. The PyPy 3.6 distribution compatible with Ubuntu 18.04 is accessible from this page bitbucket.org/pypy/pypy/downloads/pypy3.6-v7.3.1-aarch64.tar.bz2. Inside the bin directory, there is an executable file named pypy3. This is what runs PyPy. Just open a terminal and issue the ./pypy3 command as described in Fig. 1.20. After PyPy is activated, use Python commands as with CPython.

Conclusion This chapter prepared the development environment that will be used throughout the book. It starts by downloading Python from python.org for users interested in installing the libraries using pip. The chapter also mentioned Anaconda where the many libraries come already installed, which saves the user’s time. To build Android applications using Kivy, the Linux Ubuntu distribution is installed on a virtual machine where it supports the build tool Buildozer. Finally, PyPy is downloaded and executed in Ubuntu 18.04.

Chapter 2

Introduction to artificial neural networks (ANN) Chapter outline Simplest model Y = X Error calculation Introducing weight Weight as a constant Weight as a variable Optimizing the parameter Introducing bias Bias as a constant

15 16 17 18 20 21 23 23

Bias as a variable Optimizing the weight and the bias From mathematical to graphical form of a neuron Neuron with multiple inputs Sum of products Activation function Conclusion

25 26 28 28 29 29 32

ABSTRACT This chapter gives an introduction that covers some important mathematical concepts behind ANN. It discusses the meaning of weight and bias, and how they are useful in neural networks. By starting with the simplest model Y = X, which has no parameters at all, some parameters will be added gradually until we build a single neuron. This neuron will be made to accept one or more inputs. The math behind building the neuron will then be mapped to a graphical representation. By connecting multiple neurons together, a complete ANN can be created.

Simplest model Y = X The building blocks of machine learning are actually quite simple. Even absolute beginners can build a basic machine learning model. Talking about supervised machine learning, its goal is to find (i.e., learn) a function that maps between the set of inputs and the set of outputs. After the learning itself is complete, the function should return the correct outputs for each given input(s). Let us discuss one of the simplest tasks according to the data given in Table 2.1. There are 4 samples. Each sample has a single input and also a single output. After having a look at the data, we will need to prepare a function that returns the correct output for each given input with the least possible error. After looking at the data, we obviously notice that the output Y is identical to the input X. If X equals 2, then Y also equals 2. If X is 4, then Y also is 4.

Introduction to Deep Learning and Neural Networks with Python™ https://doi.org/10.1016/B978-0-323-90933-4.00007-3 © 2021 Elsevier Inc. All rights reserved.

15

16 Introduction to deep learning and neural networks with Python™

TABLE 2.1 Data for testing the Y = X model. Sample ID

X (input)

Y (output)

1

2

2

2

3

3

3

4

4

4

5

5

So, what we need is a function that accepts a single input X and returns a single output. This output is identical to the input. With no doubt, the function is: f X X For simplicity, we can replace f(X) by Y. So, the function will be: Y=X Do you wonder whether Y = X is a machine learning model? It is indeed a model but with no parameters. Generally, the goal of machine learning is to build a model that maps between a set of inputs and their associated outputs. The model is actually a mathematical function. This function models the relationship between the inputs and the outputs. A function is a relation that maps inputs to outputs. Some functions might be complex and others might be simple. The one we started with, Y = X, is the simplest possible function to create as it just has the minimum entities, which are the inputs and the outputs. Being simple does not mean it is not a function or a machine learning model. Some other functions might be very complex and will have more parameters in addition to the inputs and the outputs. To make things simple for the reader, the book starts with Y = X and build over it to know how more complex models could be created.

Error calculation After finding the suitable machine learning model (i.e., function), we need to test it to find whether it predicts the outputs correctly or if there is an error. We can use a simple error function that calculates the absolute difference between the correct output and the predicted output according to the next equation. N

i i error Ycorrect ypredict i 1

It loops through the data samples, calculates the absolute difference between the correct and the predicted outputs for each sample, and finally sums all absolute differences and returns it into the error variable. The absolute operator

Introduction to artificial neural networks (ANN) Chapter | 2 17

TABLE 2.2 Calculating the prediction error for data in Table 2.1. Sample ID

Y (correct)

Y (predict)

Error

1

2

2

0

2

3

3

0

3

4

4

0

4

5

5

0

Total error

0

is used because we are not interested in whether the correct output is higher or lower than the predicted output. The symbol N used in the summation operator represents the number of samples. The details of the calculations are given in Table 2.2. According to that table, the function predicted all outputs correctly, and thus the total error is 0. GREAT. But do not forget that we are working with the simplest problems for absolute beginners. Before changing the problem to make it a bit harder, I need to ask a question. In building a machine learning model, there are two main steps that are learning (i.e., training) and testing. We have seen a basic testing step. But where is the learning step? Did we do learning in the previous model? The answer is NO. Learning means that there are some parameters within the model to be learned from the data within the training step. The function of the previous model Y = X has no parameters at all to learn. The function just equalizes the input X and the output Y with no intermediate parameters to balance the relationship. In this case, there is no learning step because the model is non-parametric. Non-parametric means that the model has no parameters to learn from the data to map the inputs to the outputs. The popular example of the non-parametric machine learning models is the K-nearest neighbors (KNN). Note that KNN has a parameter K, but it is still non-parametric. Why? The answer is that the parameter K is not learned from the data. It is called a hyperparameter where the data scientist selects the most suitable value, maybe by experience, that fits the problem being solved.

Introducing weight After clarifying that there is no parameter to learn, we can make a simple modification to the data used. The new data is given in Table 2.3. The modifications are quite simple. Each output Y is no longer equal to the input X. It is now double the input, i.e., 2X. Without making changes in the function, we can use Y = X to predict the outputs and calculate the total error and see whether this model still fits the data.

18 Introduction to deep learning and neural networks with Python™

TABLE 2.3 Doubling the outputs. Sample ID

X (input)

Y (output)

1

2

4

2

3

6

3

4

8

4

5

10

TABLE 2.4 Error calculation for data in Table 2.3. Sample ID

Y (correct)

Y (predict)

Error

1

4

2

2

2

6

3

3

3

8

4

4

4

10

5

5

Total error

14

The details of the error calculations are given in Table 2.4. The total error, in this case, is not 0 as in the previous example, but its value is 14. The existence of error within the data means that the model function cannot do the mapping between the input and the output correctly. In order to reduce the error, we have to make some modifications over that function. The question is, what are the sources of modifications within this function Y = X that can reduce the prediction error? The function just has two variables X and Y. One represents the input and the other represents the output. We cannot change either of them. In a conclusion, the function is non-parametric, so there is no way to change it in order to reduce the error. But all is not lost! If the function currently does not have a parameter, why not add one or more parameters? Don not hesitate to design your machine learning model in ways that reduce the error. If you find that adding something to the function fixes the problem, do it at once.

Weight as a constant In the new data, the output Y is double the input X. But the function is not changed to reflect this, and we still use Y = X. We can change the function by making the output Y equals 2X rather than X. The next function will be: Y = 2X

Introduction to artificial neural networks (ANN) Chapter | 2 19

TABLE 2.5 Error calculation for data in Table 2.3 after using a weight. Sample ID

Y (correct)

Y (predict)

Error

1

4

2 (2) = 4

0

2

6

3 (2) = 6

0

3

8

4 (2) = 8

0

4

10

5 (2) = 10

0

Total error

0

After using this function, the total prediction error is calculated according to Table 2.5. The total error is now 0 again. NICE. After adding 2 to the function, does our model become parametric? NO. Why? The model is still non-parametric. A parametric model learns the values of some parameters based on the data. Here, the value is calculated independently on the data, so the model is still non-parametric. The previous model has 2 multiplied by X but the value 2 is independent of the data. As a result, the model is still non-parametric. Let us change the previous data according to Table 2.6. Because there is no learning step, we can go ahead towards the testing step that calculates the prediction error after calculating the predicted output based on the last function Y = 2X. The total error is calculated according to Table 2.7. The total error is no longer 0, but it is now 14. Why did that happen? The model used for solving this problem was created when the output Y is double the input 2X. Because this model has no parameters to learn, then it cannot adapt itself to new data and can only be used with the previous data. Now, the output Y is no longer equal to 2X but 3X. So, it is expected that we find an increase in the error. In order to eliminate this error, we have to modify the model function by using 3 rather than 2. The new function will be: Y = 3X

TABLE 2.6 Output is equal to the triple of the input. Sample ID

X (input)

Y (output)

1

2

6

2

3

9

3

4

12

4

5

15

20 Introduction to deep learning and neural networks with Python™

TABLE 2.7 Error calculation for data in Table 2.3. Sample ID

Y (correct)

Y (predict)

Error

1

6

2 (2) = 4

2

2

9

3 (2) = 6

3

3

12

4 (2) = 8

4

4

15

5 (2) = 10

5

Total error

14

This function makes the total error for the new data 0. But when working with the previous data in which Y is equal to 2X, there will be an error. So, working with the proceeding data, we have to use 3X to return a total error of 0. Working with the previous data, we had to change it back to 2X. It seems that we have to change the model ourselves each time the data is changed. It is tiresome. But there is a solution. We can avoid using constants in the function and replace them with variables. This is algebra, which is the field of using variables rather than constants.

Weight as a variable Rather than using 2 in Y = 2X or 3 in Y = 3X, we can use a variable w and the equation will be: Y = wX The value of the variable w is calculated based on the data. Because the model now includes a variable that has its value calculated based on the data, the model is now parametric. Because the model is parametric, there will now be a learning step within which the value of this variable (parameter) is calculated. By the way, this parameter is the weight of a neuron in an ANN. Note that the variable parts of the machine learning models are usually called parameters, but they are given some special names in ANN. So, you can call w either a parameter (the general term) or a weight (special term for ANN). Let us see how the model learns the value 2 of the parameter w when using the previous data given in Table 2.3 in which Y equals to 2X. The process works by initializing the parameter w to an initial value that is usually selected randomly. For each parameter value, the total error is calculated. Based on some values of the parameter, we can decide the direction in which the error reduces, which helps to select the best (optimal) value of the parameter. That is deciding whether to increase or decrease the value of the parameter in order to reduce the error. The process of finding the most suitable

Introduction to artificial neural networks (ANN) Chapter | 2 21

values for the model parameters is called optimization. Optimizing the single parameter w is discussed in the next section.

Optimizing the parameter Assuming that we selected an initial value of 1.2 for w, then our current function is: Y = 1.2 X We can calculate the total error based on this function according to Table 2.8. The error is 11.2. Because there is still an error, we can change the value of the parameter w to reduce it. But we do not know the direction in which we should change the value of the parameter w. I mean, which is better? Should we increase or decrease the value of such a parameter? Because we do not currently know, we can choose any value, either greater or smaller than the current value 1.2. Note that there are some strategic algorithms that will help us knowing which direction to follow. The one that will be covered in this book is the gradient descent. Assuming that the new value of the parameter w is now 0.5, then the new function is: Y = 0.5 X We can calculate the total error based on this function according to Table 2.9. When w = 0.5, the error is 21. Compared to the total error reached by the previous value of the parameter w = 1.2, which was 11.2, the error increased. This is an indication that we might be moving in the wrong direction (i.e., direction of reducing the parameter value). We can go in the opposite direction by changing the value of the parameter w to another value greater than 1.2 and see whether things will be better. If the new value is w = 2.5, the new function will be: Y = 2.5 X

TABLE 2.8 Error calculation when w = 1.2. Sample ID

Y (correct)

Y (predict)

Error

1

4

2 (1.2) = 2.4

1.6

2

6

3 (1.2) = 3.6

2.4

3

8

4 (1.2) = 4.8

3.2

4

10

5 (1.2) = 6

4

Total error

11.2

22 Introduction to deep learning and neural networks with Python™

TABLE 2.9 Error calculation when w = 0.5. Sample ID

Y (correct)

Y (predict)

Error

1

4

2 (0.5) = 1

3

2

6

3 (0.5) = 1.5

4.5

3

8

4 (0.5) = 2

6

4

10

5 (0.5) = 2.5

7.5

Total error

21

TABLE 2.10 Error calculation when w = 2.5. Sample ID

Y (correct)

Y (predict)

Error

1

4

2 (2.5) = 5

1

2

6

3 (2.5) = 7.5

1.5

3

8

4 (2.5) = 10

2

4

10

5 (2.5) = 12.5

2.5

Total error

7

Based on this function, the total error is calculated according to Table 2.10. The error is now 7, which is better than the previous two trials when the parameter w was equal to 0.5 and 1.2. We can continue increasing the value of w. Assuming that the new value for w is 3, the new function will be: Y = 3X The total error is calculated based on this function according to Table 2.11. The error is now 14. The error is now larger than before. To have a better view of the situation, we can summarize the previously selected values of the parameter w and their corresponding errors in Table 2.12. The region of values for the parameter w that might reduce the error is bounded between 1.2 and 2.5. We can choose a value between such two values and see how things work. The process will continue testing more values until finally concluding that the value of 2 is the best value that reaches the least possible error, which is 0. Finally, the function will be: = Y wX = , where w 2

Introduction to artificial neural networks (ANN) Chapter | 2 23

TABLE 2.11 Error calculation when w = 3. Sample ID

Y (correct)

Y (predict)

Error

1

4

2 (3) = 6

2

2

6

3 (3) = 9

3

3

8

4 (3) = 12

4

4

10

5 (3) = 15

5

Total error

14

TABLE 2.12 Summary of the prediction errors for different values for w. w

Error

0.5

21

1.2

11.2

2.5

7

3

14

This is for the data in which Y is equal to 2X. When Y is equal to 3X, the process repeats itself until we find that the best value for the parameter w is 3. Up to this point, the purpose of using the weight in an ANN is now clear. The next section discusses the second parameter used in ANN, which is the bias.

Introducing bias To serve our purpose of understanding why a bias is needed, we need to modify the data according to Table 2.13. This data is identical to Table 2.3 when Y = 2X, but we have just added a value 1 to each Y value. Similar to working on the weight, we can start by selecting a constant value for the bias then replacing it with a variable.

Bias as a constant We can test the previous function Y = wX where w = 2 and calculate the total error according to Table 2.14. There is a total error of 4. According to our previous discussion, the error of 4 means that the value of w is not the best and we have to change it until reaching an error of 0. There

24 Introduction to deep learning and neural networks with Python™

TABLE 2.13 Adding 1 to the outputs in Table 2.3 to use bias. Sample ID

X (input)

Y (output)

1

2

5

2

3

7

3

4

9

4

5

11

TABLE 2.14 Error calculation for data in Table 2.13. Sample ID

Y (correct)

Y (predict)

Error

1

5

2 (2) = 4

1

2

7

3 (2) = 6

1

3

9

4 (2) = 8

1

4

11

5 (2) = 10

1

Total error

4

are some cases in which using only the weights will not reach a 0 error. It is impossible to find a value for the weight that makes the error 0. This example is a piece of evidence. Using just the weight w in this example, could we reach an error of 0? The answer is NO. Using just the weight in this example, we can just get close to the correct outputs, but there will be still an error as we could not predict the correct outputs accurately. Let us discuss this matter in a bit more detail to prove there is no value for the weight w that can make the error 0. For a given sample, what is the best value for w in the equation Y = wX that predicts its output correctly and returns an error of 0? It is simple. We have an equation with three variables, but we know the values of two variables, which are Y and X. This leaves out a single variable w, which can be calculated easily according to this equation: Y w= X For the first sample, Y equals 5 and X equals 2, and thus w will be calculated as follows: Y 5 w = = = 2.5 X 2

Introduction to artificial neural networks (ANN) Chapter | 2 25

So, the optimal value for w that predicts the output of the first sample correctly is 2.5. We can repeat the same for the second sample. For the second sample, Y = 7 and X = 3 and thus w is: w =

Y 7 = = 2.33 X 3

So, the optimal value for w that predicts the output of the second sample correctly is 2.33. This value is different from the optimal value of w that works with the first sample, which is 2.5. According to the two values of w for the first and second samples, we cannot find a single value for w that predicts their outputs correctly. Using w = 2.5 will leave an error in the second sample, and using w = 2.33 will leave an error for the first sample. In conclusion, using just the weight, we cannot reach an error of 0. In order to fix this situation, we have to use a bias. An error of 0 can be reached by adding a value of 1 to the result of multiplication between w and X. So, the new function is given in the next equation where w = 2. Y wX 1, where w 2 Table 2.15 gives the total error, which is now 0.

Bias as a variable We are still using a constant value of 1 to be added to wX to return the predicted output Y. According to our previous discussion, using a constant value within the function makes this value dependent on a specific problem and not generic. As a result, rather than using a constant of 1, we can use a variable like b. Thus, the new function is as follows: Y wX b, where w is the weight and b is the bias The variable (parameter) b represents the bias in an ANN. Note that both w and b could be called parameters but they have special names in an ANN as w is called weight and b is called bias. TABLE 2.15 The error after using bias of value 1. Sample ID

Y (correct)

Y (predict)

Error

1

5

2 (2) + 1 = 5

0

2

7

3 (2) + 1 = 7

0

3

9

4 (2) + 1 = 9

0

4

11

5 (2) + 1 = 11

0

Total error

0

26 Introduction to deep learning and neural networks with Python™

When solving (i.e. optimizing) a problem, we now have two parameters, w and b, to decide their best values. This makes the problem a bit harder compared to optimizing a problem with just a single parameter. Rather than finding the best value for just the weight w, we are asked to optimize two parameters w (weight) and b (bias). This takes much more time than before. The time could be reduced by, for example, limiting the range of acceptable values for each parameter. For example, if w has just three values to choose from which are 1, 2, and 3 and there are four values for b which are 0, 1, 2, and 3, then there is just a total of 3 * 4 = 12 different combinations to try. Let us take an example as discussed in the next section to optimize the two parameters.

Optimizing the weight and the bias Assuming that the initial values for the weight w is 3 and the bias b is 2, then the equation of the neuron becomes: Y wX b 3 X 2 Based on these values, Table 2.16 calculates the predicted outputs and the total error. The total error is 18. Because we do not know whether increasing or decreasing the values of the two parameters reduces the error, we can try selecting other values and see how things are going. If the selected value for w changes to 2 and b changes to 3, The equation now becomes: Y wX b 2 X 3 Table 2.17 calculates the predicted outputs and the total error. The total error is 8 compared to 18 for the previous values for the two parameters w = 3 and b = 2. Because reducing the value of w and increasing the value of b reduced the error, we can continue reducing w to be 1 and increasing b to be 4. The neuron math equation is as follows: Y wX b X 4

TABLE 2.16 Total error when w = 3 and b = 2. Sample ID

Y (correct)

Y (predict)

Error

1

5

2 (3) + 2 = 8

3

2

7

3 (3) + 2 = 11

4

3

9

4 (3) + 2 = 14

5

4

11

5 (3) + 2 = 17

6

Total error

18

Introduction to artificial neural networks (ANN) Chapter | 2 27

TABLE 2.17 Total error when w = 3 and b = 2. Sample ID

Y (correct)

Y (predict)

Error

1

5

2 (2) + 3 = 7

2

2

7

3 (2) + 3 = 9

2

3

9

4 (2) + 3 = 11

2

4

11

5 (2) + 3 = 13

2

Total error

8

TABLE 2.18 Total error when w = 1 and b = 4. Sample ID

Y (correct)

Y (predict)

Error

1

5

2 (1) + 4 = 6

1

2

7

3 (1) + 4 = 7

1

3

9

4 (1) + 4 = 8

1

4

11

5 (1) + 4 = 9

1

Total error

4

Table 2.18 calculates the predicted outputs and the total error. Based on these new values, the total error is reduced from 8 to 4. By continuing to try new values for the two parameters, we can reach the best values that make the error equal to 0 which are w = 2 and b = 1. It is important to note that the strategy used for optimizing the model is not strategic but based on trying different values and selecting the best. But there exist many strategic approaches for optimizing the machine learning models. The one we are focusing on throughout this book is the gradient descent. At this point, we deduced a function with two parameters which is written as given in the next equation: Y wX b The first one is w representing the weight, and the second one is b representing the bias. This function is the mathematical representation of a neuron in ANN that accepts a single input. The input is X with a weight equal to w. The neuron also has a bias b. By multiplying the weight w by the input X and summing the result by the bias b, the output is Y, which is the output of the neuron that is regarded as the input to the other neurons connected to it.

28 Introduction to deep learning and neural networks with Python™

The neuron can also be represented graphically, which is discussed in the next section.

From mathematical to graphical form of a neuron The mathematical representation of the neuron can be represented graphically in a way that summarizes the inputs, outputs, and parameters of the neuron. The mapping between the parameters in the mathematical and graphical forms of the neuron is illustrated in Fig. 2.1. There is a single notice. The bias b is regarded as a weight to an input of value + 1. This makes it easy to manipulate the bias as a regular input. The neuron of this example only accepts a single input X. The next section discusses using multiple inputs.

Neuron with multiple inputs Up to this point, the purpose of the weight and the bias is now clear, and we are also able to represent the neuron in both mathematical and graphical forms. But the neuron still accepts a single input. How do we allow it to support multiple inputs? This is also fairly simple. Just add whatever inputs you need in the equation and assign a weight to each of them. If there are three inputs, then the mathematical form will be as follows: Y W1 X1 W2 X 2 W3 X3 b Regarding the graphical form, just create a new connection for each input, then place the input and the weight on the connection as illustrated in Fig. 2.2. By connecting multiple neurons of this form, we can create a complete ANN. Remember that the starting point was just Y = X.

FIG. 2.1 Graphical representation of a neuron with single input.

Introduction to artificial neural networks (ANN) Chapter | 2 29

FIG. 2.2 The graphical representation of a neuron with multiple inputs.

Sum of products In the mathematical form, we notice that different terms are repeated. These terms multiply each input by its corresponding weight. We can summarize all of these products within a summation operator. This operator will be responsible for returning the sum of product (SOP) between each input and its corresponding weight. The new mathematical form for the neuron is given in the next equation. Note that the summation starts from 0, not 1. This means there will be a weight w with index 0. This also applies for the input X. Such a weight with index 0 will refer to the bias b. Its input will be always assigned + 1. N

Y wi Xi i 0

Calculating the SOP is not the end and there is another step that is feeding the SOP to a function called the activation function, which is discussed in the next section.

Activation function A mathematical representation of a neuron, given by the previous equation, only works for modeling the linear relationship between the inputs and the outputs, given the appropriate values for the two parameters w and b. Let us take an example. Assume that the data to be modeled is represented graphically as given in Fig. 2.3. Based on the data distribution, it seems that a linear model could fit the purpose well as shown in Fig. 2.4. Note that the line does not fit all data points and there is a prediction error for some samples outside the line but this model could be acceptable. If a model cannot guarantee a 0 error, then at least make it small as much as possible.

30 Introduction to deep learning and neural networks with Python™

FIG. 2.3 Linear data.

FIG. 2.4 Fitting a linear model over the data in Fig. 2.3.

A linear model or a function means it is just able to represent the relationship between the inputs and the outputs as a line and this is a simple way for modeling data. Unfortunately, the data for many problems cannot be just linearly modeled; thus, a nonlinear model is to be created. The data distribution in Fig. 2.5 is an example.

Introduction to artificial neural networks (ANN) Chapter | 2 31

FIG. 2.5 Data to be modeled nonlinearly.

FIG. 2.6 A nonlinear model fits the data in Fig. 2.5.

Because there is no line that could fit the data well, a nonlinear model is to be created to reduce the error as shown in Fig. 2.6. This leaves a question, which is how to model the relationship nonlinearly? The answer is using an activation function. There are different types of activation functions such as sigmoid. The mathematical formula for the sigmoid function is as follows: sigmoid SOP

1 1 e SOP

The activation function accepts the sum of the product between the inputs and the weights (added to the bias) and then returns a single value that is able to

32 Introduction to deep learning and neural networks with Python™

nonlinearly map between the inputs and the output(s). If there is a neuron that accepts a single input, then we can decompose the SOP as follows: SOP wX b As a result, the sigmoid function will be as follows: sigmoid SOP

1 1 SOP 1 e 1 e wX b

Another popular example of the activation functions is the rectified linear unit (ReLU). If the SOP is greater than 0, then ReLU returns the same value for the SOP. Otherwise, it returns 0. Throughout the implementation of ANN in this book, both sigmoid and ReLU will be applied.

Conclusion This chapter explained, in detail, how to create a complete ANN starting from a very simple function, Y = X. The chapter explored the purpose of both weights and bias until reaching this function Y = wX + b. Moreover, the chapter mapped between the mathematical form and the graphical form of a neuron. For modeling the nonlinear data distributions, the activation functions are used which accepts the SOP and returns a single value. This book expects that the reader has a beginners-level knowledge about ANN. For example, you should be familiar that an ANN is trained in two passes, forward and backward. For more information and detailed discussion of how an ANN works, you can read this book Ahmed Fawzy Gad, Practical Computer Vision Applications Using Deep Learning with CNNs, Apress, 9781484241660, 2018. In Chapter 3, we are going to discuss the Python implementation of an ANN with just a single input and a single output in both the forward and backward passes.

Chapter 3

ANN with 1 input and 1 output Chapter outline Network architecture Forward pass Forward pass math calculations Backward pass Chain rule Backward pass math calculations Python™ implementation

33 34 35 36 37 38 39

Necessary functions Preparing inputs and outputs Forward pass Backward pass Training network Conclusion

40 41 41 42 42 45

ABSTRACT In this and the next chapters, we are going to discuss building a generic implementation of artificial neural networks using Python. Generic means it can work with any network architecture. The implementation will both cover the forward and the backward passes. The algorithm that is implemented in this book is the gradient descent. We will start from the ground-up by implementing the simplest network with just 1 input, 1 output, and 1 training sample. Each successive chapter contributes to the implementation until finally reaching the generic implementation that can work with any number of inputs, any number of outputs, any network architecture, and any number of training samples. In this chapter, we are going to make a worm start by implementing an ANN for just some specific, not generic, architectures. The first architecture is the most basic one with just 1 input and 1 output. In the subsequent examples, the number of input neurons will be increased from 1 to 10. Through these examples, we can deduce some generic rules for implementing an ANN for any architecture in both forward and backward passes. For simplicity, no bias will be used at the beginning. If the idea of working with the weights in the forward and backward passes is clear, then the bias can be used easily. The implementation assumes that the problem being solved is a regression problem. Later, we will discuss adapting the implementation to work for classification.

Network architecture The first step toward the generic implementation of an ANN optimized using the gradient descent (GD) algorithm is to implement it just for the simplest architecture given in Fig. 3.1. The architecture has only 1 input, 1 output, and no hidden layers at all. Before thinking of using the GD algorithm in the backward Introduction to Deep Learning and Neural Networks with Python™ https://doi.org/10.1016/B978-0-323-90933-4.00003-6 © 2021 Elsevier Inc. All rights reserved.

33

34 Introduction to deep learning and neural networks with Python™

FIG. 3.1 ANN architecture with 1 input and 1 output.

pass, let’s start by the forward pass and see how to move from the input until calculating the error.

Forward pass According to Fig. 3.1, the input X1 is multiplied by its weight W to return the result X1*W. In the forward pass, it is generally known that each input is multiplied by its associated weight. The products between all inputs and their weights are then summed to get the SOP. Please refer to Chapter 2 for more information. In this example, there is only 1 input and thus there is only 1 product to calculate the SOP according to the next equation. SOP = X1W After calculating the SOP, next is to feed it to an activation function in the output layer neuron. At the beginning of the implementation, the sigmoid function will be the only function to be used. Later, the ReLU function will be implemented and the user will have the ability to choose which function to use. As discussed in Chapter 2, the formula of the sigmoid function is as follows: sigmoid sop

1 1 e SOP

Assuming that the data outputs range from 0 to 1, the result returned from the sigmoid could be directly regarded the predicted output. If the outputs are outside this range, then simply scale them to the desired range. This example is a regression example but it could be converted into a classification example easily by mapping the score returned by the sigmoid to a class label. For example, if there are only 2 classes (0 and 1), then a sample with a score of 0.5 and higher could be associated to the first class (1). Otherwise, it is associated to the second class (0). 1 if score 0.5 class score 0 Otherwise

ANN with 1 input and 1 output Chapter | 3 35

After calculating the predicted output (i.e., output of the activation function), next is to measure the prediction error. Let’s use the square error function. error predicted target

2

By calculating the prediction error, the forward pass is complete. Let’s apply these steps based on a numerical example.

Forward pass math calculations This section discusses the mathematical calculations in the forward pass for the network being discussed with just a single sample with 1 input and 1 output. Table 3.1 shows the values of the input X, output Y, and initial weight W. The first thing to do is to calculate the SOP according to the next equation. For an input equal to 2 and an initial weight of 0.7, the SOP will be 1.4. SOP X1W 2 0.7 1.4 The SOP is fed as input to the sigmoid activation function and the result is calculated according to the next equation. sigmoid sop

1 1 1 0.8 1 e SOP 1 e 1.4 1.2466

In this example, the output of the activation function is regarded the predicted output. Thus, the predicted output is 0.8. After the predicted output is calculated, the next step is to calculate the error between the predicted and target outputs. Based on the square error function, the error is calculated as given in the next equation. error predicted target 0.8 1.0 0.2 0.04 2

2

2

Based on the calculated error, we can start the backward pass of training the neural network and calculate the weight gradient which is used for updating the current weight.

TABLE 3.1 Values of the input x, output y, and the initial weight w. X (input)

Y (output)

Initial W (weight)

2

1

0.7

36 Introduction to deep learning and neural networks with Python™

Backward pass In the backward pass, the goal is to know how the error changes by changing the network weights. If such a relationship is found, then the network weights could be changed to reduce the error as much as possible. To relate the error and the weights together, think of an equation including both the error and the weight. Based on this equation, it will be possible to know how changing one parameter affects the other. How to build that equation? According to the error function, the error is calculated using 2 terms which are: 1. predicted: The predicted output. 2. target: The desired (target) output. error predicted target

2

The error function just includes the error but not the weight. Remember that the variable predicted is calculated as the output of the sigmoid function. Thus, the predicted variable could be replaced by the equation of the sigmoid function. The new error function is given in the next equation. Up this point, the error is only included in the equation but the weight is not. error predicted target

2

1 target SOP 1 e

2

Remember that the variable SOP is calculated as the product between the input X1 and its weight W. Thus, we can replace SOP from the previous equation and use its equivalent part, X1*W, as given in the next equation. 2

1 1 error target target SOP X W 1 e 1 e 1

2

After building an equation that relates both the error and the weight together, we can start calculating the gradient of the error relative to the weight to see how changing the weight affects the error. The gradient is calculated according to the next equation. 1 d target X1W dError 1 e dW dW

2

Using an equation like the one given above for calculating the gradient might be complex especially when more inputs and weights exist. As an alternative, we can use the chain rule which simplifies the calculations. The chain rule is discussed in the next section.

ANN with 1 input and 1 output Chapter | 3 37

Chain rule When the two participants of the gradient, error and w in this example, are not related directly by a single equation, follow a chain of derivatives that starts from error until reaching w. Looking back at the error function, predicted is the link between error and w. As a result, the first derivative in the chain is the derivative of error relative to predicted as calculated in the next equation. dError 2 predicted target dPredicted Back to the chain, the next derivative, which is the derivative of predicted to SOP, is calculated. In this example, the predicted output is actually the output of the sigmoid function. As a result, the variable predicted is calculated using the next equation. predicted sigmoid sop

1 1 e SOP

Thus, this derivative of predicted to SOP is simply the derivative of the sigmoid function as given in the next equation. dPredicted 1 1 1 SOP SOP dSOP 1 e 1 e Finally, the last derivative in the chain is the derivative of SOP relative to the weight w which is calculated in the next equation. dSOP = X1 dW After going through all the three derivatives in the chain from the error to the weight, the derivative of the error relative to the weight is calculated by multiplying all of the individual derivatives as in the next equation. dError dError dPredicted dSOP = dW dPredicted dSOP dW After the weight gradient is calculated, next is to update the weight according to the next equation. Wnew Wold learning _ rate grad The variable named learning_rate is the learning rate which defines the amount of change introduced to the old weight based on the value of the gradient. Its value ranges from 0 to 1 inclusive. For more information about how the learning rate affects the learning process, you can read the tutorial titled Is Learning Rate Useful in Artificial Neural Networks? in which a simple network

38 Introduction to deep learning and neural networks with Python™

is built using Python to show how changing the learning rate helps in learning the parameters correctly. The next section continues the mathematical example started previously to do the backward pass calculations.

Backward pass math calculations The first derivative to be calculated in the backward pass is the error to the predicted output derivative as calculated in the next equation. By substituting by the values of the predicted (0.8) and target (1.0) outputs, the derivative is − 0.4. dError 2 predicted target 2 0.8 1.0 2 0.2 0.4 dPredicted The next derivative is between predicted to SOP and calculated according to the next equation. By substituting by the value of the sum of products (− 1.4), the derivative is 0.16. dPredicted 1 1 1 SOP SOP dSOP 1 e 1 e 0.8 0.2 0.16

1 1 1 1.4 1.4 1 e 1 e

The last one is the SOP to W derivative as calculated in the next equation which yields 2. dSOP 2 = X= 1 dW After calculating all three derivatives in the chain from the error to W, next is to calculate the gradient of W by multiplying all these 3 derivatives (− 0.4, 0.16, and 2) according to the next equation. The gradient is − 0.128. dError dError dPredicted dSOP 0.4 0.16 2 0.128 dW dPredicted dSOP dW The sign of the gradient tells us a good piece of information. When the gradient sign is positive, there is a direct relationship between W and the error. This means increasing one of these variables increases the other. If the gradient is negative, then there is an inverse relationship between W and the error. This means increasing the W reduces the error and reducing W increases the error. Assuming that the learning rate is 0.5, the new weight is calculated as given in the next equation. Wnew Wold learning _ rate grad 0.7 0.5 0.128 0.764

ANN with 1 input and 1 output Chapter | 3 39

Note that amount of change in the current value of the weight is calculated by multiplying the learning rate by the gradient. w _ change learning _ rate grad 0.5 0.128 0.064 The new value of the weight is now 0.764 which is larger than the previous value 0.7. This means higher value for W is preferred to reduce the error. By repeating the calculations, the new W value gives a predicted output of 0.8217 and thus the error is 0.0318 which is smaller than the previous error 0.04. The next value for W is 0.8575 which is higher than the previous one 0.764. By doing these calculations more and more, we can reach the desired output 1.0. After understanding how the process works theoretically, we can apply it easily in Python as discussed in the next section.

Python™ implementation In this section, the ANN discussed in this chapter will be implemented in Python. The only library used for this task is numerical Python (NumPy) and it is the main one used in this book. We are going to list the final code that builds the network and then start discussing it in details. The next code goes through the steps discussed previously to implement an ANN with 1 input and 1 output trained using a single training sample in both the forward and the backward passes.

40 Introduction to deep learning and neural networks with Python™

The pt code consists of three parts that do the following: 1. Defining and implementing the necessary functions. 2. Defining input, output, weight, and learning rate. 3. Implementing the forward pass. 4. Implementing the backward pass. Each of these parts is discussed in the next subsections.

Necessary functions There are six necessary functions to be implemented for building the ANN successfully in both the forward and the backward passes. The names of the six functions are as follows: 1. sigmoid(): Sigmoid function. 2. error(): Error function. 3. error_predicted_deriv(): Calculates the derivative of the error relative to the predicted output. 4. activation_sop_deriv(): Calculates the derivative of the activation function (i.e., predicted) to the SOP. 5. sop_w_deriv(): Calculates the derivative of the SOP to the weight w. 6. update_w(): Updates the weight w based on the calculated gradient (i.e., derivative chain product).

ANN with 1 input and 1 output Chapter | 3 41

The first two functions, sigmoid() and error(), serve the forward pass and the remaining four functions serve the backward pass. The next listing gives the code for implementing these six functions. The implementation of each function is just a mapping from the mathematical equations discussed previously. Note that within the implementation of the sigmoid function, numpy. exp() is used to calculate the exponential of the SOP.

Preparing inputs and outputs After these 6 functions are built, the next part of the code prepares the input, output, weight, and learning rate. The value of the input X1 is 0.1 and the target (i.e., Y) is 0.3. The weight w is initialized randomly using numpy.random.rand() which returns a number between 0 and 1. The learning rate is set to 0.01.

Forward pass The next part of the code goes through the forward pass by propagating the input and the weight through the network until calculating the predicted output followed by the error. At first, the SOP between the input and the weight is calculated. The SOP is then passed to the sigmoid() function. Remember that the output from the sigmoid() function is regarded as the predicted output.

42 Introduction to deep learning and neural networks with Python™

After calculating the predicted output, the final step is to calculate the error using the error() function. By doing that, the forward pass is complete.

Backward pass The next derivative, which is the derivative of the predicted (activation) output to the SOP, is calculated using the activation_sop_deriv() function. The result is stored in the variable g2. Finally, the last derivative between the SOP and the weight w is calculated using the sop_w_deriv() function and the result is stored in the variable g3.

After calculating all derivatives in the chain and storing their values in the three variables g1, g2, and g3, next is to calculate the derivative of the error to the weight by multiplying all of these derivatives. This returns the gradient by which the weight value is updated. The weight is updated using the update_w() function which accepts three arguments: 1. w: The weight. 2. grad: The weight’s gradient. 3. learning_rate: The learning rate. By calling the update_w() function, the new value of the weight w is returned which replaces the old value. Based on the new value of w, the calculations in the forward and backward passes should be repeated.

Training network The previous code does not re-train the network using the updated weight. An ANN is usually trained by updating the network weights in a number of iterations. This help to reach a better value for the weight and thus reducing the prediction error. The next code uses a for loop to repeat the calculations in the forward and backward passes for 80,000 iterations. You can change the learning rate and number of iterations until the network makes good predictions.

ANN with 1 input and 1 output Chapter | 3 43

After the code completes, Fig. 3.2 shows how the network prediction changes by iterations. The network is able to reach the desired output after 50,000 iterations. Note that you can reach the desired output using less number of iterations by changing the learning rate.

44 Introduction to deep learning and neural networks with Python™

0.50

Prediction

0.45

0.40

0.35

0.30 0

10000

20000 30000

40000 50000

60000 70000

80000

Iteration Number FIG. 3.2 Network prediction vs. iteration where the learning rate is 0.01.

0.500 0.475 0.450

Prediction

0.425 0.400 0.375 0.350 0.325 0.300 0

10000

20000

30000

40000

50000

60000

Iteration Number FIG. 3.3 Network prediction Vs. Iteration where the Learning Rate is 0.5.

70000

80000

ANN with 1 input and 1 output Chapter | 3 45

0.040 0.035 0.030

Error

0.025 0.020 0.015 0.010 0.005 0.000 0

10000

20000

30000

40000

50000

60000

70000

80000

Iteration Number FIG. 3.4 Network error vs. iteration.

When the learning rate is 0.5, Fig. 3.3 shows how the predicted output changes by iteration. The network reached the desired output after only 10,000 iterations. Fig. 3.4 shows how the network error changes by iteration when the learning rate is 0.5.

Conclusion This chapter discussed the math for working with an ANN with just 1 input and 1 output. It is the most basic type of network architectures. No hidden layers were used and just a single sample was used for training. After discussing the theory behind how the ANN works in both the forward and the backward passes, the chapter discussed the implementation of the example in Python in details. Chapter 4 builds over this implementation and works with two architectures. The first one has 2 inputs rather than 1. This is not far away from the one discussed in this chapter. The second architecture increases the number of inputs to 10. To avoid modifying the code when the number of inputs changes, Chapter 4 builds a Python implementation that can work with any number of inputs.

Chapter 4

Working with any number of inputs Chapter outline ANN with 2 inputs and 1 output Math example Python™ implementation Code changes Training ANN ANN with 10 inputs and 1 output Training ANN ANN with any number of inputs

47 48 50 52 53 56 62 66

Inputs assignment Weights initialization Calculating the SOP Calculating the SOP to weights derivatives Calculating the weights gradients Updating the weights Conclusion

66 67 67 68 68 69 74

ABSTRACT Chapter 3 implemented a basic neural network with just 1 input and 1 output and trained using a single sample. This chapter extends the previous implementation until building a network that works with any number of inputs. The first architecture discussed in this chapter has 2 inputs and 1 input. The second architecture has 10 neurons rather than 2 in the input layer. Through these examples, some generic rules are deduced for implementing a network that works with any number of inputs.

ANN with 2 inputs and 1 output This section extends the ANN implementation in Chapter 3 to work with an input layer with 2 inputs rather than just 1 input. The diagram of the ANN is given in Fig. 4.1. Each input has a different weight. For the first input X1, there is a weight W1. For the second input X2, its weight is W2. How to work with these 2 parameters in both the forward and backward passes? Regarding the forward pass, things are straightforward. All we need to do is to calculate the SOP based on these inputs. For 2 inputs, the SOP will be calculated according to the next equation. SOP W1 X1 W2 X 2

Introduction to Deep Learning and Neural Networks with Python™ https://doi.org/10.1016/B978-0-323-90933-4.00009-7 © 2021 Elsevier Inc. All rights reserved.

47

48 Introduction to deep learning and neural networks with Python™

FIG. 4.1 ANN Architecture with 2 inputs and 1 Input.

After the SOP is calculated, the result is applied to the activation function to return the predicted output. sigmoid sop

1 1 e SOP

Finally, the error is calculated as regular. error predicted target

2

Thus, what is needed in forward pass to work with multiple inputs is to calculate the SOP properly by multiplying each input by its corresponding weight and summing the products. Regarding the backward pass, there is some more work to do. To make things simpler, the next equations list the chain of derivatives of the error with respect to the 2 weights W1 and W2. What is the difference between these 2 chains? dError dError dPredicted dSOP = dW1 dPredicted dSOP dW1 dError dError dPredicted dSOP = dW2 dPredicted dSOP dW2 The difference is how to calculate the last derivative between the SOP and the weight. The first 2 derivatives are identical for all weights. The next section applies the theory using a simple mathematical example.

Math example The sample used for training the ANN in addition to the initial weights is given in Table 4.1. The values of the 2 inputs X1 and X2 are 2 and 3, respectively. The output Y is 1. The initial values of the 2 weights W1 and W2 are 0.7 and 0.2. Based on the inputs and their weights, the SOP is calculated according to the next equation. Its value is 2.

Working with any number of inputs Chapter | 4 49

TABLE 4.1 Values of the inputs, output, and the initial weights. X1 (input)

X2 (input)

Y (output)

W1 (initial)

W2 (initial)

2

3

1

0.7

0.2

SOP W1 X1 W2 X 2 0.7 2 0.2 3 2 The SOP is then applied to the sigmoid function according to the next equation. The result is 0.881. Note that the SOP is regarded the predicted output. sigmoid sop

1 1 0.881 1 e SOP 1 e 2

Based on the predicted and target outputs, the next equation calculates error which equals to 0.0142. error predicted target 0.881 1.0 0.0142 2

2

The forward pass ends by calculating the error and the backward pass starts. In the backward pass, there are 3 derivatives in the chain from the error to each of the 2 weights. For the first weight W1, here is the derivative chain. dError dError dPredicted dSOP = dW1 dPredicted dSOP dW1 Starting from left to right, the derivative of the error with respect to the redicted output is calculated according to the next equation. It returns the value p − 0.238. dError 2 predicted target 2 0.881 1.0 0.238 dPredicted The next equation calculates the derivative of the predicted output with respect to the SOP. The result is 0.105. dPredicted 1 1 1 SOP SOP dSOP 1 1 e e

1 1 1 2 1 1 e 2 e

0.207

The next equation calculates the derivative of the SOP with respect to the weight W1. dSOP = X= 2 1 dW 1 After calculating all the derivatives in the chain between the error and W1, all of these derivatives are multiplied to return the gradient by which W1 is updated.

50 Introduction to deep learning and neural networks with Python™

dError dError dPredicted dSOP 0.238 0.207 2 0.099 dW1 dPredicted dSOP dW1 Based on the gradient of W1, its new value is 0.725 given that the learning rate if 0.5. W1 new W1old learning _ rate grad 0.7 0.5 0.098 0.749 For the second weight W1, its derivative chain is given in the next equation. dError dError dPredicted dSOP = dW2 dPredicted dSOP dW2 While working on calculating the gradient of W1, the first 2 derivatives in the chain between the error and W2 are already calculated and their values are − 0.238 and 0.207. The remaining derivative for SOP with respect to W2 is calculated in the next equation. Its value is 3. dSOP = X= 3 2 dW 2 Based on the 3 derivatives in the chain between the error and W2, the gradient of W2 is calculated in the next equation. The gradient is − 0.148. dError dError dPredicted dSOP 0.238 0.2207 3 0.148 dW2 dPredicted dSOP dW2 The new value of W2 is calculated according to the next equation which returns 0.274 given that the learning rate is 0.5. W2 new W2 old learning _ rate grad 0.2 0.5 0.148 0.274 At this time, the new values for the 2 weights W1 and W2 are 0.749 and 0.274, respectively. After discussing the calculations in the forward and backward passes, the next section builds the Python implementation of the network.

Python™ implementation The next code gives the Python implementation that builds an ANN in both the forward and backward passes for working with 2 inputs and 1 output. The changes in this code compared to the code developed in Chapter 3 are discussed in the next subsection.

Working with any number of inputs Chapter | 4 51

52 Introduction to deep learning and neural networks with Python™

Code changes There are 3 major differences compared to the implementation in Chapter 3 and they can be summarized as follows: 1. Inputs and weights initialization 2. SOP calculation 3. Derivatives calculation The example discussed in Chapter 3 uses just a single input and thus there was a single variable named x1 for that input. For x1, it had the weight w1 associated with it. The example in this chapter has 2 inputs rather than just one and thus 2 variables are created to hold the 2 inputs which are x1 and x2. There are 2 variables to hold the weights for the 2 inputs and these variables are w1 and w2. Remember that the weights are initialized randomly using numpy.random.rand(). For working with 2 weights, this function is called 2 times. The next code lists the part of the implementation necessary for preparing the inputs and the weights. Note that there are no necessary changes for the target and the learning rate.

The second change was already discussed but it does not hurt to mention it again. In the forward pass, the SOP is calculated as the sum of products between each input and its associated weight. Because there are 2 inputs and 2 weights, it is now calculated according to the next equation. SOP X1W 1 X 2W2 The part of the code used for building the forward pass is as follows.

The third change is about the backward pass which is calculating the derivative of the SOP to each of the 2 weights. In Chapter 3, there was just a single weight and thus a single derivative was calculated, which is the derivative of SOP with respect to the weight. In this example, there are 2 weights and thus 2 derivatives are calculated, 1 for each weight. To follow up, here is the code that implements the backward pass.

Working with any number of inputs Chapter | 4 53

The variable g3w1 calculates the SOP to W1 derivative and the variable g3w2 calculates the SOP to W2 derivative. The gradient by which each weight is updated is calculated in the 2 variables gradw1 and gradw2. Finally, 2 calls to the update_w() function are necessary for updating the 2 weights. At this time, the necessary changes that make the implementation in Chapter 3 works for 2 inputs are discussed. The next section trains the ANN by entering a looping that updates the weights multiple times until reaching a good state.

Training ANN The previous code just works for 1 iteration. Using a loop, the code is repeated through several iterations in which the weighs can be updated to better values that reduce the error. Here is the new code.

54 Introduction to deep learning and neural networks with Python™

Fig. 4.2 shows how the prediction of the ANN changes until reaching the desired output which is 0.3. After around 5000 iterations, the network is able to do the correct prediction. Remember that changing the learning rate might increase or decrease the required number of iterations until reaching the desired output. Fig. 4.3 shows how the error changes by the iteration number. The error is 0.0 after 5000 iterations. Up to this point, an ANN is implemented in both forward and backward (gradient descent) passes for working with 2 inputs. In the next section, the previous implementation will be extended to allow the algorithm to work with 10 inputs. Later, an ANN is built that works with any number of inputs.

Working with any number of inputs Chapter | 4 55

0.50

Prediction

0.45

0.40

0.35

0.30 0

10000

20000 30000

40000 50000

60000 70000

80000

Iteration Number FIG. 4.2 Network prediction vs. iteration for the ANN architecture in Fig. 4.1.

0.10

0.08

Error

0.06

0.04

0.02

0.00 0

10000

20000 30000

40000 50000

60000 70000

Iteration Number FIG. 4.3 Network error vs. iteration for the ANN architecture in Fig. 4.1.

80000

56 Introduction to deep learning and neural networks with Python™

ANN with 10 inputs and 1 output The network architecture with 10 inputs and 1 output is represented graphically in Fig. 4.4. There are 10 inputs, X1 to X10 and 10 weights, W1 to W10. Each input Xi has its weight Wi. Building the ANN for such an architecture algorithm is similar to the previous example but just using 10 inputs rather than 2. Similar to the 3 changes applied to implement an ANN with 2 inputs, there will be some changes to make it works with 10 inputs. The first change is preparing the initial weights. Because there are 10 inputs, then there will be 10 weights, one for each input. Its implementation is just a matter of repeating the lines of code. The second change is calculating the SOP in the forward pass. Because there are 10 inputs and 10 weights, the SOP is the sum of 10 products between each input Xi and its corresponding weight Wi. The SOP will be calculated according to the next equation. Input Layer

Output Layer

X1 W1 X2

W2

X3

W3

X4

W4

X5

Error

W5 W6

X6

Sop | Activ

W7 X7 X8 X9

W8 W9

W10

X10 FIG. 4.4 ANN architecture with 10 inputs and 1 output.

Working with any number of inputs Chapter | 4 57

SOP W1 X1 W2 X 2 W3 X3 W4 X 4 W5 X 5 W6 X6 W7 X 7 W8 X8 W9 X 9 W10 X10 Regarding the backward pass, there will be 10 derivative chains, one for generating the gradient for each of the 10 weights. The necessary derivative chains to be calculated are summarized in the next equations. Their implementation is nothing more than repeating some lines of code. It is clear that the first 2 derivatives are fixed across all weights but the last derivative (SOP to weight) is what changes for each weight. dError dError dPredicted dSOP = dW1 dPredicted dSOP dW1 dError dError dPredicted dSOP = dW2 dPredicted dSOP dW2 dError dError dPredicted dSOP = dW3 dPredicted dSOP dW3 dError dError dPredicted dSOP = dW4 dPredicted dSOP dW4 dError dError dPredicted dSOP = dW5 dPredicted dSOP dW5 dError dError dPredicted dSOP = dW6 dPredicted dSOP dW6 dError dError dPredicted dSOP = dW7 dPredicted dSOP dW7 dError dError dPredicted dSOP = dW8 dPredicted dSOP dW8 dError dError dPredicted dSOP = dW9 dPredicted dSOP dW9 dError dError dPredicted dSOP = dW10 dPredicted dSOP dW10

58 Introduction to deep learning and neural networks with Python™

The generic derivative chain that works for any weight is given in the next equation. Just replace i by the index of the weight. dError dError dPredicted dSOP = dWi dPredicted dSOP dWi In the example with just 2 inputs, there were just 2 weights and thus the following exists: 1. 2 lines of code for specifying the value of each of the 2 inputs. 2. 2 lines of code for initializing the 2 weights. 3. Calculating the SOP by summing the product between the 2 inputs and the 2 weights. 4. 2 lines of code for calculating the derivatives of the SOP with respect to the 2 weights. 5. 2 lines of code for calculating the gradient for the 2 weights. 6. 2 lines of code for updating the 2 weights. In this example, 10 lines will replace the 2 lines and thus the following will be existing: 1. 10 lines of code for specifying the values of the 10 inputs. 2. 10 lines of code for initializing the 10 weights. 3. Calculating the SOP by summing the product between the 10 inputs and the 10 weights. 4. 10 lines of code for calculating the derivatives of the SOP with respect to the 10 weights. 5. 10 lines for calculating the gradient for the 10 weights. 6. 10 lines for updating the 10 weights. The code for implementing the ANN with 10 inputs in both the forward and backward passes is listed as follows:

Working with any number of inputs Chapter | 4 59

60 Introduction to deep learning and neural networks with Python™

The part responsible for preparing the 10 inputs is as follows. Note that there are no necessary changes to do for the output and the learning rate. Their values can be left as they are or select others of your choice. It is important to note that for every new input, a new variable is added. It is tiresome specially if there are hundreds or even thousands of inputs. Could you imagine how to organize things better? This is discussed in a later section.

Working with any number of inputs Chapter | 4 61

The code block that prepares the 10 weights is listed as follows. It is nothing more than calling the numpy.random.rand() function 10 times. Similar to the inputs, a new line is used for each new weight.

The forward pass is implemented according to the next code. The only change is preparing the SOP by multiplying each input by its corresponding weight.

Finally, the backward pass is implemented in the next code. According to the derivative chains of all weights, the first 2 derivatives across all derivative chains are the same except for the last derivative. dError dError dPredicted dSOP = dWi dPredicted dSOP dWi The first 2 derivatives are stored in the variables g1 and g2. The third derivative varies for each weight and thus it is calculated differently for each weight. The variables g3w1 to g3w10 calculate such derivative. After all derivatives are prepared, they are multiplied to return the gradients for the 10 weights which are stored in the variables gradw1 to gradw10.

62 Introduction to deep learning and neural networks with Python™

Finally, the update_w() function is called 10 times, one for each weight, to return the updated weights.

At this time, all the changes required to work with 10 weights are done. The next section lists the code that trains the network.

Training ANN As regular, the previous code just goes through a single iteration. Using a loop, the network can be trained for some iterations. The next code goes through 80,000 iterations.

Working with any number of inputs Chapter | 4 63

64 Introduction to deep learning and neural networks with Python™

Fig. 4.5 shows how the predicted output changes by the iteration number. After around 30,000 iterations, the network is able to make the correct prediction. Fig. 4.6 shows the relationship between the error and the iteration number. The network is able to reach an error of 0.0. Up to this point, the implementation of the ANN for working with 10 inputs is complete. The expected questions is what if there are more than 10 inputs? Do we have to add more lines for each input and weight? Using the current implementation, unfortunately the lines must be duplicated. But this is not the only way to do.

Working with any number of inputs Chapter | 4 65

1.00

0.95

Prediction

0.90

0.85

0.80

0.75

0.70 0

10000

20000 30000

40000 50000

60000 70000

80000

Iteration Number FIG. 4.5 Network prediction vs. iteration for the ANN architecture in Fig. 4.4.

0.08

Error

0.06

0.04

0.02

0.00 0

10000

20000

30000

40000

50000

60000

Iteration Number FIG. 4.6 Prediction error vs. iteration for the ANN architecture in Fig. 4.4.

70000

80000

66 Introduction to deep learning and neural networks with Python™

The previous code will be refined in the next section so that there is no need to duplicate lines of code at all for working with any number of inputs.

ANN with any number of inputs The strategy followed at the current time for implementing the GD algorithm is to duplicate some lines of code for each new input. Despite being legacy way to do, it is helpful to understand how each little step works. In this section, the previous implementation will be edited so that it works regardless of increasing or decreasing the number of inputs. The strategy is to inspect the previous implementation for the lines that are repeated for each input. After that, these lines will be replaced by just a single line that works for all inputs. Looking at the previous code, there are 6 parts to be refined: 1. Inputs assignment. 2. Weights initialization. 3. Calculating the SOP. 4. Calculating the SOP to Weights Derivatives. 5. Calculating the weights gradients. 6. Updating the weights. The next sections address each of these 6 points by looking at the code part and then suggesting the change to make it dynamic for working with any number of inputs.

Inputs assignment The first point to be addressed is assigning values to the inputs. The part of the previous code used for specifying the values of all inputs is listed as follows:

According to the current strategy, if more inputs are to be added, then more lines must be written.

Working with any number of inputs Chapter | 4 67

A better approach is to replace all of these lines by just the next single line. A NumPy array is created using numpy.array() that holds all of the inputs. Using indexing, each individual inputs is returned. For example, if the first input is to be retrieved, then index 0 is used for indexing the array x.

Weights initialization The second point to be addressed is initializing the weights. The part of the previous code used for initializing the weights is listed as follows:

If more weights are to be initialized, more lines are written. Rather than adding a separate line for initializing each weight, all of these lines could be replaced by the next line. If the numpy.random.rand() function is called without arguments, then it returns a single value. If called while passing a number, then it returns a vector of random numbers of length equal to that passed number. The next line returns a NumPy array with 10 values, one for each weight. Again, using indexing the individual weights are retrieved.

Calculating the SOP The third change is how to calculate the SOP. The SOP in the previous code was calculated according to the next line. For each input, a new term is added for multiplying it by its weight.

Rather than multiplying each input by its weight this way, a better approach could be used. Remember that the SOP is calculated by multiplying each input by its weight. Also remember that both x and w are now NumPy arrays and each has 10 values where the weight at index i in array w corresponds to the input at index i at array x.

68 Introduction to deep learning and neural networks with Python™

To return the output, each value in the vector x is multiplied by its corresponding value in the vector w. The good news is that NumPy supports multiplying arrays value by value. Just w*x returns a new array with 10 values representing the product of each input by its weight. After summing all of these products according to the next line, the SOP is returned.

Calculating the SOP to weights derivatives The next code to be edited is responsible for calculating the derivatives of the SOP with respect to the weights.

Rather than calling the sop_w_deriv() function for calculating the derivative for each individual weight, a better way is simply passing the array x to this function according to the next line.

When a NumPy array is passed to the function, it processes each value in that array independently and returns a new NumPy array of the same size as the passed array. In other words, the sop_w_deriv() function will be internally called 10 times, one for each value in the array x. The output of this function g3 is a NumPy array of the same shape as the input array x.

Calculating the weights gradients The fifth point to be edited in the code to allow it to work with any number of inputs is the part responsible for calculating the weights gradients as given in the next code.

Working with any number of inputs Chapter | 4 69

Rather than adding a new line for multiplying all derivatives in the chain for each weight, simply use the next line.

Remember that both g1 and g2 are NumPy arrays holding a single value but g3 is a NumPy array holding 10 values. This can be regarded multiplying an array with a scalar value. The returned output stored in the grad variable is a NumPy array with 10 values, one for each weight.

Updating the weights The final code part to be edited is responsible for updating the weights which is listed in the next block.

Rather than calling the update_w() function for each weight, simply pass the grad array calculated in the previous step alongside with the weights array w to this function as given in the next line. Inside the function, the weights update equation is applied for each the pair of weight and its gradient. It returns a new array of 10 values representing the new weights that will be used in the next iteration. After making all the necessary edits to make the code works with any number of inputs, here is the final optimized code. It shows the previous code but commented as a way of mapping each part with its edit.

70 Introduction to deep learning and neural networks with Python™

Working with any number of inputs Chapter | 4 71

72 Introduction to deep learning and neural networks with Python™

To make the code clearer, all comments are removed and here is the new code.

Working with any number of inputs Chapter | 4 73

If a network with 5 inputs is to be created, then just 2 changes are made to the previous code: 1. Preparing an input array x to just have 5 values. 2. Replacing 10 by 5 inside numpy.random.rand() for returning the weights array.

74 Introduction to deep learning and neural networks with Python™

Conclusion This chapter implemented 2 ANN architectures. The first one has just 2 inputs and the second one has 10 inputs. Based on these 2 examples, the chapter discussed optimizing the code to avoid line duplication in case of increasing or decreasing the number of inputs. The final code of this chapter is able to work with any number of inputs but it still works with just a single output and cannot work with any hidden layers. In Chapter 5, this implementation will be extended to work with hidden layers.

Chapter 5

Working with hidden layers Chapter outline ANN with 1 hidden layer with 2 neurons Forward pass Forward pass math calculations Backward pass Output layer weights Hidden layer weights Backward pass math calculations Output layer weights gradients

75 76 77 79 79 80 84 84

Hidden layer weights gradients Updating weights Python™ implementation Forward pass Backward pass Complete code Conclusion

85 87 88 91 91 93 96

ABSTRACT The latest neural network Python implementation built in Chapter 4 supports working with any number of inputs but without hidden layers. This chapter extends the implementation to work with a single hidden layer with just 2 hidden neurons. In later chapters, more hidden layers and neurons will be supported.

ANN with 1 hidden layer with 2 neurons Similar to the strategy used for building the input layer in which layer starts with a fixed number of inputs until building a generic input layer that can work with any number of inputs, this chapter just builds a hidden layer with 2 neurons. This section discusses the network architecture shown in Fig. 5.1. The network has 3 inputs, 1 hidden layer with 2 neurons, and 1 output neuron. Each of the 3 inputs is connected to the 2 hidden neurons. Thus, there are 2(3) = 6 connections. For each of the 6 connections, there is a different weight. The weights between the input and hidden layers are labeled as Wzy where z refers to the input layer neuron index and y refers to the index of the hidden neuron. Note that the weights between the layers with indices K-1 and K can be directly called the weights of layer K. The weight for the connection between the first input X1 and the first hidden neuron is W11. The weight W12 is for the connection between X1 and the second hidden neuron. Regarding X2, the weights W21 and W22 are for the connections to the first and second hidden neurons, respectively. Similarly, X3 has 2 weights W31 and W32. Introduction to Deep Learning and Neural Networks with Python™ https://doi.org/10.1016/B978-0-323-90933-4.00004-8 © 2021 Elsevier Inc. All rights reserved.

75

76 Introduction to deep learning and neural networks with Python™

FIG. 5.1 ANN architecture with 3 inputs, 1 output, and 1 hidden layer with 2 neurons.

In addition to the weights between the input and hidden layers, there are 2 weights connecting the 2 hidden neurons to the output neuron which are W41 and W42. How does the gradient descent algorithm work with these parameters? The answer will be clear after discussing the theory of the forward and backward passes. The next section discusses the theory of the forward pass.

Forward pass In the forward pass, the neurons in the hidden layer accept the inputs from the input layer in addition to their weights. For each hidden neuron, the sum of products SOP between the inputs and their weights is calculated. For the first hidden neuron, it accepts the 3 inputs X1, X2, and X3 in addition to their weights W11, W21, and W31. The SOP for this neuron is calculated by summing the products between each input and its weight. The SOP is calculated in the next equation. SOP1 X1W11 X 2W21 X3W31 For reference, the SOP for the first hidden neuron is labeled SOP1 in Fig. 5.1. For the second hidden neuron, its SOP, which is labeled SOP2, is calculated in the next equation. SOP2 X1W12 X 2W22 X3W32 After calculating the SOP for the 2 hidden neurons, next is to feed the SOP of each neuron to an activation function. Remember that the function used up to this time is the sigmoid function which is calculated as given in the next equation. 1 sigmoid SOP 1 e SOP By feeding SOP1 to the sigmoid function, the result is Activ1 as calculated by the next equation. 1 Activ1 1 e SOP1

Working with hidden layers Chapter | 5 77

For the second hidden neurons, its activation function output is Activ2 as calculated by the next equation: 1 Activ2 1 e SOP2 Remember that in the forward pass, the outputs of a layer are regarded the inputs to the next layer. That is the outputs of the hidden layer, which are Activ1 and Activ2 as labeled in Fig. 5.1, are regarded the inputs to the output layer. The process repeats for calculating the SOP in the output layer neuron. Each input to the output neuron has a weight. For the first input Activ1, its weight is W41. For the second input Activ2, its weight is W42. The SOP for the output neuron is labeled SOP3 and calculated as follows: SOP3 Activ1W41 Activ2 W42 SOP3 is fed to the sigmoid function to return Activ3 as given in the next equation. 1 Predicted Activ3 1 e SOP3 Note that the output of the activation function Activ3 is regarded as the predicted output of the network. After the network makes its prediction, next is to calculate the error using the squared error function. error Predicted Target

2

To have better understanding, the next section discusses an example to go through the math calculations behind the forward pass.

Forward pass math calculations According to the architecture in Fig. 5.1, there are 3 inputs 3 inputs (X1, X2, and X3) and 1 output Y. The values of the 3 inputs and the output of a single sample are listed in Table 5.1. Regarding the network weights, Table 5.2 lists the weights for the first neuron in the hidden layer. The weights for the second hidden neuron are listed in Table 5.3. The final weights are the ones connected to the output neuron which are given in Table 5.4.

TABLE 5.1 Sample input for architecture in Fig. 5.1. X1

X2

X3

Y

0.1

0.4

4.1

0.2

78 Introduction to deep learning and neural networks with Python™

TABLE 5.2 Initial weights for the first neuron in the hidden layer. W11

W21

W31

0.481

0.299

0.192

TABLE 5.3 Initial weights for the second neuron in the hidden layer. W12

W22

W32

0.329

0.548

0.214

TABLE 5.4 Initial weights for the output neuron. W41

W42

0.882

0.567

For the first neuron in the hidden layer, the next equation calculates its SOP (SOP1). The result is SOP1 = 0.9549.

SOP1 X1W11 X 2W21 X3W31 0.1 0.481 0.4 0.299 4.1 0.192 0.99549 The next equation calculates the SOP for the second hidden neuron which is SOP2 = 1.1295. SOP2 X1W12 X 2W22 X3W32 0.1 0.329 0.4 0.548 4.1 0.214 1.11295 After feeding SOP1 and SOP2 to the sigmoid function, the result is calculated according to the next equations. Activ1

1 1 0.722 1 e SOP1 1 e 0.9549

Activ2

1 1 0.756 1 e SOP2 1 e 1.1295

The outputs of the hidden layer, Activ1 = 0.722 and Activ2 = 0.756, are regarded the inputs to the next layer which is the output layer. As a result, the values of the output layer neuron’s inputs are 0.722 and 0.756. The next equation calculates the SOP for this neuron which is SOP3 = 1.066. SOP3 Activ1W41 Activ2 W42 0.722 0.882 0.756 0.567 1.066

Working with hidden layers Chapter | 5 79

SOP3 is fed to the sigmoid function to return the predicted output as calculated in the next equation. The predicted output is 0.744. Predicted Activ3

1 1 0.744 SOP3 1.066 1 e 1 e

After the predicted output is calculated, next is to calculate the prediction error according to the next equation which results in an error equal to 0.296. error Predicted Target 0.744 0.2 0.296 2

2

Calculating the prediction error of the network signals the end of the forward pass. The next section discusses the theory of the backward pass.

Backward pass In the backward pass, the goal is to calculate the gradients that update the network weights. Because we start from where we ended in the forward pass, the gradients of the last layer are calculated at first and then moving in a backward direction until reaching the input layer. Let’s start calculating the gradients of weights between the hidden layer and the output layer.

Output layer weights Because there is no explicit equation relating both the error and the output layer’s weights (W41 and W42), it is preferred to use the chain rule. What are the derivatives in the chain from the error to the output layer’s weights? Starting by the first weight, we need to find the derivative of the error with respect to W41. The error function is used for this purpose. error Predicted Target

2

The error function has 2 terms which are: 1. Predicted 2. Target Because the error function does not include an explicit term as the weight W41, one of these terms should lead to that weight. Of the 2 terms in the error function, which one leads to the weight W41? It is Predicted because the other term Target is constant. The first derivative to calculate is the derivative of the error with respect to the predicted output as calculated in the next equation. dError 2 Predicted Target dPredicted

80 Introduction to deep learning and neural networks with Python™

Next is to calculate the derivative of Predicted to SOP3 by substituting in the derivative of the sigmoid function by SOP3 as given in the next equation. dPredicted 1 1 1 SOP3 SOP3 dSOP3 1 e 1 e The next derivative is the SOP3 to W41 derivative. To follow up, here is the equation that relates both SOP3 and W41. SOP3 Activ1W41 Activ2 W42 The derivative of SOP3 to W41 is given in the next equation. dSOP3 = Activ1 dW41 By calculating all derivatives in the chain from the error to W41, the W41 gradient is calculated by multiplying all of these derivatives as given in the next equation. dError dError dPredicted dSOP3 = dW41 dPredicted dSOP3 dW41 Similar to calculating the error to W41 derivative, the error to W42 derivative is easily calculated. The only term that changes from the previous equation is the last one. Rather than calculating the SOP3 to W41 derivative, now the SOP3 to W42 derivative is calculated which is given in the next equation. dSOP3 = Activ2 dW42 Finally, the error to W42 gradient is calculated according to the next equation. dError dError dPredicted dSOP3 = dW42 dPredicted dSOP3 dW42 At this point, the gradients for all weights between the hidden layer and the output layer are successfully calculated. Next is to calculate the gradients for the weights between the input layer and the hidden layer.

Hidden layer weights The generic chain of derivatives from the error to any of the weights of the hidden layer is given in the next equation where Wzy means the weight connecting the input neuron with index z with the hidden neuron indexed y. dError dError dPredicted dSOP3 dActiv y dSOPy = dWzy dPredicted dSOP3 dActivy dSOPy dWzy

Working with hidden layers Chapter | 5 81

Of the derivatives in the chain, the first 2 derivatives are the first 2 ones used in the previous chain which are: 1. Error to Predicted derivative. 2. Predicted to SOP3 derivative. The next derivative in the chain is the derivative of SOP3 with respect to Activ1 and Activ2. The derivative of SOP3 to Activ1 helps to calculate the gradients of the weights connected to the first hidden neuron which are W11, W21, and W31. The derivative of SOP3 to Activ2 helps to calculate the gradients of the weights connected to the second hidden neuron which are W12, W22, and W32. Starting by Activ1, here is the equation relating SOP3 to Activ1. SOP3 Activ1W41 Activ2 W42 The SOP3 to Activ1 derivative is calculated as given in the next equation. dSOP3 = W41 dActiv1 Similarly, the SOP3 to Activ2 derivative is calculated as given in the next equation. dSOP3 = W42 dActiv2 After calculating the derivatives of SOP3 to both Activ1 and Activ2, the next derivatives in the chain to be calculated are: 1. The derivative of Activ1 to SOP1. 2. The derivative of Activ2 to SOP2. The derivative of Activ1 to SOP1 is calculated by substituting by SOP1 in the sigmoid function’s derivative as given in the next equation. The resulting derivative will be used for updating the weights of the first hidden neuron which are W11, W21, and W31. dActiv1 1 1 1 dSOP1 1 e SOP1 1 e SOP1 Similarly, the Activ2 to SOP2 derivative is calculated according to the next equation. This will be used for updating the weights of the second hidden neuron which are W12, W22, and W32. dActiv2 1 1 1 dSOP2 1 e SOP2 1 e SOP2 In order to update the first hidden neuron’s weights W11, W21, and W31, the last derivative to calculate is the derivative between SOP1 to all of these weights. Here is the equation relating SOP1 to all of these weights. SOP1 X1W11 X 2W21 X3W31

82 Introduction to deep learning and neural networks with Python™

The derivatives of SOP1 to all of these 3 weights are given in the next equations. dSOP1 = X1 dW11 dSOP1 = X2 dW21 dSOP1 = X3 dW31 Here is the equation relating SOP2 to the second hidden neuron’s weights W12, W22, and W32. SOP2 X1W12 X 2W22 X3W32 The derivatives of SOP2 to W12, W22, and W32 are given in the next equations. dSOP2 = X1 dW12 dSOP2 = X2 dW22 dSOP2 = X3 dW32 After calculating all derivatives in the chains from the error to all hidden weights, next is to multiply them for calculating the gradient of each weight. For the weights connected to the first hidden neuron (W11, W21, and W31), their gradients are calculated using the chains given in the next equations. Note that all of these chains share all derivatives except for the last one. dError dError dPredicted dSOP3 dActiv1 dSOP1 = dW11 dPredicted dSOP3 dActiv1 dSOP1 dW11 dError dError dPredicted dSOP3 dActiv1 dSOP1 = dW21 dPredicted dSOP3 dActiv1 dSOP1 dW21 dError dError dPredicted dSOP3 dActiv1 dSOP1 = dW31 dPredicted dSOP3 dActiv1 dSOP1 dW31 For the weights connected to the second hidden neuron (W12, W22, and W32), their gradients are calculated using the chains given in the next equations. Note that all of these chains share all derivatives except for the last derivative.

Working with hidden layers Chapter | 5 83

dError dError dPredicted dSOP3 dActiv2 dSOP2 = dW12 dPredicted dSOP3 dActiv2 dSOP2 dW12 dError dError dPredicted dSOP3 dActiv2 dSOP2 = dW22 dPredicted dSOP3 dActiv2 dSOP2 dW22 dError dError dPredicted dSOP3 dActiv2 dSOP2 = dW32 dPredicted dSOP3 dActiv2 dSOP2 dW32 At this time, the chains for calculating the gradients for all weights in the entire network are successfully prepared. The next equations summarize these chains. dError dError dPredicted dSOP3 = dW41 dPredicted dSOP3 dW41 dError dError dPredicted dSOP3 = dW42 dPredicted dSOP3 dW42 dError dError dPredicted dSOP3 dActiv1 dSOP1 = dW11 dPredicted dSOP3 dActiv1 dSOP1 dW11 dError dError dPredicted dSOP3 dActiv1 dSOP1 = dW21 dPredicted dSOP3 dActiv1 dSOP1 dW21 dError dError dPredicted dSOP3 dActiv1 dSOP1 = dW31 dPredicted dSOP3 dActiv1 dSOP1 dW31 dError dError dPredicted dSOP3 dActiv2 dSOP2 = dW12 dPredicted dSOP3 dActiv2 dSOP2 dW12 dError dError dPredicted dSOP3 dActiv2 dSOP2 = dW22 dPredicted dSOP3 dActiv2 dSOP2 dW22 dError dError dPredicted dSOP3 dActiv2 dSOP2 = dW32 dPredicted dSOP3 dActiv2 dSOP2 dW32 After calculating all gradients, next is to update the weights according to the next equation. Wnew Wold learning _ rate grad By discussing the steps of calculating the gradients and updating the weights, the next section continues the math example started previously to do the backward pass calculations.

84 Introduction to deep learning and neural networks with Python™

Backward pass math calculations In this section, the values for all derivatives are calculated followed by calculating the weights’ gradients. Of all derivatives in the chains, the first 2 derivatives are shared across all the chains. Given the values of the predicted and target outputs, the first derivative in all chains is calculated in the next equation. dError 2 Predicted Target 2 0.744 0.2 1.088 dPredicted The second derivative in all chains is between Predicted and SOP3 which is calculated according to the next equation. dPredicted 1 1 1 1 1 1 0.191 SOP3 SOP3 1.066 1.066 dSOP3 1 e 1 e 1 e 1 e Besides the first 2 derivatives, the others change for some chains. The next subsection calculates the derivative for the output layer. The subsequent subsection works on the derivatives of the hidden layer.

Output layer weights gradients For calculating the gradients of the 2 output layer’s weights W41 and W41, there are 2 remaining derivatives in the chain which are: 1. The derivative of SOP3 to W41. 2. The derivative of SOP3 to W42. These 2 derivatives are calculated in the next equations. dSOP3 = Activ = 0.722 1 dW41 dSOP3 = Activ = 0.756 2 dW42 Once all derivatives in the chain connecting the error to the 2 output layer’s weights W41 and W41 are prepared, the gradients can be calculated as in the next equations. The gradients are 0.15 and 0.157. dError dError dPredicted dSOP3 1.088 0.191 0.722 0.15 dW41 dPredicted dSOP3 dW41 dError dError dPredicted dSOP3 1.088 0.191 0.756 0.157 dW42 dPredicted dSOP3 dW42 After the gradients for W41 and W42 are calculated, the next section works on calculating the gradients of the hidden neurons.

Working with hidden layers Chapter | 5 85

Hidden layer weights gradients According to the chains of derivatives of the hidden neurons, the next 2 derivatives to be calculated are: 1. The derivative of SOP3 to Activ1. 2. The derivative of SOP3 to Activ2. These 2 derivatives are calculated in the next equations. dSOP3 = W= 0.882 41 dActiv1 dSOP3 = W= 0.567 42 dActiv2 The next 2 derivatives are: 1. The derivative of Activ1 to SOP1. 2. The derivative of Activ2 to SOP2. These derivatives are calculated according to the next equations. dActiv1 1 1 1 dSOP1 1 e SOP1 1 e SOP1

1 1 1 0.9549 0.9549 1 e 1 e

0.2

dActiv2 1 1 1 1 0.185 1 1 SOP2 SOP2 1.1295 1.1295 dSOP2 1 e 1 e 1 e 1 e Before calculating the gradient for the weights of the first hidden neuron, there are 3 derivatives to be calculated which are: 1. The derivative of SOP1 to W11. 2. The derivative of SOP1 to W21. 3. The derivative of SOP1 to W31. Their calculations are given in the next equations. dSOP1 = X= 0.1 1 dW11 dSOP1 = X= 0.4 2 dW21 dSOP1 = X= 4.1 3 dW31

86 Introduction to deep learning and neural networks with Python™

By multiplying the derivatives in the chain from the error to each of the 3 weights of the first hidden neuron (W11, W21, and W31), their gradients are calculated according to the next equations. The gradients are 0.004, 0.015, and 0.15. dError dError dPredicted dSOP3 dActiv1 dSOP1 dW11 dPredicted dSOP3 dActiv1 dSOP1 dW11 1.088 0.191 0.882 0.2 0.1 0.004

dError dError dPredicted dSOP3 dActiv1 dSOP1 dW21 dPredicted dSOP3 dActiv1 dSOP1 dW21 1.088 0.191 0.882 0.2 0.4 0.015

dError dError dPredicted dSOP3 dActiv1 dSOP1 dW31 dPredicted dSOP3 dActiv1 dSOP1 dW31 1.088 0.191 0.882 0.2 4.1 0.15

For the 3 weights of the second hidden neuron (W12, W22, and W32), there are 3 remaining derivatives to be calculated which are: 1. The derivative of SOP2 to W12. 2. The derivative of SOP2 to W22. 3. The derivative of SOP2 to W32. These derivatives are calculated according to the next equations. dSOP2 = X= 0.1 1 dW12 dSOP2 = X= 0.4 2 dW22 dSOP2 = X= 4.1 3 dW32 By multiplying the derivatives in the chain from the error to each of the 3 weights of the second hidden neuron (W12, W22, and W32), their gradients are calculated according to the next equations. The gradients are 0.002, 0.009, and 0.089. dError dError dPredicted dSOP3 dActiv2 dSOP2 dW12 dPredicted dSOP3 dActiv2 dSOP2 dW12 1.088 0.191 0.567 0.185 0.1 0.002

dError dError dPredicted dSOP3 dActiv2 dSOP2 dW22 dPredicted dSOP3 dActiv2 dSOP2 dW22 1.088 0.191 0.567 0.185 0.4 0.009

Working with hidden layers Chapter | 5 87

dError dError dPredicted dSOP3 dActiv2 dSOP2 dW32 dPredicted dSOP3 dActiv2 dSOP2 dW32 1.088 0.191 0.567 0.185 4.1 0.089

By calculating the gradients for all weights in the network, the next subsection updates the weights.

Updating weights After calculating the gradients for all weights in the network, the next equation updates all network weights assuming that the learning_rate is 0.001. dError W11new W11old learning _ rate 0.481 0.001 0.004 0.480996 dW11 dError W21new W21old learning _ rate 0.299 0.001 0.015 0.298985 dW21 dError W31new W31old learning _ rate 0.192 0.001 0.15 0.19185 dW31 dError W12 new W12 old learning _ rate 0.329 0.001 0.002 0.328998 dW12 dError W22 new W22 old learning _ rate 0.548 0.001 0.009 0.547991 dW22 dError W32 new W32 old learning _ rate 0.214 0.001 0.089 0.213911 dW32 dError W41new W41old learning _ rate 0.882 0.001 0.15 0.88185 dW41 dError W41new W41old learning _ rate 0.567 0.001 0.157 0.566843 dW41 At this time, the network weights are updated in only 1 iteration. The forward and backward passes calculations could be repeated for a number of iterations until reaching the desired output. If the calculations are repeated only once, the error will be reduced from 0.296 to 0.29543095. That is the error reduction is only 0.000569049. Note that setting the learning rate to a higher value than 0.001 may help in increasing the speed of error reduction.

88 Introduction to deep learning and neural networks with Python™

After understanding the theory behind how the ANN architecture of this chapter works in both the forward and backward passes, the next section starts its Python implementation. Note that the implementation is highly dependent on the implementations developed previously in Chapters 3 and 4. Hence, it is very important to have a solid understanding of how the previous implementations work before building over them.

Python™ implementation The complete code that implements an ANN with 3 inputs, 1 hidden layer with 2 neurons, and 1 output neuron and optimizing it using the gradient descent algorithm is listed below.

Working with hidden layers Chapter | 5 89

90 Introduction to deep learning and neural networks with Python™

At first, the inputs and the output are prepared using these 2 lines.

The network weights are prepared according to the next lines which defines the following 3 variables: 1. w1_3: An array holding the 3 weights connecting the 3 inputs to the first hidden neuron (W11, W21, and W31). 2. w2_3: An array holding the 3 weights connecting the 3 inputs to the second hidden neuron (W12, W22, and W32). 3. w3_2: An array with 2 weights for the connections between the hidden layer neurons and the output neuron (W41 and W42).

After preparing the inputs and the weights, the next section works through the forward pass.

Working with hidden layers Chapter | 5 91

Forward pass The code of the forward pass is listed in the next block. It starts by calculating the sum of products for the 2 hidden neurons and saving them into the variables sop1 and sop2. These 2 variables are passed to the sigmoid() function and the results are saved in the variables sig1 and sig2. These 2 variables are multiplied by the 2 weights connected to the output neuron to return sop3. sop3 is also applied as input to the sigmoid() function to return the predicted output. Finally, the error is calculated.

After the forward pass is complete, next is to go through the backward pass.

Backward pass The part of the code responsible for updating the weights between the hidden and output layer is given in the next code. The derivative of the error to the predicted output is calculated and saved in the variable g1. g2 holds the predicted output to SOP3 derivative. The derivatives of SOP3 to both W41 and W42 are calculated and saved in the vector g3. Note that g1 and g2 will be used while calculating the gradients of the hidden neurons. After calculating all derivatives required to calculate the gradients for the weights W41 and W41, the gradients are calculated and saved in the grad_ hidden_output vector. Finally, these 2 weights are updated using the update_w() function by passing the old weights, gradients, and learning rate.

92 Introduction to deep learning and neural networks with Python™

After updating the weights between the hidden and output layers, next is to work on the weights between the input and hidden layers. The next code updates the weights connected to the first hidden neuron. g3 represents the SOP3 to Activ1 derivative. Because this derivative is calculated using the old weights’ values, the old weights are saved into the w3_2_old variable for being used in this step. g4 represents the Activ1 to SOP1 derivative. Finally, g5 represents the SOP1 to weights (W11, W21, and W31) derivatives.

Based on the derivatives saved in g3, g4, and g5, the gradients of the first hidden neuron’s weights are calculated by multiplying the variables g1 to g5. Based on the calculated gradients, the weights are updated. Similar to the 3 weights connected to the first hidden neuron, the other 3 weights connected to the second hidden neuron are updated according to the next code.

At the end of the code, the w3_2_old variable is set equal to w3_2. By reaching this step, the entire code for implementing the neural network in Fig. 5.1 is complete. The next subsection lists the code that trains the network in a number of iterations.

Working with hidden layers Chapter | 5 93

Complete code The previously discussed code just trains the network for a single iteration. The next code uses a loop for going through a number of iterations in which the weights are updated.

94 Introduction to deep learning and neural networks with Python™

After the iterations complete, Fig. 5.2 shows how the predicted output changes for each iteration. The network is able to reach the desired output (0.2) successfully. Fig. 5.3 shows how the error changes for each iteration.

Working with hidden layers Chapter | 5 95

0.60 0.55

Prediction

0.50 0.45 0.40 0.35 0.30 0.25 0.20 0

10000

20000 30000

40000 50000

60000 70000

80000

Iteration Number FIG. 5.2 Network prediction vs. iteration for the ANN architecture in Fig. 5.1.

0.40 0.35 0.30

Error

0.25 0.20 0.15 0.10 0.05 0.00 0

10000

20000

30000

40000

50000

60000

Iteration Number FIG. 5.3 Network error vs. iteration for the ANN architecture in Fig. 5.1.

70000

80000

96 Introduction to deep learning and neural networks with Python™

Conclusion Continuing the implementation of the ANN started in Chapters 3 and 4, this chapter implemented an ANN with a hidden layer that has just 2 hidden neurons. This chapter discussed the theory of how an ANN with 3 inputs, 1 hidden layer with 2 hidden neurons, and 1 output works. Based on a numerical example, all steps in the forward and backward passes are covered. Finally, the Python implementation is discussed. In Chapter 6, the implementation will be extended to use any number of hidden neurons within a single hidden layer.

Chapter 6

Using any number of hidden neurons Chapter outline ANN with 1 hidden layer with 5 neurons Forward pass Backward pass Hidden layer gradients Python™ implementation Forward pass Backward pass

97 98 99 101 107 110 111

More iterations Any number of hidden neurons in 1 layer Weights initialization Forward pass Backward pass ANN with 8 hidden neurons Conclusion

113 117 119 120 121 122 125

ABSTRACT In Chapter 4, latest implementation is able to work with any number of inputs. Chapter 5 extended this implementation to work with a hidden layer with just 2 neurons. In this chapter, the implementation will be extended to use any number of neurons in a single hidden layer. There are 2 examples discussed. The first one uses 5 neurons in the hidden layer rather than just 2 as in Chapter 5. The second example extends such an implementation to work with any number of hidden neurons within a single hidden layer.

ANN with 1 hidden layer with 5 neurons The architecture discussed in this section has an input layer with 3 neurons and a hidden layer with 5 neurons as shown in Fig. 6.1. Compared to the architecture discussed in Chapter 5, there are 5 rather than 2 hidden neurons. As a result, the code implementing the extra 3 neurons is just a repetition of some lines of code. Note that the labels of the weights between the input layer and the hidden layer are omitted to leave the figure clear. According to the flow of discussion followed in the previous chapters, this chapter starts by exploring the calculations in the forward and the backward passes.

Introduction to Deep Learning and Neural Networks with Python™ https://doi.org/10.1016/B978-0-323-90933-4.00008-5 © 2021 Elsevier Inc. All rights reserved.

97

98 Introduction to deep learning and neural networks with Python™

FIG. 6.1 ANN Architecture with 3 inputs, 1 output, and 1 hidden layer with 5 inputs.

Forward pass The architecture in Fig. 6.1 has 3 inputs X1, X2, and X3. Those inputs are connected to 5 neurons in the hidden layer. For each hidden neuron, there will be a SOP between the inputs and their weights. Assuming that the weights of the first hidden neuron are W11, W21, and W31, then the SOP for this neuron, SOP1, is calculated according to the next equation. SOP1 X1W11 X 2W21 X3W31 If the weights for the second hidden neuron are W12, W22, and W32, then the SOP for this neuron, SOP2, is as follows. SOP2 X1W12 X 2W22 X3W32 Following the same concept, the SOPs for the remaining 3 neurons, SOP3, SOP4, and SOP5, can be calculated as given in the next equations. SOP3 X1W13 X 2 W23 X3W33 SOP4 X1W14 X 2W24 X3W34 SOP5 X1W15 X 2 W25 X3W35

Using any number of hidden neurons Chapter | 6 99

The SOPs for the 5 hidden neurons will be fed to the activation function (sigmoid) to return their outputs. The outputs from the activation functions of the 5 neurons, Activ1, Activ2, Activ3, Activ4, and Activ5, are calculated as given in the next equations. 1 Activ1 1 e SOP1 1 1 e SOP2 1 Activ3 1 e SOP3

Activ2

Activ4

1 1 e SOP4

Activ5

1 1 e SOP5

Because the outputs from one layer are regarded the inputs to the next layer, the 5 outputs from the hidden layer are the inputs to the next layer (i.e., output layer). Because there is just a single neuron in the output layer, there is just 1 SOP which is calculated according to the next equation. In order, the weights connecting the 5 hidden neurons to the output neurons are W41, W42, W43, W44, and W45. SOP6 Activ1W41 Activ2W42 Activ3W43 Activ4 W44 Activ5W45 SOP6, which is the SOP of the output neuron, is then applied to the sigmoid function as given in the next equation. Note that the output of the activation function is regarded, in this example, the predicted output. 1 1 e SOP6 After the predicted output is calculated, the next step is to measure the prediction error. Predicted Activ6

error Predicted Target

2

Making predictions and calculating the error means the forward pass is over and it is time to discuss the backward pass.

Backward pass In the backward pass, the weights’ gradients are calculated using the chain rule. For the 5 weights between the hidden and the output layers, their generic chain rule is given in the next equation where z is replaced by the neuron index from

100 Introduction to deep learning and neural networks with Python™

1 to 5. Note that the gradients of such 5 weights are calculated as discussed in Chapter 5 without any change. dError dError dPredicted dSOP6 = dW4 z dPredicted dSOP6 dW4 z The first derivative is the error to predicted output derivative as given in the next equation. dError 2 Predicted Target dPredicted The second derivative is the predicted output to SOP6 derivative. dPredicted 1 1 1 SOP6 SOP6 dSOP6 1 1 e e The third and last derivative is the derivative of SOP6 with respect to the 5 weights between the hidden and output layers. Because there are 5 weights, one for each connection between the hidden neurons and the output neuron, there will be 5 derivatives, one for each weight. Remember that SOP6 is calculated according to the next equation. SOP6 Activ1W41 Activ2W42 Activ3W43 Activ4 W44 Activ5W45 The 5 derivatives of SOP6 with respect to the 5 weights (W41 to W45) are given in the next equations. dSOP6 = Activ1 dW41 dSOP6 = Activ2 dW42 dSOP6 = Activ3 dW43 dSOP6 = Activ4 dW44 dSOP6 = Activ5 dW45 After preparing all the derivative chains for the 5 weights, the next equations calculate the derivative of the error with respect to such 5 weights. Note that all of these chains share the first 2 derivatives and only the last derivative differs.

Using any number of hidden neurons Chapter | 6 101

dError dError dPredicted dSOP6 = dW41 dPredicted dSOP6 dW41 dError dError dPredicted dSOP6 = dW42 dPredicted dSOP6 dW42 dError dError dPredicted dSOP6 = dW43 dPredicted dSOP6 dW43 dError dError dPredicted dSOP6 = dW44 dPredicted dSOP6 dW44 dError dError dPredicted dSOP6 = dW45 dPredicted dSOP6 dW45 After calculating the gradients for the weights between the hidden layer and the output layer, next is to calculate the gradients for the weights between the input and hidden layers.

Hidden layer gradients The generic chain of derivatives from the error to any of the weights of the hidden layer is given in the next equation where Wzy means the weight connecting the input neuron with index z with the hidden neuron indexed y. dError dError dPredicted dSOP3 dActivy dSOPy = dWzy dPredicted dSOP3 dActivy dSOPy dWzy Because the architecture in Fig. 6.1 has 3 inputs and 5 hidden neurons, there is a total of 3 × 5 = 15 weights between the input and the hidden layers. As a result, there are 15 derivative chains. The derivative chain for calculating the gradients for the weights between the input and hidden layers starts by the first 2 derivatives calculated in the previous section which are: 1. Error to predicted output derivative. 2. Predicted output to SOP6 derivative. The third derivative in the chain is between SOP6 and the outputs of the sigmoid function which are Activ1 to Activ5. Because there are 5 hidden neurons, there are 5 derivatives. Before calculating such derivatives, it

102 Introduction to deep learning and neural networks with Python™

is important to keep in mind the equation that relates SOP6 to the 5 sigmoid outputs (Activ1 to Activ5). SOP6 Activ1W41 Activ2W42 Activ3W43 Activ4 W44 Activ5W45 Based on that equation, the 5 derivatives are given in the next equations. dSOP6 = W41 dActiv1 dSOP6 = W42 dActiv2 dSOP6 = W43 dActiv3 dSOP6 = W44 dActiv4 dSOP6 = W45 dActiv5 The next derivative in the chain is the derivative of the sigmoid function’s output with respect to the hidden layer neurons’ SOPs. Remember that each neuron’s sigmoid output is related to its SOP according to the next equation. Activ

1 1 e SOP

Also remember that the derivative of this equation is calculated as in the next equation. dActiv 1 1 1 dSOP 1 e SOP 1 e SOP Because there are 5 hidden neurons, there will be 5 derivatives according to the next equations. dActiv1 1 1 1 dSOP1 1 e SOP1 1 e SOP1 dActiv2 1 1 1 dSOP2 1 e SOP2 1 e SOP2

Using any number of hidden neurons Chapter | 6 103

dActiv3 1 1 1 dSOP3 1 e SOP3 1 e SOP3 dActiv4 1 1 1 dSOP4 1 e SOP4 1 e SOP4 dActiv5 1 1 1 SOP5 SOP5 dSOP5 1 e 1 e The last derivative in the chain is the derivative of the SOP at each hidden neuron with respect to the 3 weights connected to it. For simplicity, Fig. 6.2 shows the ANN architecture with just the connections between the input neurons and the first hidden neuron. In order to calculate the derivative of SOP1 to its 3 weights (W11, W21, and W31), keep in mind the equation that relates them together. SOP1 X1W11 X 2W21 X3W31

FIG. 6.2 Connections between the Input Neurons and First Hidden Neuron.

104 Introduction to deep learning and neural networks with Python™

According to this equation, the derivatives between SOP1 and the 3 weights are given in the next equations. dSOP1 = X1 dW11 dSOP1 = X2 dW21 dSOP1 = X3 dW31 If the weights connecting the input neurons to the second hidden neuron are W12, W22, and W32, then SOP2 is calculated according to the next equation. SOP2 X1W12 X 2W22 X3W32 As a result, the derivatives between SOP2 and the 3 weights are calculated according to the next equations. dSOP2 = X1 dW12 dSOP2 = X2 dW22 dSOP2 = X3 dW32 If the connections for the third hidden neuron has the weights W13, W23, and W33, then the SOP of this neuron, SOP3, is calculated according to the next equation. SOP3 X1W13 X 2 W23 X3W33 Based on this equation, the derivatives between SOP3 and the 3 weights are calculated according to the next equations. dSOP3 = X1 dW13 dSOP3 = X2 dW23 dSOP3 = X3 dW33

Using any number of hidden neurons Chapter | 6 105

The same process is repeated for the remaining 2 hidden neurons. Note that the derivatives of any SOP of any hidden neuron with respect to its 3 weights are X1, X2, and X3. After calculating all derivatives in the chain from the error to the input layer’s weights, the gradients can be calculated by multiplying all derivatives in the chain. The 3 gradients of the 3 weights connected to the first hidden neuron, W11, W12, and W13, are calculated according to the next equations. Note that all chains share the same derivatives except for the final derivative. dError dError dPredicted dSOP6 dActiv1 dSOP1 = dW11 dPredicted dSOP6 dActiv1 dSOP1 dW11 dError dError dPredicted dSOP6 dActiv1 dSOP1 = dW21 dPredicted dSOP6 dActiv1 dSOP1 dW21 dError dError dPredicted dSOP6 dActiv1 dSOP1 = dW31 dPredicted dSOP6 dActiv1 dSOP1 dW31 Regarding the weights of the second hidden neuron, the gradients for its 3 weights W12, W22, and W32 are calculated according to the derivative chains given in the next equations. Compared to the previous 3 chains, here are the changes: 1. Each Activ1 is replaced by Activ2. 2. Each SOP1 is replaced by SOP2. Note also that all derivatives in these chains are identical except for the last one which changes according to the weight being used. dError dError dPredicted dSOP6 dActiv2 dSOP2 = dW12 dPredicted dSOP6 dActiv2 dSOP2 dW12 dError dError dPredicted dSOP6 dActiv2 dSOP2 = dW22 dPredicted dSOP6 dActiv2 dSOP2 dW22 dError dError dPredicted dSOP6 dActiv2 dSOP2 = dW32 dPredicted dSOP6 dActiv2 dSOP2 dW32 The derivative chains for calculating the gradients for the weights connected to the third hidden neuron, which are W13, W23, and W33, are given in the next equations.

106 Introduction to deep learning and neural networks with Python™

Compared to the previous 3 chains, here are the changes: 1. Each Activ2 is replaced by Activ3. 2. Each SOP2 is replaced by SOP3. All derivatives in the 3 chains are identical except for the last one. dError dError dPredicted dSOP6 dActiv3 dSOP3 = dW13 dPredicted dSOP6 dActiv3 dSOP3 dW13 dError dError dPredicted dSOP6 dActiv3 dSOP3 = dW23 dPredicted dSOP6 dActiv3 dSOP3 dW23 dError dError dPredicted dSOP6 dActiv3 dSOP3 = dW33 dPredicted dSOP6 dActiv3 dSOP3 dW33 Similar to preparing the derivatives’ chains and calculating the gradients of the first 3 hidden neurons, the same process applies for working with the remaining 2 hidden neurons. All derivative chains for all the weights between the input layer and the hidden layer are listed in the next equations. Because the input layer has 3 neurons and the hidden layer has 5 neurons, there is a total of 3 × 5 = 15 weights. In other words, there are 15 derivative chains. dError dError dPredicted dSOP6 dActiv1 dSOP1 = dW11 dPredicted dSOP6 dActiv1 dSOP1 dW11 dError dError dPredicted dSOP6 dActiv1 dSOP1 = dW21 dPredicted dSOP6 dActiv1 dSOP1 dW21 dError dError dPredicted dSOP6 dActiv1 dSOP1 = dW31 dPredicted dSOP6 dActiv1 dSOP1 dW31 dError dError dPredicted dSOP6 dActiv2 dSOP2 = dW12 dPredicted dSOP6 dActiv2 dSOP2 dW12 dError dError dPredicted dSOP6 dActiv2 dSOP2 = dW22 dPredicted dSOP6 dActiv2 dSOP2 dW22 dError dError dPredicted dSOP6 dActiv2 dSOP2 = dW32 dPredicted dSOP6 dActiv2 dSOP2 dW32

Using any number of hidden neurons Chapter | 6 107

dError dError dPredicted dSOP6 dActiv3 dSOP3 = dW13 dPredicted dSOP6 dActiv3 dSOP3 dW13 dError dError dPredicted dSOP6 dActiv3 dSOP3 = dW23 dPredicted dSOP6 dActiv3 dSOP3 dW23 dError dError dPredicted dSOP6 dActiv3 dSOP3 = dW33 dPredicted dSOP6 dActiv3 dSOP3 dW33 dError dError dPredicted dSOP6 dActiv4 dSOP4 = dW14 dPredicted dSOP6 dActiv4 dSOP4 dW14 dError dError dPredicted dSOP6 dActiv4 dSOP4 = dW24 dPredicted dSOP6 dActiv4 dSOP4 dW24 dError dError dPredicted dSOP6 dActiv4 dSOP4 = dW34 dPredicted dSOP6 dActiv4 dSOP4 dW34 dError dError dPredicted dSOP6 dActiiv5 dSOP5 = dW15 dPredicted dSOP6 dActiv5 dSOP5 dW15 dError dError dPredicted dSOP6 dActiv5 dSOP5 = dW25 dPredicted dSOP6 dActiv5 dSOP5 dW25 dError dError dPredicted dSOP6 dActiv5 dSOP5 = dW35 dPredicted dSOP6 dActiv5 dSOP5 dW35 After successfully preparing all derivative chains for all weights in the network, the next section starts the Python implementation of the forward and backward passes of the architecture discussed in this chapter.

Python™ implementation The Python script for implementing the network in Fig. 6.1 with 3 inputs, 1 hidden layer with 5 neurons, and a single output is given in the next code.

108 Introduction to deep learning and neural networks with Python™

Using any number of hidden neurons Chapter | 6 109

110 Introduction to deep learning and neural networks with Python™

Preparing the inputs and their output is the first thing done in this code according to the next 2 lines. Because the input layer has 3 inputs, a NumPy array is created with 3 values. The target is specified as a single value.

Next is to prepare the network weights according to the next 6 lines. The weights of each hidden neuron are created in a separate variable. The weights of the first hidden neuron are stored into the w1_3 variable, the weights for the second hidden neuron are stored in the w2_3 variable, and so on. The variable w6_5 holds the 5 weights connecting the 5 hidden neurons to the output neuron.

The variable w6_5_old holds the weights in the w6_5 variable as a backup for use when calculating the derivatives of SOP6 with respect to Activ1-Activ5. After preparing the inputs, outputs, and weights, next is to start the forward pass.

Forward pass The first task in the forward pass is to calculate the SOP for each hidden neuron as given in the next code. This is by multiplying the 3 inputs by the 3 weights.

Using any number of hidden neurons Chapter | 6 111

After that, the sigmoid function is applied to all of these sum of products.

The outputs of the sigmoid function are regarded the inputs to the output neuron. The SOP for this neuron is calculated using the next line.

The SOP of the output neuron is fed to the sigmoid function to return the predicted output. After the predicted output is calculated, the error is calculated using the error() function. Error calculation is the final step in the forward pass after which the backward pass.

Backward pass In the backward pass, the first derivative calculated is the error to the predicted output derivative according to the next line. The result is saved in the variable g1 for later use. The next derivative is the predicted output to SOP6 derivative which is calculated according to the next line. The result is saved in the variable g2 for later use. In order to calculate the gradients of the weights between the hidden and output layers, the derivatives of SOP6 with respect to the 5 weights W41-W45 are calculated. All of these derivatives are saved in a NumPy array named g3 according to the next lines.

After preparing all derivatives required for calculating the gradients for the weights W41 to W45, next is to calculate the gradients using the next line.

112 Introduction to deep learning and neural networks with Python™

After that, such 5 weights can be updated using the update_w() function according to the next line. It accepts the old weights, gradients, and learning rate. It returns the new weights.

After updating the weights between the hidden and output layers, next is to calculate the gradients for the weights between the input and hidden layers. Through our discussion, a single hidden neuron is worked at a time. For the first hidden neuron, the required calculations for preparing the gradients for its weights are given in the next code in which: 1. The derivative of SOP6 to Activ1 is calculated and saved in the variable g3. 2. The derivative of Activ1 to SOP1 is calculated and saved in g4. 3. The last derivatives (SOP1 to W11-W31) are saved in the g5 variable. Note that g5 has 3 derivatives, one for each weight while g4 and g3 have just one derivative.

After calculating all derivatives in the chain, next is to calculate the gradient for updating the 3 weights connecting the 3 input neurons to the first hidden neuron. This is by multiplying the variables g1, g2, g3, g4, and g5. The result is saved in the grad_hidden1_input variable. Finally, the 3 weights are updated using the update_w() function. Working on the other hidden neurons is very similar to the previous code. From the previous 5 lines, just changes are necessary for the first 2 lines. Regarding the second hidden neuron, here are the changes: 1. The value of g3 is returning by passing index 1 to the w6_5_old array. 2. For calculating g4, use sop2 rather than sop1. The next code is responsible for updating the weights of the second hidden neuron.

Using any number of hidden neurons Chapter | 6 113

For working with the third hidden neuron, the changes are: 1. g3 is calculated by using the index 2 with the w6_5_old array. 2. g4 is calculated using sop3.

For the forth hidden neuron, the changes are: 1. g3 is calculated by using the index 3 with the w6_5_old array. 2. g4 is calculated using sop4.

For the fifth hidden neuron, the changes are: 1. g3 is calculated by using the index 4 with the w6_5_old array. 2. g4 is calculated using sop5.

At this point, the gradients for all network weights are calculated and the weights are updated. Just remember to set the w6_5_old variable to the new w6_5 at the end of the code.

More iterations After implementing the network architecture in Fig. 6.1, a for loop is used to train the network for a number of iterations. This is implemented in the next code.

114 Introduction to deep learning and neural networks with Python™

Using any number of hidden neurons Chapter | 6 115

116 Introduction to deep learning and neural networks with Python™

0.9 0.8

Predicted

0.7 0.6 0.5 0.4 0.3 0.2 0

10000

20000

30000

40000

50000

60000

70000

80000

Iteration Number FIG. 6.3 Prediction vs. iteration for the network in Fig. 6.1.

Fig. 6.3 shows a plot relating the predicted output to each iteration which proves that the ANN is able to make the correct prediction. The relation between the error and the iteration is given in Fig. 6.4. At this time, the ANN is able to work with just a specific number of neurons in a single hidden layer. Seeking to generalize the implementation, the next section continues editing the previous code so that it can work with any number of neurons within a single hidden layer.

Using any number of hidden neurons Chapter | 6 117

0.5

Error

0.4

0.3

0.2

0.1

0.0 0

10000

20000

30000

40000

50000

60000

70000

80000

Iteration Number FIG. 6.4 Error vs. Iteration for the Network in Fig. 6.1.

Any number of hidden neurons in 1 layer According to the previous implementation, there were some lines that are repeated. To avoid repeating such lines, the code is optimized to use for loops. The new code is given in the next listing. The summary of changes in this code is as follows: 1. Defining the network architecture in a variable named network_ architecture. 2. Initialing the network weights using for loops that works regardless of the network architecture. 3. Doing the forward pass calculations using a single line with the help of the numpy.matmul() function. 4. Updating the weights of all hidden neurons in the backward pass regardless of their number. Within this section, the above changes are discussed.

118 Introduction to deep learning and neural networks with Python™

Using any number of hidden neurons Chapter | 6 119

For the first change, a new variable named network_architecture holds the ANN architecture. For the architecture in use, the number of inputs equal to x.shape[0] which is 3 in this example. The number of hidden neurons is 5 and number of output neurons is 1.

Based on the network_architecture variable, the network weights are initialized as discussed in the next section.

Weights initialization Previously, the weights were initialized using the next code. For each hidden or output neuron, there is an associated variable to hold its weights.

120 Introduction to deep learning and neural networks with Python™

A better way for initializing the weights is to use a for loop that goes through each layer according to the network architecture specified in the variable network_architecture. Doing the initialization this way, a single array named w is returned holding all network weights according to the next code.

For this example, the shape of the array w is (2,) which means there are just 2 elements within it. The shape of the first element is (5, 3) which holds the weights between the input layer, which has 3 inputs, and hidden layer, which has 5 neurons. The shape of the second element in the array w is (1, 5) which holds the weights between the hidden layer, which has 5 neurons, and the output layer, which has just a single neuron.

Forward pass Preparing the network weights in a single array facilitates working on both the forward and backward pass. All sum of products are calculated using a single line using the numpy.matmul() function. Note that w[0] means the weights between the input and hidden layers.

Similarly, the sigmoid function is called once to be applied to all the sum of products.

The sum of products between the hidden and output layers is calculated according to the next line. Note that w[1] returns the weights between such 2 layers.

As regular, the predicted output and the error are calculated.

Using any number of hidden neurons Chapter | 6 121

Backward pass In the backward pass, because there is just a single neuron in the output layer, its weights will be updated in the same way used previously.

Rather than repeating these lines, a for loop can be used according to the next code. It loops through each hidden neuron and uses the appropriate inputs to the functions

122 Introduction to deep learning and neural networks with Python™

By doing so, the code is successfully minimized and generalized to work with any number of hidden neurons within a single hidden layer. The next section builds a network with 8 hidden neurons.

ANN with 8 hidden neurons This section tests the code to use a different number of hidden neurons. The only change required is to specify the desired number of hidden neurons in the network_architecture variable according to the next line.

The implementation of an ANN that works with a single hidden layer with 8 hidden neurons is given in the next code.

Using any number of hidden neurons Chapter | 6 123

124 Introduction to deep learning and neural networks with Python™

0.9 0.8

Predicted

0.7 0.6 0.5 0.4 0.3 0.2 0

10000

20000

30000

40000

50000

60000

70000

80000

70000

80000

Iteration Number FIG. 6.5 Prediction vs. Iteration for a Network with 8 Hidden Neurons.

0.5

Error

0.4

0.3

0.2

0.1

0.0 0

10000

20000

30000

40000

50000

Iteration Number FIG. 6.6 Error vs. Iteration for a Network with 8 Hidden Neurons.

60000

Using any number of hidden neurons Chapter | 6 125

Fig. 6.5 shows the relationship between the predicted output and the iteration number which proves that the ANN is trained successfully. The relationship between the error and the iteration number is given in Fig. 6.6.

Conclusion By the end of this chapter, an ANN is implemented that works with a variable number of hidden neurons within just a single hidden layer besides using any number of inputs. The code is also optimized to include all the network weights in a single variable and doing the forward and backward pass calculations using less code. In Chapter 7, the implementation will be extended to work with more 2 hidden layers while using any number of neurons in such layers.

Chapter 7

Working with 2 hidden layers Chapter outline ANN with 2 hidden layers with 5 and 3 neurons Editing Chapter 6 implementation to work with an additional layer Preparing inputs, outputs, and weights

127 129

Forward pass Backward pass ANN with 2 hidden layers with 10 and 8 neurons Conclusion

131 133 142 147

129

ABSTRACT There were 2 milestones in the implementation built in Chapter 6 as the network can use: 1. Any number of inputs. 2. Any number of neurons within a single hidden layer. In this chapter, the implementation will be generalized to allow the neural network to work with 2 hidden layers with any number of hidden neurons. The chapter starts by discussing the implementation of a network with an input layer with 3 neurons, 2 hidden layers with 5 and 3 hidden neurons, respectively, and an output layer with a single neuron. Based on this architecture, some rules are deduced to generalize the implementation to work with any number of neurons in 2 hidden layers.

ANN with 2 hidden layers with 5 and 3 neurons The first architecture discussed is shown in Fig. 7.1 in which there is an input layer with 3 neurons, the first hidden layer has 5 neurons, the second hidden layer has 3 neurons, and the output layer only has a single neuron. To leave the figure clear, the labels of the hidden layers’ weights are omitted. Chapters 3–6 starts by discussing how the forward and backward passes work in detail from scratch. It is expected that the reader has a solid understanding of how these passes work and thus no need to start from scratch in this chapter. This chapter makes use of the implementation built in Chapter 6, because it is the latest implementation, and build over it. The architecture diagram discussed in Fig. 6.1 is given again in Fig. 7.2. It has 3 neurons in the input layer, a single hidden layer with 5 neurons, and a single output neuron. The change introduced in this chapter is just adding a second hidden layer with 3 neurons. Introduction to Deep Learning and Neural Networks with Python™ https://doi.org/10.1016/B978-0-323-90933-4.00011-5 © 2021 Elsevier Inc. All rights reserved.

127

128 Introduction to deep learning and neural networks with Python™

FIG. 7.1 ANN Architecture with 2 Hidden Layers.

FIG. 7.2 Replication of Fig. 6.1.

To understand how the 2 architectures in Fig. 7.1 and Fig. 7.2 are similar, the Fig. 7.3 maps the 2 architectures. From this mapping, the last 3 layers in the architecture of this chapter (first hidden, second hidden, and output layers) are mapped to the 3 layers of the architecture in Chapter 6 (input, hidden, and output layers). Thus, the implementation of Chapter 6 can be used for working with the last 3 layers in the architecture in Fig. 7.1. There is just a single layer not considered in the implementation which is the first (input) layer. So, rather than starting from scratch, only a single layer is prepended to the implementation of the architecture in Fig. 6.1. This chapter discusses how to work with this new layer in both the forward and backward passes.

Working with 2 hidden layers Chapter | 7 129

Input Layer

Hidden Layer

Output Layer

SOP1 Activ1

X1 SOP2 Activ2

Error

X2 SOP3 Activ3

X3

SOP6 Activ6(Predicted)

SOP4 Activ4

SOP5 Activ5

W91

SOP1 Activ1 SOP6 Activ6

X1

W92

SOP2 Activ2

X2

SOP7 Activ7

W93

SOP9 Activ9(Predicted)

SOP3 Activ3

X3

Error

SOP8 Activ8 SOP4 Activ4

SOP5 Activ5

Input Layer

Hidden Layer 1

Hidden Layer 2

Output Layer

FIG. 7.3 Mapping between the 2 architectures in Figs. 7.1 and 7.2.

Editing Chapter 6 implementation to work with an additional layer As discussed in the previous chapters, the implementation is usually divided into 3 parts which are: 1. Preparing inputs, outputs, and weights. 2. Forward pass. 3. Backward pass. Let’s start editing the 3 parts one by one.

Preparing inputs, outputs, and weights The next code from Chapter 6 prepares the inputs, outputs, and weights. The inputs and the output are defined in the arrays x and target, respectively. The network architecture is specified in the network_architecture array which is then used for initializing the network weights in the array w. How to edit this code to work with the architecture used in this chapter?

130 Introduction to deep learning and neural networks with Python™

This chapter uses 3 inputs for the training sample, like the previous code, and thus no need to edit the input array x. The same holds for the output array target. Regarding the weights, the previous chapter uses just a single hidden layer but this chapter uses 2 hidden layers. Thus, the network architecture in the network_architecture array has to be changed to reflect the new hidden layer. The new value for the array is assigned in the next line where x.shape[0] means the number of inputs which returns 3, the value 5 refers to the number of neurons in the first hidden layer, 3 is the number of neurons in the second hidden layer, and the last value 1 is for the single output.

Remember that the weights array is created automatically using the nested for loops. These loops use the architecture defined in the network_architecture variable to determine the shape of the weights array. By editing the network_ architecture array, then this is enough to update the weights array. The next block lists the updated code that works with the architecture in Fig. 7.1.

Working with 2 hidden layers Chapter | 7 131

Assuming that there are 2 successive layers with N and M neurons, then there are NxM connections between these layers. Because each connection has its weights, then there is a total of NxM weights. For the output layer, there is only a single neuron. Assuming that its directly proceeding hidden layer has N neurons, then there is a total of Nx1 connections for the 2 layers. The weights array w creates a subarray for each layer to hold all of its weights. The shape of the weights array for the single output neuron is (N,1) where N refers to the number of connections (weights) to that neuron. When indexing the weights array w for returning the weights of the output neuron, a fixed index with value 0 is always specified this way w[x][0]. Because index 0 does not change, no need to specify it at all. This is why the next line is added at the end of the previous code to convert the weights of the output layer from being an array to just a vector. Later in Chapter 10 when supporting multiple outputs, this line will be removed. After preparing the inputs, outputs, and weights, the next subsection works on the forward pass.

Forward pass In Chapter 6, the code responsible for implementing the forward pass is given in the next block. How to edit this code to work with the architecture in Fig. 7.1?

According to Fig. 7.3, the difference between the architectures in Chapter 6, and this chapter is just adding a second hidden layer. Because the outputs of one layer are the inputs of the next layer, then the outputs from the first hidden layer are the inputs to the second hidden layer. According to the previous code, the variable sig_hidden holds the outputs from the hidden layer. Thus, this variable is the input to the second hidden layer.

132 Introduction to deep learning and neural networks with Python™

The edited code that attaches the new hidden layer to the forward pass is listed in the next block. Now, the second hidden layer accepts the outputs of the first hidden layer, stored in the sig_hidden1 variable, in addition to its weights returned by indexing the array w by index 1. The sum of products (SOPs) for the second hidden layer are calculated using the numpy.matmul() function and the result is stored in the sop_hidden2 variable. The sigmoid function is then applied to this variable and the result is saved into the sig_hidden2 variable. This variable represents the outputs from the new hidden layer. The sig_hidden2 variable is then fed to the output neuron to calculate the predicted output. Finally, the error is calculated. Once the error is calculated, this indicates the end of the forward pass.

Note that the indices used for retrieving the weights of each layer from the array w are hardcoded. Here is a summary of the 3 indices used: 1. Index 0: Returns the weights between the input layer and the first hidden layer. 2. Index 1: Returns the weights between the 2 hidden layers. 3. Index 2: Returns weights between the second hidden layer and the output layer. Working this way needs tracing the indices to return the correct weights for each layer. A better way is to avoid hardcoding the indices by using a variable that starts from 0 and increases by 1 for each layer. The next code includes such edits. The variable layer_idx represents the index of the current layer. It starts from 0 and then incremented until reaching the value 2. Later, a loop will be used in which the layer_idx variable is incremented automatically.

Working with 2 hidden layers Chapter | 7 133

After completing the forward pass, the next subsection works on the backward pass.

Backward pass The implementation of the backward pass in the architecture in Fig. 6.1 is listed in the next code. The code calculates the gradients for the network layers (output and hidden) and then updates the weights.

Referring to Fig. 7.3 that maps the architectures in Fig. 7.1 and Fig. 6.1, then the previous code can, with some modifications, calculate the gradients for:

134 Introduction to deep learning and neural networks with Python™

1. Weights between the second hidden layer and the output layer. 2. Weights between the first hidden layer and the second hidden layer. The modified code is listed in the next listing. This leaves the weights between the input layer and the first hidden layer untouched.

In the forward pass, the layer_idx variable was incremented from 0 to 2 but in the backward pass, it will be decremented from 2 back to 0. Next is to discuss how to calculate the gradients and update the weights between the input layer and the first hidden layer.

First hidden layer gradients For making things simpler, the architecture diagram in Fig. 7.1 is edited as shown in Fig. 7.4 to show a fewer number of connections. The available connections in this figure are: 1. Connections between the first hidden neuron in the first hidden layer and all input neurons. The weights labels for such connections are also available for making it easier to figure out how the derivative chain is calculated. The 3 weights are labeled W11, W21, and W31. 2. Connections between the first hidden neuron in the first hidden layer and all neurons in the second hidden layer. 3. Connections between the second hidden layer and the output neuron. The next discussion focuses only on how to calculate the gradients for the 3 weights between the input neurons and the first neuron in the first layer. The labels of these weights are W11, W21, and W31. After understanding how the gradients are calculated for these 3 weights, then the idea will be easily

Working with 2 hidden layers Chapter | 7 135

FIG. 7.4 Architecture in Fig. 7.1 with Fewer Connections.

eneralized to calculate the gradient for all weights between the input layer g and the first hidden layer. Following the chain from the error to W11, there are 3 paths connecting them. Each path visits a different hidden neuron in the second hidden layer. Because there are 3 neurons in the second hidden layer, then there are 3 paths. The first path visits the first hidden neuron in the second hidden layer. The chain of derivatives for this path is given in the next equation. There are 7 derivatives in the chain. dError dError dPrediced dSOP9 dActiv6 dSOP6 dActiv1 dSOP1 = dW11 dPredicted dSOP9 dActiv6 dSOP6 dActiv1 dSOP1 dW11 The second path visits the second hidden neuron in the second hidden layer. The chain of derivatives for this path is given in the next equation. dError dError dPrediced dSOP9 dActiv7 dSOP7 dActiv1 dSOP1 = dW11 dPredicted dSOP9 dActiv7 dSOP7 dActiv1 dSOP1 dW11 Finally, the third path visits the third hidden neuron in the second hidden layer, and the chain of derivatives for this path is given in the next equation. dError dError dPrediced dSOP9 dActiv8 dSOP8 dActiv1 dSOP1 = dW11 dPredicted dSOP9 dActiv8 dSOP8 dActiv1 dSOP1 dW11 As a summary, all of the 3 chains connecting the error to W11 are listed in the next 3 equations. Out of the 7 derivatives in the chains of the 3 paths, 4 derivatives are identical for all paths and the remaining 3 derivatives

136 Introduction to deep learning and neural networks with Python™

change. These 3 derivatives are the ones at the center of the chain. Moving in either the left or the right direction, the derivatives that change are the third, fourth, and fifth ones). dError dError dPrediced dSOP9 dActiv 6 dSOP6 dActiv1 dSOP1 = dW11 dPredicted dSOP9 dActiv 6 dSOP6 dActiv1 dSOP1 dW11 dError dError dPrediced dSOP9 dActiv 7 dSOP7 dActiv1 dSOP1 = dW11 dPredicted dSOP9 dActiv 7 dSOP7 dActiv1 dSOP1 dW11 dError dError dPrediced dSOP9 dActiv 8 dSOP8 dActiv1 dSOP1 = dW11 dPredicted dSOP9 dActiv 8 dSOP8 dActiv1 dSOP1 dW11 Each of the previously mentioned 3 chains results in a gradient. Because there are 3 chains, then there are 3 gradients to update the same weight. How the single weight is updated with 3 gradients? Simply, the weight could be updated by the first gradient to return an updated value of the weight. This updated weight is again updated by the next gradient. The process continues until all gradients are used to update the weight. Let’s have an example. Assume that the value of W11 is 0.9 and its 3 gradients are − 0.1, 0.2, and 0.3. If the learning rate is 0.1, then the calculations are done according to the next formula. W11new W11 gradient learning _ rate Here are the calculations: 1. Update the current value of W11 (0.9) using the first gradient (− 0.1). The result is 0.9-(− 0.1)(0.1) = 0.91. Thus, the new value for W11 will be 0.91. 2. Update the latest weight 0.91 using the second gradient (0.2). The result is 0.91-(0.2)(0.1) = 0.89. Thus, the new value for W11 is 0.89. 3. Update the latest weight of 0.89 using the third gradient (0.3). The result is 0.89-(0.3)(0.1) = 0.86. Thus, the new value for W11 is 0.86. After updating W11 by all the gradients, its final value is 0.86. This is one way for updating a weight when multiple gradients exist. Another way is to sum all gradients and then update the weight once. The sum of the 3 gradients is − 0.1 + 0.2 + 0.3 = 0.4. Using the value 0.4, the updated weight is 0.9-(0.4)(0.1) = 0.86. The same value is returned by the 2 ways. Thus, either way, can be used. From the previous discussion, there are 3 gradients for updating the weight W11 of the connection between the first input neuron and the first neuron in the first hidden layer.

Working with 2 hidden layers Chapter | 7 137

Similarly, there are other 3 gradients for updating the weight W21 connecting this hidden neuron to the second input. Such 3 gradients are calculated according to the 3 derivative chains given in the next equations. All derivatives in the 3 chains are identical except for the middle 3 ones. Comparing these chains to those for the weight W11, all the derivatives are identical except for the last derivative. dError dError dPrediced dSOP9 dActiv 6 dSOP6 dActiv1 dSOP1 = dW21 dPredicted dSOP9 dActiv 6 dSOP6 dActiv1 dSOP1 dW21 dError dError dPrediced dSOP9 dActiv 7 dSOP7 dActiv1 dSOP1 = dW21 dPredicted dSOP9 dActiv 7 dSOP7 dActiv1 dSOP1 dW21 dError dError dPrediced dSOP9 dActiv 8 dSOP8 dActiv1 dSOP1 = dW21 dPredicted dSOP9 dActiv 8 dSOP8 dActiv1 dSOP1 dW21 Similar to calculating the chains used for the weights W11 and W21, there are 3 chains for the weight W31 as given in the next equations. As regular, only the middle 3 derivatives change from one chain to another. Note that these chains are identical to the chains of W11 and W21 except for the last derivative. dError dError dPrediced dSOP9 dActiv 6 dSOP6 dActiv1 dSOP1 = dW31 dPredicted dSOP9 dActiv 6 dSOP6 dActiv1 dSOP1 dW31 dError dError dPrediced dSOP9 dActiv 7 dSOP7 dActiv1 dSOP1 = dW31 dPredicted dSOP9 dActiv 7 dSOP7 dActiv1 dSOP1 dW31 dError dError dPrediced dSOP9 dActiv 8 dSOP8 dActiv1 dSOP1 = dW31 dPredicted dSOP9 dActiv 8 dSOP8 dActiv1 dSOP1 dW31 After understanding how the derivative chains and the gradients are calculated for all the 3 weights connecting the input neurons to the first hidden neuron, next is to extend the backward passcode to calculate the gradients of all weights between the input layer and the first hidden layer. For each weight, the number of derivative chains is equal to the number of neurons in the second hidden layer. Because this layer has 3 neurons, then the total number of chains is 3 where each chain returns a gradient. Generally, the total number of derivative chains (i.e. gradients) for each weight at a given layer equals the product of the number of neurons for all layers following that layer. According to the network architecture in Fig. 7.1, the number of chains required for updating the weights in the first hidden layer equals the product of

138 Introduction to deep learning and neural networks with Python™

the number of neurons in all layers following that layer. There are 2 layers following this layer which are: 1. A second hidden layer with 3 neurons. 2. An output layer with 1 neuron. The product of the number of neurons in the previously mentioned 2 layers is 3*1 = 3. Thus, the number of chains used for updating the weights of the first hidden layer is 3. If the output layer has 4 neurons, then the total number of chains is 3*4 = 12. In other words, 12 gradients are calculated to update a weight connected to a single neuron in the first hidden layer. Back to the Python implementation, there are 3 neurons in the second hidden layer. These 3 neurons are connected to all 5 neurons in the first hidden layer. For updating all weights of the first hidden layer, 2 loops can be used as given in the next code. One loop iterates through the 3 neurons in the second hidden layer and the other loop iterates through the 5 neurons in the first hidden layer. Within such loops, the gradient is calculated based on the derivative chain is calculated and the weights are updated.

At this point, all weights of the entire network are successfully updated, and here is the complete code for building the example in Fig. 7.1.

Working with 2 hidden layers Chapter | 7 139

140 Introduction to deep learning and neural networks with Python™

The previous code just works for a single iteration. Using a loop, the weights are updated for some iterations as in the next code. Remember to update the w_old variable after each iteration.

Working with 2 hidden layers Chapter | 7 141

142 Introduction to deep learning and neural networks with Python™

Fig. 7.5 shows how the predicted output changes by iteration. Note that using another value for the learning rate might speed-up the process and a smaller error could be reached in a fewer number of iterations. Figure 7.6 shows the relation between the error and the iteration. (See Fig. 7.6.) Based on the previous implementation, it is worth mentioning that it can work with any number of inputs and also any number of hidden neurons in just 2 hidden layers. The next section builds another network with different number of neurons.

ANN with 2 hidden layers with 10 and 8 neurons To make sure the implementation works with any number of neurons in the input and the 2 hidden layers, this section builds a network with 10 inputs, 8 neurons in the first hidden layer, and 4 neurons in the second hidden layer.

Working with 2 hidden layers Chapter | 7 143

0.8

0.7

Predicted

0.6

0.5

0.4

0.3

0.2 0

10000

20000

30000

40000

50000

60000

70000

80000

60000

70000

80000

Iteration Number FIG. 7.5 Prediction vs. Iteration for the Network in Fig. 7.1.

0.35 0.30

Error

0.25 0.20 0.15 0.10 0.05 0.00 0

10000

20000

30000

40000

50000

Iteration Number FIG. 7.6 Error vs. Iteration for the Network in Fig. 7.1.

144 Introduction to deep learning and neural networks with Python™

There are just 2 changes in the code: 1. The x array is updated to include 10 inputs. 2. The network_architecture array is updated to reflect the new network architecture. The new network is implemented in the next code.

Working with 2 hidden layers Chapter | 7 145

146 Introduction to deep learning and neural networks with Python™

0.8

0.7

Predicted

0.6

0.5

0.4

0.3

0.2 0

10000

20000

30000

40000

50000

60000

70000

80000

Iteration Number FIG. 7.7 Prediction vs. Iteration for a Network with 2 Hidden Layers with 10 and 8 Neurons.

Working with 2 hidden layers Chapter | 7 147

0.35 0.30

Error

0.25 0.20 0.15 0.10 0.05 0.00 0

10000

20000

30000

40000

50000

60000

70000

80000

Iteration Number FIG. 7.8 Error vs. Iteration for a Network with 2 Hidden Layers with 10 and 8 Neurons.

Fig. 7.7 shows how the predicted output changes by iteration for the new architecture. Fig. 7.8 gives the relation between the error and the iteration.

Conclusion In this chapter, the neural network implementation is made more generic as it can work with any number of inputs and also any number of hidden neurons within just 2 hidden layers. To use the code, just change the arrays x and network_architecture. In Chapter 8, the current implementation will be extended to include an additional hidden layer so that the network can work with 3 hidden layers.

Chapter 8

ANN with 3 hidden layers Chapter outline ANN with 3 hidden layers with 5, 3, and 2 neurons Required changes in the forward pass Required changes in the backward pass Editing Chapter 7 implementation to work with 3 hidden layers Preparing inputs, outputs, and weights

149 149 151

Forward pass Backward pass Python™ implementation ANN with 10 inputs and 3 hidden layers with 8, 5, and 3 neurons Conclusion

153 156 164 172 175

152 152

ABSTRACT The recent implementation built in Chapter 7 works with a network of any number of inputs and any number of hidden neurons within just 2 hidden layers. In this chapter, there are 2 major changes: 1. Introducing a third hidden layer where any number of neurons can be used. 2. Using a generic way to calculate the sum of products (SOP) and activation function outputs in the forward pass. The chapter discusses architecture with 3 inputs, 1 output, and 3 hidden layers with 4, 3, and 2 hidden neurons, respectively.

ANN with 3 hidden layers with 5, 3, and 2 neurons The architecture discussed in this chapter has 3 inputs, 1 output, and 3 hidden layers with 5, 3, and 2 neurons, respectively as shown in Fig. 8.1. Like the strategy followed in Chapter 7, no need to start the implementation from scratch. The implementation built in Chapter 7 could be used after making some changes to work with the third hidden layer in both the forward and backward passes.

Required changes in the forward pass The network architecture in Fig. 7.1 is given again in Fig. 8.2 in which there is a total of4 layers (1 input, 2 hidden, and 1 output). In this network, the second Introduction to Deep Learning and Neural Networks with Python™ https://doi.org/10.1016/B978-0-323-90933-4.00006-1 © 2021 Elsevier Inc. All rights reserved.

149

150 Introduction to deep learning and neural networks with Python™ Input Layer

Hidden Layer 1

Hidden Layer 2

Hidden Layer 3

Output Layer

SOP1|Activ1

X1 SOP2|Activ2

SOP6|Activ6

SOP3|Activ3

SOP7|Activ7

SOP9|Activ9

X2

Error SOP11|Activ11(Predicted)

X3

SOP10|Activ10 SOP4|Activ4

SOP8|Activ8

SOP5|Activ5

FIG. 8.1 ANN architecture with 3 hidden layers with 5, 3, and 2 neurons.

Input Layer

Hidden Layer 1

Hidden Layer 2

Output Layer

SOP1|Activ1 SOP6|Activ6

X1 SOP2|Activ2

X2

Error SOP7|Activ7

SOP9|Activ9(Predicted)

SOP3|Activ3

X3

SOP8|Activ8 SOP4|Activ4

SOP5|Activ5

FIG. 8.2 Replication of Fig. 7.1.

hidden layer is connected to the output layer. Thus, the outputs of the second hidden layer (Activ6 to Activ8) were fed as inputs to the output neuron. In this chapter, the second hidden layer is connected to a third hidden layer, not the output layer. As a result, there are 2 changes in the forward pass: 1. The outputs from the second hidden layer are fed as inputs to the third hidden layer. 2. The outputs from the third hidden layer are fed as inputs to the output layer. This is everything that is needed in the forward pass. The next subsection discusses the changes in the backward pass.

ANN with 3 hidden layers Chapter | 8 151

Required changes in the backward pass To better understand the necessary edits in the backward pass, the 2 architectures in Figs. 8.1 and 8.2 are mapped together in Fig. 8.3. The4 layers in Fig. 8.2 are mapped to the last4 layers in the architecture used in this chapter which are: 1. Output 2. Third hidden 3. Second hidden 4. First hidden As a result, the backward pass calculations for these layers are identical, without changes, to those discussed in Chapter 7. In other words, the derivative chains for the weights between the pairs of layers listed in the following section are calculated as discussed in Chapter 7: 1. (Third hidden, Output) 2. (Second hidden, Third hidden) 3. (First-hidden, Second hidden) Input Layer

Hidden Layer 1

Hidden Layer 2

Output Layer

SOP1 | Activ1

X1

SOP6 | Activ6 SOP2 | Activ2

X2

Error SOP7 | Activ7

SOP9 | Activ9(Predicted)

SOP3 | Activ3

X3

SOP8 | Activ8 SOP4 | Activ4

SOP5 | Activ5

SOP1 | Activ1

X1 SOP2 | Activ2

SOP6 | Activ6 SOP9 | Activ9

X2 SOP3 | Activ3

Error

SOP7 | Activ7

SOP11 | Activ11(Predicted)

X3

SOP10 | Activ10 SOP4 | Activ4

SOP8 | Activ8

SOP5 | Activ5

Input Layer

Hidden Layer 1

Hidden Layer 2

Hidden Layer 3

FIG. 8.3 Mapping between the 2 architectures in Figs. 8.1 and 8.2.

Output Layer

152 Introduction to deep learning and neural networks with Python™

The only weights not handled are those between the input layer and the first hidden layer. This chapter discusses how the derivative chains for those weights are calculated. Note that some layers in the 2 architectures have different numbers of neurons but it does not matter because the implementation does not depend on a specific number of neurons in either the hidden or the input layers. The next section edits the implementation built in Chapter 7 to do the necessary changes in both the forward and backward passes to build the architecture in Fig. 8.1.

Editing Chapter 7 implementation to work with 3 hidden layers The implementation is usually divided into 3 parts: 1. Preparing Inputs, Outputs, and Weights. 2. Forward Pass. 3. Backward Pass. Let’s start editing the 3 parts one by one.

Preparing inputs, outputs, and weights In Chapter 7, the inputs, outputs, and weights were prepared using the next code. Think about the required changes to make this code work for the architecture of this chapter.

For this chapter, the same number of inputs and outputs are used. The only change is to edit the network_architecture variable to include an extra hidden layer with 2 neurons. Here is the new code.

ANN with 3 hidden layers Chapter | 8 153

By making such edits, this is everything required for preparing the inputs, outputs, and weights for the architecture of this chapter. Let’s move to the editing of the forward passcode.

Forward pass The implementation of the forward pass in Chapter 7 is given in the next code. As stated previously, the activation function outputs of the second hidden layer (Activ6 to Activ8), stored in the sig_hidden2 variable, are fed to a third hidden layer, not to the output layer. Let’s make the necessary code changes.

154 Introduction to deep learning and neural networks with Python™

For the architecture of this chapter, the code is listed in the next block. The SOPs of the third hidden layer are calculated by multiplying the weights matrix of this layer by the outputs from the second hidden layer. The result is saved in the sop_hidden3 variable. The sop_hidden3 variable is fed to the sigmoid function to return the outputs from the third hidden layer which are saved in the activ_hidden3 variable. This variable is then fed to the output neuron to return the predicted output. Finally, the error is calculated.

According to the previous code, extra lines are added for handling the new hidden layer. For each added hidden layer, more lines must be added. The next subsection works on making the implementation generic so that the forward pass can use any number of hidden layers without changing the code.

Working with any number of layers For each layer, the forward pas calculations can be summarized in 2 operations: 1. Calculating the sum of products. 2. Feeding the sum of products to the activation function. Those 2 operations are coded using the next 2 lines. These 2 lines are repeated for each hidden/output layer. The first line calculates the SOPs and the

ANN with 3 hidden layers Chapter | 8 155

second one applies the activation function (e.g., sigmoid) to such SOPs. Using a for loop, these 2 lines can be applied to all layers.

The generic implementation of the forward pass is listed in the next code. Before entering the loop, the sop_activ_mul list is created to hold all forward pass calculations of all layers. For each layer, there is a single row in the sop_activ_mul list. Each row holds 2 sublists: 1. The first sublist named sop_temp holds the SOPs for the layer. 2. The second sublist named activ_temp holds the outputs of the layer (activation function outputs). The 2 sublists are temporarily saved in a variable named sop_activ_ mul_temp which is then appended to the main list sop_activ_mul.

For calculating the SOPs of a layer, the weights are multiplied by the input of that layer. For the unification of all layers, the current layer’s inputs are saved into a new variable named curr_multiplicand. It is initialized by the input array x. For each iteration, this variable is multiplied by the weights of each layer using the numpy.matmul() function. Because the outputs of the current layer are the inputs to the next layer, then the curr_multiplicand variable is updated in each iteration to hold the outputs of the current layer.

156 Introduction to deep learning and neural networks with Python™

After the for loop completes, the sop_activ_mul array is converted from a list to a NumPy array to better handle it. For the architecture in Fig. 8.1, the shape of this array is (4, 2). Despite the 5 layers, the calculations of the SOPs and the activation functions are just applied to the weights between the pairs of layers. This leaves only4 pairs of layers and thus the sop_activ_mul array has4 rows. The first row holds the calculations between the input and first hidden layer, the second row holds the calculations between the first and second hidden layers, and so on. Table 8.1 summarizes the contents of the sop_activ_mul array. After calculating the SOPs and activation function outputs of all layers, the final step outside the loop is to calculate the error. The error is calculated by calling the error() function which accepts the output of the last layer (predicted output) and the target output. The last layer can be reversely indexed using − 1. Using index − 1 returns both the SOPs and the predicted output of the output layer. To only return the predicted output, another index with value 1 is used. Error calculation is the last thing to do in the forward pass after which the backward pass starts.

Backward pass The next code implements the backward pass discussed in Chapter 7. Remember that such code works with the last4 layers of the architecture of this chapter but it does not work for the weights between the input layer and the first hidden layer.

TABLE 8.1 Contents of the sop_activ_mul array for the architecture in Fig. 8.1. sopactivmul shape = (4, 2) Layers

Row IDX

Column 0 (SOPs)

Column 1 (Outputs)

Input – Hidden 1

0

SOP

Activation

Hidden 1 – Hidden 2

1

SOP

Activation

Hidden 2 – Hidden 3

2

SOP

Activation

Hidden 3 – Output

3

SOP

Activation

ANN with 3 hidden layers Chapter | 8 157

The previously mentioned implementation is edited to work with this chapter as listed in the following section. This code just has some minor changes compared to the previous code such as using the sop_activ_ mul array. This code does not handle the weights between the input layer and the first hidden layer. The next discussion prepares the derivative chains necessary for calculating the gradients for these weights. Later, this discussion will be converted into code.

158 Introduction to deep learning and neural networks with Python™

ANN with 3 hidden layers Chapter | 8 159

Input Layer

Hidden Layer 1

W11

X1 X2

W21

Hidden Layer 2

Hidden Layer 3

Output Layer

SOP1 | Activ1

SOP2 | Activ2

SOP6 | Activ6 SOP9 | Activ9

W31 SOP3 | Activ3

Error

SOP7 | Activ7

SOP11 | Activ11(Predicted)

X3

SOP10 | Activ10 SOP4 | Activ4

SOP8 | Activ8

SOP5 | Activ5

FIG. 8.4 Architecture in Fig. 8.1 with fewer connections.

For simplicity, the architecture diagram is edited, as given in Fig. 8.4, to focus on the connections between the input neurons and the first neuron in the first hidden layer. The 3 inputs X1, X2, and X3 are connected to that neuron with weights W11, W21, and W31, respectively. Remember from Chapter 7 that each weight connected to the first hidden layer had 3 chains. Each chain results in a gradient and thus the 3 chains return 3 gradients. The final gradient for the weight was calculated as the sum of the 3 gradients. Regarding the example of this chapter, how many chains are necessary for calculating the gradient of the weights W11, W21, and W31? From Chapter 7, the number of chains at a given layer equals the product of the number of neurons in the layers that follow the current hidden layer. After the current hidden layer, which is the first hidden layer, there are 3 layers (2 hidden layers and 1 output layer). The number of neurons in these 3 layers are 3, 2, and 1 and their product is 3*2*1 = 6. As a result, there are 6 chains necessary for calculating the gradient of each weight between the input layer and the first hidden layer. In other words, there are 6 paths connecting W11, W21, and W31 to the error. The 6 chains of the weight W11 are listed in the next 6 equations. Each chain has 9 derivatives. The first 2 derivatives in all chains are identical. dError dError dPrediced dSOP11 dActiv9 dSOP9 dActiv6 = dW11 dPredicted dSOP11 dActiv9 dSOP9 dActiv6 dSOP6 dSOP6 dActiv1 dSOP1 dActiv1 dSOP1 dW11

160 Introduction to deep learning and neural networks with Python™

dError dError dPrediced dSOP11 dActiv9 dSOP9 dActiv7 = dW11 dPredicted dSOP11 dActiv9 dSOP9 dActiv7 dSOP7 dSOP7 dActiv1 dSOP1 dActiv1 dSOP1 dW11 dError dError dPrediced dSOP11 dActiv9 dSOP9 = dW11 dPredicted dSOP11 dActiv9 dSOP9 dActiv8 dActiv8 dSOP8 dActiv1 dSOP1 dSOP8 dActiv1 dSOP1 dW11 dError dError dPrediced dSOP11 dActiv10 dSOP10 = dW11 dPredicted dSOP11 dActiv10 dSOP10 dActiv6 dActiv6 dSOP6 dActiv1 dSOP1 dSOP6 dActiv1 dSOP1 dW11 dError dError dPrediced dSOP11 dActiv10 dSOP10 = dW11 dPredicted dSOP11 dActiv10 dSOP10 dActiv7 dActiv7 dSOP7 dActiv1 dSOP1 dSOP7 dActiv1 dSOP1 dW11 dError dError dPrediced dSOP11 dActiv10 dSOP10 = dW11 dPredicted dSOP11 dActiv10 dSOP10 dActiv8 dActiv8 dSOP8 dActiv1 dSOP1 dSOP8 dActiv1 dSOP1 dW11 The first 3 chains visit the first neuron in the third hidden layer. The last 3 chains visit the second (last) neuron in the third hidden layer. If there are more than 2 neurons in the third hidden layer, then more derivative chains were to be created. The best way to implement that is by using a for loop with several iterations equal to the number of neurons in the third hidden layer. Each neuron in the third hidden layer has 3 chains to reach the input layer. The reason is that the layer proceeding the third hidden layer, which is the second hidden layer, has 3 neurons. Each chain uses a different neuron of these 3 neurons. If there are more than 3 neurons in the second hidden layer, then more chains were to be created. The best way for implementing that is also using a for loop with several iterations equal to the number of second hidden layer neurons. According to the previous 2 paragraphs, there are 2 for loops as in Fig. 8.5: 1. The first loop iterates through the third hidden layer neurons. This is the outer loop.

ANN with 3 hidden layers Chapter | 8 161

FIG. 8.5 Nested loops for working with the last 2 hidden layers in Fig. 8.1.

2. The second loop iterates through the second hidden layer neurons This is the inner loop. As discussed in Chapter 7, each chain returns a gradient. Thus, there are 6 gradients. To update W11, then start by updating it by the first gradient. The result is then updated by the second gradient and so on. It is also possible to sum all of these 6 gradients and update the weight W11 once. Similar to the chains of the weight W11, the chains used with the weight W21 are listed in the next equations. Compared to the 9 derivative chains of W11, only the last derivative changed. dError dError dPrediced dSOP11 dActiv9 dSOP9 = dW21 dPredicted dSOP11 dActiv9 dSOP9 dActiv6 dActiv6 dSOP6 dActiv1 dSOP1 dSOP6 dActiv1 dSOP1 dW21 dError dError dPrediced dSOP11 dActiv9 dSOP9 = dW21 dPredicted dSOP11 dActiv9 dSOP9 dActiv7 dActiv7 dSOP7 dActiv1 dSOP1 dSOP7 dActiv1 dSOP1 dW21 dError dError dPrediced dSOP11 dActiv9 dSOP9 = dW21 dPredicted dSOP11 dActiv9 dSOP9 dActiv8 dActiv8 dSOP8 dActiv1 dSOP1 dSOP8 dActiv1 dSOP1 dW21 dError dError dPrediced dSOP11 dActiv10 dSOP10 = dW21 dPredicted dSOP11 dActiv10 dSOP10 dActiv6 dActiv6 dSOP6 dActiv1 dSOP1 dSOP6 dActiv1 dSOP1 dW21 dError dError dPrediced dSOP11 dActiv10 dSOP10 = dW21 dPredicted dSOP11 dActiv10 dSOP10 dActiv7 dActiv7 dSOP7 dActiv1 dSOP1 dSOP7 dActiv1 dSOP1 dW21

162 Introduction to deep learning and neural networks with Python™

dError dError dPrediced dSOP11 dActiv10 dSOP10 = dW21 dPredicted dSOP11 dActiv10 dSOP10 dActiv8 dActiv8 dSOP8 dActiv1 dSOP1 dSOP8 dActiv1 dSOP1 dW21 The chains used for calculating the gradient of the weight W31 are listed in the next equations. Compared to the 9 derivative chains of W11 and W21, only the last derivative changed. dError dError dPrediced dSOP11 dActtiv9 dSOP9 = dW31 dPredicted dSOP11 dActiv9 dSOP9 dActiv6 dActiv6 dSOP6 dActiv1 dSOP1 dSOP6 dActiv1 dSOP1 dW31 dError dError dPrediced dSOP11 dActiv9 dSOP9 = dW31 dPredicted dSOP11 dActiv9 dSOP9 dActiv7 dActiv7 dSOP7 dActiv1 dSOP1 dSOP7 dActiv1 dSOP1 dW31 dError dError dPrediced dSOP11 dActiv9 dSOP9 = dW31 dPredicted dSOP11 dActiv9 dSOP9 dActiv8 dActiv8 dSOP8 dActiv1 dSOP1 dSOP8 dActiv1 dSOP1 dW31 dError dError dPrediced dSOP11 dActiv10 dSOP10 = dW31 dPredicted dSOP11 dActiv10 dSOP10 dActiv6 dActiv6 dSOP6 dActiv1 dSOP1 dSOP6 dActiv1 dSOP1 dW31 dError dError dPrediced dSOP11 dActiv10 = dW31 dPredicted dSOP11 dActiv10 dSOP10 dSOP10 dActiv7 dSOP7 dActiv1 dSOP1 dActiv7 dSOP7 dActiv1 dSOP1 dW31 dError dError dPrediced dSOP11 dActiv10 = dW31 dPredicted dSOP11 dActiv10 dSOP10 dSOP10 dActiv8 dSOP8 dActiv1 dSOP1 dActiv8 dSOP8 dActiv1 dSOP1 dW31

ANN with 3 hidden layers Chapter | 8 163

FIG. 8.6 Nested loops for working with the all hidden layers in Fig. 8.1.

Note that the previous 18 chains are just for updating the weights between the input neurons and just the first neuron in the first hidden layer. Because there are 5 neurons in the first hidden layer, then the previous calculations will be repeated. The calculations can be elegantly implemented using a for loop with several iterations equal to the number of neurons in the first hidden layer. This adds a third for loop in addition to the previous 2 for loops mentioned in Fig. 8.5. Fig. 8.6 shows how the 3 for loops are nested. After understanding the theory behind implementing the gradient descent algorithm in the backward pass for updating the weights between the input layer and the first hidden layer, the next code gives its implementation.

The first for loop iterates from 0 to the number of neurons in the third hidden layer. This number is returned using g3.shape[0] which returns 2. This means the outer for loop goes through each of the 2 neurons in the third hidden layer. The second for loop iterates from 0 to the number of neurons in the second hidden layer. This number is returned using g5.shape[0] which returns 3. The third and last for loop iterates from 0 to the number of neurons in the third hidden layer. This number is returned by g7.shape[0] which returns 5 to iterate through each of the 5 neurons in the first hidden layer.

164 Introduction to deep learning and neural networks with Python™

FIG. 8.7 Derivative chain for calculating the gradient for W11.

The grad_hidden_input array, inside the third for loop, is calculated by multiplying the 9 variables (g9 to g1). Each one of these 9 variables corresponds to one of the 9 derivatives in the derivative chains. This is illustrated in Fig. 8.7 for the chain of weight W11. Think of g1 and g2 as constants across all chains because their corresponding derivatives do not change. Because there are 6 derivative chains associated with each weight between the input layer and the first hidden layer, the array grad_hidden_input holds the result of one of the 6 derivative chains in each iteration in the most inner loop. As this loop is executed 6 times, then all the 6 derivative chains are processed and each weight is updated 6 times. After discussing the implementation of the gradient descent algorithm for working with a network with 3 hidden layers, it is worth mentioning that any number of neurons can be used in these 3 hidden layers. This is in addition to using any number of inputs in the input layer. Up to this time, one big limitation is that of supporting a single output neuron. The next section lists the complete code of the implementation.

Python™ implementation The complete code that implements a network with 3 hidden layers is listed in the next block. Note that it just goes through a single iteration.

ANN with 3 hidden layers Chapter | 8 165

166 Introduction to deep learning and neural networks with Python™

ANN with 3 hidden layers Chapter | 8 167

168 Introduction to deep learning and neural networks with Python™

ANN with 3 hidden layers Chapter | 8 169

170 Introduction to deep learning and neural networks with Python™

Fig. 8.8 shows how the predicted output change by iteration in which the network made the correct prediction (0.45) after 30,000 iterations. The error versus the iteration number is plotted in Fig. 8.9. After completing the example in which there is an architecture with 3 inputs and 3 hidden layers with 5, 3, and 2 neurons, respectively, the next section builds another architecture.

ANN with 3 hidden layers Chapter | 8 171

0.70

Predicted

0.65

0.60

0.55

0.50

0.45 0

10000

20000

30000

40000

50000

60000

70000

80000

60000

70000

80000

Iteration Number FIG. 8.8 Prediction vs. iteration for the network in Fig. 8.1.

0.08 0.07 0.06

Error

0.05 0.04 0.03 0.02 0.01 0.00 0

10000

20000

30000

40000

50000

Iteration Number FIG. 8.9 Error vs. iteration for the network in Fig. 8.1.

172 Introduction to deep learning and neural networks with Python™

ANN with 10 inputs and 3 hidden layers with 8, 5, and 3 neurons The new architecture to be created in this section has 10 inputs and 3 hidden layers with 8, 5, and 3 neurons, respectively. Its implementation is given in the next code. Compared to the previous code, there are just 2 variables to be changed which are: 1. The input array x to include 10 values. 2. The array network_architecture to specify the number of hidden neurons in the 3 hidden layers.

ANN with 3 hidden layers Chapter | 8 173

174 Introduction to deep learning and neural networks with Python™

ANN with 3 hidden layers Chapter | 8 175

Fig. 8.10 shows how the predicted output change by iteration. After 20,000 iterations, the network made the correct prediction. The error versus the iteration number is plotted in Fig. 8.11.

Conclusion In this chapter, the neural network is implemented to support 3 hidden layers with the ability to use any number of neurons in the input and hidden layers. The forward and backward passes calculations for the new third hidden layer are discussed in detail. The chapter also generalized the forward pass calculations to be independent of the network architecture. This is by calculating the SOPs and activation function outputs for all layers using for loops. In Chapter 9, the implementation will be extended to allow using any number of hidden layers and any number of neurons within such layers.

176 Introduction to deep learning and neural networks with Python™

0.75

0.70

Predicted

0.65

0.60

0.55

0.50

0.45 0

10000

20000

30000

40000

50000

60000

70000

80000

Iteration Number FIG. 8.10 Prediction vs. iteration for a network with 3 hidden layers with 8, 5, and 3 neurons.

0.10

0.08

Error

0.06

0.04

0.02

0.00 0

10000

20000

30000

40000

50000

60000

70000

80000

Iteration Number FIG. 8.11 Error vs. iteration for a network with 3 hidden layers with 8, 5, and 3 neurons.

Chapter 9

Working with any number of hidden layers Chapter outline What to do for a generic gradient descent implementation? Generic approach for gradients calculation Output layer gradients Hidden layer gradients Python™ implementation

177 178 180 180 184

backward_pass() method Output layer Hidden layers Example: Training the network Making predictions Conclusion

193 194 195 197 200 201

ABSTRACT In the latest implementation built in Chapter 8, the neural network supports: ● Any number of inputs. ● Up to 3 hidden neurons. ● Any number of neurons in the hidden layers.

Additionally, Chapter 8 introduced a general way of doing the forward pass calculations regardless of the network architecture. This is by calculating the sum of products (SOPs) and activation function outputs for all layers using loops. In this chapter, a milestone in the implementation is reached by building a generic implementation that uses any number of hidden layers.

What to do for a generic gradient descent implementation? In the previous chapters, a specific number of hidden layers were used. For example, Chapter 8 used 3 hidden layers, Chapter 7 used 2 hidden layers, and Chapters 5 and 6 used just 1 hidden layer. Throughout these chapters, all steps were worked manually before implementation. That is the network architecture was graphed and all steps required to implement neural networks, such as calculating the sum of products in the forward pass and derivative chains in the backward pass, were manually discussed in detail before starting the Python implementation.

Introduction to Deep Learning and Neural Networks with Python™ https://doi.org/10.1016/B978-0-323-90933-4.00005-X © 2021 Elsevier Inc. All rights reserved.

177

178 Introduction to deep learning and neural networks with Python™

In this chapter, the implementation is generic to work with architectures of any number of layers compared to using just a fixed architecture in the previous chapters. Thus, the previous strategy of manually solving all steps before their implementation will not work. This section discusses a general rule of implementing the neural network regardless of the number of hidden layers. As mentioned in Chapters 7 and 8, the 3 changes to make in the implementation to work with different architectures are: 1. Preparing inputs, outputs, and weights. 2. Forward pass. 3. Backward pass. Throughout the previous chapters, the stuff of preparing the inputs, weights, and doing the forward pass calculations were generalized and there is nothing more to do. Here is the summary of the generalized operations: ● ●

●

In Chapter 4, the network works with any number of inputs. In Chapter 6, the entire network weights were prepared regardless of the network architecture. In Chapter 8, the forward pass is implemented regardless of the network architecture.

The previous chapters just focused on specific network architectures to work on their backward pass. The remaining challenge towards the generic neural network implementation is building a generic gradient descent algorithm in the backward pass. The implementation should prepare all derivative chains for all network weights and calculate their gradients regardless of the network architecture. This is discussed in this chapter.

Generic approach for gradients calculation For building a genetic implementation, a generic not specific network architecture is used as in Fig. 9.1. It shows a network with N inputs and K hidden layers. The first hidden layer is given the label R and has A neurons, hidden layer S has B neurons, hidden layer T has C neurons, and so on until reaching the last hidden layer which is layer K with D neurons. The architecture still has a single output. Supporting multiple outputs is something easy and will be discussed in Chapter 10. For the implementations in the previous chapters, what is the first derivative to calculate? Sure it is the error to predicted output derivative. What is the next derivative? It is the derivative of the predicted output concerning SOPout. SOPout is the label given to the sum of the product at the output neuron. There are 2 derivatives in hand. The next subsection calculates the derivatives of the weights between the hidden layer K and the output layer.

Input Layer X1

Hidden Layer R

Hidden Layer S

Hidden Layer T

Hidden Layer K

SOPR1 | ActivR1

SOPS1 | ActivS1

SOPT1 | ActivT1

SOPK1 | ActivK1

SOPR2 | ActivR2

SOPS2 | ActivS2

SOPT2 | ActivT2

SOPK2 | ActivK2

SOPR3 | ActivR3

SOPS3 | ActivS3

SOPT3 | ActivT3

SOPK3 | ActivK3

SOPRA | ActivRA

SOPSB | ActivSB

SOPTC | ActivTC

SOPKD | ActivKD

Output Layer

X2 X3

XN

FIG. 9.1 Generic ANN architecture

Error SOPout | Activout (Predicted)

180 Introduction to deep learning and neural networks with Python™

Output layer gradients For the chain from the error to the output layer’s weights, there is an additional derivative to calculate besides the previous 2 derivatives. This is the derivative of SOPout to the weights between the last hidden layer (layer K) and the output layer. The generic derivative chain is given in the next equation. dError dError dPrediced dSOPout = dWOd dPredicted dSOPout dWOd Because there are D neurons at layer K, there are D connections to the output neuron. In the previous equation, the subscript Od refers to the neuron with index d at the output layer. Because each connection has a weight, there are D weights. Moreover, there is a derivative calculated for each weight and thus there are D derivatives, one for each weight. Note that working with the weights between the last hidden layer and the output layer is very simple as the architecture just uses a single output neuron. Regarding the weights of the hidden layers, it is more complex.

Hidden layer gradients According to Fig. 9.1, the weights to be considered are the ones between the 2 hidden layers numbered K and T. Generally speaking, there are 3 arrays of derivatives to be calculated for a layer numbered Z: 1. Array 1: Derivatives of the SOPs of layer Z + 1 to the activation outputs of layer Z. 2. Array 2: Derivatives of the activation outputs of layer Z to the SOPs of layer Z. 3. Array 3: Derivatives of the SOPs of layer Z to the weights of layer Z. The word array is deliberately used because the current and next discussions work with all weights together and not for individuals ones. Note that the weights of layer Z are the weights connecting layer Z − 1 to layer Z. The number of elements in the 3 arrays equal to: 1. Array 1: (Number of Neurons in Layer Z + 1) * (Number of Neurons in Layer Z). 2. Array 2: Number of Neurons in Layer Z. 3. Array 3: (Number of Neurons in Layer Z) * (Number of Neurons in layer Z − 1). Note that these arrays of derivatives are just for layer Z. To prepare the derivative chains for updating the weights at layer Z, these derivatives must be multiplied by all the derivatives calculated in all previous layers.

Working with any number of hidden layers Chapter | 9 181

Assuming that the current layer is numbered 2 and there are 5 layers in the network, then the derivative chains for updating the weights at layer 2 is built by multiplying its derivatives by those of layers 3, 4, and 5. The arrays used from layers 3, 4, and 5 are only Array 1 and Array 2. Derivatives in Array 3 are exclusive derivatives for the current layer and not used in the derivative chains of other layers. It is worth remembering how many derivative chains are available for each weight. The number of chains at layer Z equals the product of the number of neurons at the layers following this layer. For example, if layers 3, 4, and 5 have 10, 8, and 6 neurons, then there will be a total of 10*8*6 = 480 chains for updating a single weight at layer 2. Let’s apply the discussion according to Fig. 9.2 which makes it easy to know the required derivatives to be calculated. It shows the first 3 hidden layers in Fig. 9.1. The discussion is completely general and applicable to any example. There are 3 arrows at the top of the figure, one for each of the 3 arrays of derivatives listed previously: 1. The rightmost arrow for the array 3 of derivatives. 2. The middle arrow for the array 2 of derivatives. 3. The leftmost arrow for the array 3 of derivatives. Each arrow starts and ends at the 2 participants of the derivative. For the layer numbered S, then array 1 of derivatives are the derivatives of the SOPs of layer S + 1, which is layer T, to the activation outputs of layer S. According to Fig. 9.2, the rightmost arrow starts from the SOPs of layer S + 1 (layer T) and ends at the activation outputs of layer S. Because layer S has B neurons, its activation outputs range from ActivS1 to ActivSB. Because layer T has C neurons, its SOP ranges from SOPT1 to SOPTC.

FIG. 9.2 3 hidden layers with undefined number of neurons.

182 Introduction to deep learning and neural networks with Python™

Because there are B neurons at layer S and C neuron at layer S + 1 (layer T), then the array of derivatives have C*B = CB derivatives, C derivatives for each of the B neurons at layer S because each neuron at layer S is connected to all C neurons at layer T. This is for the array 1 of derivatives. Regarding array 2 of derivatives, they are the derivatives of the activation outputs of layer S to the SOPs of layer S. The labels of the SOPs of layer S range from SOPS1 to SOPSB because this layer has B neurons. How many derivatives returned in this array? Because layer S has B neurons, then there are B derivatives, 1 derivative for each of the B neurons at layer S. Note that this array of derivatives is represented in Fig. 9.2 using the middle arrow that starts from the activation outputs of layer S and ends at the SOPs of layer S. Array 3 of derivatives is represented in Fig. 9.2 using the leftmost arrow. They are the derivatives of the SOP of layer S to the weights of layer S. Note that the weights of layer S are the weights between layer S and layer S-1 which is layer R in Fig. 9.2. How many derivatives returned in this array? Because layer S has B neurons and layer S-1 (layer R) has A neurons, then each of the B neurons in layer S is connected to all A neurons in layer R. As a result, there will be a total of A*B = AB derivatives, A derivatives for each of the B neurons at layer S. It is very important to note that array 3 works with each neuron in layer S independently from the other neurons. That is calculating A derivatives for each of the B neurons and returning an array of shape AB. Array 3 is identical to all neurons in layer S. The reason is that the derivatives of all neurons in layer B are calculated using the activation outputs of layer R which are ActivR1 to ActivRA. Each neuron in layer S uses the same values for calculating the derivatives. As a result, the array 3 is calculated only once for all neurons. Now, array 3 just has a single row but B columns, 1 for each neuron. After going through the three arrays of derivatives at layer S, let’s summarize the size of such three arrays: 1. Array 1: Has C*B = CB derivatives. Each of the B neurons at layer S has C derivatives. 2. Array 2: Has B derivatives. Each of the B neurons at layer S uses the same B derivatives. 3. Array 3: Has B derivatives. Each of the B neurons at layer S uses the same B derivatives. Fig. 9.3 shows how the three arrays are multiplied together. Array 1 has C rows and each row has B columns. The shape of a single row in array 1 is (1, B). Array 2 has just 1 row in which there are B columns and the shape of array 2 is (1, B). Array 3 has also just 1 row and that row has B columns. The shape of array 3 is (1, B). The multiplication is very simple. It is about multiplying each row in array 1 of shape (1, B) by the single rows in array 2 and array 3. In other words, each row in array 2 is multiplied by the single row in array 2. The result is a single

Working with any number of hidden layers Chapter | 9 183

FIG. 9.3 Multiplications between the three Arrays of Derivatives

row of shapes (1, B). This row is then multiplied by the single row in array 3 of shape (1, B). The result is a row of shapes (1, B). The result of multiplying the three arrays has the same shape as array 1. Assume that layer S has 3 neurons (that is B = 3) and layer S + 1 (layer T) has 2 neurons. The shape of array 1 is (2, 3). Here is an example of this array. [[0.3, 0.1, 0.32] [0.02, 0.9, 0.14]] Assume that the single row in array 2 is: [0.05, 0.8, 0.013] Also, assume the single row in array 3 is: [0.21, 0.04, 0.13] Regarding the first row in array 1, the multiplication is done as follows: 1. Multiply that row by the single row in array 2. [0.3, 0.1, 0.32] * [0.05, 0.8, 0.013] = [0.015, 0.08, 0.00416] 1. Multiply the result by the single row in array 3. [0.015, 0.08, 0.00416] * [0.21, 0.04, 0.13] = [0.00315, 0.0032, 0.0005408] Regarding the second row in array 1, the multiplication is done as follows: 1. Multiply that row by the single row in array 2. [0.02, 0.9, 0.14] * [0.05, 0.8, 0.013] = [0.001, 0.72, 0.00182] 1. Multiply the result by the single row in array 3. [0.001, 0.72, 0.00182] * [0.21, 0.04, 0.13] = [0.00021, 0.0288, 0.0002366] The result after multiplying all rows in array 1 by array 2 and array 3 is a new array of shape (2, 3). [[0.00315, 0.0032, 0.0005408] [0.00021, 0.0288, 0.0002366]] These derivatives need to be multiplied by the derivative chains of all other layers to produce the gradients by which the weights of the layer S are updated.

184 Introduction to deep learning and neural networks with Python™

Calculations summary Fig. 9.4 summarizes the necessary calculations by the gradient descent algorithm to build the derivative chains of a network with 3 hidden layers and an output layer. The products between arrays 1 and 2 of all layers are calculated and saved into the variables temp1 to temp4. Each temp variable is then multiplied by the derivative chains of directly preceding the hidden layer to get the derivative chains of the current layer. All chains can be produced by generating all possible combinations between the current array of derivatives at layer S and the derivative chains of the previous layer. Each different combination is regarded as a chain. Note that the derivative chains are produced as a result of multiplying array 1 and array 2 only. Array 3 is only multiplied by the derivative chains on its layer as in Fig. 9.5. For each chain, multiply its derivatives to return the gradient. If a weight has more than 1 chain, then sum the gradients returned from all chains and update the weight according to that sum. After updating the weights at layer S, it is time to move to the next hidden layer which is layer R, and repeat the previous steps to update its weights. After discussing the theory behind generalizing the implementation of the gradient descent algorithm, the next section starts the Python implementation.

Python™ implementation In the previous chapters, there was a Python single script holding the entire code. As the project becomes bigger, it is preferred to have some sort of organization. Thus, the internal operations of the project are moved into a script named MLP.py which has a class named MLP standing for Multi-Layer Perceptron. The script can be imported to create an instance of the MLP class. The content of the script is listed in the next code. The user is given just a simple interface to call the methods in the MLP class to specify the network architecture, feed the data, train the neural network, and others. Check the code at the book’s GitHub project: https://github.com/ ahmedfgad/IntroDLPython/tree/master/Generic-ANN-GD/Ch10. The methods within the MLP class are as follows: ●

● ● ● ●

●

train(): This function is the main interface that users call after being fed by the training inputs and network parameters. initialize_weights(): Initializes the network weights. forward_path(): Implements the forward pass. sigmoid(): Implements the sigmoid function. error(): Calculates the network error after the predicted output is produced. backward_pass(): Implements the backward pass.

FIG. 9.4 Summary of the gradient descent calculations to work with 3 hidden layers.

FIG. 9.5 Calculating the gradients using array 3 of derivatives.

Working with any number of hidden layers Chapter | 9 187

●

●

●

● ●

●

error_predicted_deriv(): Calculates the error to predicted output derivative. sigmoid_sop_deriv(): Calculates the activation output to SOP derivative. sop_w_deriv(): Calculates the SOP to activation output or weight derivative. deriv_chain_prod(): Generates all derivatives chains. update_weights(): Updates the weights according to the gradient calculated from the chains. predict(): Makes predations after the network is trained.

188 Introduction to deep learning and neural networks with Python™

Working with any number of hidden layers Chapter | 9 189

190 Introduction to deep learning and neural networks with Python™

Working with any number of hidden layers Chapter | 9 191

192 Introduction to deep learning and neural networks with Python™

The train() method is the main and most important method of the project which accepts all network parameters from the user and trains the network. It accepts the following parameters: ● ● ● ● ●

● ● ●

x: Training inputs. y: Training output. net_arch: Network architecture. max_iter = 5000: Maximum number of iterations which defaults to 5,000. tolerance = 0.0000001: Acceptable tolerance to stop network training if reached. learning_rate = 0.001: Learning rate. activation ="sigmoid": Activation function to be used. debug = False: If True, some informational messages are printed while the network is being trained. The train() method does 4 main tasks:

1. Initializes the weights by calling the initialize_weights() method. 2. Loops through the iterations and calling the forward_pass() method for producing the predicted output.

Working with any number of hidden layers Chapter | 9 193

3. Calls the backward_pass() method after each prediction the network makes to update the weights. 4. After the network is trained, it returns the trained model as a dictionary. After the network is trained, all parameters of the network are saved in a dictionary which is returned by the train() method. This dictionary holds the following items: ● ● ● ● ●

● ● ● ● ● ● ● ● ●

●

●

w: Trained network weights. max_iter: Maximum number of iterations. elapsed_iter: Number of iterations used for training the network. tolerance: Network tolerance. activation: Activation function name. The only function available at the current time is the sigmoid function. learning_rate: Learning rate. initial_w: Initial weights. net_arch: Network architecture. num_hidden_layers: Number of hidden layers. in_size: Number of inputs. out_size: Number of outputs. training_time_sec: Training time in seconds. network_error: Network error after being trained. derivative_chain: All derivative chains for the weights in the entire network layers. This is the chain calculated as the product of all combinations of the array 1 and array 2 of derivatives across all layers. weights_derivatives: All derivatives for the weights in the entire network layers. This is an array 3 of derivatives. weights_gradients: Gradients for the entire network weights.

Because all parameters necessary for restoring the network are saved [weights, activation function, and learning rate], the network can be easily restored and used for making predictions. This is discussed in the Making Predictions section of this chapter. In the MLP class, the reader is expected to be familiar with the entire implementation except for how the backward_pass() and deriv_chain_ prod() methods work. This is why they are discussed in the next section.

backward_pass() method The backward_pass() method implements the backward pass of training the network in which the gradients for all network weights are calculated and then the weights are updated. This method accepts the following parameters: ● ● ● ● ●

x: Inputs. y: Output. w: Network weights. net_arch: Network architecture. sop_activ_mul: An array holding all calculations in the forward pass.

194 Introduction to deep learning and neural networks with Python™ ● ●

prediction: Predicted output in the last forward pass. learning_rate: Learning rate.

Using the backward_pass() method, the 3 arrays of derivatives (array 1, array 2, and array 3) are calculated at first for all network layers. After that, the network weights are updated.

Output layer Working with the weights between the last hidden layer and the output layer is a special case because there is just a single neuron in the output layer. This is why such weights are updated exactly after calculating the 3 arrays according to the next code. Updating such weights does not follow the approach followed for working with the hidden layers’ weights.

The variable g1 holds the error to the predicted output derivative. It represents array 1 of derivatives. The variable g2 holds the predicted output to SOPout (SOP of the output neuron) derivative. It represents array 2 of derivatives. The variable output_layer_derivs holds the derivative chain for the output layer weights. There are 2 empty lists defined: 1. layer_weights_derivs: A list holding the results of the array 3 of derivatives. 2. layer_weights_grads: A list that holding the gradients of all network weights, where each layer has a separate sublist. The derivatives of the weights between the last hidden layer and the output layer are calculated and saved at the beginning of the layer_weights_ derivs list. This is the same for the gradients which are saved at the beginning of the layer_weights_grads list. Based on such 2 lists, the weights are updated.

Working with any number of hidden layers Chapter | 9 195

If the network does not have any hidden layers, that is, the input layer is directly connected to the output layer, then the backward_pass() method returns after updating the output layer weights. This is done inside the if statement of the next code. The if statement checks whether the net_arch array has just 2 elements. These 2 elements are the number of inputs and number of outputs. If indeed there are just 2 elements, then this indicates that no hidden layers exist in the network. Thus, the execution of the backward_pass() method stops after executing the next code.

This is everything required for working with the weights between the last hidden layer and the output layer. The next subsection discusses how to work with the hidden layers.

Hidden layers If the network has hidden layers, the strategy for updating their weights is as follows: 1. Calculate the 3 arrays of derivatives for each layer. 2. Calculate the product between array 1 and array 2 for each layer. This returns a new array representing the derivatives of this layer. 3. The derivatives chains for each layer are prepared by creating all the possible combinations between this array from the current and all previous layers. 4. The deriv_chain_final list is created which has a sublist for each hidden layer holding the multiplication of all derivatives in all derivative chains for each weight. Note that this derivative chain is built using just array 1 and array 2.

196 Introduction to deep learning and neural networks with Python™

Because the first derivative chain was calculated in the previous code [for the weights between the last hidden and the output layer], it is appended to the list. deriv_chain_final.append(output_layer_derivs) To calculate the derivatives for all the hidden layers, the nested for loops in the next code are used.

For a given layer with index curr_lay_idx, the derivatives of the SOPs of the layer numbered curr_lay_idx + 1 to the activation outputs of the current layer numbered curr_lay_idx are calculated and saved in the SOPs_ deriv variable. This is the array 1 of derivatives. The derivatives of the activation outputs of the layer numbered curr_lay_ idx to the SOPs of the same layer are calculated and the result is saved in the ACTIVs_deriv variable. This is an array 2 of derivatives. After preparing array 1 and array 2 of derivatives, next is to multiply them. This is done using the temp list. It is named temp because it changes for each layer. Using a for loop, each row from array 1 [stored in the SOPs_deriv variable] is returned and multiplied by the single row in array 2 [stored in the ACTIVs_deriv variable]. After that, the derivative chains of the weights in the current layer are calculated by building all possible combinations between the results stored in the temp list and the derivatives of the previous layers stored in the deriv_chain_final list.

Working with any number of hidden layers Chapter | 9 197

To return all possible combinations, the deriv_chain_prod() method is used. It accepts 2 parameters: 1. layer_derivs: Derivatives of the current layer which are calculated as the product between array 1 and array 2 of derivatives. 2. previous_lay_derivs_chains: Derivatives chains of the previous layer. It is returned by using index -1 with the deriv_chain_ final list. The deriv_chain_prod() method uses the itertools.product() function to return all possible combinations of multiplications between the 2 terms layer_derivs and previous_lay_derivs_chains. After the derivative chains of the current layer are calculated, they are appended to the deriv_chain_final list. After calculating array 1 and array 2 and building the derivative chain [saved in the deriv_chain_final list], the remaining step is to calculate the array 3 of derivatives which represents the derivatives of the SOPs of the layer numbered curr_lay_idx to the weights of the same layer. This is implemented in the if-else statement in the next code.

After discussing the content of the MLP class, the next section gives an example in which the train() method is used.

Example: Training the network To use the train() method of the MLP class, just prepare the training data inputs, outputs, and network parameters. After preparing the required parameters, then call the train() method for training the network. The next code gives an example of how to use the train() method.

198 Introduction to deep learning and neural networks with Python™

The first line imports the MLP module. The inputs and the output are defined in the arrays x and y, respectively. The network architecture is defined in the network_architecture list which creates a network with 2 hidden layers where the first layer has 7 neurons and the second layer has 5 neurons. After that, the train() method is called after passing the appropriate parameters. The train() method returns the trained model as a dictionary holding information about the trained model. The information includes trained weight, training time, number of training iterations, derivative chains, and more. This dictionary can be used for restoring the model and using it for making predictions or resuming the training at another time. After the network trains successfully, Fig. 9.6 shows how the predicted output changes for each iteration. The max_iter argument is set to 500 but the network reached a fine state after just 85 iterations. This is the benefit of using tolerance. Its default value is 0.0000001 =(1/10,000000). When the difference between the target and the predicted outputs is below the tolerance, the network stops training because the network reached its desired state in fewer iterations than defined in the max_iter parameter. Fig. 9.7 shows the relationship between the network error and each iteration. The derivative chains can be returned from the dictionary using the key derivative_chain according to the next line.

Working with any number of hidden layers Chapter | 9 199

FIG. 9.6 Prediction vs. iteration for a network with 2 hidden layers.

FIG. 9.7 Error vs. iteration for a network with 2 hidden layers.

200 Introduction to deep learning and neural networks with Python™

The returned array has 3 subarrays, one array for each layer. According to the previous example, there are 2 hidden layers with 7 and 5 neurons, respectively. This is why the first subarray has 7 elements and the second subarray has 5 elements. In the end, there is a subarray with just 1 element. This is the product of the derivative chain for the output neuron. The next two lines print the training time and the number of iterations used for training the network.

Here are the outputs of the print statements. The network took 0.13 seconds to be trained. It reached a stable state after just 85 iterations.

After training the network, the predict() method can be used for making predictions as discussed in the next section.

Making predictions The MLP class has a method named predict() for making predictions. It is a simple method. Its implementation is given in the next code. The predict() method accepts two parameters: 1. trained_ann: Trained network. This is the model (i.e. dictionary) returned by the train() method. 2. x: A sample to predict its output.

Working with any number of hidden layers Chapter | 9 201

Within the method, input size validation is applied to make sure the input size matches the input size that the network was trained by. If there is a mismatch, the method returns. If the input size matches the input size that the network expects, then the following items are retrieved from the trained_ann dictionary: 1. w: trained weights 2. activation: Activation function used for training the network. Up to this time, sigmoid is the only supported function. More functions could be added easily after preparing how they work in both the forward and backward passes. After returning the weights and the activation function, the forward_ pass() method is called after passing its three arguments (input, weights, and activation function). As regular, this function returns a NumPy array with all calculations in the forward pass. By indexing this array, the predicted output is accessed and returned by the predict() method. After discussing the predict() method, here is an example of the prediction it made for a new sample.

Conclusion This chapter achieved a milestone in the neural network Python implementation by as it works with any number of hidden layers with any number of neurons. The chapter discussed a general strategy to prepare the derivative chains of all weights and calculate their gradients. The strategy depends on preparing 3 arrays holding the individual derivatives of the entire network. These arrays are given the names array 1, array 2, and array 3. The Python implementation is organized as the core functions and methods are moved into a script named MLP.py in which a class named MLP is created. The user is given a simple interface to use the code. All the user needs to do for creating and training a network is to create an instance of the MLP class and call its train() method after passing the appropriate parameters. The train() method returns the trained network as a dictionary that holds all the information about the network. The predict() method accepts this dictionary to make predictions for new samples. In the next chapter, there are four achievements as the implementation will: 1. Use with any number of outputs. 2. Use more than 1 sample for training. 3. Support the bias in both forward and backward passes. 4. Support the batch gradient descent.

Chapter 10

Generic ANN Chapter outline Preparing initial weights for any number of outputs Calculating gradients for all output neurons Network with 2 outputs Network with 3 outputs Working with multiple training samples Calculating the size of the inputs and the outputs Iterating through the training samples Calculating the network error Implementing ReLU

203 205 212 214 215 215 216 217 218

New implementation for MLP class Example for training network with multiple samples Using bias Initializing the network bias Using bias in the forward pass Updating bias using gradient descent Complete implementation with bias Stochastic and batch gradient descent Example Conclusion

219 225 226 227 228 230 232 239 244 245

ABSTRACT In Chapter 9, a milestone was reached by building a generic neural network that works independently to the number of inputs, the number of hidden layers, and the number of hidden neurons within such layers. There were 2 main limitations which are as follows: 1. The output layer can have only a single neuron. As a result, each input sample must have only a single output. 2. A single training sample is used. This chapter overcomes the limitations described in Chapter 9 in addition to adding new features. There are 4 milestones in this chapter which are as follows: 1. 2. 3. 4.

Work with any number of outputs and not only limited to a single output. Use more than 1 training sample. Introduce the bias in both forward and backward passes. Support the batch gradient descent.

Preparing initial weights for any number of outputs In the implementation built in Chapter 9, there was a script named “MLP.py,” which has a class named “MLP” in which all the operations for building, training, and testing the network exist. The main method in this class is the train() method. Introduction to Deep Learning and Neural Networks with Python™ https://doi.org/10.1016/B978-0-323-90933-4.00010-3 © 2021 Elsevier Inc. All rights reserved.

203

204 Introduction to deep learning and neural networks with Python™

One of the parameters accepted by the train() method is named “net_ arch” that accepts the network architecture. This parameter is passed to the initialize_weights() method to create an array of the initial network weights. In this section, the initialize_weights() method is edited to work with multiple outputs. The initialize_weights() method was designed to use only a single output. According to the implementation from Chapter 9 given in the next code, the “if” statement assumes that the output layer has only a single neuron.

Here are the changes in the initialize_weights() method to support multiple outputs: 1. The “if” statement at the beginning of the code is commented. 2. The last 2 lines are commented.

According to the edited initialize_weights() method, the weights array is prepared to work with more than 1 output.

Generic ANN Chapter | 10 205

Note that there are no required changes in the forward pass. The next change is within the backward_pass() method as discussed in the next section.

Calculating gradients for all output neurons In Chapter 9, the network used only a single output. As a result, just a single gradient was calculated in the output layer. The gradient was calculated according to the next line inside the backward_pass() method. To work with multiple outputs, the major change is to loop through the output neurons and calculate their gradients according to the next code.

206 Introduction to deep learning and neural networks with Python™

The MLP.py script is listed in the next block after making the required changes to work with multiple outputs.

Generic ANN Chapter | 10 207

208 Introduction to deep learning and neural networks with Python™

Generic ANN Chapter | 10 209

210 Introduction to deep learning and neural networks with Python™

Generic ANN Chapter | 10 211

212 Introduction to deep learning and neural networks with Python™

After editing the code to work with multiple outputs, 2 examples are presented in the next 2 subsections that build networks with 2 and 3 outputs.

Network with 2 outputs The minimum code required for building a neural network is as follows. The code imports the MLP module. It prepares the single training sample's inputs and outputs in addition to the network architecture. Note that the array y has 2 values, which means there are 2 outputs.

The train() function is called from the MLP class by feeding the training inputs, outputs, and network architecture. Note that there are more parameters to be received by this method as explained in Chapter 9. After going through the number of training iterations, which is 5000 by default, the network error is 0.0029 for the first output and 0.000000962 for

Generic ANN Chapter | 10 213

0.9

0.8

Prediction

0.7

0.6

0.5

0.4 0

1000

2000

3000

4000

5000

Iteration FIG. 10.1 Prediction vs. iterations for a network with 2 outputs.

the second output. Thus, the mean network error is 0.0014 for the 2 outputs. Fig. 10.1 shows how the network predictions change for the 2 outputs. Note that the network errors might be reduced by making some changes such as increasing the number of iterations, changing the learning rate value, or adding hidden layers. The next example sets the number of iterations to 20,000 and the learning rate value to 0.9.

These changes made a great impact on the network performance. The network error for the 2 outputs is 9.93e − 15 and 1.21e − 19. These are very small errors. The network converged after just 10,995 iterations. For testing, the trained network is made to predict the outputs of the training sample using the predict() method as in the next code. The predicted

214 Introduction to deep learning and neural networks with Python™

outputs are 0.0500001 and 0.2. The second output is 100% accurate and the first output is identical to 0.05.

Fig. 10.2 shows how network predictions change by iteration.

Network with 3 outputs The second example builds a network with 3 outputs as in the next code. The code uses a single hidden layer with 5 neurons.

Prediction

0.8

0.6

0.4

0.2

0

2000

4000

6000 Iteration

FIG. 10.2 Prediction vs. iteration when the learning rate is 0.9.

8000

10000

Generic ANN Chapter | 10 215

0.8

Prediction

0.6

0.4

0.2

0

500

1000

1500

2000

2500

3000

3500

4000

Iteration FIG. 10.3 Prediction vs. iteration for a network with 3 outputs.

The errors of the trained network for the 3 outputs are 6.33307916e − 15, 1.98006390e − 16, and 4.02468881e − 17. The predictions that the network made after being fed by the training sample are 0.05000008, 0.19999999, and 0.90000001. They are very close to optimal results. Fig. 10.3 shows how the network predictions change by the iterations. At this point, the implementation can successfully work with multiple outputs. The next change to consider is using multiple training samples.

Working with multiple training samples The implementation built in the previous section works with only a single training sample. There are 3 main changes in the MLP class to support multiple training samples: 1. Calculating the size of the inputs and the outputs. 2. Using a loop for iterating through the training samples. 3. Calculating the network error as the sum of individual errors. Away from these 3 changes, everything works exactly as previously. Let us discuss each of these changes in the next 3 subsections.

Calculating the size of the inputs and the outputs The first method called within the MLP class is the train() method. At its first few lines, the size of the inputs and the outputs are calculated according to these 2

216 Introduction to deep learning and neural networks with Python™

lines. When there is just a single sample, the size of the x and y arrays are returned using shape[0]. What is the required change when multiple samples are used?

The next 2 lines prepare the x and y arrays for a single sample. Using the shape property, the shape of x is (3,) and the shape of y is (1,). The returned shapes are tuples of a single value. To return this value, the tuple is indexed with the value 0 to the size of the inputs, which is 3, and the size of the outputs, which is 1.

When there are multiple samples, index 0 will not return the size of the inputs or the outputs but the number of training samples. The next code prepares the x and y arrays for using multiple samples.

The shape of x is (2, 3) and the shape of y is (2, 1). If shape[0] is used, then the returned size of the inputs is 2 but it should be 3. Thus, shape[1] is used rather than shape[0]. This also holds for the output array.

Iterating through the training samples The current implementation works with only a single sample. The part of the code that trains the network using a single sample is listed in the next block. This code assumes that the x and y arrays only have 1 sample. The code uses a while loop to train the network through a number of iterations. Within each iteration, the single sample is propagated through the forward and backward passes.

Generic ANN Chapter | 10 217

To work with multiple samples, a for loop is added within the while loop to fetch sample by sample from the data and feed it to the network. The new training loop is given in the next code. The while loop goes through the iterations. For each iteration, the for loop iterates through the samples. The inputs and outputs of each sample are returned in the curr_sample_x and curr_sample_y variables, respectively.

Calculating the network error When one sample is used, the network error is calculated simply according to the error of predicting that single sample. When more samples are used, the error is the sum of the errors for the individual samples. Therefore, the previous code increments the network_error variable according to the error of each sample.

Note that the network may stop learning after the network prediction is less than the tolerance. When just a single sample is used, the tolerance is just compared to just the difference between the predicted and target outputs of this sample. When multiple samples are used, the tolerance is compared to the sum of differences between the predicted and target outputs of all samples. This is why the pred_target_diff variable is incremented after the network makes a prediction. At this time, all necessary changes for making the network works with multiple training samples and multiple outputs are accommodated. The next section adds a new activation function to the implementation, which is a rectified linear unit (ReLU).

218 Introduction to deep learning and neural networks with Python™

Implementing ReLU Previously, just the sigmoid activation function was supported. This section builds the ReLU function in both the forward and backward passes. The implementation of the ReLU function is in the next code. This is used in the forward pass.

For the backward pass, the derivative of the ReLU function is calculated according to the next code.

Note that the train() method accepts an argument named “activation” that specifies, which activation function to be used. If it is set to sigmoid, then the sigmoid function will be used. If set to relu, then the ReLU function will be used. Here is the “if” statement that decides which function to be used.

The next section lists the new implementation of the MLP class, so that it can work with multiple training samples, multiple outputs, and also supports the ReLU function.

Generic ANN Chapter | 10 219

New implementation for MLP class The next code gives the most recent implementation of the MLP.py script. This code supports the following: ● ● ● ● ●

Work with any number of inputs. Work with any number of outputs. Use of multiple training samples. Work with any number of hidden layers. Support for sigmoid and ReLU activation functions.

220 Introduction to deep learning and neural networks with Python™

Generic ANN Chapter | 10 221

222 Introduction to deep learning and neural networks with Python™

Generic ANN Chapter | 10 223

224 Introduction to deep learning and neural networks with Python™

Generic ANN Chapter | 10 225

The next section gives an example of training a network that uses multiple training samples.

Example for training network with multiple samples After completing the code that trains the network with multiple samples, this section gives an example of training a network with multiple samples as given in the next code. The next code prepares the training data inputs and outputs in the x and y arrays, respectively. Note that the outputs of each sample are saved in a separate list within the array. The network architecture is defined into the list network_architecture, which indicates that a single hidden layer with 2 neurons is used.

226 Introduction to deep learning and neural networks with Python™

The train() method is called for building and training the network. The activation function used is ReLU as specified by the activation argument of the train() method. In the end, the network is made to predict the outputs of the training samples. The number of iterations is set to 30,000 but the code reached a total error of 3.56e − 15 after 17,486 iterations, and it is acceptable. Fig. 10.4 shows how the total error of the network changes by iteration. The predictions of the network for the used samples are as follows. They are very close to the desired outputs. At this time, the implementation works with one or more samples where each sample can have one or more outputs. Remember that the implementation just uses the weights as the only parameters in building the network. The next section introduces the other type of parameters, bias, in the network.

Using bias This section discusses extending the latest implementation to use bias. There are 3 changes to be accommodated:

Generic ANN Chapter | 10 227

1.0

Error

0.8

0.6

0.4

0.2

0.0 0

2500

5000

7500 10000 Iteration

12500

15000

17500

FIG. 10.4 Error vs. iteration for a network trained with 3 samples.

1. Initializing network bias. 2. Using bias in the forward pass for calculating the predicted outputs. 3. Updating the bias using the gradient descent algorithm in the backward pass. The next 3 subsections handle each of these points.

Initializing the network bias The method initialize_weights() is responsible for initializing the network weights. To also initialize the network bias, the method will be edited. The implementation of this method is listed in the next code. The code uses 2 for loops for initializing the entire network weights. The first for loops through the network layers and the second for loops through the neurons in such a layer. The reason for looping through the neurons is that each neuron can have more than 1 weight.

228 Introduction to deep learning and neural networks with Python™

Talking about the network bias, each neuron has only 1 bias. As a result, the network bias for the neurons inside a layer can be initialized at once inside the first for loop. The next code modifies the initialize_weights() method to initialize and return the network bias. An empty list named “b” holds the network bias. Using the numpy.random.uniform() function, the bias of an entire layer is returned in the rand_ array variable and then appended into the list b. The list b is returned at the end.

After the initialize_weights() are modified, here is the modified code that calls it inside the train() method. Now, the method returns 3 variables including the b_initial variable, which holds the initial network bias.

After initializing the network bias, next is to use it in the forward pass for calculating the predicted outputs.

Using bias in the forward pass When the weights were the only parameters supported, the predicted outputs were calculated in the forward pass by calling the forward_path() method as given in the next line.

Generic ANN Chapter | 10 229

The implementation of the forward_path() method is listed in the next code. The weights, in the array w, are passed to the numpy.matmul() function to calculate the sum of products (SOPs) for all neurons in one layer.

For the bias, according to the discussion in Chapter 2, it is added to the SOP of each neuron. In other words, the bias is added to the result of the numpy. matmul() function. The modified forward_path() method that uses the bias is listed in the next code. The method accepts a new argument b representing the bias. The bias of each layer, returned by b[layer_num], is added to the result of the numpy.matmul() function.

By editing the forward_path() method, the bias can be successfully used in the forward pass to calculate the predicted outputs. The new method is called according to this line where the bias is passed as an argument.

230 Introduction to deep learning and neural networks with Python™

The last change in the implementation to fully support using the bias is to edit the backward pass.

Updating bias using gradient descent After the forward pass completes, it is time to update the network parameters in the backward pass using the gradient descent algorithm. The method backward_pass() implements the gradient descent algorithm. When just the weights were supported, it is called according to the next line.

The weights were updated in this method in 3 positions. The first position is inside the “if” statement given in the next code. This “if” statement is found at the beginning of the backward_pass() method. It updates the network weights when no hidden layers are used.

From the previous code, there is a for loop iterating through the neurons in the output layer and updates their weights according to the next line.

For updating the bias, the next line is used instead to calculate the derivative chain for the bias the same way done for the weights.

Generic ANN Chapter | 10 231

The edited “if” statement that updates the network parameters (weight and bias) is listed in the next code.

The second position in which the weights are updated is inside the for loop given in the next code. This loop is executed when there are hidden layers in use. It is found in the middle of the backward_pass() method.

The edited loop that also updates the bias is listed in the next code.

The third and last position in which the weights are updated is inside the for loop given in the next code. This code is at the end of the backward_pass() method. The weights of each layer are updated by calling the update_ weights() method.

The implementation of the update_weights() method is given in the next code. It loops through the neurons available in a given layer and updates their weights using the 2 arrays layer_weights_grads and deriv_chain_final.

232 Introduction to deep learning and neural networks with Python™

When updating the network bias, there is no need for the gradients of the weights available in the layer_weights_grads array, and only the deriv_chain_final array is used. As a result, a new method named “update_bias()” is implemented to update the bias of a given layer. Its implementation is given in the next code. The method just accepts the bias b, the derivatives chain array deriv_chain_final, and the learning rate learning_rate.

Complete implementation with bias After addressing the necessary changes to use the bias, here is the complete code. Note that there is a new item with key b added to the trained_ann dictionary for storing the bias. This is to be restored inside the predict() method when the trained network is tested with new samples.

Generic ANN Chapter | 10 233

234 Introduction to deep learning and neural networks with Python™

Generic ANN Chapter | 10 235

236 Introduction to deep learning and neural networks with Python™

Generic ANN Chapter | 10 237

238 Introduction to deep learning and neural networks with Python™

Generic ANN Chapter | 10 239

After the bias is embedded successfully into the code, the next and last edit in this chapter is to allow the gradient descent algorithm to work either in stochastic or batch modes. This is discussed in the next section.

Stochastic and batch gradient descent The gradient descent algorithm implemented previously is actually called “stochastic gradient descent,” which updates the network parameters per sample. That is a single sample is fed to the forward pass to calculate its predicted output(s). The gradients for the network are calculated based on the error of this single sample and then the parameters are updated. Besides the stochastic gradient descent, there are 2 types, which are minibatch gradient descent and batch gradient descent.

240 Introduction to deep learning and neural networks with Python™

The batch gradient descent updates the network parameters once per epoch (i.e. when all samples are fed to the network). Here are the steps of the batch gradient descent: 1. Feed sample by sample to the network. 2. Use the current network parameters w for making predictions. 3. Calculate the prediction error for the sample. 4. Update the parameters and save them into Wi. 5. Once all samples are fed to the network, calculate the average of the temporary parameters Wi to update the parameters w. The network parameters are updated only after the epoch ends by averaging the parameters. For the minibatch gradient descent, the weights are updated after a batch of samples is fed to the network. The number of samples in the batch is greater than 1 and less than the total number of training samples. In this type, the weights are updated more than once per epoch. Here is a summary of the 3 variations of the gradient descent algorithm: 1. Stochastic gradient descent updates the weights after each sample. 2. Batch gradient descent updates the weights once per epoch. 3. Minibatch gradient descent updates the weights multiple times per epoch based on the batch size. To allow the user to select either stochastic or batch gradient descent, a new optional argument is passed to the train() method, which is named “GD_ type.” The new method is given in the next code. The accepted values for this argument are stochastic and batch.

Generic ANN Chapter | 10 241

242 Introduction to deep learning and neural networks with Python™

Generic ANN Chapter | 10 243

For the batch gradient descent, there are 2 new lists to be created before the while loop as given in the next code. The list named “w_temp” stores the weights calculated for each sample. The other list b_temp is for the bias. Using a for loop, these 2 lists are initialized to zeros with the same size of the w and b lists.

The w_temp and b_temp lists are used to sum the parameters generated from all samples according to the next code. When the GD_type argument is set equal to batch, the weight and bias for each sample returned are temporarily saved in the 2 variables w_sample and b_sample. A for loop then adds the values in these 2 variables in the 2 lists named “w_temp” and “b_temp.” By doing this, the parameters from all training samples are summed into these 2 lists.

After calculating the sum of the parameters from all samples, next is to average them according to the next code. It starts with a for loop that divides the parameters by the number of samples. After that, the values inside the lists w_temp and b_temp are copied into the w and b lists. The lists w and b are used in the next iteration for making predictions for all samples. Finally, the

244 Introduction to deep learning and neural networks with Python™

w_temp and b_temp lists are reset back to zeros to start accepting the parameters calculated from each sample.

Example After updating the code for working with either batch or stochastic gradient descent, the next code gives an example that uses the batch gradient descent. The GD_type argument of the train() method is set to the word batch.

Generic ANN Chapter | 10 245

Conclusion This chapter extended, and it built a generic implementation of the gradient descent algorithm that trains a neural network with any number of training samples where each sample can have one or more outputs. This is in addition to using the bias in addition to the weights in both the forward and backward passes. The gradient descent algorithm is updated to calculate the gradients of the bias. The gradient descent algorithm can also work in both the stochastic and batch modes. This chapter reached a solid-state of building a generic artificial neural network that is trained using the gradient descent algorithm. It is important to note that the implementation might look complete, but there are more works to be done. For example, the used learning rate is fixed across all iterations. The implementation can be edited so that it can change, which helps to overcome the situation in which the network no longer learns or learns very slowly.

Chapter 11

Running neural networks in Android Chapter outline Building the first Kivy app Getting started with KivyMD MDTextField MDCheckbox MDDropdownMenu MDFileManager MDSpinner

247 251 254 257 258 261 265

Training network in a thread Neural network KivyMD app neural.kv main.py Use the app Building the Android app Conclusion

265 267 267 269 273 274 280

ABSTRACT The previous chapters developed a Python project for building a generic neural network that works in desktop computers. This chapter builds a graphical user interface (GUI) for the project using a library called “Kivy” so that the user can build the network through a number of clicks. Kivy is a cross-platform Python library for building natural user interfaces. Thanks to the python-for-android project, the Kivy applications can also be made available in mobile devices (e.g. Android). This chapter makes the project available for Android devices. Please refer to Chapter 1 for the instructions for installing Kivy. Note that this chapter will not discuss many details about Kivy, and the reader may need to access other resources to learn more about Kivy.

Building the first Kivy app Kivy is a simple library to use for building a GUI that works on all platforms. A Python developer with just some knowledge in object-oriented programming can build Kivy applications easily. Check the official website of Kivy kivy.org. A basic Kivy app without any widget is listed in the next code. A widget in Kivy is the element that appears on the screen. To build a Kivy app, the first thing is to create a new class that inherits the kivy.app.App class. The selected name for the child class is “NeuralApp.” Even if the child class is empty, it is a valid Kivy app but without any widgets.

Introduction to Deep Learning and Neural Networks with Python™ https://doi.org/10.1016/B978-0-323-90933-4.00001-2 © 2021 Elsevier Inc. All rights reserved.

247

248 Introduction to deep learning and neural networks with Python™

After running the app, the window in Fig. 11.1 appears. A naming convention of the child class is to end it with the word App. The text before is used as the app window title. This is why the window is given the title Neural. Kivy has many widgets to place on the window. Generally, Kivy widgets are categorized into five categories (kivy.org/doc/stable/api-kivy.uix.html). The frequently used widgets are UX widgets and layouts. 1. UX widgets: The atomic widgets that render on the screen like buttons and labels. 2. Layouts: These are nonrendered widgets to group other widgets. Two of the most used layouts are box and grid. 3. Complex UX widgets: Nonatomic widgets that combine multiple atomic widgets. Some examples include file chooser and drop-down lists. 4. Behaviors widgets: Offer some graphics effects. 5. Screen manager: Creates multiple screens within the same app. The next example builds a Kivy app that has just a single button. To create a button widget, then an instance of the kivy.uix.button.Button class is created. Its constructor accepts an argument named text, which is assigned

FIG. 11.1 A basic Kivy app with no widgets.

Running neural networks in Android Chapter | 11 249

the text to be displayed on the button. In the kivy.app.App class, there is a method called build(), which returns the widget displayed on the screen.

Fig. 11.2 shows the result after running the previous code. It is possible to display multiple widgets on the window by grouping them into a layout widget. The next code creates a box layout, which groups 2 widgets: button and label. The button and label widgets are created in the button and label variables, respectively. The label is an instance of the kivy.uix.label.Label class. The box layout is created as an instance of the kivy.uix.boxlayout.BoxLayout class, which is saved into the box_layout variable. The class's constructor has a parameter named orientation, which defines the order of adding the widgets into the layout. It can be either vertical or horizontal. The box layout is a grouping widget in which other widgets are added. The add_widget() method is used to add a widget inside another. Finally, the box layout is returned from the build() method.

FIG. 11.2 A Kivy app with a single widget.

250 Introduction to deep learning and neural networks with Python™

By running the previous code, the window in Fig. 11.3 appears. Note how the label and the button are placed vertically, where the button is at the top because it is the first widget added inside the layout. Note that pressing the button makes no action because the button press event is not handled. Here is a line that handles the button press action to make it print a statement when pressed.

FIG. 11.3 A Kivy app with 2 widgets grouped into a box layout.

Running neural networks in Android Chapter | 11 251

The previous discussion made a quick introduction to Kivy. The next section introduces KivyMD, an open-source Python library that offers more Kivy widgets and themes. Before starting, make sure KivyMD is installed.

Getting started with KivyMD KivyMD is a library collecting Kivy widgets that are compliant with Google's material design. KivyMD supports the user of the original Kivy widgets in addition to the newly designed widgets. It has more themes than the default back theme in the Kivy apps. The way KivyMD is used is very close to using Kivy. Many Kivy widgets are accessible in KivyMD by adding MD at the beginning of their names. For example, the App class in Kivy is now MDApp in KivyMD, the Kivy Label widget is MDLabel in KivyMD, and so on. The next code builds the same app in Fig. 11.3 but using KivyMD widgets. The layout is an instance of the MDBoxLayout class with a vertical orientation. A button is created using the MDRectangleFlatButton widget, which is rendered with a border. By default, the button is given the minimum width and height to hold its text. The size_hint property is assigned the value [1, 1] to stretch the button so that it fills its area. The button press action is handled so that the label's text changes to Button Pressed. Finally, a label is created using the MDLabel widget. The halign is assigned the value center to center the label's text horizontally.

By running the previous code, the window in Fig. 11.4 appears. The first change compared to the regular Kivy app is the white background of the window. Another change is the button's text and border. Another type of button offered by KivyMD is the MD Raised Button, which has a background. Replacing MDRectangleFlatButton by MDRaisedButton in the previous code results in the window in Fig. 11.5.

252 Introduction to deep learning and neural networks with Python™

FIG. 11.4 KivyMD app with button and label.

FIG. 11.5 MDRaisedButton button with a background.

The previous app builds the GUI inside the Python file. It is a good practice to separate the Python logic from the widget tree. This is by creating the widget tree inside a KV file according to the next code. The properties of each widget are indented. The id of each widget defines its name by which it is accessed from the Python code. The MDRaisedButton and MDLabel widgets are children of the root widget MDBoxLayout as they are inside it.

Running neural networks in Android Chapter | 11 253

After moving the widget tree into a KV file, here is the content of the Python file. Note how the MDLabel widget is accessed by its ID. Inside the build() method, the KV file is loaded by passing its name into the Buildozer. load_file() function. If the name of the KV file is set to the lower case of the text before the word App in the child class name, which is neural in this case, then no need to explicitly load it. Kivy searches for a file with that name, neural.kv, and loads it automatically.

There are many KivyMD widgets and each widget might have different variants (kivymd.readthedocs.io/en/latest/components/index.html). For example, there are 12 different button types including MDIconButton, MDFloatingActionButton, and more. To learn more about KivyMD, check its official documentation, which offers examples for each widget: kivymd.readthedocs.io.

254 Introduction to deep learning and neural networks with Python™

In addition to MDRectangleFlatButton and MDRaisedButton, only the following widgets are discussed, which are used to build the Android app of this chapter: ● ● ● ● ●

MDTextField MDCheckbox MDDropdownMenu MDFileManager MDSpinner

MDTextField The KivyMD MDTextField widget accepts text input from the user. Some of the properties in this widget are: ● ● ● ●

text: The text entered. hint_text: Static text at the top-right corner. helper_text: Helper text at the bottom of the text. helper_text_mode: Mode of the helper text. If set to on_focus, then it appears only when the widget is focused.

For the purpose of building a neural network, the MDTextField widget is used to accept the values of the following 4 parameters in the MLP.train() function: 1. learning_rate 2. tolerance 3. iterations 4. net_arch The next KV code creates 4 MDTextField widgets for the 4 parameters.

Running neural networks in Android Chapter | 11 255

256 Introduction to deep learning and neural networks with Python™

FIG. 11.6 Adding 2 MDTextField to the widget tree.

Fig. 11.6 shows the window of the previous code. Note how the hint text is displayed. To access the values of these parameters inside the Python code, the next method is used to fetch and validate the entered values in the 4 MDTextField widgets using their text attribute. The 4 values are saved in the 4 instance attributes named learning_rate, tolerance, max_iter, and net_ arch. The method ensures that the learning rate is from 0.0 to 1.0, the values of the tolerance and number of iterations are integers, and the network architecture is a comma-separated integer.

Running neural networks in Android Chapter | 11 257

MDCheckbox The MDCheckbox widget creates a check box by which the user can activate or deactivate something. For building the neural network, it is used to decide the value of the debug parameter in the MLP.train() function. The next KV code updates the MDGridLayout widget created previously to add a box layout in which a label and a check box are created. The active property of the check box represents whether it is checked or not. The callback Python method debug_switch() is called when its state changes.

258 Introduction to deep learning and neural networks with Python™

Fig. 11.7 shows the app window after the check box is used. Here is the implementation of the debug_switch() method. The active attribute of the MDCheckbox widget is True when the box is checked. Otherwise, it is False.

MDDropdownMenu The MDDropdownMenu widget in KivyMD displays a menu with multiple options from which the user can select. In the MLP.train() function, this widget is used to determine the values of 2 parameters:

FIG. 11.7 Checkbox to control the debug parameter of the MLP.train() function.

Running neural networks in Android Chapter | 11 259

1. activation: The activation function, which can be sigmoid or relu. 2. GD_type: The gradient descent type, which can be stochastic or batch. The steps to create a working menu are as follows: 1. Create a button in the KV file that opens the menu when pressed/released. 2. Create an instance of the MD Dropdown Menu widget in the Python file. 3. Implement a callback method that retrieves the selected menu item. For the first step, here is the KV code that creates 2 buttons. The first button has the ID activation_menu, which opens a menu named “activation_menu” when pressed. This menu is created in the Python file. Similarly, the second button with the ID gdtype_menu opens the menu named “gdtype_menu.”

The second step is to create 2 instances of the MD Dropdown Menu widget as in the next code. The 2 menus are created within the on_start() callback method of the MDKivy app. The following 4 parameters of the MDDropdownMenu class's constructor are used: 1. caller: The button that opens the menu. 2. items: The items listed on the menu. Each item is a dictionary with a key named text where its value is displayed in the menu. All the dictionaries are grouped into a list. 3. callback: The callback method/function is called when a menu item is selected. 4. width_mult: The width of the menu item. This number is multiplied by 56dp for mobile and 64dp for desktop.

260 Introduction to deep learning and neural networks with Python™

The last step is implementing the callback methods when a menu item is selected. According to the next code, the activation_menu_callback() method is called when an item in the activation_menu menu is selected. The selection is saved into the activation attribute. The gdtype_menu_ callback() method is associated with the gdtype_menu menu where the selection is saved in the GD_type attribute.

After the app runs, the window in Fig. 11.8 appears where to the new buttons that call the 2 menus are added.

FIG. 11.8 Adding 2 buttons that call 2 dropdown menus.

Running neural networks in Android Chapter | 11 261

FIG. 11.9 Dropdown menu opened.

When a button is clicked, a menu is opened to show its items. Fig. 11.9 shows the opened menu from which the activation function is selected.

MDFileManager KivyMD has a widget called “MDFileManager” that opens a file chooser from which the user can select directories and files. This widget is used to select the path of 2 files holding the NumPy arrays of the training data inputs and outputs. The steps to create a working file chooser are as follows: 1. Create a button in the KV file. 2. Create an instance of the MDFileManager class in the Python file. 3. Implement a Python callback method that opens the MDFileManager widget when the button is pressed/released. 4. Implement callback methods to be called when a file/directory is selected and when the menu is closed. For the first step, the 2 buttons that open the 2 file managers are created in the next code. The first button has the ID input_data_file_button that opens a file manager named “input_file_manager_open”when pressed from which the path of the NumPy file holding the input data is selected. Similarly, the button with ID output_data_file_button opens the file manager output_file_manager_open for selecting the path of the NumPy file holding the output data.

262 Introduction to deep learning and neural networks with Python™

The second step creates 2 instances of the MDFileManager class as in the next code. The first file manager is named “input_data_file_manager,” which is assigned 2 callback methods. The first one is input_file_ select_path() and called when a file/directory is selected. The second method is input_file_exit_manager() and called when the file manager closes. Similarly, the second file manager is created and saved in the variable output_data_file_manager. Note that the ext attribute restricts the listed files to those with extension .npy.

The third step implements the callback methods that open the file manager as in the next code. The home directory of the file manager is passed to the show() method. For Android, it is important to change this directory to /storage because just/is not recognized by Android.

Running neural networks in Android Chapter | 11 263

The next code implements the callback methods that are called when a file/ directory is selected or when the manager is closed. This is the fourth and last step of building a file chooser. Note that it is possible to select a directory rather than a file or not to select anything at all. The input_file_select_path() and output_ file_select_path() methods ensure that a file with the extension .pkl is selected. The NumPy files are loaded where the inputs and outputs are saved in the x and y attributes, respectively. Note that the arrays must have 2 dimensions. The first dimension represents the number of samples and the second one represents the sample's input (x array) or output (y array).

After running the app, Fig. 11.10 shows how it looks like after adding the 2 buttons that open the file managers. When a button is pressed, a file manager opens from which the user can select a file with extension .npy as in Fig. 11.11. At this time, the user can specify all the required parameters in the MLP. train() function through various Kivy widgets. The input received from each widget is saved into an attribute. Here is a list of the attributes and their associated widgets. Each attribute has a parameter in the MLP.train() function with the same name. ● ● ● ●

x: MDFileManager y: MDFileManager max_iter: MDTextField tolerance: MDTextField

264 Introduction to deep learning and neural networks with Python™

FIG. 11.10 The caller buttons to the file manager.

FIG. 11.11 Opening a file manager to select a file with extension .npy.

● ● ● ● ●

learning_rate: MDTextField net_arch: MDTextField activation: MDDropdownMenu GD_type: MDDropdownMenu debug: MDCheckbox

Before building and training a neural network, the next section discusses a widget called “MDSpinner.”

Running neural networks in Android Chapter | 11 265

MDSpinner The MDSpinner widget offers a circular progress indicator that informs the user that an operation is in progress. The spinner has a boolean attribute named “active.” When True, the spinner rotates. When False, it disappears. The next code builds the spinner in the KV file.

When the spinner's active attribute is True, it looks as in Fig. 11.12.

Training network in a thread A network could last some time until being trained; thus, it is preferred to do the processing in a background thread. To create a thread in Python, there is a class named “Thread” in the threading module. To create a thread, just create a new class that extends the Thread class as in the next code. In the class NeuralThread, the constructor accepts all the parameters of the MLP.train() function. An a dditional

FIG. 11.12 An active spinner.

266 Introduction to deep learning and neural networks with Python™

parameter is called “app” that refers to the Kivy app. It is used to access the widgets. The constructor creates instance attributes to accept all the passed arguments.

There is a method in the Thread class named “run(),” which is called once the thread starts. The previous code implements this method to call the MLP.train() function. Some information about the trained network is displayed on the label. The next code creates an instance of the NeuralThread class and calls its start() method.

Running neural networks in Android Chapter | 11 267

Now, everything in the Kivy app is developed. The next section lists the complete code of both the KV and Python files.

Neural network KivyMD app The Kivy app consists of 2 files, which are the KV and Python files. Because the child class in the Kivy app is named “NeuralApp,” then the KV file should be named “neural.kv” to automatically discover and load it. For the Python file, there are no restrictions on its name at this time. To build the Android app, the Python file must be named “main.py.” This will be the main activity. If no Python file with that name exists, an error occurs while building the Android app. So, it is a good idea to set main.py as the Python file name. The next 2 sections list the contents of the neural.kv and main.py files.

neural.kv The content of the neural.kv file is listed in the next code.

268 Introduction to deep learning and neural networks with Python™

Running neural networks in Android Chapter | 11 269

main.py The next code lists the main.py file. Note that the home directory of the file manager is set to /storage, which works for Android. For a desktop computer, change this directory according to your preference.

270 Introduction to deep learning and neural networks with Python™ import import import import import import import

kivymd.app kivymd.uix.menu kivymd.uix.button kivymd.uix.filemanager numpy threading MLP

class NeuralApp(kivymd.app.MDApp): def on_start(self): self.x = None self.y = None self.net_arch = None self.max_iter = None self.tolerance = None self.learning_rate = None self.activation = None self.GD_type = None self.debug = False self.activation_menu = kivymd.uix.menu.MDDropdownMenu(caller=self.root.ids.activation_menu, items=[{"text": "sigmoid"}, {"text": "relu"}], callback=self.activation_menu_callback,

width_mult=4)

self.gdtype_menu = kivymd.uix.menu.MDDropdownMenu(caller=self.root.ids.gdtype_menu, items=[{"text": "stochastic"}, {"text": "batch"}], callback=self.gdtype_menu_callback,

width_mult=4)

self.input_data_file_manager = kivymd.uix.filemanager.MDFileManager(select_path =self.input_file_select_path, exit_manager=self.input_file_exit_manager) self.input_data_file_manager.ext = [".npy"] self.output_data_file_manager = kivymd.uix.filemanager.MDFileManager(select_path =self.output_file_select_path , exit_manager=self.output_file_exit_manager) self.output_data_file_manager.ext = [".npy"] def activation_menu_callback(self, activation_menu): self.root.ids.activation_menu.text = activation_menu.text

self.activation = activation_menu.text self.activation_menu.dismiss() def gdtype_menu_callback(self, gdtype_menu): self.root.ids.gdtype_menu.text = gdtype_menu.text self.GD_type = gdtype_menu.text self.gdtype_menu.dismiss()

Running neural networks in Android Chapter | 11 271 def input_file_manager_open(self): self.input_data_file_manager.show('/storage') def input_file_select_path(self, path): self.input_file_exit_manager() if path[-4:] == ".npy": # kivymd.toast.toast(path) self.root.ids.input_data_file_button.text = path self.x = numpy.load(path) if self.x.ndim != 2: self.root.ids.label.text = "Error: The input array must have 2 dimensions." self.x = None else: self.x = None def input_file_exit_manager(self, *args): self.input_data_file_manager.close() def output_file_manager_open(self): self.output_data_file_manager.show('/storage') def output_file_select_path(self, path): self.output_file_exit_manager() if path[-4:] == ".npy": # kivymd.toast.toast(path) self.root.ids.output_data_file_button.text = path self.y = numpy.load(path) if self.y.ndim != 2: self.root.ids.label.text = "Error: The output array must have 2 dimensions." self.y = None else: self.y = None def output_file_exit_manager(self, *args): self.output_data_file_manager.close() def debug_switch(self, *args): self.debug = self.root.ids.debug.active

def button_press(self, *args): self.learning_rate = self.root.ids.learning_rate.text try: self.learning_rate = float(self.learning_rate) if self.learning_rate >= 0.0 and self.learning_rate