Life Of A.I A Complete Guide

Table of contents :
Overview of Artificial Intelligence
What is AI ?
Importance of AI
Speech Recognition
Understanding Natural Language
Computer Vision
Expert Systems
Heuristic Classification
Early work in AI
Logic/Mathematics
Computation
Psychology / Cognitive Science
Biology / Neuroscience
Evolution
Natural Language Processing
Common Techniques
AI and related fields
Searching And-Or graphs
Constraint Satisfaction Search
Forward Checking
Most-Constrained Variables
Heuristic Repair
Local Search and Metaheuristics
Exchanging Heuristics
Iterated Local Search
Tabu Search
Translating between English and Logic Notation
Truth Tables
Complex Truth Tables
Tautology
Equivalence
Propositional Logic
Predicate Calculus
First-Order Predicate Logic
Soundness
Completeness
Reasoning in Modal Logic
Possible world representations
Dempster- Shafer theory
Fuzzy Set Theory
Fuzzy
Set
Fuzzy Membership
Fuzzy Operations
Fuzzy Properties
Fuzzy Relations
3.1 Definition of Fuzzy Relation
Forming Fuzzy Relations
Max-Min and Min-Max Composition
Fuzzy Systems
Fuzzy Logic
Classical Logic
Fuzzification
Fuzzy Inference
Fuzzy Rule Based System
Defuzzification
Centroid method
Probabilistic Reasoning
Bayesian probabilistic inference
Definition and importance of knowledge
Knowledge Based Systems
Representation of knowledge
Knowledge Organization
Knowledge Manipulation
Matching techniques:
Measure for Matching
Distance Metrics
Matching like Patterns
The RETE matching algorithm
Natural Language Processing :
Overview of linguistics
Morphological Analysis
BNF
Basic parsing techniques
Augmented Transition Networks
Chart Parsing
Semantic Analysis
Rules for Knowledge Representation
Conflict Resolution
Rule-Based Expert Systems
Architecture of an Expert System
The Expert System Shell
Knowledge Engineering
CLIPS (C Language Integrated Production System)
Backward Chaining in Rule-Based Expert Systems
CYC
AI Deep Learning Frameworks for DS
How AI DL systems work
AI Main deep learning frameworks
AI Main deep learning programming languages
How to leverage DL frameworks
ETL processes for DL
AI Deploying data models
AI Assessing a deep learning framework
Interpretability
Model maintenance
AI Building a DL Network Using MXNet
Core components
Datasets description
Classification for mxnet
Creating checkpoints for models developed in MXNet
Artificial Intelligence Building an Optimizer Based on the Particle Swarm Optimization Algorithm
PSO algorithm for AI
Firefly optimizer PSO
PSO versus other optimization methods
AI Maximizing an exponential expression
AI Building an Advanced Deep Learning System
Standard Genetic Algorithm
GAs in action for AI
AI Advanced Building Deep Learning System
AI CNN components
Data flow and functionality
CNN Training process
AI Visualization of a CNN model
Recurrent Neural Networks
AI Alternative Frameworks in DS
AI Extreme Learning Machines (ELMs)
AI Motivation behind ELMs
AI Architectures of ELMs
AI Capsule Networks
AI Motivations behind CapsNets
AI Fully connected layer
AI Dynamic routing between capsules
Fuzzy sets
Python
Big data
Hadoop
Apache Spark
Machine Learning for AI
AI Perceptron & Neural Networks
Artificial Intelligence Decision Trees
AI Support Vector Machines
AI Probabilistic Models
AI Dynamic Programming and Reinforcement Learning
AI Evolutionary Algorithms
AI Time Series Models
Artificial Intelligence The Nature of Language
Mathematics for AI
Algebraic Structures for AI
Linear Algebra for AI
Internet of Things (IOT)
What is the IOT?
IOT Programming Connected Devices
IOT Digital Switches
IOT User Defined Functions
Ardos IOT
IOT Programming Raspberry Pi with C and Python
IOT Python Hellothere.py
IOT Python Functions
IOT Installation of Vim
IOT Programming in C
IOT Installing Wiring Pi
IOT Raspberry Pi with Raspbian Operating System
How to Setup the Raspbian Operating System
File System Layout IOT
IOT Programming in Raspberry
Galileo, Windows, and the IOT
IOT Creating the Server Applications
IOT Temperature Controller
IOT Creation of Tables and Controllers
IOT Seeding the Database
IOT Custom APIs
Conclusion
Python for Artificial Intelligence
Agents and Control for AI
AI Representing Search Problems
AI Reasoning with Constraints
Deep Learning for AI
Introduction
1.2 Historical Trends in Deep Learning
1.2.1 The Many Names and Changing Fortunes of Neural Networks
1.2.2 Increasing Dataset Sizes
1.2.3 Increasing Model Sizes
1.2.4 Increasing Accuracy, Complexity and Real-World Impact
Applied Math and Machine
Learning Basics
Linear Algebra
2.1 Scalars, Vectors, Matrices and Tensors
2.3 Identity and Inverse Matrices
2.4 Linear Dependence and Span
2.5 Norms
2.6 Special Kinds of Matrices and Vectors
2.7 Eigendecomposition
2.8 Singular Value Decomposition
2.9 The Moore-Penrose Pseudoinverse
2.10 The Trace Operator
2.11 The Determinant
2.12 Example: Principal Components Analysis
Probability and Information Theory
3.1 Why Probability?
3.2 Random Variables
3.3 Probability Distributions
3.3.1 Discrete Variables and Probability Mass Functions
3.4 Marginal Probability
3.5 Conditional Probability
3.6 The Chain Rule of Conditional Probabilities
3.7 Independence and Conditional Independence
3.8 Expectation, Variance and Covariance
3.9 Common Probability Distributions
3.9.1 Bernoulli Distribution
3.9.2 Multinoulli Distribution
3.9.3 Gaussian Distribution
3.9.4 Exponential and Laplace Distributions
3.9.5 The Dirac Distribution and Empirical Distribution
3.9.6 Mixtures of Distributions
3.10 Useful Properties of Common Functions
3.11 Bayes’ Rule
3.12 Technical Details of Continuous Variables
3.13 Information Theory
Reinforcement Learning
1. Introduction
1.1 Reinforcement Learning
1.2 Examples
1.3 Elements of Reinforcement Learning
1.4 An Extended Example: Tic-Tac-Toe
1.5 Summary
1.6 History of Reinforcement Learning
2. Evaluative Feedback
2.1 An -Armed Bandit Problem
2.2 Action-Value Methods
2.3 Softmax Action Selection
2.4 Evaluation Versus Instruction
2.5 Incremental Implementation
2.6 Tracking a Nonstationary Problem
2.7 Optimistic Initial Values
2.8 Reinforcement Comparison
2.9 Pursuit Methods
2.10 Associative Search
2.11 Conclusions
3. The Reinforcement Learning Problem
3.1 The Agent-Environment Interface
3.2 Goals and Rewards
3.3 Returns
3.4 Unified Notation for Episodic and Continuing Tasks
3.5 The Markov Property
3.6 Markov Decision Processes
3.7 Value Functions
3.8 Optimal Value Functions
3.10 Summary
Reinforcement Learning algorithms — an intuitive overview
Terminologies
Model-Free vs Model-Based Reinforcement Learning
I. Model-free RL
I.1. Policy optimization or policy-iteration methods
I.1.1. Policy Gradient (PG)
I.1.2. Asynchronous Advantage Actor-Critic (A3C)
I.1.3. Trust Region Policy Optimization (TRPO)
I.1.4. Proximal Policy Optimization (PPO)
I.2. Q-learning or value-iteration methods
I.2.1 Deep Q Neural Network (DQN)
I.2.2 C51
I.2.3 Distributional Reinforcement Learning with Quantile Regression (QR-DQN)
I.2.4 Hindsight Experience Replay (HER)
I.3 Hybrid
II.1. Learn the Model
Asynchronous Advantage Actor Critic (A3C) algorithm
Advantages:
Role of AI in Autonomous Driving
Virtual Assistants in Desktop Environments
Virtual Assistants in Mobile Contexts
Virtual Assistants and the Internet of Things
Virtual Assistants as a type of (Disembodied) Robot
Virtual Assistants as Social Robots
The Place for AI in Autonomous Driving:
AI in “safety-related” Autonomous Driving
Functionalities is based on Standards
Various approaches and products
Communication Displays
In-Car Virtual Assistants
Gateway to IoT (e.g. home control)
Customised Infotainment
Personal Health and Well-being
Recent Strategic Developments
Highlights on AI and Self-Driving Cars
Highlights on AI and Self-Driving Cars
Highlights on AI and Self-Driving Cars
Highlights on AI and Self-Driving Cars
Highlights on AI and Self-Driving Cars
Social Robots, AI and Self-Driving Cars
Social Robots, AI and Self-Driving Cars
AI & Industry 4.0
Some Recent Developments
ARTIFICIAL INTELLIGENCE AND ROBOTICS

Citation preview

LIFE OF AI ARTIFICIAL INTELLIGENCE BY WILLIAM KRYSTAL

Copyright c 2020 by William Krystal, Inc. All rights reserved Published by William Krystal Inc., Hoboken, New Jersey Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per‐copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750‐8400, fax (978) 750‐4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, William Krystal, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.williamkrystalepublisher.com/go/permissions. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762‐2974, outside the United States at (317) 572‐3993 or fax (317) 572‐4002.

OVERVIEW OF ARTIFICIAL INTELLIGENCE WHAT IS AI ? IMPORTANCE OF AI SPEECH RECOGNITION UNDERSTANDING NATURAL LANGUAGE COMPUTER VISION EXPERT SYSTEMS HEURISTIC CLASSIFICATION EARLY WORK IN AI LOGIC/MATHEMATICS COMPUTATION PSYCHOLOGY / COGNITIVE SCIENCE BIOLOGY / NEUROSCIENCE EVOLUTION NATURAL LANGUAGE PROCESSING COMMON TECHNIQUES AI AND RELATED FIELDS SEARCHING AND-OR GRAPHS CONSTRAINT SATISFACTION SEARCH FORWARD CHECKING MOST-CONSTRAINED VARIABLES HEURISTIC REPAIR LOCAL SEARCH AND METAHEURISTICS EXCHANGING HEURISTICS ITERATED LOCAL SEARCH TABU SEARCH TRANSLATING BETWEEN ENGLISH AND LOGIC NOTATION TRUTH TABLES COMPLEX TRUTH TABLES TAUTOLOGY EQUIVALENCE PROPOSITIONAL LOGIC PREDICATE CALCULUS FIRST-ORDER PREDICATE LOGIC SOUNDNESS COMPLETENESS REASONING IN MODAL LOGIC POSSIBLE WORLD REPRESENTATIONS DEMPSTER- SHAFER THEORY FUZZY SET THEORY FUZZY SET FUZZY MEMBERSHIP FUZZY OPERATIONS FUZZY PROPERTIES FUZZY RELATIONS 3.1 DEFINITION OF FUZZY RELATION FORMING FUZZY RELATIONS MAX-MIN AND MIN-MAX COMPOSITION FUZZY SYSTEMS

FUZZY LOGIC CLASSICAL LOGIC FUZZIFICATION FUZZY INFERENCE FUZZY RULE BASED SYSTEM DEFUZZIFICATION CENTROID METHOD PROBABILISTIC REASONING BAYESIAN PROBABILISTIC INFERENCE DEFINITION AND IMPORTANCE OF KNOWLEDGE KNOWLEDGE BASED SYSTEMS REPRESENTATION OF KNOWLEDGE KNOWLEDGE ORGANIZATION KNOWLEDGE MANIPULATION MATCHING TECHNIQUES: MEASURE FOR MATCHING DISTANCE METRICS MATCHING LIKE PATTERNS THE RETE MATCHING ALGORITHM NATURAL LANGUAGE PROCESSING : OVERVIEW OF LINGUISTICS MORPHOLOGICAL ANALYSIS BNF BASIC PARSING TECHNIQUES AUGMENTED TRANSITION NETWORKS CHART PARSING SEMANTIC ANALYSIS RULES FOR KNOWLEDGE REPRESENTATION CONFLICT RESOLUTION RULE-BASED EXPERT SYSTEMS ARCHITECTURE OF AN EXPERT SYSTEM THE EXPERT SYSTEM SHELL KNOWLEDGE ENGINEERING CLIPS (C LANGUAGE INTEGRATED PRODUCTION SYSTEM) BACKWARD CHAINING IN RULE-BASED EXPERT SYSTEMS CYC AI DEEP LEARNING FRAMEWORKS FOR DS HOW AI DL SYSTEMS WORK AI MAIN DEEP LEARNING FRAMEWORKS AI MAIN DEEP LEARNING PROGRAMMING LANGUAGES HOW TO LEVERAGE DL FRAMEWORKS ETL PROCESSES FOR DL

AI DEPLOYING DATA MODELS AI ASSESSING A DEEP LEARNING FRAMEWORK INTERPRETABILITY MODEL MAINTENANCE AI BUILDING A DL NETWORK USING MXNET CORE COMPONENTS DATASETS DESCRIPTION CLASSIFICATION FOR MXNET CREATING CHECKPOINTS FOR MODELS DEVELOPED IN MXNET ARTIFICIAL INTELLIGENCE BUILDING AN OPTIMIZER BASED ON THE PARTICLE SWARM OPTIMIZATION ALGORITHM PSO ALGORITHM FOR AI FIREFLY OPTIMIZER PSO PSO VERSUS OTHER OPTIMIZATION METHODS AI MAXIMIZING AN EXPONENTIAL EXPRESSION AI BUILDING AN ADVANCED DEEP LEARNING SYSTEM STANDARD GENETIC ALGORITHM GAS IN ACTION FOR AI AI ADVANCED BUILDING DEEP LEARNING SYSTEM AI CNN COMPONENTS DATA FLOW AND FUNCTIONALITY CNN TRAINING PROCESS AI VISUALIZATION OF A CNN MODEL RECURRENT NEURAL NETWORKS AI ALTERNATIVE FRAMEWORKS IN DS AI EXTREME LEARNING MACHINES (ELMS) AI MOTIVATION BEHIND ELMS AI ARCHITECTURES OF ELMS AI CAPSULE NETWORKS AI MOTIVATIONS BEHIND CAPSNETS AI FULLY CONNECTED LAYER AI DYNAMIC ROUTING BETWEEN CAPSULES FUZZY SETS PYTHON BIG DATA HADOOP APACHE SPARK MACHINE LEARNING FOR AI AI PERCEPTRON & NEURAL NETWORKS

ARTIFICIAL INTELLIGENCE DECISION TREES AI SUPPORT VECTOR MACHINES AI PROBABILISTIC MODELS AI DYNAMIC PROGRAMMING AND REINFORCEMENT LEARNING AI EVOLUTIONARY ALGORITHMS AI TIME SERIES MODELS ARTIFICIAL INTELLIGENCE THE NATURE OF LANGUAGE MATHEMATICS FOR AI ALGEBRAIC STRUCTURES FOR AI LINEAR ALGEBRA FOR AI INTERNET OF THINGS (IOT) WHAT IS THE IOT? IOT PROGRAMMING CONNECTED DEVICES IOT DIGITAL SWITCHES IOT USER DEFINED FUNCTIONS ARDOS IOT IOT PROGRAMMING RASPBERRY PI WITH C AND PYTHON IOT PYTHON HELLOTHERE.PY IOT PYTHON FUNCTIONS IOT INSTALLATION OF VIM IOT PROGRAMMING IN C IOT INSTALLING WIRING PI IOT RASPBERRY PI WITH RASPBIAN OPERATING SYSTEM HOW TO SETUP THE RASPBIAN OPERATING SYSTEM FILE SYSTEM LAYOUT IOT IOT PROGRAMMING IN RASPBERRY GALILEO, WINDOWS, AND THE IOT IOT CREATING THE SERVER APPLICATIONS IOT TEMPERATURE CONTROLLER IOT CREATION OF TABLES AND CONTROLLERS IOT SEEDING THE DATABASE IOT CUSTOM APIS CONCLUSION PYTHON FOR ARTIFICIAL INTELLIGENCE

AGENTS AND CONTROL FOR AI AI REPRESENTING SEARCH PROBLEMS AI REASONING WITH CONSTRAINTS DEEP LEARNING FOR AI INTRODUCTION 1.2 HISTORICAL TRENDS IN DEEP LEARNING 1.2.1 THE MANY NAMES AND CHANGING FORTUNES OF NEURAL NETWORKS 1.2.2 INCREASING DATASET SIZES 1.2.3 INCREASING MODEL SIZES 1.2.4 INCREASING ACCURACY, COMPLEXITY AND REAL-WORLD IMPACT APPLIED MATH AND MACHINE LEARNING BASICS LINEAR ALGEBRA 2.1 SCALARS, VECTORS, MATRICES AND TENSORS 2.3 IDENTITY AND INVERSE MATRICES 2.4 LINEAR DEPENDENCE AND SPAN 2.5 NORMS 2.6 SPECIAL KINDS OF MATRICES AND VECTORS 2.7 EIGENDECOMPOSITION 2.8 SINGULAR VALUE DECOMPOSITION 2.9 THE MOORE-PENROSE PSEUDOINVERSE 2.10 THE TRACE OPERATOR 2.11 THE DETERMINANT 2.12 EXAMPLE: PRINCIPAL COMPONENTS ANALYSIS PROBABILITY AND INFORMATION THEORY 3.1 3.2 3.3 3.3.1 3.4 3.5 3.6 3.7 3.8 3.9 3.9.1 3.9.2 3.9.3 3.9.4 3.9.5 3.9.6 3.10 3.11 3.12 3.13

WHY PROBABILITY? RANDOM VARIABLES PROBABILITY DISTRIBUTIONS DISCRETE VARIABLES AND PROBABILITY MASS FUNCTIONS MARGINAL PROBABILITY CONDITIONAL PROBABILITY THE CHAIN RULE OF CONDITIONAL PROBABILITIES INDEPENDENCE AND CONDITIONAL INDEPENDENCE EXPECTATION, VARIANCE AND COVARIANCE COMMON PROBABILITY DISTRIBUTIONS BERNOULLI DISTRIBUTION MULTINOULLI DISTRIBUTION GAUSSIAN DISTRIBUTION EXPONENTIAL AND LAPLACE DISTRIBUTIONS THE DIRAC DISTRIBUTION AND EMPIRICAL DISTRIBUTION MIXTURES OF DISTRIBUTIONS USEFUL PROPERTIES OF COMMON FUNCTIONS BAYES’ RULE TECHNICAL DETAILS OF CONTINUOUS VARIABLES INFORMATION THEORY

REINFORCEMENT LEARNING

1. INTRODUCTION 1.1 REINFORCEMENT LEARNING 1.2 EXAMPLES 1.3 ELEMENTS OF REINFORCEMENT LEARNING 1.4 AN EXTENDED EXAMPLE: TIC-TAC-TOE 1.5 SUMMARY 1.6 HISTORY OF REINFORCEMENT LEARNING 2. EVALUATIVE FEEDBACK 2.1 AN -ARMED BANDIT PROBLEM 2.2 ACTION-VALUE METHODS 2.3 SOFTMAX ACTION SELECTION 2.4 EVALUATION VERSUS INSTRUCTION 2.5 INCREMENTAL IMPLEMENTATION 2.6 TRACKING A NONSTATIONARY PROBLEM 2.7 OPTIMISTIC INITIAL VALUES 2.8 REINFORCEMENT COMPARISON 2.9 PURSUIT METHODS 2.10 ASSOCIATIVE SEARCH 2.11 CONCLUSIONS 3. THE REINFORCEMENT LEARNING PROBLEM 3.1 THE AGENT-ENVIRONMENT INTERFACE 3.2 GOALS AND REWARDS 3.3 RETURNS 3.4 UNIFIED NOTATION FOR EPISODIC AND CONTINUING TASKS 3.5 THE MARKOV PROPERTY 3.6 MARKOV DECISION PROCESSES 3.7 VALUE FUNCTIONS 3.8 OPTIMAL VALUE FUNCTIONS 3.10 SUMMARY REINFORCEMENT LEARNING ALGORITHMS — AN INTUITIVE OVERVIEW TERMINOLOGIES MODEL-FREE VS MODEL-BASED REINFORCEMENT LEARNING I. MODEL-FREE RL I.1. POLICY OPTIMIZATION OR POLICY-ITERATION METHODS I.1.1. POLICY GRADIENT (PG) I.1.2. ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC (A3C) I.1.3. TRUST REGION POLICY OPTIMIZATION (TRPO) I.1.4. PROXIMAL POLICY OPTIMIZATION (PPO) I.2. Q-LEARNING OR VALUE-ITERATION METHODS I.2.1 DEEP Q NEURAL NETWORK (DQN) I.2.2 C51 I.2.3 DISTRIBUTIONAL REINFORCEMENT LEARNING WITH QUANTILE REGRESSION (QR-DQN) I.2.4 HINDSIGHT EXPERIENCE REPLAY (HER) I.3 HYBRID II.1. LEARN THE MODEL

ASYNCHRONOUS ADVANTAGE ACTOR CRITIC (A3C) ALGORITHM ADVANTAGES: ROLE OF AI IN AUTONOMOUS DRIVING VIRTUAL ASSISTANTS IN DESKTOP ENVIRONMENTS VIRTUAL ASSISTANTS IN MOBILE CONTEXTS VIRTUAL ASSISTANTS AND THE INTERNET OF THINGS VIRTUAL ASSISTANTS AS A TYPE OF (DISEMBODIED) ROBOT VIRTUAL ASSISTANTS AS SOCIAL ROBOTS THE PLACE FOR AI IN AUTONOMOUS DRIVING: AI IN “SAFETY-RELATED” AUTONOMOUS DRIVING FUNCTIONALITIES IS BASED ON STANDARDS VARIOUS APPROACHES AND PRODUCTS COMMUNICATION DISPLAYS IN-CAR VIRTUAL ASSISTANTS GATEWAY TO IOT (E.G. HOME CONTROL) CUSTOMISED INFOTAINMENT PERSONAL HEALTH AND WELL-BEING RECENT STRATEGIC DEVELOPMENTS HIGHLIGHTS ON AI AND SELF-DRIVING CARS HIGHLIGHTS ON AI AND SELF-DRIVING CARS HIGHLIGHTS ON AI AND SELF-DRIVING CARS HIGHLIGHTS ON AI AND SELF-DRIVING CARS HIGHLIGHTS ON AI AND SELF-DRIVING CARS SOCIAL ROBOTS, AI AND SELF-DRIVING CARS SOCIAL ROBOTS, AI AND SELF-DRIVING CARS AI & INDUSTRY 4.0 SOME RECENT DEVELOPMENTS ARTIFICIAL INTELLIGENCE AND ROBOTICS

Overview of Artificial Intelligence

What is AI ?

Artificial Intelligence (AI) is a branch of Science which deals with helping machines find solutions to complex problems in a more human-like fashion. This generally involves borrowing characteristics from human intelligence, and applying them as algorithms in a computer friendly way. A more or less flexible or efficient approach can be taken depending on the requirements established, which influences how artificial the intelligent behavior appears Artificial intelligence can be viewed from a variety of perspectives. ü From

the

perspective

of intelligence

artificial intelligence is making machines "intelligent" -- acting as we would expect people to act. o The inability to distinguish computer responses from human responses is called the Turing test. o Intelligence requires knowledge o

Expert problem solving - restricting domain to allow including significant relevant knowledge

From a business perspective AI is a set of very powerful tools, and methodologies for using those tools to solve business problems. From a programming perspective, AI includes the study of symbolic programming, problem solving, and search. o

Typically AI programs focus on symbols rather than numeric processing.

o Problem solving - achieve goals. o

Search - seldom access a solution directly. Search may include a variety of techniques.

o

AI programming languages include:

– LISP, developed in the 1950s, is the early programming language strongly associated with AI. LISP is a functional programming language with procedural extensions. LISP (LISt Processor) was specifically designed for processing

heterogeneous lists -- typically a list of symbols. Features of LISP are

run- time type checking, higher order functions (functions that have other functions as parameters), automatic memory management (garbage collection) and an interactive environment. The second language strongly associated with AI is PROLOG. PROLOG was developed in the 1970s. PROLOG is based on first order logic. PROLOG is declarative in nature and has facilities for explicitly limiting the search space. Object-oriented languages are a class of languages more recently used for AI programming. Important features of object-oriented languages include: concepts of objects and messages, objects bundle data and methods for manipulating the data, sender specifies what is to be done receiver decides how to do it, inheritance (object hierarchy where objects inherit the attributes of the more general class of objects). Examples of object-oriented languages are Smalltalk, Objective C, C++. Object oriented extensions to LISP (CLOS - Common LISP Object System) and PROLOG (L&O - Logic & Objects) are also used. Artificial Intelligence is a new electronic machine that stores large amount of information and process it at very high speed The computer is interrogated by a human via a teletype It passes if the human cannot tell if there is a computer or human at the other end The ability to solve problems It is the science and engineering of making intelligent machines, especially intelligent computer programs. It is related to the similar task of using computers to understand human intelligence

Importance of AI Game Playing You can buy machines that can play master level chess for a few hundred dollars. There is some AI in them, but they play well against people mainly through brute force computation-looking at hundreds of thousands of positions. To beat a world champion by brute force and known reliable heuristics requires being able to look at 200 million positions per second.

Speech Recognition In the 1990s, computer speech recognition reached a practical level for limited purposes. Thus United Airlines has replaced its keyboard tree for flight information by a system using speech recognition of flight numbers and city names. It is quite convenient. On the other hand, while it is possible to instruct some computers using speech, most users have gone back to the keyboard and the mouse as still more convenient.

Understanding Natural Language Just getting a sequence of words into a computer is not enough. Parsing sentences is not enough either. The computer has to be provided with an understanding of the domain the text is about, and this is presently possible only for very limited domains.

Computer Vision The world is composed of three-dimensional objects, but the inputs to the human eye and computers' TV cameras are two dimensional. Some useful programs can work solely in two dimensions, but full computer vision requires partial three-dimensional information that is not just a set of two-dimensional views. At present there are only limited ways of representing three-dimensional information directly, and they are not as good as what humans evidently use.

Expert Systems A “knowledge engineer'' interviews experts in a certain domain and tries to embody their knowledge in a computer program for carrying out some task. How well this works depends on whether the intellectual mechanisms required for the task are within the present state of AI. When this turned out not to be so, there were many disappointing results. One of the first expert systems was MYCIN in 1974, which diagnosed bacterial infections of the blood and suggested treatments. It did better than medical students or practicing doctors, provided its limitations were observed. Namely, its ontology included bacteria, symptoms, and treatments and did not include patients, doctors, hospitals, death, recovery, and events occurring in time. Its interactions depended on a single patient being considered. Since the experts consulted by the knowledge engineers knew about patients, doctors, death, recovery, etc., it is clear that the knowledge engineers forced what the experts told them into a predetermined framework. The usefulness of current expert systems depends on their users having common sense.

Heuristic Classification One of the most feasible kinds of expert system given the present knowledge of AI is to put some information in one of a fixed set of categories using several sources of information. An example is advising whether to accept a proposed credit card purchase. Information is available about the owner of the credit card, his record of payment and also about the item he is buying and about the establishment from which he is buying it (e.g., about whether there have been previous credit card frauds at this establishment).

The applications of AI are shown in Fig 1.1: ü Consumer Marketing o Have you ever used any kind of credit/ATM/store card while shopping? o if so, you have very likely been “input” to an AI algorithm o All of this information is recorded digitally o

Companies like Nielsen gather this information weekly and search for patterns – general changes in consumer behavior – tracking responses to new products – identifying customer segments: targeted marketing, e.g., they find out that consumers with sports cars who buy textbooks respond well to offers of new credit cards.

o

Algorithms (“data mining”) search data for patterns based on mathematical theories of learning

Identification Technologies o

ID cards e.g., ATM cards

o

can be a nuisance and security risk: cards can be lost, stolen, passwords forgotten, etc

o

Biometric Identification, walk up to a locked door

– Camera – Fingerprint device – Microphone – Computer uses biometric signature for identification – Face, eyes, fingerprints, voice pattern

– This works by comparing data from person at door with stored library – Learning algorithms can learn the matching process by analyzing a large library database off-line, can improve its performance. Intrusion Detection o

Computer security - we each have specific patterns of computer use times of day, lengths of sessions, command used, sequence of commands, etc

– would like to learn the “signature” of each authorized user – can identify non-authorized users o

How can the program automatically identify users? – record user’s commands and time intervals – characterize the patterns for each user – model the variability in these patterns – classify (online) any new user by similarity to stored patterns

Machine Translation o

Language problems in international business – e.g., at a meeting of Japanese, Korean, Vietnamese and Swedish investors, no common language – If you are shipping your software manuals to 127 countries, the solution is ; hire translators to translate – would be much cheaper if a machine could do this!

o

How hard is automated translation – very difficult! – e.g., English to Russian – not only must the words be translated, but their meaning also! 1

Fig : Application areas of AI

Early work in AI “Artificial Intelligence (AI) is the part of computer science concerned with designing intelligent computer systems, that is, systems that exhibit characteristics we associate with intelligence in human behaviour – understanding language, learning, reasoning, solving problems, and so on.” Scientific Goal To determine which ideas about knowledge representation, learning, rule systems, search, and so on, explain various sorts of real intelligence.

Engineering Goal To solve real world problems using AI techniques such as knowledge representation, learning, rule systems, search, and so on. Traditionally, computer scientists and engineers have been more interested in the engineering goal, while psychologists, philosophers and cognitive scientists have been more interested in the scientific goal. The Roots - Artificial Intelligence has identifiable roots in a number of older disciplines, particularly: Philosophy Logic/Mathematics Computation Psychology/Cognitive Science Biology/Neuroscience Evolution There is inevitably much overlap, e.g. between philosophy and logic, or between mathematics and computation. By looking at each of these in turn, we can gain a better understanding of their role in AI, and how these underlying disciplines have developed to play that role. Philosophy ~400 ü BC

Socrates asks for an algorithm to distinguish piety from non-piety. Aristotle formulated different styles of deductive ü ~350 BC reasoning, which could mechanically generate conclusions from initial premises, e.g. Modus Ponens

If A ? and A then B B A implies If B and A is true then

B is true when it’s raining you

get wet and it’s raining then you get wet 1596 – 1650 Rene Descartes idea of mind-body dualism – part of the mind is exempt from physical laws. 1646 – 1716 Wilhelm Leibnitz was one of the first to take the materialist position which holds that the mind operates by ordinary physical processes – this has the implication that mental processes can potentially be carried out by machines.

Logic/Mathematics Earl Stanhope’s Logic Demonstrator was a machine that was able to solve syllogisms, numerical problems in a logical form, and elementary questions of probability. ü 1815 – 1864 George Boole introduced his formal language for making logical inference in 1847 – Boolean algebra. 1848 – 1925 Gottlob Frege produced a logic that is essentially the first-order logic that today forms the most basic knowledge representation system. 1906 – 1978 Kurt Gödel showed in 1931 that there are limits to what logic can do. His Incompleteness Theorem showed that in any formal logic powerful enough to describe the properties of natural numbers, there are true statements whose truth cannot be established by any algorithm.

1995 Roger Penrose tries to prove the human mind has non-computable capabilities.

Computation 1869 William Jevon’s Logic Machine could handle Boolean Algebra and Venn Diagrams, and was able to solve logical problems faster than human beings. 1912 – 1954 Alan Turing tried to characterise exactly which functions are capable of being computed. Unfortunately it is difficult to give the notion of computation a formal definition. However, the Church-Turing thesis, which states that a Turing machine is capable of computing any computable function, is generally accepted as providing a sufficient definition. Turing also showed that there were some functions which no Turing machine can compute (e.g. Halting Problem). 1903 – 1957 John von Neumann proposed the von Neuman architecture which allows a description of computation that is independent of the particular realisation of the computer. 1960s Two important concepts emerged: Intractability (when solution time grows atleast exponentially) and Reduction (to ‘easier’ problems).

Psychology / Cognitive Science Modern Psychology / Cognitive Psychology / Cognitive Science is the science which studies how the mind operates, how we behave, and how our brains process information. Language is an important part of human intelligence. Much of the early work on knowledge representation was tied to language and informed by research into linguistics. It is natural for us to try to use our understanding of how human (and other animal) brains lead to intelligent behavior in our quest to build artificial intelligent systems. Conversely, it makes sense to explore the properties of artificial systems (computer models/simulations) to test our hypotheses concerning human systems.

Many sub-fields of AI are simultaneously building models of how the human system operates, and artificial systems for solving real world problems, and are allowing useful ideas to transfer between them.

Biology / Neuroscience Our brains (which give rise to our intelligence) are made up of tens of billions of neurons, each connected to hundreds or thousands of other neurons. Each neuron is a simple processing device (e.g. just firing or not firing depending on the total amount of activity feeding into it). However, large networks of neurons are extremely powerful computational devices that can learn how best to operate. The field of Connectionism or Neural Networks attempts to build artificial systems based on simplified networks of simplified artificial neurons. The aim is to build powerful AI systems, as well as models of various human abilities. Neural networks work at a sub-symbolic level, whereas much of conscious human reasoning appears to operate at a symbolic level. Artificial neural networks perform well at many simple tasks, and provide good models of many human abilities. However, there are many tasks that they are not so good at, and other approaches seem more promising in those areas.

Evolution One advantage humans have over current machines/computers is that they have a long evolutionary history. Charles Darwin (1809 – 1882) is famous for his work on evolution by natural selection. The idea is that fitter individuals will naturally tend to live longer and produce more children, and hence after many generations a population will automatically emerge with good innate properties. This has resulted in brains that have much structure, or even knowledge, built in at birth. This gives them at the advantage over simple artificial neural network systems that have to learn everything. Computers are finally becoming powerful enough that we can simulate evolution and evolve good AI systems. We can now even evolve systems (e.g. neural networks) so that they are good at learning. A related field called genetic programming has had some success in evolving programs, rather than programming them by hand. Sub-fields of Artificial Intelligence Neural Networks – e.g. brain modelling, time series prediction, classification Evolutionary Computation – e.g. genetic algorithms, genetic programming Vision – e.g. object recognition, image understanding Robotics – e.g. intelligent control, autonomous exploration Expert Systems – e.g. decision support systems, teaching systems Speech Processing– e.g. speech recognition and production Natural Language Processing – e.g. machine translation Planning – e.g. scheduling, game playing Machine Learning – e.g. decision tree learning, version space learning Speech Processing

As well as trying to understand human systems, there are also numerous real world applications: speech recognition for dictation systems and voice activated control; speech production for automated announcements and computer interfaces.

How do we get from sound waves to text streams and vice-versa?

Natural Language Processing For example, machine understanding and translation of simple sentences: Planning Planning refers to the process of choosing/computing the correct sequence of steps to solve a given problem. To do this we need some convenient representation of the problem domain. We can define states in some formal language, such as a subset of predicate logic, or a series of rules. A plan can then be seen as a sequence of operations that transform the initial state into the goal state, i.e. the problem solution. Typically we will use some kind of search algorithm to find a good plan.

Common Techniques Even apparently radically different AI systems (such as rule based expert systems and neural networks) have many common techniques. Four important ones are: 1

o Knowledge Representation: Knowledge needs to be represented somehow – perhaps as a series of if-then rules, as a frame based system, as a semantic network, or in the connection weights of an artificial neural network. o

Learning: Automatically building up knowledge from the environment – such as acquiring the rules for a rule based expert system, or determining the appropriate connection weights in an artificial neural network.

o

o

Rule Systems: These could be explicitly built into an expert system by a knowledge engineer, or implicit in the connection weights learnt by a neural network. Search: This can take many forms – perhaps searching for a sequence of states that leads quickly to a problem solution, or searching for a good set of connection weights for a neural network by minimizing a fitness function.

AI and related fields Logical AI What a program knows about the world in general the facts of the specific situation in which it must act, and its goals are all represented by sentences of some mathematical logical language. The program decides what to do by inferring that certain actions are appropriate for achieving its goals. Search AI programs often examine large numbers of possibilities, e.g. moves in a chess game or inferences by a theorem proving program. Discoveries are continually made about how to do this more efficiently in various domains. Pattern Recognition When a program makes observations of some kind, it is often programmed to compare what it sees with a pattern. For example, a vision program may try to match a pattern of eyes and a nose in a scene in order to find a face. More complex patterns, e.g. in a natural language text, in a chess position, or in the history of some event are also studied. 1

Representation Facts about the world have to be represented in some way. Usually languages of mathematical logic are used. Inference From some facts, others can be inferred. Mathematical logical deduction is adequate for some purposes, but new methods of non-monotonic inference have been added to logic since the 1970s. The simplest kind of non-monotonic reasoning is default reasoning in which a conclusion is to be inferred by default, but the conclusion can be withdrawn if there is evidence to the contrary. For example, when we hear of a bird, we man infer that it can fly, but this conclusion can be reversed when we hear that it is a penguin. It is the possibility that a conclusion may have to be withdrawn that constitutes the non-monotonic character of the reasoning. Ordinary logical reasoning is monotonic in that the set of conclusions that can the drawn from a set of premises is a monotonic increasing function of the premises.

Common sense knowledge and reasoning This is the area in which AI is farthest from human-level, in spite of the fact that it has been an active research area since the 1950s. While there has been considerable progress, e.g. in developing systems of non-monotonic reasoning and theories of action, yet more new ideas are needed. Learning from experience Programs do that. The approaches to AI based on connectionism and neural nets specialize in that. There is also learning of laws expressed in logic. Programs can only learn what facts or behaviors their formalisms can represent, and unfortunately learning systems are almost all based on very limited abilities to represent information. Planning Planning programs start with general facts about the world (especially facts about the effects of actions), facts about the particular situation and a statement of a goal. From these, they generate a strategy for achieving the goal. In the most common cases, the strategy is just a sequence of actions. Epistemology 1

This is a study of the kinds of knowledge that are required for solving problems in the

world. Ontology Ontology is the study of the kinds of things that exist. In AI, the programs and sentences deal with various kinds of objects, and we study what these kinds are and what their basic properties are. Emphasis on ontology begins in the 1990s. Heuristics A heuristic is a way of trying to discover something or an idea imbedded in a program. The term is used variously in AI. Heuristic functions are used in some approaches to search to measure how far a node in a search tree seems to be from a goal. Heuristic predicates that compare two nodes in a search tree to see if one is better than the other, i.e. constitutes an advance toward the goal, may be more useful.

Genetic Programming Genetic programming is a technique for getting programs to solve a task by mating random Lisp programs and selecting fittest in millions of generations. Search and Control Strategies: Problem solving is an important aspect of Artificial Intelligence. A problem can be considered to consist of a goal and a set of actions that can be taken to lead to the goal. At any given time, we consider the state of the search space to represent where we have reached as a result of the actions we have applied so far. For example, consider the problem of looking for a contact lens on a football field. The initial state is how we start out, which is to say we know that the lens is somewhere on the field, but we don’t know where. If we use the representation where we examine the field in units of one square foot, then our first action might be to examine the square in the top-left corner of the field. If we do not find the lens there, we could consider the state now to be that we have examined the top-left square and have not found the lens. After a number of actions, the state might be that we have examined 500 squares, and we have now just found the lens in the last square we examined. This is a goal state because it satisfies the goal that we had of finding a contact lens. Search is a method that can be used by computers to examine a problem space like this in order to find a goal. Often, we want to find the goal as quickly as possible or without using too many resources. A problem space can also be considered to be a search space 1

because in order to solve the problem, we will search the space for a goal state.We will continue to use the term search space to describe this concept. In this chapter, we will look at a number of methods for examining a search space. These methods are called search methods. The Importance of Search in AI It has already become clear that many of the tasks underlying AI can be phrased in terms of a search for the solution to the problem at hand. Many goal based agents are essentially problem solving agents which must decide what to do by searching for a sequence of actions that lead to their solutions. For production systems, we have seen the need to search for a sequence of rule applications that lead to the required fact or action. For neural network systems, we need to search for the set of connection weights that will result in the required input to output mapping. Which search algorithm one should use will generally depend on the problem domain? There are four important factors to consider: Completeness – Is a solution guaranteed to be found if at least one solution exists? Optimality – Is the solution found guaranteed to be the best (or lowest cost) solution if there exists more than one solution? Time Complexity – The upper bound on the time required to find a solution, as a function of the complexity of the problem. Space Complexity – The upper bound on the storage space (memory) required at any point during the search, as a function of the complexity of the problem.

Preliminary concepts Two varieties of space-for-time algorithms: Input enhancement — preprocess the input (or its part) to store some info to be used later in solving the problem o Counting for sorting o String searching algorithms Prestructuring — preprocess the input to make accessing its elements easier

1

o Hashing 1

o

Indexing schemes (e.g., B-trees)

State Space Representations: The state space is simply the space of all possible states, or configurations, that our system may be in. Generally, of course, we prefer to work with some convenient representation of that search space.

There are two components to the representation of state spaces: Static States

Transitions between States

State Space Graphs: If the number of possible states of the system is small enough, we can represent all of them, along with the transitions between them, in a state space graph, e.g.

Routes through State Space: Our general aim is to search for a route, or sequence of transitions, through the state space graph from our initial state to a goal state.

Sometimes there will be more than one possible goal state. We define a goal test to determine if a goal state has been achieved. The solution can be represented as a sequence of link labels (or transitions) on the state space graph. Note that the labels depend on the direction moved along the link.

Sometimes there may be more than one path to a goal state, and we may want to find the optimal (best possible) path. We can define link costs and path costs for measuring the cost of going along a particular path, e.g. the path cost may just equal the number of links, or could be the sum of individual link costs. 1

For most realistic problems, the state space graph will be too large for us to hold all of it explicitly in memory at any one time. Search Trees: It is helpful to think of the search process as building up a search tree of routes through the state space graph. The root of the search tree is the search node corresponding to the initial state. The leaf nodes correspond either to states that have not yet been expanded, or to states that generated no further nodes when expanded.

At each step, the search algorithm chooses a new unexpanded leaf node to expand. The different search strategies essentially correspond to the different algorithms one can use to select which is the next mode to be expanded at each stage.

Examples of search problems Traveling Salesman Problem: Given n cities with known distances between each pair, find the shortest tour that passes through all the cities exactly once before returning to the starting city. A lower bound on the length l of any tour can be computed as follows ü For each city i, 1 ≤ i ≤ n, i findofthedistancesthefrom sumcityito thestwo nearest cities.

Compute the sum s of these n numbers. Divide the result by 2 and round up the result to the nearest integer

lb = s / 2 1

The lower bound for the graph shown in the Fig 5.1 can be computed as follows: 1

lb = [(1 + 3) + (3 + 6) + (1 + 2)

+ (3 + 4) + (2 + 3)] / 2 = 14.

For any subset of tours that must include particular edges of a given graph, the lower bound can be modified accordingly. E.g.: For all the Hamiltonian circuits of the graph that must include edge (a, d), the lower bound can be computed as follows:

lb = [(1 + 5) + (3 + 6) + (1 + 2) + (3 + 5) + (2 + 3)] / 2 = 16. Applying the branch-and-bound algorithm, with the bounding function lb = s / 2, to find the shortest Hamiltonian circuit for the given graph, we obtain the state-space tree as shown below: To reduce the amount of potential work, we take advantage of the following two observations: We can consider only tours that start with a. Since the graph is undirected, we can generate only tours in which b is visited before c. In addition, after visiting n – 1 cities, a tour has no choice but to visit the remaining unvisited city and return to the starting one is shown in the Fig 5.2 1

Root node includes only the starting vertex a with a lower bound of lb = [(1 + 3) + (3 + 6) + (1 + 2) + (3 + 4) + (2 + 3)] / 2 = 14.

Node 1 represents the inclusion of edge (a, b)

lb = [(1 + 3) + (3 + 6) + (1 + 2) + (3 + 4) + (2 + 3)] / 2 = 14. Node 2 represents the inclusion of edge (a, c). Since b is not visited before c, this node is terminated. Node 3 represents the inclusion of edge (a, d) lb = [(1 + 5) + (3 + 6) + (1 + 2) + (3 + 5) + (2 + 3)] / 2 = 16. Node 1 represents the inclusion of edge (a, e) lb = [(1 + 8) + (3 + 6) + (1 + 2) + (3 + 4) + (2 + 8)] / 2 = 19. Among all the four live nodes of the root, node 1 has a better lower bound. Hence we branch from node 1. Node 5 represents the inclusion of edge (b, c) lb = [(1 + 3) + (3 + 6) + (1 + 6) + (3 + 4) + (2 + 3)] / 2 = 16. Node 6 represents the inclusion of edge (b, d) lb = [(1 + 3) + (3 + 7) + (1 + 2) + (3 + 7) + (2 + 3)] / 2 = 16. Node 7 represents the inclusion of edge (b, e) lb = [(1 + 3) + (3 + 9) + (1 + 2) + (3 + 4) + (2 + 9)] / 2 = 19. 1

Since nodes 5 and 6 both have the same lower bound, we branch out from each of them. Node 8 represents the inclusion of the edges (c, d), (d, e) and (e, a). Hence, the length of the tour, l = 3 + 6 + 4 + 3 + 8 = 24. Node 9 represents the inclusion of the edges (c, e), (e, d) and (d, a). Hence, the length of the tour, l = 3 + 6 + 2 + 3 + 5 = 19. Node 10 represents the inclusion of the edges (d, c), (c, e) and (e, a). Hence, the length of the tour, l = 3 + 7 + 4 + 2 + 8 = 24. Node 11 represents the inclusion of the edges (d, e), (e, c) and (c, a). Hence, the length of the tour, l = 3 + 7 + 3 + 2 + 1 = 16. Node 11 represents an optimal tour since its tour length is better than or equal to the other live nodes, 8, 9, 10, 3 and 4. Ø The optimal tour is a → b → d → e → c → a with a to

Uniformed or Blind search Breadth First Search (BFS): BFS expands the leaf node with the lowest path cost so far, and keeps going until a goal node is generated. If the path cost simply equals the number of links, we can implement this as a simple queue (“first in, first out”).

1

This is guaranteed to find an optimal path to a goal state. It is memory intensive if the state space is large. If the typical branching factor is b, and the depth of the shallowest goal state is d – the space complexity is O(bd), and the time complexity is O(bd).

BFS is an easy search technique to understand. The algorithm is presented below. breadth_first_search () { store initial state in queue Q set state in the front of the Q as current state ; while (goal state is reached OR Q is empty) { apply rule to generate a new state from the current state ; if (new state is goal state) quit ; else if (all states generated from current states are exhausted) { delete the current state from the Q ; set front element of Q as the current state ; } else continue ; } } The algorithm is illustrated using the bridge components configuration problem. The initial state is PDFG, which is not a goal state; and hence set it as the current state. Generate another state DPFG (by swapping 1st and 2nd position values) and add it to 1

the list. That is not a goal state, hence; generate next successor state, which is FDPG (by swapping 1st and 3rd position values). This is also not a goal state; hence add it to the list and generate the next successor state GDFP. Only three states can be generated from the initial state. Now the queue Q will have three elements in it, viz., DPFG, FDPG and GDFP. Now take DPFG (first state in the list) as the current state and continue the process, until all the states generated from this are evaluated. Continue this process, until the goal state DGPF is reached.

The 14th evaluation gives the goal state. It may be noted that, all the states at one level in the tree are evaluated before the states in the next level are taken up; i.e., the evaluations are carried out breadth-wise. Hence, the search strategy is called breadth-first search. Depth First Search (DFS): DFS expands the leaf node with the highest path cost so far, and keeps going until a goal node is generated. If the path cost simply equals the number of links, we can implement this as a simple stack (“last in, first out”).

This is not guaranteed to find any path to a goal state. It is memory efficient even if the state space is large. If the typical branching factor is b, and the maximum depth of the tree is m – the space complexity is O(bm), and the time complexity is O(bm).

In DFS, instead of generating all the states below the current level, only the first state below the current level is generated and evaluated recursively. The search continues till a further successor cannot be generated. Then it goes back to the parent and explores the next successor. The algorithm is given below. depth_first_search () {

set initial state to current state ; 1

if (initial state is current state) quit ; 1

else { if (a successor for current state exists) { generate a successor of the current state and set it as current state ; } else return ; depth_first_search (current_state) ; if (goal state is achieved) return ; else continue ; } } Since DFS stores only the states in the current path, it uses much less memory during the search compared to BFS. The probability of arriving at goal state with a fewer number of evaluations is higher with DFS compared to BFS. This is because, in BFS, all the states in a level have to be evaluated before states in the lower level are considered. DFS is very efficient when more acceptable solutions exist, so that the search can be terminated once the first acceptable solution is obtained. BFS is advantageous in cases where the tree is very deep. An ideal search mechanism is to combine the advantages of BFS and DFS. Depth Limited Search (DLS): DLS is a variation of DFS. If we put a limit l on how deep a depth first search can go, we can guarantee that the search will terminate (either in success or failure). 1

If there is at least one goal state at a depth less than l, this algorithm is guaranteed to find a goal state, but it is not guaranteed to find an optimal path. The space complexity is O(bl), and the time complexity is O(bl). Depth First Iterative Deepening Search (DFIDS): DFIDS is a variation of DLS. If the lowest depth of a goal state is not known, we can always find the best limit l for DLS by trying all possible depths l = 0, 1, 2, 3, … in turn, and stopping once we have achieved a goal state. This appears wasteful because all the DLS for l less than the goal level are useless, and many states are expanded many times. However, in practice, most of the time is spent at the deepest part of the search tree, so the algorithm actually combines the benefits of DFS and BFS. Because all the nodes are expanded at each level, the algorithm is complete and optimal like BFS, but has the modest memory requirements of DFS. Exercise: if we had plenty of memory, could/should we avoid expanding the top level states many times? The space complexity is O(bd) as in DLS with l = d, which is better than BFS. The time complexity is O(bd) as in BFS, which is better than DFS. Bi-Directional Search (BDS): The idea behind bi-directional search is to search simultaneously both forward from the initial state and backwards from the goal state, and stop when the two BFS searches meet in the middle.

This is not always going to be possible, but is likely to be feasible if the state transitions are reversible. The algorithm is complete and optimal, and since the two 1

d/2

d/2

search depths are ~d/2, it has space complexity O(b ), and time complexity O(b ). However, if there is more than one possible goal state, this must be factored into the complexity.

Repeated States: In the above discussion we have ignored an important complication that often arises in search processes – the possibility that we will waste time by expanding states that have already been expanded before somewhere else on the search tree. For some problems this possibility can never arise, because each state can only be reached in one way. For many problems, however, repeated states are unavoidable. This will include all problems where the transitions are reversible, e.g.

The search trees for these problems are infinite, but if we can prune out the repeated states, we can cut the search tree down to a finite size, We effectively only generate a portion of the search tree that matches the state space graph. Avoiding Repeated States: There are three principal approaches for dealing with repeated states: Ø Never return to the state you have just come from The node expansion function must be prevented from generating any node successor that is the same state as the node’s parent. Ø Never create search paths with cycles in them The node expansion function must be prevented from generating any node successor that is the same state as any of the node’s ancestors. Ø Never generate states that have already been generated before This requires that every state ever generated is remembered, potentially resulting in d

space complexity of O(b ).

Comparing the Uninformed Search Algorithms: We can now summarize the properties of our five uninformed search strategies: 1

Simple BFS and BDS are complete and optimal but expensive with respect to space and time. DFS requires much less memory if the maximum tree depth is limited, but has no guarantee of finding any solution, let alone an optimal one. DLS offers an improvement over DFS if we have some idea how deep the goal is. The best overall is DFID which is complete, optimal and has low memory requirements, but still exponential time. Informed search Informed search uses some kind of evaluation function to tell us how far each expanded state is from a goal state, and/or some kind of heuristic function to help us decide which state is likely to be the best one to expand next. The hard part is to come up with good evaluation and/or heuristic functions. Often there is a natural evaluation function, such as distance in miles or number objects in the wrong position. Sometimes we can learn heuristic functions by analyzing what has worked well in similar previous searches. The simplest idea, known as greedy best first search, is to expand the node that is already closest to the goal, as that is most likely to lead quickly to a solution. This is like DFS in that it attempts to follow a single route to the goal, only attempting to try a different route

when it reaches a dead end. As with DFS, it is not complete, not optimal, and has time and complexity of O(bm). However, with good heuristics, the time complexity can be reduced substantially. Branch and Bound: An enhancement of backtracking. Applicable to optimization problems. 1

For each node (partial solution) of a state-space tree, computes a bound on the value of the objective function for all descendants of the node (extensions of the partial solution). Uses the bound for: Ruling out certain nodes as “nonpromising” to prune the tree – if a node’s bound is not better than the best solution seen so far. Guiding the search through state-space. The search path at the current node in a state-space tree can be terminated for any one of the following three reasons: The value of the node’s bound is not better than the value of the best solution seen so far. The node represents no feasible solutions because the constraints of the problem are already violated. The subset of feasible solutions represented by the node consists of a single point and hence we compare the value of the objective function for this feasible solution with that of the best solution seen so far and update the latter with the former if the new solution is better. Best-First branch-and-bound: A variation of backtracking. Among all the nonterminated leaves, called as the live nodes, in the current tree, generate all the children of the most promising node, instead of generation a single child of the last promising node as it is done in backtracking. Consider the node with the best bound as the most promising node. A* Search: Suppose that, for each node n in a search tree, an evaluation function f(n) is defined as the sum of the cost g(n) to reach that node from the start state, plus an estimated cost h(n) to get from that state to the goal state. That f(n) is then the estimated cost of the cheapest solution through n. A* search, which is the most popular form of best-first search, repeatedly picks the node with the lowest f(n) to expand next. It turns out that if the heuristic function h(n) satisfies certain conditions, then this strategy is both complete and optimal.

In particular, if h(n) is an admissible heuristic, i.e. is always optimistic and never overestimates the cost to reach the goal, then A* is optimal. 1

The classic example is finding the route by road between two cities given the straight line distances from each road intersection to the goal city. In this case, the nodes are the intersections, and we can simply use the straight line distances as h(n).

Hill Climbing / Gradient Descent: The basic idea of hill climbing is simple: at each current state we select a transition, evaluate the resulting state, and if the resulting state is an improvement we move there, otherwise we try a new transition from where we are. We repeat this until we reach a goal state, or have no more transitions to try. The transitions explored can be selected at random, or according to some problem specific heuristics. In some cases, it is possible to define evaluation functions such that we can compute the gradients with respect to the possible transitions, and thus compute which transition direction to take to produce the best improvement in the evaluation function. Following the evaluation gradients in this way is known as gradient descent. In neural networks, for example, we can define the total error of the output activations as a function of the connection weights, and compute the gradients of how the error changes as we change the weights. By changing the weights in small steps against

those gradients, we systematically minimize the network’s output errors.

Searching And-Or graphs The DFS and BFS strategies for OR trees and graphs can be adapted for And-Or trees The main difference lies in the way termination conditions are determined, since all goals following an And node must be realized, whereas a single goal node following

an Or node will do A more general optimal strategy is AO* (O for ordered) algorithm As in the case of the A* algorithm, we use the open list to hold nodes that have been generated but not expanded and the closed list to hold nodes that have been expanded

The algorithm is a variation of the original given by Nilsson It requires that nodes traversed in the tree be labeled as solved or unsolved in the solution process to account for And node solutions which require solutions to all successors nodes. A solution is found when the start node is labeled as solved 1

The AO* algorithm Step 1: Place the start node s on open Step 2: Using the search tree constructed thus far, compute the most promising solution tree T0 Step 3:Select a node n that is both on open and a part of T0. Remove n from open and place it on closed

Step 4: If n ia terminal goal node, label n as solved. If the solution of n results in any of n’s ancestors being solved, label all the ancestors as solved. If the start node s is solved, exit with success where T0 is the solution tree. Remove from open all nodes with a solved ancestor

Step 5: If n is not a solvable node, label n as unsolvable. If the start node is labeled as unsolvable, exit with failure. If any of n’s ancestors become unsolvable because n is, label them unsolvable as well. Remove from open all nodes with unsolvable ancestors Otherwise, expand node n generating all of its successors. For each such successor node that contains more than one subproblem, generate their successors to give individual subproblems. Attach to each newly generated node a back pointer to its predecessor. Compute the cost estimate h* for each newly generated node and place all such nodes thst do not yet have descendents on open. Next recomputed the values oh h* at n and each ancestors of n Step 7: Return to step 2 It can be shown that AO* will always find a minimum-cost solution tree if one exists, provided only that h*(n) ≤ h(n), and all arc costs are depends on how closely h* approximates h

Constraint Satisfaction Search

Search can be used to solve problems that are limited by constraints, such as the eight-queens problem. Such problems are often known as Constraint Satisfaction Problems, or CSPs. I 1

n this problem, eight queens must be placed on a chess board in such a way that no two queens are on the same diagonal, row, or column. If we use traditional chess board notation, we mark the columns with letters from a to g and the rows with numbers from 1 to 8. So, a square can be referred to by a letter and a number, such as a4 or g7. This kind of problem is known as a constraint satisfaction problem (CSP) because a solution must be found that satisfies the constraints. In the case of the eight-queens problem, a search tree can be built that represents the possible positions of queens on the board. One way to represent this is to have a tree that is 8-ply deep, with a branching factor of 64 for the first level, 63 for the next level, and so on, down to 57 for the eighth level. A goal node in this tree is one that satisfies the constraints that no two queens can be on the same diagonal, row, or column. An extremely simplistic approach to solving this problem would be to analyze every possible configuration until one was found that matched the constraints. A more suitable approach to solving the eight-queens problem would be to use depth-first search on a search tree that represents the problem in the following manner:

The first branch from the root node would represent the first choice of a square for a queen. The next branch from these nodes would represent choices of where to place the second queen. The first level would have a branching factor of 64 because there are 64 possible squares on which to place the first queen. The next level would have a somewhat lower branching factor because once a queen has been placed, the constraints can be used to determine possible squares upon which the next queen can be placed. The branching factor will decrease as the algorithm searches down the tree. At some point, the tree will terminate because the path being followed will lead to a position where no more queens can be placed on legal squares on the board, and there are still some queens remaining. 1

In fact, because each row and each column must contain exactly one queen, the branching factor can be significantly reduced by assuming that the first queen must be placed in row 1, the second in row 2, and so on. In this way, the first level will have a branching factor of 8 (a choice of eight squares on which the first queen can be placed), the next 7, the next 6, and so on. The search tree can be further simplified as each queen placed on the board “uses up” a diagonal, meaning that the branching factor is only 5 or 6 after the first choice has been

made, depending on whether the first queen is placed on an edge of the board (columns a or h) or not. The next level will have a branching factor of about 4, and the next may have a branching factor of just 2, as shown in Fig 6.1. The arrows in Fig 6.1 show the squares to which each queen can move. Note that no queen can move to a square that is already occupied by another queen. 1

In Fig 6.1, the first queen was placed in column a of row 8, leaving six choices for the next row. The second queen was placed in column d of row 7, leaving four choices for row 6. The third queen was placed in column f in row 6, leaving just two choices (column c or column h) for row 5.

Using knowledge like this about the problem that is being solved can help to significantly reduce the size of the search tree and thus improve the efficiency of the search solution. A solution will be found when the algorithm reaches depth 8 and successfully places the final queen on a legal square on the board. A goal node would be a path containing eight squares such that no two squares shared a diagonal, row, or column. One solution to the eight-queens problem is shown in above Fig . Note that in this solution, if we start by placing queens on squares e8, c7, h6, and then d5, once the fourth queen has been placed, there are only two choices for placing the fifth queen (b4 or g4). If b4 is chosen, then this leaves no squares that could be chosen for the final three queens to satisfy the constraints. If g4 is chosen for the fifth queen, as has been done in Fig 6.2, only one square is available for the sixth queen (a3), and the final two choices are similarly constrained. So, it can be seen that by applying the 1

constraints appropriately, the search tree can be significantly reduced for this problem. Using chronological backtracking in solving the eight-queens problem might not be the most efficient way to identify a solution because it will backtrack over moves that did not necessarily directly lead to an error, as well as ones that did. In this case, nonchronological backtracking, or dependency-directed backtracking could be more useful because it could identify the steps earlier in the search tree that caused the problem further down the tree.

Forward Checking In fact, backtracking can be augmented in solving problems like the eightqueens problem by using a method called forward checking. As each queen is placed on the board, a forward-checking mechanism is used to delete from the set of possible future choices any that have been rendered impossible by placing the queen on that square. For example, if a queen is placed on square a1, forward checking will remove all squares in row 1, all squares in column a, and also squares b2, c3, d4, e5, f6, g7, and h8. In this way, if placing a queen on the board results in removing all remaining squares, the system can immediately backtrack, without having to attempt to place any more queens. This can often significantly improve the performance of solutions for CSPs such as the eight-queens problem.

Most-Constrained Variables A further improvement in performance can be achieved by using the most-constrained variable heuristic. At each stage of the search, this heuristic involves working with the variable that has the least possible number of valid choices. 1

In the case of the eight-queens problem, this might be achieved by considering the problem to be one of assigning a value to eight variables, a through h. Assigning value 1 to variable a means placing a queen in square a1. To use the most constrained variable heuristic with this representation means that at each move we assign a value to the variable that has the least choices available to it. Hence, after assigning a = 1, b = 3, and c = 5, this leaves three choices for d, three choices for e, one choice for f, three choices for g, and three choices for h. Hence, our next move is to place a queen in column f. This heuristic is perhaps more clearly understood in relation to the mapcoloring problem. It makes sense that, in a situation where a particular country can be given only one color due to the colors that have been assigned to its neighbors, that country be colored next. The most-constraining variable heuristic is similar in that it involves assigning a value next to the variable that places the greatest number of constraints on future variables.

The least-constraining value heuristic is perhaps more intuitive than the two already presented in this section. This heuristic involves assigning a value to a variable that leaves the greatest number of choices for other variables. This heuristic can be used to make n-queens problems with extremely large values of n quite solvable.

Example: Cryptographic Problems The constraint satisfaction procedure is also a useful way to solve problems such as cryptographic problems. For example: FORTY +

TEN

+

TEN

SIXTY Solution: 1

29786

1

+

850

+

850

31486 This cryptographic problem can be solved by using a Generate and Test method, applying the following constraints: Each letter represents exactly one number. No two letters represent the same number. Generate and Test is a brute-force method, which in this case involves cycling through all possible assignments of numbers to letters until a set is found that meets the constraints and solves the problem. Without using constraints, the method would first start by attempting to assign 0 to all letters, resulting in the following sum: 00000 +

000

+

000

00000 Although this may appear to be a valid solution to the problem, it does not meet the constraints laid down that specify that each letter can be assigned only one number, and each number can be assigned only to one letter. Hence, constraints are necessary simply to find the correct solution to the problem. They also enable us to reduce the size of the search tree. In this case, for example, it is not necessary to examine possible solutions where two letters have been assigned the same number, which dramatically reduces the possible solutions to be examined. 1

Heuristic Repair Heuristics can be used to improve performance of solutions to constraint satisfaction problems. One way to do this is to use a heuristic repair method, which involves generating a possible solution (randomly, or using a heuristic to generate a position that is close to a solution) and then making changes that reduce the distance of the state from the goal. In the case of the eight-queens problem, this could be done using the minconflicts heuristic. To move from one state to another state that is likely to be closer to a solution using the min-conflicts heuristic, select one queen that conflicts with another queen (in other words, it is on the same row, column, or diagonal as another queen).

Now move that queen to a square where it conflicts with as few queens as possible. Continue with another queen. To see how this method would work, consider the starting position shown in Fig 6.3.

1

This starting position has been generated by placing the queens such that there are no conflicts on rows or columns. The only conflict here is that the queen in column 3 (on c7) is on a diagonal with the queen in column h (on h2). To move toward a solution, we choose to move the queen that is on column h. We will only ever apply a move that keeps a queen on the same column because we already know that we need to have one queen on each column. Each square in column h has been marked with a number to show how many other queens that square conflicts with. Our first move will be to move the queen on column h up to row 6, where it will conflict only with one queen. Then we arrive at the position shown in below Fig Because we have created a new conflict with the queen on row 6 (on f6), our next move must be to move this queen. In fact, we can move it to a square where it has zero conflicts. This means the problem has been solved, and there are no remaining conflicts. This method can be used not only to solve the eight-queens problem but also has been successfully applied to the n-queens problem for extremely large values of n. It has been shown that, using this method, the 1,000,000 queens problem can be solved in an average of around 50 steps. Solving the 1,000,000-queens problem using traditional search techniques would be impossible because it would involve searching a tree with a branching factor of 1012. 1

Local Search and Metaheuristics Local search methods work by starting from some initial configuration (usually random) and making small changes to the configuration until a state is reached from which no better state can be achieved. Hill climbing is a good example of a local search technique. Local search techniques, used in this way, suffer from the same problems as hill climbing and, in particular, are prone to finding local maxima that are not the best solution possible. The methods used by local search techniques are known as metaheuristics. Examples of metaheuristics include simulated annealing, tabu search, genetic algorithms, ant colony optimization, and neural networks. This kind of search method is also known as local optimization because it is attempting to optimize a set of values but will often find local maxima rather than a global maximum. A local search technique applied to the problem of allocating teachers to classrooms would start from a random position and make small changes until a configuration was reached where no inappropriate allocations were made.

Exchanging Heuristics The simplest form of local search is to use an exchanging heuristic. 1

An exchanging heuristic moves from one state to another by exchanging one or more variables by giving them different values. We saw this in solving the eight-queens problem as heuristic repair. A k-exchange is considered to be a method where k variables have their values changed at each step. The heuristic repair method we applied to the eight-queens problem was 2-exchange. A k-exchange can be used to solve the traveling salesman problem. A tour (a route through the cities that visits each city once, and returns to the start) is generated at random. Then, if we use 2-exchange, we remove two edges from the tour and substitute them for two other edges. If this pro duces a valid tour that is shorter than the previous one, we move on from here. Otherwise, we go back to the previous tour and try a different set of substitutions.

In fact, using k = 2 does not work well for the traveling salesman problem, whereas using k = 3 produces good results. Using larger numbers of k will give better and better results but will also require more and more iterations. Using k = 3 gives reasonable results and can be implemented efficiently. It does, of course, risk finding local maxima, as is often the case with local search methods.

Iterated Local Search Iterated local search techniques attempt to overcome the problem of local maxima by running the optimization procedure repeatedly, from different initial states. If used with sufficient iterations, this kind of method will almost always find a global maximum. The aim, of course, in running methods like this is to provide a very good solution without needing to exhaustively search the entire problem space. In problems such as the traveling salesman problem, where the search space grows extremely quickly as the number of cities increases, results can be generated that are good enough (i.e., a local maximum) without using many iterations, where a perfect solution would be impossible to find (or at least it would be impossible to guarantee a perfect solution even one iteration of local search may happen upon the global maximum). 1

Tabu Search Tabu search is a metaheuristic that uses a list of states that have already been visited to attempt to avoid repeating paths. The tabu search metaheuristic is used in combination with another heuristic and operates on the principle that it is worth going down a path that appears to be poor if it avoids following a path that has already been visited.

In this way, tabu search is able to avoid local maxima. Simulated Annealing Annealing is a process of producing very strong glass or metal, which involves heating the material to a very high temperature and then allowing it to cool very slowly. In this way, the atoms are able to form the most stable structures, giving the material great strength. Simulated annealing is a local search metaheuristic based on this method and is an extension of a process called metropolisMonteCarlo simulation. Simulated annealing is applied to a multi-value combinatorial problem where values need to be chosen for many variables to produce a particular value for some global function, dependent on all the variables in the system. This value is thought of as the energy of the system, and in general the aim of simulated annealing is to find a minimum energy for a system. Simple Monte Carlo simulation is a method of learning information (such as shape) about the shape of a search space. The process involves randomly selecting points within the search space. An example of its use is as follows: A square is partially contained within a circle. Simple Monte Carlo simulation can be used to identify what proportion of the square is within the circle and what proportion is outside the circle. This is done by randomly sampling points within the square and checking which ones are within the circle and which are not. Metropolis Monte Carlo simulation extends this simple method as follows: Rather than selecting new states from the search space at random, a new state is chosen by making a small change to the current state. 1

If the new state means that the system as a whole has a lower energy than it did in the previous state, then it is accepted. If the energy is higher than for the previous state, then a probability is applied to determine whether the new state is accepted or not. This probability is called a Boltzmann acceptance criterion and is calculated as follows: e(_dE/T) where T is the current temperature of the system, and dE is the increase in energy that has been produced by moving from the previous state to the new state. The temperature in this context refers to the percentage of steps that can be taken that lead to a rise in energy: At a higher temperature, more steps will be accepted that lead to a rise in energy than at low temperature. To determine whether to move to a higher energy state or not, the probability e(_dE/T) is calculated, and a random number is generated between 0 and 1. If this random number is lower than the probability function, the new state is accepted. In cases where the increase in energy is very high, or the temperature is very low, this means that very few states will be accepted that involve an increase in energy, as e(_dE/T) approaches zero. The fact that some steps are allowed that increase the energy of the system enables the process to escape from local minima, which means that simulated annealing often can be an extremely powerful method for solving complex problems with many local maxima. Some systems use e(_dE/kT) as the probability that the search will progress to a state with a higher energy, where k is Boltzmann’s constant (Boltzmann’s constant is approximately 1.3807 _ 10_23 Joules per Kelvin). Simulated annealing usesMonte Carlo simulation to identify the most stable state (the state with the lowest energy) for a system. This is done by running successive iterations of metropolis Monte Carlo simulation, using progressively lower temperatures. Hence, in successive iterations, fewer and fewer steps are allowed that lead to an overall increase in energy for the system.

A cooling schedule (or annealing schedule) is applied, which determines the manner in which the temperature will be lowered for successive iterations. Two popular cooling schedules are as follows: Tnew = Told _ dT 1

Tnew = C _ Told (where C < 1.0) 1

The cooling schedule is extremely important, as is the choice of the number of steps of metropolis Monte Carlo simulation that are applied in each iteration. These help to determine whether the system will be trapped by local minima (known as quenching). The number of times the metropolis Monte Carlo simulation is applied per iteration is for later iterations. Also important in determining the success of simulated annealing are the choice of the initial temperature of the system and the amount by which the temperature is decreased for each iteration. These values need to be chosen carefully according to the nature of the problem being solved. When the temperature, T, has reached zero, the system is frozen, and if the simulated annealing process has been successful, it will have identified a minimum for the total energy of the system. Simulated annealing has a number of practical applications in solving problems with large numbers of interdependent variables, such as circuit design. It has also been successfully applied to the traveling salesman problem. Uses of Simulated Annealing Simulated annealing was invented in 1983 by Kirkpatrick, Gelatt, and Vecchi.

It was first used for placing VLSI* components on a circuit board. Simulated annealing has also been used to solve the traveling salesman problem, although this approach has proved to be less efficient than using heuristic methods that know more about the problem. It has been used much more successfully in scheduling problems and other large combinatorial problems where values need to be assigned to a large

number of variables to maximize (or minimize) some function of those variables. Real-Time A* Real-time A* is a variation of A*. Search continues on the basis of choosing paths that have minimum values of f(node) = g(node) + h(node). However, g(node) is the distance of the node from the current node, rather than from the root node.

Hence, the algorithm will backtrack if the cost of doing so plus the estimated cost of solving the problem from the new node is less than the estimated cost of solving the problem from the current node. 1

Implementing real-time A* means maintaining a hash table of previously visited states with their h(node) values. Iterative-Deepening A* (IDA*) By combining iterative-deepening with A*, we produce an algorithm that is optimal and complete (like A*) and that has the low memory requirements of depth-first search. IDA* is a form of iterative-deepening search where successive iterations impose a greater limit on f(node) rather than on the depth of a node. IDA* performs well in problems where the heuristic value f (node) has relatively few possible values. For example, using the Manhattan distance as a heuristic in solving the eight-queens problem, the value of f (node) can only have values 1, 2, 3, or 4. In this case, the IDA* algorithm only needs to run through a maximum of four iterations, and it has a time complexity not dissimilar from that of A*, but with a significantly improved space complexity because it is effectively running depth-first search. In cases such as the traveling salesman problem where the value of f (node) is different for every state, the IDA* method has to expand 1 + 2 + 3 + . . . + n nodes =

O(n2) where A* would expand n nodes. Propositional and Predicate Logic Logic is concerned with reasoning and the validity of arguments. In general, in logic, we are not concerned with the truth of statements, but rather with their validity. That is to say, although the following argument is clearly logical, it is not something that we would consider to be true: All lemons are blue Mary is a lemon Therefore, Mary is blue This set of statements is considered to be valid because the conclusion (Mary is blue) follows logically from the other two statements, which we often call the premises. The reason that validity and truth can be separated in this way is simple: a piece of a reasoning is considered 1

to be valid if its conclusion is true in cases where its premises are also true. Hence, a valid set of statements such as the ones above can give a false conclusion, provided one or more of the premises are also false. We can say: a piece of reasoning is valid if it leads to a true conclusion in every situation where the premises are true. Logic is concerned with truth values. The possible truth values are true and false. These can be considered to be the fundamental units of logic, and almost all logic is ultimately concerned with these truth values.

Logic is widely used in computer science, and particularly in Artificial Intelligence. Logic is widely used as a representational method for Artificial Intelligence. Unlike some other representations, logic allows us to easily reason about negatives (such as, “this book is not red”) and disjunctions (“or”—such as, “He’s either a soldier or a sailor”). Logic is also often used as a representational method for communicating concepts and theories within the Artificial Intelligence community. In addition, logic is used to represent language in systems that are able to understand and analyze human language.

As we will see, one of the main weaknesses of traditional logic is its inability to deal with uncertainty. Logical statements must be expressed in terms of truth or falsehood—it is not possible to reason, in classical logic, about possibilities. We will see different versions of logic such as modal logics that provide some ability to reason about possibilities, and also probabilistic methods and fuzzy logic that provide much more rigorous ways to reason in uncertain situations.

Logical Operators In reasoning about truth values, we need to use a number of operators, which can be applied to truth values. We are familiar with several of these operators from everyday language: I like apples and oranges.

1

You can have an ice cream or a cake. 1

If you come from France, then you speak French. Here we see the four most basic logical operators being used in everyday language. The operators are: and or not if . . . then . . . (usually called implies) One important point to note is that or is slightly different from the way we usually use it. In the sentence, “You can have an icecream or a cake,” the mother is usually suggesting to her child that he can only have one of the items, but not both. This is referred to as an exclusiveor in logic because the case where both are allowed is excluded. The version of or that is used in logic is called inclusive-or and allows the case with both options. The operators are usually written using the following symbols, although other symbols are sometimes used, according to the context: and ∧ or ∨

not



implies → iff ↔ Iff is an abbreviation that is commonly used to mean “if and only if.” We see later that this is a stronger form of implies that holds true if one thing implies another, and also the second thing implies the first. For example, “you can have an ice-cream if and only if you eat your dinner.” It may not be immediately apparent why this is different from “you can have an icecream if you eat your dinner.” This is because most mothers really mean iff when they use if in this way. 1

Translating between English and Logic Notation To use logic, it is first necessary to convert facts and rules about the real world into logical expressions using the logical operators Without a reasonable amount of experience at this translation, it can seem quite a daunting task in some cases. Let us examine some examples. First, we will consider the simple operators, ∧, ∨, and

¬. Sentences that use the word and in English to express more than one concept, all of which is true at once, can be easily translated into logic using the AND operator, ∧.

For example: “It is raining and it is Tuesday.” might be expressed as: R ∧T, Where R means “it is raining” and T means “it is Tuesday.” For example, if it is not necessary to discuss where it is raining, R is probably enough. If we need to write expressions such as “it is raining in New York” or “it is raining heavily” or even “it rained for 30 minutes on Thursday,” then R will probably not suffice. To express more complex concepts like these, we usually use predicates. Hence, for example, we might translate “it is raining in New York” as: N(R) We

might equally well choose to write it as: R(N) This depends on whether we consider the rain to be a property of New York, or vice versa. In other words, when we write N(R), we are saying that a property of the rain is that it is in New York, whereas with R(N) we are saying that a property of New York is that it is raining. Which we use depends on the problem we are solving. It is likely that if we are solving a problem about New York, we would use R(N), whereas if we are solving a problem about the location of various types of weather, we might use N(R). Let us return nowto the logical operators. The expression “it is raining inNew York, and I’meither getting sick or just very tired”can be expressed as follows: R(N) ∧(S(I) ∨T(I)) Here we have used both the ∧operator, and the ∨operator to express a collection of statements. The statement can be broken down into two sections, which is indicated

by the use of parentheses. 1

The section in the parentheses is S(I) ∨T(I), which means “I’m either getting sick OR I’m very tired”. This expression is “AND’ed”with the part outside the parentheses, which is R(N).



Finally, the operator is applied exactly as you would expect—to express negation. For example, It is not raining in New York, might be expressed as R(N)





It is important to get the in the right place. For example: “I’m either not well or just very tired” would be translated as W(I) ∨T(I) The position of the





here indicates that it is bound to W(I) and does not play any

role in affecting T(I). Now let us see how the → operator is used. Often when dealing with logic we are discussing rules, which express concepts such as “if it is raining then I will get wet.”

This sentence might be translated into logic as R→W(I) This is read “R implies W(I)” or “IF R THEN W(I)”. By replacing the symbols R and W(I) with their respective English language equivalents, we can see that this sentence can be read as “raining implies I’ll get wet” or “IF it’s raining THEN I’ll get wet.”

Implication can be used to express much more complex concepts than this. For example, “Whenever he eats sandwiches that have pickles in them, he ends up either asleep at his desk or singing loud songs” might be translated as S(y) ∧E(x, y) ∧P(y)→A(x) ∨(S(x, z) ∧L(z)) Here we have used the following symbol translations: S(y) means that y is a sandwich.

E(x, y)

means that x (the man) eats y (the sandwich). P(y) means that y (the sandwich) has pickles in it. A(x) means that x ends up asleep at his desk. S(x, z) means that x (the man) sings z (songs). 1

L(z) means that z (the songs) are loud. 1

The important thing to realize is that the choice of variables and predicates is important, but that you can choose any variables and predicates that map well to your problem and that help you to solve the problem. For example, in the example we have just looked at, we could perfectly well have used instead S→A ∨L where S means “he eats a sandwich which has pickles in it,” A means “he ends up asleep at his desk,” and L means “he sings loud songs.” The choice of granularity is important, but there is no right or wrong way to make this choice. In this simpler logical expression, we have chosen to express a simple relationship between three variables, which makes sense if those variables are all that we care about—in other words, we don’t need to know anything else about the sandwich, or the songs, or the man, and the facts we examine are simply whether or not he eats a sandwich with pickles, sleeps at his desk, and sings loud songs.

The first translation we gave is more appropriate if we need to examine these concepts in more detail and reason more deeply about the entities involved. Note that we have thus far tended to use single letters to represent logical variables. It is also perfectly acceptable to use longer variable names, and thus to write expressions

such as the following: Fish (x) ∧living (x) →has_scales (x) This kind of notation is obviously more useful when writing logical expressions that are intended to be read by humans but when manipulated by a computer do not add any value.

Truth Tables We can use variables to represent possible truth values, in much the same way that variables are used in algebra to represent possible numerical values. We can then apply logical operators to these variables and can reason about the way in which they behave. It is usual to represent the behavior of these logical operators using truth tables. A truth table shows the possible values that can be generated by applying an operator to truth values. Not 1

First of all, we will look at the truth table for not,

¬.

Not is a unary operator, which means it is applied only to one variable. Its behavior is very simple:



true is equal to false



false is equal to true

If variable A has value true, then

¬A has value false.

If variable B has value false, then

¬B has value true.

Ø These can be represented by a truth table,

And Now, let us examine the truth table for our first binary operator—one which acts on two variables:

∧is

also called the conjunctive operator.

A ∧B is the conjunction of A and B. You can see that the only entry in the truth table for which A ∧B is true is the

one where A is true and B is true. If A is false, or if B is false, then A ∧B is false. If both A and B are false, then A ∧B is also false. What do A and B mean? They can represent any statement, or proposition, that can take on a truth value. 1

For example, A might represent “It’s sunny,” and B might represent “It’s warm outside.” In this case, A ∧B would mean “It is sunny and it’s warm outside,” which clearly is true only if the two component parts are true (i.e., if it is true that it is sunny and it is true that it is warm outside). Or The truth table for the or operator, ∨

∨is

also called the disjunctive operator.

A ∨B is the disjunction of A and B. Clearly A ∨B is true for any situation except when both A and B are false. If A is true, or if B is true, or if both A and B are true, A ∨B is true. This table represents the inclusive-or operator. A table to represent exclusive-or would have false in the final row. In other words, while A ∨B is true if A and B are both true, A EOR B (A exclusive-or

B) is false if A and B are both true. You may also notice a pleasing symmetry between the truth tables for ∧and ∨. This will become useful later, as will a number of other symmetrical relationships.

Implies The truth table for implies (→) is a little less intuitive.

1

This form of implication is also known as material implication In the statement A→B, A is the antecedent, and B is the consequent. The bottom two lines of the table should be obvious. If A is true and B is true, then A → B seems to be a reasonable thing to believe. For example, if A means “you live in France” and B means “You speak French,” then A→B corresponds to the statement “if you live in France, then you speak French.” Clearly, this statement is true (A→B is true) if I live in France and I speak French (A is true and B is true). Similarly, if I live in France, but I don’t speak French (A is true, but B is false), then it is clear that A→B is not true. The situations where A is false are a little less clear. If I do not live in France (A is not true), then the truth table tells us that regardless of whether I speak French or not (the value of B), the statement A→B is true. A→B is usually read as “A implies B” but can also be read as “If A then B” or “If A is true then B is true.” Hence, if A is false, the statement is not really saying anything about the value of B, so B is free to take on any value (as long as it is true or false, of course!).

All of the following statements are valid: 52 = 25 →4 = 4 (true →true) 9 _ 9 = 123 →8 > 3 (false →true) 52 = 25 →0 = 2 (false →false) In fact, in the second and third examples, the consequent could be given any meaning, and the statement would still be true. For example, the following statement is valid: 52 = 25 →Logic is weird Notice that when looking at simple logical statements like these, there does not need to be any real-world relationship between the antecedent and the consequent. For logic to be useful, though, we tend to want the relationships being expressed to be meaningful as well as being logically true. 1

iff 1

Ø The truth table for iff (if and only if {↔}) is as follows:

It can be seen that A ↔ B is true as long as A and B have the same value. In other words, if one is true and the other false, then A ↔ B is false. Otherwise, if A and B have the same value, A↔ B is true.

Complex Truth Tables Truth tables are not limited to showing the values for single operators. For example, a truth table can be used to display the possible values for A ∧(B ∨C).

Note that for two variables, the truth table has four lines, and for three variables, it has

eight. In general, a truth table for n variables will have 2n lines. The use of brackets in this expression is important. A ∧(B ∨C) is not the same as (A ∧B) ∨C. To avoid ambiguity, the logical operators are assigned precedence, as with mathematical operators. The order of precedence that is used is as follows: 1

¬,

∧, ∨,→,↔

Hence, in a statement such as

¬A ¬B ∨

∧C,

the



operator has the greatest

precedence, meaning that it is most closely tied to its symbols. ∧has a greater precedence than ∨, which means that the sentence above can be expressed as ( A) ∨ (( B) ∧C) Similarly, when we write

¬A

∨B





this is the same as ( A) ∨B rather than



¬(A

∨B)

In general, it is a good idea to use brackets whenever an expression might otherwise be ambiguous.

Tautology Consider the following truth table:

This truth table has a property that we have not seen before: the value of the expression A∨ A is true regardless of the value of A.



An expression like this that is always true is called a tautology. If A is a tautology, we write: |=A A logical expression that is a tautology is often described as being valid. A valid expression is defined as being one that is true under any interpretation. In other words, no matter what meanings and values we assign to the variables in a valid expression, it will still be true. For example, the following sentences are all valid: If wibble is true, then wibble is true. Either wibble is true, or wibble is not true. In the language of logic, we can replace wibble with the symbol A, in which case these two statements can be rewritten as A→A 1

A

¬A



If an expression is false in any interpretation, it is described as being contradictory. The following expressions are contradictory: A

¬A





(A ∨ A)→(A ∧

¬A)

Equivalence Consider the following two expressions: A ∧ B B∧A It should be fairly clear that these two expressions will always have the same value for a given pair of values for A and B. In otherwords, we say that the first expression is logically equivalent to the second expression. We write this as A ∧ B _ B ∧ A. This means that the ∧ operator is commutative. Note that this is not the same as implication: A statement is also true.



B→B



A, although this second

The difference is that if for two expressions e1 and e2: e1 _ e2, then e1 will always have the same value as e2 for a given set of variables. On the other hand, as we have seen, e1→e2 is true if e1 is false and e2 is true. There are a number of logical equivalences that are extremely useful.

The following is a list of a few of the most common: A



A_A

A∧A_A A ∧ (B ∧ C) _ (A ∧ B) ∧C (∧ is associative) A ∨ (B ∨ C) _ (A ∨ B) ∨C (∨ is associative) 1

A ∧ (B ∨ C) _ (A ∧ B) ∨ (A ∧ C) (∧ is distributive over ∨) 1

A ∧ (A ∨ B) _ A A ∨ (A ∧ B) _ A A ∧ true _ A A ∧ false _ false A ∨ true _ true A ∨ false _ A All of these equivalences can be proved by drawing up the truth tables for each side of the equivalence and seeing if the two tables are the same. The following is a very important equivalence: A→B _

¬A



B

We do not need to use the → symbol at all—we can replace it with a combination of and ∨.



Similarly, the following equivalences mean we do not need to use ∧ or↔: A∧B_

¬(¬A ¬B)

A↔ B _

¬(¬(¬A





B) ∨

¬ (¬B



A))

In fact, any binary logical operator can be expressed using



and ∨. This is a fact

that is employed in electronic circuits, where nor gates, based on an operator called nor, are used. Nor is represented by ↓,and is defined as follows: A ↓B _

¬(A



B)

Finally, the following equivalences are known as DeMorgan’s Laws: A∧B_

¬(¬A ¬B)

A∨B_

¬(¬A ¬B)





By using these and other equivalences, logical expressions can be simplified. For example, (C ∧ D) ∨ ((C ∧ D) ∧ E) can be simplified using the following rule: A ∨ (A ∧ B) _ A hence, (C ∧ D) ∨ ((C ∧ D) ∧ E) _ C ∧ D 1

In this way, it is possible to eliminate subexpressions that do not contribute to the overall value of the expression.

Propositional Logic There are a number of possible systems of logic. The system we have been examining so far is called propositional logic. The language that is used to express propositional logic is called the propositional calculus. A logical system can be defined in terms of its syntax (the alphabet of symbols and how they can be combined), its semantics (what the symbols mean), and a set of rules of deduction that enable us to derive one expression from a set of other expressions and thus make arguments and proofs. Syntax We have already examined the syntax of propositional calculus. The alphabet of symbols, _ is defined as follows ∑= {true, false,

¬,→, (, ),

∧, ∨,↔,

p1, p2, p3, . . . , pn, . . . }

Here we have used set notation to define the possible values that are contained within the alphabet ∑. Note that we allow an infinite number of proposition letters, or propositional symbols, p1, p2, p3, . . . , and so on. More usually, we will represent these by capital letters P, Q, R, and so on, If we need to represent a very large number of them, we will use the subscript notation (e.g., p1). An expression is referred to as a well-formed formula (often abbreviated as wff) or a sentence if it is constructed correctly, according to the rules of the syntax of propositional calculus, which are defined as follows. In these rules, we use A, B, C to represent sentences. In other words, we define a sentence recursively, in terms of other sentences. The following are wellformed sentences: P,Q,R. . . 1

true, false 1

(A)

¬A A∧B A∨B A→B A↔ B Hence, we can see that the following is an example of a wff: P ∧ Q ∨ (B ∧ C)→A ∧ B ∨ D ∧ ( E) Semantics





The semantics of the operators of propositional calculus can be defined in terms of truth tables. The meaning of P ∧ Q is defined as “true when P is true and Q is also true.”

The meaning of symbols such as P and Q is arbitrary and could be ignored altogether if we were reasoning about pure logic. In other words, reasoning about sentences such as P ∨ Q∧ considering what P, Q, and R mean.

¬R is possible without

Because we are using logic as a representational method for artificial intelligence, however, it is often the case that when using propositional logic, the meanings of these symbols are very important. The beauty of this representation is that it is possible for a computer to reason about them in a very general way, without needing to know much about the real world. In other words, if we tell a computer, “I like ice cream, and I like chocolate,” it might represent this statement as A ∧ B, which it could then use to reason with, and, as we will see, it can use this to make deductions.

Predicate Calculus Syntax Predicate calculus allows us to reason about properties of objects and relationships between objects. In propositional calculus, we could express the English statement “I like cheese” by A. This enables us to create constructs such as A, which means



“I do not like cheese,” but it does not allow us to extract any information about the cheese, or me, or other things that I like. In predicate calculus, we use predicates to express properties of objects. So the sentence “I like cheese” might be expressed as L(me, cheese) where L is a predicate that represents the idea of “liking.” Note that as well as expressing a property of me, this statement also expresses a relationship between me and cheese. This can be useful, as we will see, in describing environments for robots and other agents. For example, a simple agent may be concerned with the location of various blocks, and a statement about the world might be T(A,B), which could mean: Block A is on top of Block B. It is also possible to make more general statements using the predicate calculus. For example, to express the idea that everyone likes cheese, we might say ( x)(P(x)→L(x, C)) The symbol is read “for all,” so the statement above could be read as “for every x it is true that if property P holds for x, then the relationship L holds between x and C,” or in plainer English: “every x that is a person likes cheese.” (Here we are interpreting P(x) as meaning “x is a person” or, more precisely, “x has property P.”) Note that we have used brackets rather carefully in the statement above. This statement can also be written with fewer brackets: x P(x) →L(x, C), is called the universal quantifier. The quantifier can be used to express the notion that some values do have a certain property, but not necessarily all of them: ( x)(L(x,C)) This statement can be read “there exists an x such that x likes cheese.”

This does not make any claims about the possible values of x, so x could be a person, or a dog, or an item of furniture. When we use the existential 1

quantifier in this way, we are simply saying that there is at least one value of x

for which L(x,C) holds. The following is true: ( x)(L(x,C))→( x)(L(x,C)), but the following is not: ( x) (L(x,C))→( x)(L(x,C)) Relationships between and It is also possible to combine the universal and existential quantifiers, such as in the following statement: ( x) ( y) (L(x,y)). This statement can be read “for all x, there exists a y such that L holds for x and y,” which we might interpret as “everyone likes something.” A useful relationship exists between and . Consider the statement “not everyone likes cheese.” We could write this as

¬( x)(P(x)→L(x,C)) -------------

(1)

As we have already seen, A→B is equivalent to equivalent to (A ∧ B). Hence, the statement





¬ A ∨ B. Using DeMorgan’s laws, we can see that this is

(1) above, can be rewritten:

¬( x)¬(P(x) ¬L(x,C)) ----∧

--------

(2)

This can be read as “It is not true that for all x the following is not true: x is a person and x does not like cheese.” If you examine this rather convoluted sentence carefully, you will see that it is in fact the same as “there exists an x such that x is a person and x does not like cheese.” Hence we can rewrite it as ( x)(P(x) ∧ L(x,C)) ------------- (3)



In making this transition from statement (2) to statement (3), we have utilized the following equivalence: x ( x)

¬ ¬

In an expression of the form ( x)(P(x, y)), the variable x is said to be bound, whereas y is said to be free. This can be understood as meaning that the variable y could be replaced by any other variable because it is free, and the expression would still have the same meaning, whereas if the variable x were to be replaced by some other variable in P(x,y), then the meaning of the 1

expression would be changed: ( x)(P(y, z)) is not equivalent to ( x)(P(x, y)),

whereas ( x)(P(x, z)) is. Note that a variable can occur both bound and free in an expression, as in ( x) (P(x,y,z) → ( y)(Q(y,z))) In this expression, x is bound throughout, and z is free throughout; y is free in its first occurrence but is bound in ( y)(Q(y,z)). (Note that both occurrences of y are bound here.) Making this kind of change is known as substitution. Substitution is allowed of any free variable for another free variable. Functions In much the same way that functions can be used in mathematics, we can express an object that relates to another object in a specific way using functions. For example, to represent the statement “my mother likes cheese,” we might use L(m(me),cheese) Here the function m(x) means the mother of x. Functions can take more than

one argument, and in general a function with n arguments is represented as f(x1, x2, x3, . . . , xn)

First-Order Predicate Logic The type of predicate calculus that we have been referring to is also called firstorder predicate logic (FOPL). A first-order logic is one in which the quantifiers and can be applied to objects or terms, but not to predicates or functions. So we can define the syntax of FOPL as follows. First,we define a term: A constant is a term. A variable is a term. f(x1, x2, x3, . . . , xn) is a term if x1, x2, x3, . . . , xn are all terms. Anything that does not meet the above description cannot be a term. For example, the following is not a term: x P(x). This kind of construction we call a sentence or a well-formed formula (wff), which is defined as follows. 1

In these definitions, P is a predicate, x1, x2, x3, . . . , xn are terms, and A,B are wff ’s. The following are the acceptable forms for wff ’s: P(x1, x2, x3, . . . , xn)

¬A A∧B A∨B A→B A↔ B ( x)A ( x)A An atomic formula is a wff of the form P(x1, x2, x3, . . . , xn). Higher order logics exist in which quantifiers can be applied to predicates and functions, and where the following expression is an example of a wff: ( P)( x)P(x)

Soundness We have seen that a logical system such as propositional logic consists of a syntax, a semantics, and a set of rules of deduction. A logical system also has a set of fundamental truths, which are known as axioms. The axioms are the basic rules that are known to be true and from which all other theorems within the system can be proved. An axiom of propositional logic, for example, is A→(B→A) A theorem of a logical system is a statement that can be proved by applying the rules of deduction to the axioms in the system. If A is a theorem, then we write ├A A logical system is described as being sound if every theorem is logically valid, or a tautology. It can be proved by induction that both propositional logic and FOPL are sound.

Completeness A logical system is complete if every tautology is a theorem—in other words, if every valid statement in the logic can be proved by applying the rules of deduction to the axioms. Both propositional logic and FOPL are complete. Decidability A logical system is decidable if it is possible to produce an algorithm that will determine whether any wff is a theorem. In other words, if a logical system is decidable, then a computer can be used to determine whether logical expressions in that system are valid or not. We can prove that propositional logic is decidable by using the fact that it is complete. We can prove that a wff A is a theorem by showing that it is a tautology. To show if a wff is a tautology, we simply need to draw up a truth table for that wff and show that all the lines have true as the result. This can clearly be done algorithmically because we know that a truth table for n values has 2n lines and is therefore finite, for a finite number of variables. FOPL, on the other hand, is not decidable. This is due to the fact that it is not possible to develop an algorithm that will determine whether an arbitrary wff in FOPL is logically valid. Monotonicity A logical system is described as being monotonic if a valid proof in the system cannot be made invalid by adding additional premises or assumptions. In other words, if we find that we can prove a conclusion C by applying rules of deduction to a premise B with assumptions A, then adding additional assumptions A and B will not stop us from being able to deduce C.



Monotonicity of a logical system can be expressed as follows: If we can prove {A, B} ├ C, then we can also prove: {A, B, A_, B_} ├ C.



In other words, even adding contradictory assumptions does not stop us from making the proof in a monotonic system. In fact, it turns out that adding contradictory assumptions allows us to prove anything, including invalid conclusions. This makes sense if we recall the line in the truth table for →, which shows that false → true. By adding a contradictory assumption, we make our assumptions false and can thus prove any conclusion.

Modal Logics and Possible Worlds The forms of logic that we have dealt with so far deal with facts and properties of objects that are either true or false. In these classical logics, we do not consider the possibility that things change or that things might not always be as they are now. Modal logics are an extension of classical logic that allow us to reason about possibilities and certainties. In other words, using a modal logic, we can express ideas such as “although the sky is usually blue, it isn’t always” (for example, at night). In this way, we can reason about possible worlds. A possible world is a universe or scenario that could logically come about. The following statements may not be true in our world, but they are possible, in the sense that they are not illogical, and could be true in a possible world: Trees are all blue. Dogs can fly. People have no legs. It is possible that some of these statements will become true in the future, or even that they were true in the past. It is also possible to imagine an alternative universe in which these statements are true now. The following statements, on the other hand, cannot be true in any possible world: A∧

¬A

(x > y) ∧ (y > z) ∧ (z > x)

The first of these illustrates the law of the excluded middle, which simply states that a fact must be either true or false: it cannot be both true and false. It also cannot be the case that a fact is neither true nor false. This is a law of classical logic, it is possible to have a logical system without the law of the excluded middle, and in which a fact can be both true and false. 1

The second statement cannot be true by the laws of mathematics. We are not interested in possible worlds in which the laws of logic and mathematics do not hold. A statement that may be true or false, depending on the situation, is called contingent. A statement that must always have the same truth value, regardless of which possible

world we consider, is noncontingent. Hence, the following statements are contingent: A∧B A∨B I like ice cream. The sky is blue. The following statements are noncontingent: A ∨

A∧

¬A

¬A

If you like all ice cream, then you like this ice cream. Clearly, a noncontingent statement can be either true or false, but the fact that it is noncontingent means it will always have that same truth value. If a statement A is contingent, then we say that A is possibly true, which is written ◊ A If A is noncontingent, then it is necessarily true, which is written □ A

Reasoning in Modal Logic It is not possible to draw up a truth table for the operators ◊and □ The following rules are examples of the axioms that can be used to reason in this kind of modal logic: □A→◊A

¬ ¬

□ A→ ◊A



1

◊A→ □A 1

Although truth tables cannot be drawn up to prove these rules, you should be able to reason about them using your understanding of the meaning of the ◊ and □operators.

Possible world representations It describes method proposed by Nilsson which generalizes firtst order logic in the modeling of uncertain beliefs The method assigns truth values ranging from 0 to 1 to possible worlds Each set of possible worlds corresponds to a different interpretation of sentences contained in a knowledge base denoted as KB Consider the simple case where a KB contains only the single sentence S, S may be either true or false. We envision S as being true in one set of possible worlds W1 and false in another set W2 . The actual world , the one we are in, must be in one of the

two sets, but we are uncertain which one. Uncertainty is expressed by assigning a probability P to W1 and 1 – P to W2. We can say then that the probability of S being true is P

When KB contains L sentences, S1,… SL , more sets of possible worlds are required to represent all consistent truth value assignments. There are 2L possible truth assignments for L sentences. Truth Value assignments for the set {P. P→Q, Q} Consistent

Inconsistent

P

Q

P→

P

Q

P→

True

True

True

True

True

False

True

False

False

True

False

True

False

True

True

False

True

False

False

False

True

False

False

False

They are based on the use of the probability constraints 0 ≤ i ≤p 1, i pandi=1 ∑ 1

The consistent probability assignments are bounded by the hyperplanes of a certain convex hull

Dempster- Shafer theory The Dempster-Shafer theory, also known as the theory of belief functions, is a generalization of the Bayesian theory of subjective probability. Whereas the Bayesian theory requires probabilities for each question of interest, belief functions allow us to base degrees of belief for one question on probabilities for a related question. These degrees of belief may or may not have the mathematical properties of probabilities; The Dempster-Shafer theory owes its name to work by A. P. Dempster (1968) and Glenn Shafer (1976), but the theory came to the attention of AI researchers in the early 1980s, when they were trying to adapt probability theory to expert systems.

Dempster-Shafer degrees of belief resemble the certainty factors in MYCIN, and this resemblance suggested that they might combine the rigor of probability theory with the flexibility of rule-based systems. The Dempster-Shafer theory remains attractive because of its relative flexibility. The Dempster-Shafer theory is based on two ideas: the idea of obtaining degrees of belief for one question from subjective probabilities for a related question, Dempster's rule for combining such degrees of belief when they are based on independent items of evidence. To illustrate the idea of obtaining degrees of belief for one question from subjective probabilities for another, suppose I have subjective probabilities for the reliability of my friend Betty. My probability that she is reliable is 0.9, and my probability that she is unreliable is 0.1. Suppose she tells me a limb fell on my car. This statement, which must true if she is reliable, is not necessarily false if she is unreliable. So her testimony alone justifies a 0.9 degree of belief that a limb fell on my car, but only a zero degree of belief (not a 0.1 degree of belief) that no limb fell on my car. This zero does not mean that I am sure that no limb fell on my car, as a zero probability would; it merely means that Betty's testimony gives me no reason to believe that no limb fell on my car. The 0.9 and the zero together constitute a belief function.

To illustrate Dempster's rule for combining degrees of belief, suppose I also have a 0.9 subjective probability for the reliability of Sally, and suppose she too testifies, independently of Betty, that a limb fell on my car. The event that Betty is reliable is independent of the event that Sally is reliable, and we may multiply the probabilities of these events; the probability that both are reliable is 0.9x0.9 = 0.81, the probability that neither is reliable is 0.1x0.1 = 0.01, and the probability that at least one is reliable is 1 - 0.01 = 0.99. Since they both said that a limb fell on my car, at least of them being reliable implies that a limb did fall on my car, and hence I may assign this event a degree of belief of 0.99. Suppose, on the other hand, that Betty and Sally contradict each other— Betty says that a limb fell on my car, and Sally says no limb fell on my car. In this case, they cannot both be right and hence cannot both be reliable—only one is reliable, or neither is reliable. The prior probabilities that only Betty is reliable, only Sally is reliable, and that neither is reliable are 0.09, 0.09, and 0.01, respectively, and the posterior probabilities (given

that not both are reliable) are 9 19 , 9 19 , and 1 19 , respectively. Hence we have a 9 19 degree of belief that a limb did fall on my car (because Betty is reliable) and a 9 19 degree of belief that no limb fell on my car (because Sally is reliable).

In summary, we obtain degrees of belief for one question (Did a limb fall on my car?) from probabilities for another question (Is the witness reliable?). Dempster's rule begins with the assumption that the questions for which we have probabilities are independent with respect to our subjective probability judgments, but this independence is only a priori; it disappears when conflict is discerned between the different items of evidence. Implementing the Dempster-Shafer theory in a specific problem generally involves solving two related problems. First, we must sort the uncertainties in the problem into a priori independent items of evidence. Second, we must carry out Dempster's rule computationally. These two problems and their solutions are closely related. Sorting the uncertainties into independent items leads to a structure involving items of evidence that bear on different but related questions, and this structure can be used to make computations This can be regarded as a more general approach to representing uncertainty than the Bayesian approach. The basic idea in representing uncertainty in this model is:

Set up a confidence interval -- an interval of probabilities within which the true probability lies with a certain confidence -- based on the Belief B and plausibility PL provided by some evidence E for a proposition P. The belief brings together all the evidence that would lead us to believe in P with some certainty. The plausibility brings together the evidence that is compatible with P and is not inconsistent with it. This method allows for further additions to the set of knowledge and does not assume disjoint outcomes. If is the set of possible outcomes, then a mass probability, M, is defined for each member of the set and takes values in the range [0,1]. The Null set, , is also a member of .

NOTE: This deals wit set theory terminology that will be dealt with in a tutorial shortly. Also see exercises to get experience of problem solving in this important subject matter. M is a probability density function defined not just for So if

but for em all subsets.

is the set { Flu (F), Cold (C), Pneumonia (P) } then

{P}, {F, C}, {F, P}, {C, P}, {F, C, P} } The confidence interval is where

then

where

defined

is the set { , {F}, {C}, as [B(E),PL(E)]

i.e. all the evidence that makes us believe in the

correctness of P, and

where i.e. all the evidence that contradicts P. Let X be the universal set: the set of all states under consideration. The power set is the set of all possible sub-sets of X, including the empty set . For example, if: X = {a,b}then 2x = {Ǿ, {a},{b}, X} The elements of the power set can be taken to represent propositions that one might be interested in, by containing all and only the states in which this proposition is true.

The theory of evidence assigns a belief mass to each element of the power set. Formally, a function m: 2x→ [0, 1] basicisbeliefcalledassignment (BBA),a when it has two properties. First, the mass of the empty set is zero: m (Ǿ) = 0 Second, the masses of the remaining members of the power set add up to a total of 1: ∑m(A) =1 A€ 2x The mass m(A) of a given member of the power set, A, expresses the proportion of all relevant and available evidence that supports the claim that the actual state belongs to A but to no particular subset of A. The value of m(A) pertains only to the set A and makes no additional claims about any subsets of A, each of which have, by definition, their own mass.

From the mass assignments, the upper and lower bounds of a probability interval can be defined. This interval contains the precise probability of a set of interest (in the classical sense), and is bounded by two non-additive continuous measures called belief (or support) and plausibility: bel(A) ≤ P(A) ≤ pl(A)

1

1

1

Benefits of Dempster-Shafer Theory: Allows proper distinction between reasoning and decision taking No modeling restrictions (e.g. DAGs) It represents properly partial and total ignorance Ignorance is quantified: o

low degree of ignorance means - high confidence in results - enough information available for taking decisions

o

high degree of ignorance means - low confidence in results - gather more information (if possible) before taking decisions

Conflict is quantified:

o

low conflict indicates the presence of confirming information sources

o

high conflict indicates the presence of contradicting sources

Simplicity: Dempster’s rule of combination covers o

combination of evidence

o Bayes’ rule o Bayesian updating (conditioning) o

belief revision (results from non-monotonicity) DS-Theory is not very successful

because:

Inference is less efficient than Bayesian inference Pearl is the better speaker than Dempster (and Shafer, Kohlas, etc.) Microsoft supports Bayesian Networks The UAI community does not like „outsiders“ 1

Fuzzy Set Theory What is Fuzzy Set ?

• The word "fuzzy" means "vagueness". Fuzziness occurs when the boundary of a piece of information is not clear-cut.

• Fuzzy sets have been introduced by Lotfi A. Zadeh (1965) as an extension of the classical notion of set.

• Classical set theory allows the membership of the in binary terms, a bivalent condition - an element does not belong to the set. Fuzzy set theory permits the gradual of elements in a set, described with valued in the real unit interval [0, 1].

elements in the set either belongs or

assessment of the membership the aid of a membership function

• Example: Words like young, tall, good, or high are fuzzy. There is no single quantitative value which defines the term young. − For some people, age 25 is young, and for others, age 35 is young. − The concept young has no clean boundary. − Age 1 is definitely young and age 100 is definitely not young; has some of usually − Age 35 possibility being young and depends on the context in which it is being considered. −

Introduction In real world, there exists much fuzzy knowledge; Knowledge that is vague, imprecise, uncertain, ambiguous, inexact, or probabilistic in nature. Human thinking and reasoning frequently involve fuzzy information, originating from inherently inexact human concepts. Humans, can give satisfactory answers, which are probably true. are unable to answer many questions. The However, our systems reason is, most systems are designed based upon classical set theory and two-valued logic which is unable to cope with unreliable and incomplete information and give expert opinions. 1

• Classical Set Theory A Set is any well defined collection of objects. An object in a set is called an element or member of that set. Sets are defined by a simple statement describing whether a particular element having a certain property belongs to that particular set.



Classical set theory enumerates all its elements using A = { a1 , a2 , a3 , a4 , . . . . an } If the elements ai (i = 1, 2, 3, . . . subset of n) of a set A are universal set X, then set A can be represented for all elements x

X by

its

characteristic function

µA (x) = −

1

if

x

X

0

otherwise

A set A is well described by a function called characteristic function. This function, defined on the universal space X, assumes : a value of 1 for those elements x that belong to set A, and a value of 0 for those elements x that do not belong to set A. The notations used to express these mathematically are Α:Χ→

[0, 1]

A(x) = 1 , x is a member of A

Eq.(1)

A(x) = 0 , x is not a member of A

Alternatively, the set A can be represented for by its characteristic function µA (x) defined as 1

if

x

0

otherwise

x

X

X

µA (x) =



all elements

Eq.(2)

Thus in classical set theory µA (x) has only and 1 ('true''). Such sets are called crisp sets.

the values 0 ('false')

1

• Fuzzy Set Theory Fuzzy set theory is an extension elements have varying degrees of

of classical set theory where membership. A logic based on

the two truth values, True and False, is sometimes inadequate when describing human reasoning. Fuzzy logic uses the whole interval between 0 (false) and 1 (true) to describe human reasoning. −

A Fuzzy Set is any set that allows its members to have different degree of membership, called membership function, in the interval [0 , 1].

The



or truth is not same as probability; fuzzy truth is not likelihood of some event or condition.

degree of membership

fuzzy truth represents membership in vaguely defined sets; − Fuzzy logic is derived from fuzzy set theory dealing with reasoning that is approximate rather than precisely deduced from classical predicate logic.

− −



Fuzzy logic is capable of handling inherently imprecise concepts. Fuzzy logic allows in linguistic form the set membership values to imprecise concepts like "slightly", "quite" and "very". Fuzzy set theory defines Fuzzy Operators on Fuzzy Sets.

• Crisp and Non-Crisp Set −

As said before, in classical set theory, the

characteristic function µA(x) of Eq.(2) has only values 0

('false') and 1 ('true'').

Such sets are crisp sets. −

For Non-crisp sets the characteristic function µA(x) can be defined. The characteristic function µA(x) of Eq. (2) for the crisp set is generalized for the Non-crisp sets. This generalized characteristic function µA(x) of Eq.(2) is called membership function.

Such Non-crisp sets are called Fuzzy Sets. −

Crisp set theory is not capable of representing descriptions and classifications in many cases; In fact, Crisp set does not provide adequate representation for most cases. 1

• Representation of Crisp and Non-Crisp Set Example : Classify students for a basketball team This example explains the grade of truth value. -

tall students qualify and not tall students do not qualify

- if students 1.8 m tall are to be qualified, then 1

should we exclude a student who is /10" less? or should we exclude a student who is 1" shorter?

■ Non-Crisp Representation to represent the notion of a tall person.

Degree or grade of truth Not Tall

Degree or grade of truth

Tall

Not Tall

1

Tall

1

0

0 1.8 m Crisp logic

Height x

1.8 m

Height x

Non-crisp logic

Fig. 1 Set Representation – Degree or grade of truth

A student of height 1.79m would belong to both tall and not tall sets with a particular degree of membership.



As the height increases the membership grade within the tall set would increase whilst the membership grade within the not-tall set would decrease. Ca pturing Uncertainty Instead of avoiding or ignoring uncertainty, Lotfi Zadeh introduced Fuzzy Set theory that captures uncertainty.

■ A fuzzy set is described by a membership function µA (x) of A. This membership function associates to each element xσ X a number as µA (xσ ) in the closed unit interval [0, 1]. The number µA (xσ represents the degree of membership of xσ in A. )



The notation used for membership function µA (x) of a fuzzy set A is Α : Χ → [0, 1]



Each membership function maps elements of a given universal base set X , which is itself a crisp set, into real numbers in [0, 1] . 1

■ Example

µ 1

µc (x)

µF (x)

C

F

0.5 0

x

Fig. 2 Membership function of a Crisp set C and Fuzzy set F

■ In the case of Crisp Sets the members of a set are : either out of the set, with membership of degree " 0 ", or in the set, with membership of degree " 1 ", Therefore,

Crisp Sets ⊆ Fuzzy Sets

In other words, Crisp Sets are Special cases of Fuzzy Sets.

• Exam ples of Crisp and Non-Crisp Set Example 1: Set of prime numbers ( a crisp set)

If we consider space X consisting of natural numbers ≤ 12 ie X = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}

Then, the set of prime numbers could be described as follows. PRIME = {x contained in X | x is a prime number} = {2, 3, 5, 6, 7, 11} Example 2: Set of SMALL ( as non-crisp set)

A Set X that consists of SMALL cannot be described;

for example 1 is a member of SMALL and 12 is not a member of SMALL. Set A, as SMALL, has un-sharp boundaries, can be characterized by a function that assigns a real number from the closed interval from 0 to 1 to each element x in the set X. 1

Fuzzy Set A Fuzzy Set is any set that allows its members to have different degree of membership, called membership function, in the interval [0 , 1].



Definition of Fuzzy set A fuzzy set A, defined in the universal space X, in X which assumes values in the range [0, 1].

is a function defined

A fuzzy set A is written as a set of pairs {x, A(x)} as A = {{x , A(x)}} ,

where

x in the set X

is an element of the universal space X, and A(x) is the value of the function A for this element. x

The value A(x) is the fuzzy set A. Example : Set SMALL Assume: SMALL(1) = 1,

membership grade

of

the element

in set X consisting of natural numbers ≤ SMALL(2) = 1,

x in a

to 12.

SMALL(3) = 0.9, SMALL(4) = 0.6,

SMALL(5) = 0.4, SMALL(6) = 0.3, SMALL(7) = 0.2, SMALL(8) = 0.1, SMALL(u) = 0 for u >= 9.

Then, following the notations described in the definition above : Set SMALL = {{1, 1 }, {8, 0.1},

{2, 1 }, {3, 0.9}, {9, 0 }, {10, 0 },

{4, 0.6}, {11, 0},

{5, 0.4}, {6, 0.3}, {7, 0.2}, {12, 0}}

Note that a fuzzy set can be defined precisely by associating with each x , its grade of membership in SMALL.

• Definition of Universal Space Originally the universal space for defined only on the integers. Now,

fuzzy sets in fuzzy logic was the universal space for fuzzy sets

and fuzzy relations is defined with three numbers. The first two numbers specify the start and end of the universal space, and the third argument specifies the increment between elements. This gives the user more flexibility in choosing the universal space.

Example : The fuzzy set of numbers, defined in the universal space X = { xi } = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} is presented as SetOption [FuzzySet, UniversalSpace → {1, 12, 1}]

Fuzzy Membership A fuzzy set A defined in the universal space X is a in X which assumes values in the range [0, 1]. A fuzzy set is written as a set of pairs {x, A(x)}. A A = {{x ,

function defined

A(x)}} , x in the set X

where x is an element of the universal space X,

and

A(x) is the value of the function A for this element.

The value A(x) is in a fuzzy set A.

the

degree of membership of

the element

x

The Graphic Interpretation of fuzzy membership for the fuzzy sets : Small, Prime Numbers, Universal-space, Finite and Infinite UniversalSpace, and Empty are illustrated in the next few slides.

• Graphic Interpretation of Fuzzy Sets SMALL The fuzzy set SMALL of small numbers, defined in the universal space X = { xi } = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} is presented as SetOption [FuzzySet, UniversalSpace → {1, 12, 1}]

The Set SMALL in set X is : SMALL = FuzzySet {{1, 1

}, {2, 1 }, {3, 0.9}, {4, 0.6}, {5, 0.4}, {6, 0.3},

{7, 0.2}, {8, 0.1}, {9, 0 }, {10, 0 }, {11, 0}, {12, 0}}

Therefore SetSmall is represented as SetSmall = FuzzySet [{{1,1},{2,1}, {3,0.9}, {4,0.6}, {5,0.4},{6,0.3}, {7,0.2}, {8, 0.1}, {9, 0},

{10, 0}, {11, 0}, {12, 0}} , UniversalSpace →

FuzzyPlot [ SMALL, AxesLable →

{1, 12, 1}]

{"X", "SMALL"}]

SMALL 1 .8 .6 .4 .2 00

1

2

3

4

5

6

7

8

Fig Graphic Interpretation of Fuzzy Sets SMALL 1

9

10

11

12

X

• Graphic Interpretation of Fuzzy Sets

PRIME Numbers

The fuzzy set PRIME numbers, defined in the universal space X = { xi } = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} is presented as SetOption [FuzzySet, UniversalSpace → {1, 12, 1}]

The Set PRIME in set X is : PRIME = FuzzySet {{1, 0}, {2, 1}, {3, 1}, {4, 0}, {5, 1}, {6, 0}, {7, 1}, {8, 0}, {9, 0}, {10, 0}, {11, 1}, {12, 0}}

Therefore SetPrime is represented as SetPrime = FuzzySet [{{1,0},{2,1}, {3,1}, {4,0}, {5,1},{6,0}, {7,1}, {8, 0}, {9, 0}, {10, 0}, {11, 1}, {12, 0}} , UniversalSpace → FuzzyPlot [ PRIME, AxesLable →

{1, 12, 1}]

{"X", "PRIME"}]

PRIME 1 .8 .6 .4 .2 0 0

1

2

3

4

5

6

7

8

9

10

11

12

X

Fig Graphic Interpretation of Fuzzy Sets PRIME

• Graphic Interpretation of Fuzzy Sets

UNIVERSALSPACE

In any application of sets or fuzzy sets theory, all sets are subsets of a fixed set called universal space or universe of discourse denoted by X. Universal space X as a fuzzy set is a function equal to 1 for all elements. The fuzzy set space X = { xi SetOption [FuzzySet,

The Set

numbers, defined in the universal is presented as } = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}

UNIVERSALSPACE

UniversalSpace →

{1, 12, 1}]

UNIVERSALSPACE in set X is :

UNIVERSALSPACE = FuzzySet {{1, 1}, {2, 1}, {3, 1}, {4, 1}, {5, 1}, {6, 1}, {7, 1}, {8, 1}, {9, 1}, {10, 1}, {11, 1}, {12, 1}}

Therefore SetUniversal is represented as 1

SetUniversal = FuzzySet [{{1,1},{2,1}, {3,1}, {4,1}, {5,1},{6,1}, {7,1}, {8, 1}, {9, 1}, {10, 1}, {11, 1}, {12, 1}} , UniversalSpace → {1, 12, 1}]

FuzzyPlot [ UNIVERSALSPACE, AxesLable → {"X", " UNIVERSAL SPACE "}] UNIVERSAL SPACE

1 .8 .6 .4 .2 0

0

1

2

3

4

5

6

7

8

9

10

11

12

X

Fig Graphic Interpretation of Fuzzy Set UNIVERSALSPACE



Finite and Infinite Universal Space Universal sets can be finite or infinite. Any universal set is finite if it consists of a specific number of different elements, that is, if in counting the different elements of the set, the counting can come to an end, else the set is infinite. Examples: 1. Let N be the universal space of the days of the week. N = {Mo, Tu, We, Th, Fr, Sa, Su}. N is finite. M is 2. Let M = {1, 3, 5, 7, 9, ...}. infinite. 3. Let L = {u | u is a lake in a city }. L is finite. (Although it may be difficult to count the number of lakes in a city, but L is still a finite universal set.) 1



Graphic Interpretation of Fuzzy Sets

EMPTY

An empty set is a set that contains only elements with a grade of membership equal to 0. Example: Let EMPTY be a set of people, in Minnesota, older than 120. The Empty set is also called the Null set. The fuzzy set EMPTY , defined in the universal space X = { xi } = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} is presented as SetOption [FuzzySet, UniversalSpace → {1, 12, 1}]

The Set EMPTY in set X is : EMPTY = FuzzySet {{1, 0}, {2, 0}, {3, 0}, {4, 0}, {5, 0}, {6, 0}, {7, 0}, {8, 0}, {9, 0}, {10, 0}, {11, 0}, {12, 0}}

Therefore SetEmpty is represented as SetEmpty = FuzzySet [{{1,0},{2,0}, {3,0}, {4,0}, {5,0},{6,0}, {7,0}, {8, 0}, {9, 0}, {10, 0}, {11, 0}, {12, 0}} , UniversalSpace → FuzzyPlot [ EMPTY, AxesLable →

{1, 12, 1}]

{"X", " UNIVERSAL SPACE "}]

EMPTY 1 .8 .6 .4 .2 0 0

1

2

3

4

5

7

8

Fig Graphic Interpretation of Fuzzy Set EMPTY 1

9

10

11

12

X

Fuzzy Operations A fuzzy set operations are the operations on fuzzy sets. The fuzzy set operations are generalization of crisp set operations. Zadeh [1965] formulated the fuzzy set theory in the terms of standard operations: Complement, Union, Intersection, and Difference. following standard

In this section, the graphical interpretation of the fuzzy set terms and the Fuzzy Logic operations are illustrated: Inclusion :

FuzzyInclude [VERYSMALL, SMALL]

Equality :

FuzzyEQUALITY [SMALL, STILLSMALL]

Complement :

FuzzyNOTSMALL = FuzzyCompliment [Small]

Union :

FuzzyUNION = [SMALL

Intersection



:

MEDIUM]

FUZZYINTERSECTON = [SMALL ∩ MEDIUM]

Inclusion Let A and B be fuzzy sets defined in the same universal space X. The fuzzy set A is included in the fuzzy set B if and only if for every x in the set X we have A(x) ≤

B(x)

Example :

The fuzzy set UNIVERSALSPACE numbers, defined in the universal space X = { xi } = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} is presented as SetOption [FuzzySet, UniversalSpace → {1, 12, 1}] The fuzzy set B SMALL

The Set SMALL in set X is : SMALL = FuzzySet {{1, 1 }, {7, 0.2},

{2, 1 }, {8, 0.1},

{3, 0.9}, {9, 0 },

{4, 0.6}, {5, 0.4}, {10, 0 }, {11, 0},

{6, 0.3}, {12, 0}}

Therefore SetSmall is represented as SetSmall = FuzzySet [{{1,1},{2,1}, {3,0.9}, {4,0.6}, {5,0.4},{6,0.3}, {7,0.2}, {8, 0.1}, {9, 0}, {10, 0}, {11, 0}, {12, 0}} , UniversalSpace → {1, 12, 1}] 1

The fuzzy set A VERYSMALL

The Set VERYSMALL in set X is : VERYSMALL = FuzzySet {{1, 1 {6, 0.1}, {7, 0 },

}, {2, 0.8 }, {3, 0.7}, {4, 0.4}, {5, 0.2}, {8, 0 }, {9, 0 }, {10, 0 }, {11, 0}, {12, 0}}

Therefore SetVerySmall is represented as SetVerySmall = FuzzySet [{{1,1},{2,0.8}, {3,0.7}, {4,0.4}, {5,0.2},{6,0.1}, {7,0}, {8, 0}, {9, 0}, {10, 0}, {11, 0}, {12, 0}} , UniversalSpace → {1, 12, 1}]

The Fuzzy Operation : Inclusion Include [VERYSMALL, SMALL] Membership Grade

B

A

1 .8 .6 .4 .2 0 0

1

2

3

4

5

6

7

8

9

10

11

12

Fig Graphic Interpretation of Fuzzy Inclusion FuzzyPlot [SMALL, VERYSMALL]

• Comparability Two fuzzy sets A and B are comparable if the condition A B or B A holds, ie, if one of the fuzzy sets is a subset of the other set, they are comparable. Two fuzzy sets A and B are incomparable If the condition A B or B A holds. Example 1:

Let

A = {{a, 1}, {b, 1}, {c, 0}} and

X

B = {{a, 1}, {b, 1}, {c, 1}}.

Then A is comparable to B, since A is a subset of B. Example 2 :

Let

C = {{a, 1}, {b, 1}, {c, 0.5}} and D = {{a, 1}, {b, 0.9}, {c, 0.6}}.

Then C and D are not comparable since C is not a subset of D and D is not a subset of C. 1

Property Related to Inclusion :

for all x in the set X, if A(x)

• Equality 1

Let A and B Then and B A if and only if be fuzzy sets defined in the same space X. are equal, which is denoted X = Y for all x in the set X,

A(x) = B(x). 1

Example.

The fuzzy set

B SMALL

SMALL = FuzzySet {{1, 1

},

{7, 0.2},

{2, 1 },

{3, 0.9},

{4, 0.6},

{5, 0.4},

{8, 0.1},

{9, 0 },

{10, 0 },

{11, 0},

{6, 0.3}, {12, 0}}

The fuzzy set A STILLSMALL STILLSMALL = FuzzySet {{1, 1

},

{2, 1 }, {3, 0.9},

{6, 0.3}, {7, 0.2}, {8, 0.1}, {9, 0 },

The Fuzzy Operation : Equality Equality [SMALL, STILLSMALL] Membership Grade

1 .8 .6

BA

{4, 0.6}, {10, 0 },

{5, 0.4}, {11, 0}, {12, 0}}

B(x)

.4 .2 0

0

1

2

3

4

5

6

7

8

9

10

11

12

X

Fig Graphic Interpretation of Fuzzy Equality FuzzyPlot [SMALL, STILLSMALL]

Note : If equality A(x) = B(x) is not satisfied even for one element x in the set X, then we say that A is not equal to B. 1

• Complement Let A be a fuzzy set defined in the space X. Then the fuzzy set B is a complement of the fuzzy set A, if and only if, for all x in the set X, B(x) = 1 - A(x). The complement of the fuzzy set A is often denoted by A' or Fuzzy Complement :

Ac

or A

Ac(x) = 1 – A(x)

Example 1.

The fuzzy set A SMALL SMALL = FuzzySet {{1, 1 }, {7, 0.2},

{2, 1 }, {3, 0.9}, {8, 0.1}, {9, 0

{4, 0.6}, }, {10, 0 },

{5, 0.4}, {6, 0.3}, {11, 0}, {12, 0}}

The fuzzy set Ac NOTSMALL NOTSMALL = FuzzySet {{1, 0

},

{2, 0 },

{3, 0.1}, {4, 0.4}, {5, 0.6}, {6, 0.7},

{7, 0.8},

{8, 0.9}, {9, 1

}, {10, 1 }, {11, 1},

{12, 1}}

The Fuzzy Operation : Compliment NOTSMALL = Compliment [SMALL] Membership Grade

A

Ac

1 .8 .6 .4 .2 0 0

1

2

3

4

5

6

7

8

Fig Graphic Interpretation of Fuzzy Compliment FuzzyPlot [SMALL, NOTSMALL] 1

9

10

11

12

X

Example 2.

The empty set Φ and the complements of one another. Φ'=X

,

universal

set X,

as fuzzy sets, are

X' = Φ

The fuzzy set B EMPTY Empty = FuzzySet {{1, 0

},

{7, 0},

{2, 0 },

{3, 0},

{8, 0},

{9, 0 },

{4, 0},

{5, 0}, {6, 0},

{10, 0 },

{11, 0},

{12, 0}}

The fuzzy set A UNIVERSAL Universal = FuzzySet {{1, 1

}, {7, 1},

{2, 1 },

{3, 1},

{8, 1},

{9, 1 },

{4, 1},

{5, 1},

{6, 1},

{10, 1 }, {11, 1},

{12, 1}}

The fuzzy operation : Compliment EMPTY = Compliment [UNIVERSALSPACE] Membership Grade

B A

1 .8 .6 .4 .2 0

0

1

2

3

4

5

6

7

8

Fig Graphic Interpretation of Fuzzy Compliment FuzzyPlot [EMPTY, UNIVERSALSPACE] 1

9

10

11

12

X



Union Let A and B be fuzzy sets defined in the space X. The union is defined as the smallest fuzzy set that contains both A and B. The union of A and B is denoted by A B. The following relation must be satisfied for the union operation : for all x in the set X, (A B)(x) = Max (A(x), B(x)). Fuzzy Union : (A B)(x) = max [A(x), B(x)] for all x X

Fuzzy A and

Example 1 : Union of A(x) = 0.6 and

B

B(x) = 0.4

Example 2 : Union of

(A

B)(x) = max [0.6, 0.4] = 0.6

SMALL and MEDIUM

The fuzzy set A SMALL SMALL = FuzzySet {{1, 1

}, {2, 1 },

{7, 0.2},

{8, 0.1},

{3, 0.9},

{4, 0.6}, {5, 0.4},

{6, 0.3},

{9, 0 },

{10, 0 }, {11, 0},

{12, 0}}

{3, 0},

{4, 0.2}, {5, 0.5},

{6, 0.8},

The fuzzy set B MEDIUM MEDIUM = FuzzySet {{1, 0

}, {2, 0 },

{7, 1}, {8, 1}, {9, 0.7 },

{10, 0.4 }, {11, 0.1}, {12, 0}}

The fuzzy operation : Union FUZZYUNION = [SMALL

MEDIUM]

SetSmallUNIONMedium = FuzzySet [{{1,1},{2,1}, {3,0.9}, {4,0.6}, {5,0.5}, {6,0.8}, {7,1}, {8, 1}, {9, 0.7},

{10, 0.4}, {11, 0.1}, {12, 0}} , UniversalSpace →

Membership Grade

{1, 12, 1}]

FUZZYUNION = [SMALL MEDIUM]

1 .8 .6 .4 .2 0 0

1

2

3

4

5

6

7

8

9

10

11

12

X

Fig Graphic Interpretation of Fuzzy Union FuzzyPlot [UNION]

The notion of the union is closely related to that of the connective "or". Let A is a class of "Young" men, B is a class of "Bald" men. If "David is Young" or "David is Bald," then David is associated with the union of A and B. Implies David is a member of A B. 1



Intersection Let A and B be fuzzy sets defined in the space X. Intersection is defined as the greatest fuzzy set that include both A and B. Intersection of A and B is denoted by A ∩ B. The following relation must be satisfied for the intersection operation : for all x in the set X,

(A ∩ B)(x) = Min (A(x), B(x)).

Fuzzy Intersection :

(A ∩

B)(x)

=

Fuzzy

Example 1 : Intersection of

min [A(x), B(x)]

for

all

x X

A and B

A(x) = 0.6 and B(x) = 0.4

(A ∩ B)(x) = min [0.6, 0.4]

Example 2 : Union of

= 0.4

SMALL and MEDIUM

The fuzzy set A SMALL SMALL = FuzzySet {{1, 1 }, {7, 0.2},

{2, 1 }, {3, 0.9}, {4, 0.6}, {5, 0.4}, {6, 0.3}, {8, 0.1}, {9, 0 }, {10, 0 }, {11, 0}, {12, 0}}

The fuzzy set B MEDIUM MEDIUM = FuzzySet {{1, 0

}, {2, 0 }, {7, 1},

{3, 0},

{4, 0.2},

{5, 0.5}, {6, 0.8},

{8, 1}, {9, 0.7 }, {10, 0.4 },

{11, 0.1},

{12, 0}}

The fuzzy operation : Intersection FUZZYINTERSECTION = min [SMALL ∩

MEDIUM]

SetSmallINTERSECTIONMedium = FuzzySet [{{1,0},{2,0}, {3,0}, {4,0.2}, {5,0.4}, {6,0.3}, {7,0.2}, {8, 0.1}, {9, 0}, {10, 0}, {11, 0}, {12, 0}} , Membership Grade

UniversalSpace →

FUZZYINTERSECTON = [SMALL ∩

{1, 12, 1}]

MEDIUM]

1 .8 .6 .4 .2 00

1

2

3

4

5

6

7

8

9

10

11

12

X

Fig Graphic Interpretation of Fuzzy Union FuzzyPlot [INTERSECTION] 1

• Difference Let A and B be fuzzy sets defined in the space X. The difference of A and B is denoted by A ∩ B'. Fuzzy Difference : (A - B)(x) = min [A(x), 1- B(x)]

for

all x X

Example : Difference of MEDIUM and SMALL

The fuzzy set A MEDIUM MEDIUM = FuzzySet {{1, 0

},

{7, 1},

{2, 0 }, {3, 0}, {4, 0.2}, {5, 0.5}, {8, 1}, {9, 0.7 }, {10, 0.4 }, {11, 0.1},

{6, 0.8}, {12, 0}}

The fuzzy set B SMALL MEDIUM = FuzzySet {{1, 1 {7, 0.2},

}, {2, 1 }, {3, 0.9}, {4, 0.6}, {8, 0.1}, {9, 0.7 }, {10, 0.4 },

{5, 0.4}, {6, 0.3}, {11, 0}, {12, 0}}

Fuzzy Complement : Bc(x) = 1 – B(x)

The fuzzy set Bc NOTSMALL NOTSMALL = FuzzySet {{1, 0 }, {2, 0 }, {3, 0.1}, {4, 0.4}, {5, 0.6}, {6, 0.7}, {7, 0.8}, {8, 0.9}, {9, 1 }, {10, 1 }, {11, 1}, {12, 1}}

The fuzzy operation : Difference by the definition of Difference FUZZYDIFFERENCE = [MEDIUM ∩

SMALL']

SetMediumDIFFERECESmall = FuzzySet [{{1,0},{2,0}, {3,0}, {4,0.2}, {5,0.5}, {6,0.7}, {7,0.8}, {8, 0.9}, {9, 0.7}, {10, 0.4}, {11, 0.1}, {12, 0}} , Membership Grade

UniversalSpace →

FUZZYDIFFERENCE = [MEDIUM

{1, 12, 1}]

SMALL' ]

1 .8 .6 .4 .2 0 0

1

2

3

4

5

6

7

Fig Graphic Interpretation of Fuzzy Union FuzzyPlot [UNION] 1

8

9

10

11

12

X

Fuzzy Properties Properties related to Union, Intersection, Differences are illustrated below.



Properties Related to Union The properties related to union are : Identity, Idempotence, Commutativity and Associativity. ■ Identity: A Φ

=A

input = Equality [SMALL output = True

EMPTY , SMALL]

A X=X

input

= Equality [SMALL

UnivrsalSpace , UnivrsalSpace]

output = True ■

Idempotence : A A = A

input = Equality [SMALL SMALL , SMALL] output = True ■

Commutativity : A B=B A

input = Equality [SMALL MEDIUM, MEDIUM SMALL] output = True 1

■ Associativity: A (B C) = (A B) C

input = Equality [Small (Medium Big) , (Small Medium) Big] output = True Fuzzy Set Small , Medium , Big Small

= FuzzySet {{1, 1 }, {2, 1 }, {3, 0.9},

{4, 0.6}, {5, 0.4}, {6, 0.3},

{7, 0.2}, {8, 0.1}, {9, 0.7 }, {10, 0.4 }, {11, 0}, {12, 0}} Medium = FuzzySet {{1, 0

}, {2, 0 }, {3, 0}, {4, 0.2}, {5, 0.5}, {6, 0.8},

{7, 1}, {8, 1}, {9, 0 }, {10, 0 }, {11, 0.1}, {12, 0}} Big

= FuzzySet [{{1,0}, {2,0}, {3,0}, {4,0}, {5,0}, {6,0.1}, {7,0.2}, {8,0.4}, {9,0.6}, {10,0.8}, {11,1}, {12,1}}]

Calculate Fuzzy relations : (1) Medium Big = FuzzySet [{1,0},{2,0}, {3,0}, {4,0.2}, {5,0.5}, {6,0.8},{7,1}, {8, 1}, {9, 0.6}, {10, 0.8}, {11, 1}, {12, 1}] (2)

Small Medium = FuzzySet [{1,1},{2,1}, {3,0.9}, {4,0.6}, {5,0.5}, {6,0.8}, {7,1}, {8, 1}, {9, 0.7}, {10, 0.4}, {11, 0.1}, {12, 0}]

(3)

Small (Medium Big) = FuzzySet [{1,1},{2,1}, {3,0.9}, {4,0.6}, {5,0.5}, {6,0.8}, {7,1}, {8, 1}, {9, 0.7}, {10, 0.8}, {11, 1}, {12, 1}]

(4)

(Small Medium) Big] = FuzzySet [{1,1},{2,1}, {3,0.9}, {4,0.6}, {5,0.5}, {6,0.8}, {7,1}, {8, 1}, {9, 0.7},{10, 0.8}, {11, 1},{12, 1}]

Fuzzy set (3) and (4) proves Associativity relation 1



Properties Related to Intersection Absorption, Identity, Idempotence, Commutativity, Associativity. ■ Absorption by Empty Set : A∩ Φ

= Φ

Equality [Small Empty input = ∩ , Empty] output = True ■ Identity : A∩ X=A Equality [Small input = ∩ UnivrsalSpace , Small] output = True ■ Idempotence : A∩ A=A Equality [Small input = ∩ Small , Small] output = True ■ Commutativity

:

A∩ B=B ∩ A Equality [Small input = ∩ Big , Big ∩ Small] output = True ■ Associativity

:

A ∩ (B ∩ C) = (A ∩ B) ∩ C input = Equality [Small ∩ (Medium ∩ output = True

Big), (Small ∩ Medium) ∩ Big]



Additional Properties Related to Intersection and Union ■ Distributivity: A ∩ (B C) = (A ∩ B) (A ∩ C) input = Equality [Small ∩ (Medium Big) , (Small ∩ Medium) (Small ∩ Big)] output = True ■ Distributivity: A (B ∩ C) = (A B) ∩ (A C) input = Equality [Small (Medium ∩ Big) , (Small Medium) ∩ (Small Big)] output = True 1

■ Law of excluded middle : A

A' = X = Equality input [Small output = True

NotSmall , UnivrsalSpace ]

■ Law of contradiction A ∩

• Cartesian

A' = Φ = Equality input [Small ∩ NotSmall output = True Product Of Two Fuzzy Sets

, EmptySpace ]

■ Cartesian Product of two Crisp Sets Let A and B be two crisp sets in the universe of discourse X and Y.. The Cartesian product of A and B is denoted by A x B Defined as A x B = { (a , b) │a A,b B} Note : Generally A x B ≠ B x A Example :

Graphic representation of A x B

Let

A = {a, b, c} and B = {1, 2}

then

A x B = { (a , 1) , (a , 2) , (b , 1) , (b , 2)

B 2 ,

1

(c , 1) , (c , 2) }

A a



b

c

Cartesian product of two Fuzzy Sets Let A and B be two fuzzy sets in the universe of discourse X and Y. The Cartesian product of A and B is denoted by A x B Defined by their membership function µ A (x) and µ B (y) µ A x B (x , y) = min [ µ A (x)

or

µ A x B (x , y) =

for all x

,

µ A (x) µ B (y)

X and y

Y

µ B (y) ] = µ A (x)

as µ B (y)

Thus the Cartesian

product A x B is a fuzzy set of ordered pair and y Y, with grade membership of (x , y) in (x , y) for all x X given by the above equations . XxY In a sense Cartesian product of two Fuzzy sets is a Fuzzy Relation. 1

Fuzzy Relations Fuzzy Relations describe the degree of association of the elements; Example : “x is approximately equal to y”. −

Fuzzy relations offer the capability to capture the uncertainty and vagueness in relations between sets and elements of a set.



Fuzzy Relations make the description of a concept possible.

were introduced to supersede classical crisp − Fuzzy Relations relations; It describes the total presence or absence of association of elements.

In this section, first the fuzzy relation is defined and then expressing fuzzy relations in terms of matrices and graphical visualizations. Later the properties of fuzzy relations and operations that can be performed with fuzzy relations are illustrated.

3.1 Definition of Fuzzy Relation Fuzzy relation is a generalization of the definition 2-D from space to 3-D space.

of fuzzy set

• Fuzzy relation definition Consider a Cartesian product A x B = { (x , y) | x

A, y

B}

where A and B are subsets of universal sets U1 and U2. Fuzzy relation on

AxB

R = { ((x , y) , µR (x , y))

is denoted by R or | (x , y)

AxB,

R(x , y) is defined as the set µR (x , y)

[0,1] }

where µR (x , y) is a function in two variables called membership function. −

It gives the degree of membership of the ordered pair (x , y) in R associating with each pair (x , y) in A x B a real number in the interval [0 , 1].



The degree of membership indicates the degree to which x is in relation to y.

• Example of Fuzzy Relation R = { ((x1 , y1) , 0)) , ((x1 , y2) , 0.1)) , ((x1 , y3) , 0.2)) , ((x2 , y1) , 0.7)) , ((x2 , y2) , 0.2)) , ((x2 , y3) , 0.3)) , ((x3 , y1) , 1)) , ((x3 , y2) , 0.6)) , ((x3 , y3) , 0.2)) ,

The relation can be written in matrix form as y R

y1

Y2 Y3

x x1

0

0.1 0.2

X2

0.7

0.2 0.3

X3

1

0.6 0.2

where symbol

means ' is defined as' and the values in the matrix are the values of membership function:

µR (x1 , y1) = 0

µR (x1

µR (x2 , y1) = 0.7 µR (x3 , y1) = 1

, y2) = 0.1

µR (x1 , y3) = 0.2

µR (x2, y2) = 0.2

µR (x2 , y3) = 0.3

µR (x3

, y2) = 0.6

µR (x3 , y3) = 0.2

1

Assuming x1 = 1 , x2 = 2 , x3 = 3 the relation can be graphically and y1 = 1 , y2= 2 , y3= 3 , represented by points in 3-D space 1

(X, Y, µ) as :

1

µ

.8 .6 .4 .2

0

1

2

3 y

1

2 3

x Fig Fuzzy Relation R describing x greater than y

Note : Since the values of the membership function 0.7, 1, 0.6 are in the direction of x below the major diagonal (0, 0.2, 0.2) in the matrix are grater than those 0.1, 0.2, 0.3 in the direction of y, we therefore

say that the relation R describes x is grater than y. 1

Forming Fuzzy Relations Assume that V and W are two collections of objects. A fuzzy relation is characterized in the same way as it is in a fuzzy set. −

The first item is a list containing element and membership grade pairs, {{v1, w1}, R11}, {{ v1, w2}, R12}, ... , {{ vn, wm}, Rnm}}. where { v1, w1}, { v1, w2}, ... , { vn, wm} are the elements of the relation are defined as ordered pairs, and { R11 , R12 , ... , Rnm} are the membership grades of the elements of the relation that range from 0 to 1, inclusive.



The second item is the universal space; for relations, the universal space consists of a pair of ordered pairs, {{ Vmin, Vmax, C1}, { Wmin, Wmax, C2}}. where the first pair defines the universal space for the first set and the second pair defines the universal space for the second set.

Example showing how fuzzy relations are represented Let V = {1, 2, 3} and W = {1, 2, 3, 4}.

A fuzzy relation R is, a function defined in the space V x W, which takes values from the interval [0, 1] , expressed as R : V x W → [0, 1] R = FuzzyRelation [{{{1, 1}, 1}, {{1, 2}, 0.2}, {{1, 3}, 0.7}, {{1, 4}, 0}, {{2, 1}, 0.7}, {{2, 2}, 1}, {{2, 3}, 0.4}, {{2, 4}, 0.8}, {{3, 1}, 0}, {{3, 2}, 0.6}, {{3, 3}, 0.3}, {{3, 4}, 0.5}, UniversalSpace → {{1, 3, 1}, {1, 4, 1}}]

This relation can be represented in the following two forms shown below

Membership matrix form

1 .8

w w1 v

w2

w3 w4

.6 .4 .2

µ

Graph form

R

v1

1

0.2

0.7 0

v2

0.7

1

0.4 0.8

v3

0

0.6

0.3 0.5

4 0

1

2

3

w

1

2 3

v Vertical lines represent membership grades

Elements of fuzzy relation are ordered pairs {vi , wj}, where vi is first and wj is second element. The membership grades of the elements are represented by the heights of the vertical lines. 1

max X

Projections of Fuzzy Relations Definition : A fuzzy relation on defined as the set R = { ((x , y) , µR (x , y)) |

AxB

is denoted by

(x , y)

R or

R(x , y) is

A x B , µR (x , y) [0,1] }

where µR (x , y) is a function in two variables called membership function. The first, the second and the total projections of fuzzy relations are stated below.



First Projection of R : defined as R(1) = {(x) , µ R(1) (x , y))} = {(x) , max µ R (x , y)) | (x , y) A x B } Y



Second Projection of R : defined as R(2) = {(y) , µ R(2) (x , y))}

µR (x , y)) | (x , y)

= {(y) ,



AxB}

Total Projection of R : defined as

R(T) =

max max XY

{µ R (x , y) | (x , y) A x B }

Note : In all these three expression max Y

means

max with respect to

y while

x is considered fixed

max X

means

max with respect to

x while

y is considered fixed

The Total Projection is also known as Global projection 1

• Example : Fuzzy Projections The Fuzzy Relation of R are shown below.

R(1)

y1

y2

y3

y4

0.1

0.3

1

0.5

0.3

1

x2

0.2

0.5

0.7

0.9

0.6

0.9

x3

0.3

0.6

1

0.8

0.2

1

R(2)

0.3

0.6

1

0.9

0.6

1

x x1 R

and Total Projection

R together with First, Second

y

Y5

R(T)

=

Note : For

R(1) select max

means

max with respect to y while

x is considered fixed

means

max with respect to x while

y is considered fixed

Y For

R(2) select

For

R(T)

max x

select max with respect to

R(1) and R(2)

The Fuzzy plot of these projections are shown below. 1

R(1)

1

.8

R(2)

.8

.6

.6

.4

.4

.2

.2

x

0 1

2

3

4

y

0

5 1

2

3

Fig Fuzzy plot of 1st projection R(1) Fig Fuzzy plot of 2nd projection R(2) 1

4

5

Max-Min and Min-Max Composition The operation composition combines the fuzzy relations in different variables, say (x , y) and (y , z) ; Consider the relations :

x

A, y

B,

z

R1(x , y)

= { ((x , y) , µR1 (x , y))

| (x , y)

AxB}

R2(y , z)

= { ((y , y) , µR1 (y , z))

|

BxC}

The domain of R1 is A x B and

(y , z)

the domain of R2 is

C.

Bx C

• Max-Min Composition Definition : The Max-Min composition denoted by R1 ο R2 with membership function µ R1 ο R2 defined as R1 ο R2 = { ((x , z) ,

max(min

(µR1 (x , y) , µR2 (y , z))))} ,

Y (x , z) A x C , y B

Thus R1 ο R2 is relation in the domain A x C An example of the composition is shown in the next slide. 1

j = 1, 2, 3, Step -1 • Example : Max-Min Composition

Consider the relations R1(x , y) and R2(y , z) as given below. y

y1

y2

y3

R1

x1 x2

z1

z2

z3

0.8

0.2

0

y2

0.2

1

0.6

y3

0.5

0

0.4

z

x

y 0.1

0.3

0

0.8

1

0.3

y1

R2

Note : Number of columns in the first table and second table are equal. Compute maxmin composition denoted by R1 ο R2 : Compute min operation (definition in previous slide). Consider row x1 and column z1 , means the pair (x1 , z1) for all yj ,

and perform min operation min (µR1 (x1 , y1) , µR2 (y1 , z1)) = min (0.1, 0.8) = 0.1, min (µR1 (x1 , y2) , µR2 (y2 , z1)) = min (0.3, 0.2) = 0.2, min (µR1 (x1 , y3) , µR2 (y3 , z1)) = min ( 0, 0.5) = 0, Step -2 Compute max operation (definition

in previous slide).

For x = x1 , z = z1 , y = yj , j = 1, 2, 3, Calculate the grade membership of the

pair (x1 , z1) as

{ (x1 , z1) , max ( (min (0.1, 0.8), min (0.3, 0.2), min (0, 0.5) ) i.e. { (x1 , z1) , max(0.1, 0.2, 0) }

i.e.

{ (x1 , z1) , 0.2 }

Hence the grade membership of the pair (x1 , z1) is

0.2

Similarly, find all the grade membership of the pairs (x1 , z2) , (x1 , z3) , (x2 , z1) , (x2 , z2) , (x2 , z3)

The final result is z

z1

z2

z3

.

R1 ο R2 =

x x1

0.1

0.3

0

x2

0.8

1

0.3

Note : If tables R1 and R2 are considered as matrices, the operation composition resembles the operation multiplication in matrix calculus linking row by columns. After each cell is occupied max-min value (the product is replaced by min, the sum is replaced by max). 1



Example : Min-Max Composition The min-max composition is similar to max-min composition with the difference that the roll of max and min are interchanged. Definition : The max-min composition denoted by R1 function R1

µ R1 R2 is defined by R2 = { ((x , z) , min y

(max (µR1 (x , y) , (x , z)

Thus

R1

R2

R2

µR2 (y , z))))} , AxC,y

B

is relation in the domain A x C

Consider the relations R1(x , y) and R2(y , z) relation of previous example of max-min composition, y1

y R1

with membership

y2

y3

x x1

0.1

0.3

0

x2

0.8

1

0.3

as given by the same that is z

R2

z1

z2

z3

y y1

0.8

0.2

0

y2

0.2

1

0.6

y3

0.5

0

0.4

After computation in similar way as done in the case of max-min composition, the final result is z1

z2

z3

x x1

0.3

0

0.1

x2

0.5

0.4

0.4

z R1



R2 =

Relation between Max-Min and Min-Max Compositions The Max-Min and Min-Max Compositions are related by the formula R1 ο R2 =

R1 R2 1

Fuzzy Systems What are Fuzzy Systems ?



Fuzzy Systems include Fuzzy Logic and Fuzzy Set Theory.



Knowledge exists in two distinct forms : −

the Objective knowledge that exists in mathematical form is used in engineering problems; and



the Subjective knowledge that exists in linguistic form, usually impossible to quantify.

Fuzzy Logic can coordinate these two forms of knowledge in a logical way.



Fuzzy Systems can handle simultaneously the numerical data and linguistic knowledge.

• Fuzzy Systems •

provide opportunities for modeling of conditions which imprecisely defined.

are inherently Many real world problems have been modeled, simulated, and replicated with the help of fuzzy systems.



The applications of Fuzzy Systems are many like : Information retrieval systems, Navigation system, and Robot vision.



Expert Systems design have become easy because their domains are inherently fuzzy and can now be handled better; examples : Decision-support systems, Financial planners, Diagnostic system, and Meteorological system. 1

Introduction Any system that uses Fuzzy mathematics may be viewed as Fuzzy system. The Fuzzy Set Theory - membership function, operations, properties and the relations

have been described in previous lectures. These are the

prerequisites for understanding Fuzzy Systems. The applications of Fuzzy set theory is Fuzzy logic which is covered in this section. Here the emphasis is on the design of fuzzy system and fuzzy controller in a closed–loop. The specific topics of interest are : Fuzzification of input information, − Fuzzy Inferencing using Fuzzy sets , − De-Fuzzification of results from the Reasoning process, and − Fuzzy controller in a closed–loop. −

Fuzzy Inferencing, is the core constituent of a fuzzy system. A block schematic of Fuzzy System is shown in the next slide. Fuzzy Inferencing combines the facts obtained from the Fuzzification with the fuzzy rule base and conducts the Fuzzy Reasoning Process. • Fuzzy System A block schematic of Fuzzy System is shown below.

Fuzzy Rule Base Input variables

X1 X2 Xn

output variables

Fuzzy Fuzzification

Inferencing

Defuzzification

Y1 Y2 Ym

Membeship Function Fig. Elements of Fuzzy System 1

Fuzzy System elements − Input Vector : X = [x1 , x2, . . . xn ] T are crisp values, which are transformed into fuzzy sets in the fuzzification block. − Output Vector : Y = [y1 , y2, . . . ym ] T comes out from the

defuzzification block, which transforms an output fuzzy set back to a crisp value. − Fuzzification : a process of transforming crisp values into grades of

membership for linguistic terms, "far", "near", "small" of fuzzy sets. collection of propositions containing linguistic − Fuzzy Rule base : a variables; the rules are expressed in the form: If (x is A ) AND (y is B ) . . . . . .

THEN (z is C)

where x, y and z represent variables (e.g. distance, size) and are linguistic variables (e.g. `far', `near', `small'). A, B and Z a measure of the degree of similarity − Membership function : provides of elements in the universe of discourse U to fuzzy set. − Fuzzy Inferencing : combines the facts obtained from the Fuzzification with the rule base and conducts the Fuzzy reasoning process. − Defuzzyfication: Translate results back to the real world values.

Fuzzy Logic A simple form of logic, called a two-valued logic is the study of "truth tables" and logic circuits. Here the possible values are true as 1, and false as 0. This simple two-valued logic is generalized and called

fuzzy logic

which treats "truth" as a

continuous quantity ranging from 0 to 1. Definition : Fuzzy logic (FL) is derived from fuzzy set theory dealing with reasoning that is approximate rather than precisely deduced from classical two-valued logic. −

FL is the application of Fuzzy set theory.

FL allows set membership values to range (inclusively) between 0 and 1. − FL is capable of handling inherently imprecise concepts. −



FL allows in linguistic form, the set membership values to imprecise concepts like "slightly", "quite" and "very". 1

"Grass is green";

Classical Logic Logic is used to represent simple facts. Logic defines the ways of putting symbols together to form sentences that represent facts. Sentences are either true or false but not both are called propositions. Examples : Sentence

Truth value

Is it a Proposition ?

"Grass is green" "2 + 5 = 5" "Close the door" "Is it hot out side ?"

"true" "false" -

"x > 2" "x = x"

-

Yes Yes No No No (since x is not defined) No (don't know what is "x" and "=" mean; "3 = 3" or say "air is equal to air" or "Water is equal to water" has no meaning)



Propositional Logic (PL) A proposition is a statement - which in English is a declarative sentence and Logic defines the ways of putting symbols together to form sentences that represent facts. Every proposition is either true or false. Propositional logic is also called boolean algebra. Examples: (a) The sky is blue., (b) Snow is cold. , (c) 12 * 12=144 Propositional logic : It is fundamental to all logic.

‡ Propositions are “Sentences”; either true or false but not both. ‡

A sentence is smallest unit in propositional logic



If proposition is true, then truth value is "true"; else “false”

‡ Example ;

Sentence Truth value “ true”; Proposition “yes” 1



Statement, Variables and Symbols Statement : A simple statement is one that does not contain any other statement as a part. A compound statement is one that has two or more simple statements as parts called components. Operator or connective : Joins simple statements into compounds, and joins compounds into larger compounds. Symbols for connectives assertion

P

"p is true"

nagation

¬p

~

conjunction

p

q ·

disjunction

P v q ||

implication

p→ q

equivalence



!

NOT

"p is false"

&& &

AND

"both p and q are true"

‫׀‬

"either p is true, or q is true, or both "

OR

if . . then



"if p is true, then q is true" " p implies q "

if and only if

"p and q are either both true or both false"

■ Truth Value

The truth value of a statement is its truth or falsity , is either true or false, p is either true or false, ~p p v q is either true or false, and so on. "T" or "1"

means "true". and

"F" or "0"

means "false"

Truth table is a convenient way of showing relationship between several propositions. The truth table for negation, conjunction, disjunction,

implication and equivalence are shown below. ¬p ¬q

p

q

p q

pvq

p→ q

p↔ q

q→ p

T

T

F

F

T

T

T

T

T

T

F

F

T

F

T

F

F

T

F

T

T

F

F

T

T

F

F

F

F

T

T

F

F

T

T

T

1

■ Tautology

A Tautology is proposition formed by combining other propositions of truth or falsehood of p, q, (p, q, r, . . .) which is true regardless r, . . . .

The important tautologies are : (p→ q) ↔ ¬ [p

(¬q)]

and

A proof of these tautologies, Tautologies (p→ q) ↔ ¬ [p

(p→ q) ↔

(¬p) q

using the truth tables are given below. and (p→ q) ↔ (¬p) (¬q)] q

Table 1: Proof of Tautologies p

q

p→ q

¬q

¬ [p (¬q)]

¬p

(¬p) q

T

T

T

F

F

T

F

T

T

F

F

T

T

F

F

F

F

T

T

F

F

T

T

T

F

F

T

T

T

T

p (¬q)

F

T

Note : 1. The entries of two columns p→ q and ¬ [p (¬q)] are identical, proves the tautology. Similarly, the entries of two columns p→ q and (¬p) q are identical, proves the other tautology. 2. The importance of these tautologies is that they express the membership function for p→ q in terms of membership functions of either propositions p and ¬q or ¬p and q. ■ Equivalences

Between Logic , Set theory and Boolean algebra. Some mathematical equivalence between Logic and Set theory and the correspondence between Logic and Boolean algebra (0, 1) are given below. Logic T F

¬ ↔

Boolean

Algebra (0, 1) 1 0 x +

′ie complement =

Set theory



, ∩ , U ― ( )

p, q, r

a,

b,

c 1

■ Membership Functions obtain from facts

Consider the facts (the two tautologies) (p→ q) ↔ ¬ [p (¬q)] and (p→ q) ↔ (¬p) q

Using these facts and the equivalence between logic and set theory, we can obtain membership functions for µp→ q (x , y) . From 1st

fact : µp→

q

(x , y)

= 1-µp∩

q

(x , y)

= 1 – min [µ p(x) , 1 From 2nd

fact : µp→

q

(x , y)

= 1-µ

p

- µ q (y)]

Eq (1)

µ q (y)]

Eq (2)

U q (x , y)

= max [ 1 - µ p (x) ,

Boolean truth table below shows the validation membership functions Table-2 : Validation µ p(x)

µ q(y)

1 - µ p (x)

1 - µ q (y)

1

1

0

1

0

0

1

0

0

of Eq (1) and Eq (2)

0

max [ 1 - µ p (x) , µ q (y)] 1

1 – min [µ p(x) , 1 - µ q (y)] 1

0 1

1 0

0 1

0 1

1

1

1

1

Note : 1. Entries in last two columns of this table-2 agrees with the entries in table-1 for p→ q , the proof of tautologies, read T as 1 and F as 0. 2. The implication membership functions of Eq.1 and Eq.2 are not give agreement with the only ones that p→ q. The others are : µp→

q (x

, y)

= 1 - µ p (x) (1 - µ q (y))

µp→

q (x

, y)

= min [ 1, 1 - µ p (x) + µ q (y)]

Eq (3) Eq (4) 1



Modus Ponens and Modus Tollens In traditional propositional logic there are two important inference rules, Modus Ponens and Modus Tollens. Modus Ponens

Premise 1 : " x is A " Premise 2 : " if x is A then y is B " ; Consequence :

" y is B "

Modus Ponens is associated with the implication " A propositions p and q, the Modus Ponens is expressed as

implies B " [A→ B]

In terms of

(p (p → q)) → q Modus Tollens

Premise 1 : " y is not B " Premise 2 : " if x is A then y is B " ; Consequence : " x is not A " In terms of propositions p and q, the Modus Tollens is expressed as (¬ q (p → q)) → ¬ p

Fuzzy Logic Like the extension of crisp set theory to fuzzy set theory, the extension of crisp logic is made by replacing the bivalent membership functions of the crisp logic with the fuzzy membership functions. In crisp logic, the truth value acquired by the proposition are 2-valued, namely true as 1 and false as 0. In fuzzy logic, the truth values are multi-valued, as absolute true, partially true, absolute false etc represented numerically as real value between 0 to 1. Note : The fuzzy variables in fuzzy sets, fuzzy propositions, fuzzy relations etc are represented usually using symbol ~ as represented as P .

~P but for the purpose of easy to write it is always 1

• Recaps 01

Membership function µ A (x) describes the membership of the elements x of the base set X in the fuzzy set A.

2

Fuzzy Intersection operator ∩ ( AND connective ) applied to two fuzzy sets A and B with the membership functions µ A (x) and µ B (x) based on min/max operations is µ A ∩ B = min [ µ A (x) , µ B (x) ] , x

X (Eq. 01)

03 Fuzzy Intersection operator ∩ ( AND connective ) applied to two fuzzy sets A µ A (x) and

and B with the membership functions product is µ A ∩ B = µ A (x) µ B (x) , 4

xX

µ B (x) based on algebraic

(Eq. 02)

Fuzzy Union operator U ( OR connective ) applied to two fuzzy sets A and B with the membership functions µ A (x) and µ B (x) based on min/max operations is

µAUB =

max [ µ A (x) , µ B (x) ] ,

x

X (Eq. 03)

05 Fuzzy Union operator U ( OR connective ) applied to two fuzzy sets A and B with the membership functions µ A (x) and µ B (x) based on algebraic sum is µ A U B = µ A (x) + µ B (x) - µ A (x) µ B (x) ,

x X

(Eq.

04)

06 Fuzzy Compliment operator ( ― ) ( NOT operation ) applied to fuzzy set A with the membership function µ A (x) is µ = 1 - µ A (x) , x X (Eq. 05) A 07 Fuzzy relations combining two fuzzy sets by connective "min operation" is an operation by cartesian product

R:XxY →

µ R(x,y) = min[µ A (x), µ B (y)]

(Eq.

µ R(x,y) = µ A (x) µ B (y)

(Eq.

[0 , 1].

06) or

Y

07)

V

h-m

m

x G

1

0.5

0.0

Y

0.3

1

0.4

R

0

0.2

1

R Example : Relation and maturity grade

R between fruit colour x y characterized by base set

linguistic colorset X = {green, yellow, red} maturity grade as Y = {verdant, half-mature, mature} 8

Max-Min Composition - combines the fuzzy relations variables, say (x , y) and (y , z) ; x A , y B , z C . consider the relations : R1(x , y)

= { ((x , y) , µR1 (x , y))

| (x , y)

AxB}

R2(y , z)

= { ((y , y) , µR1 (y , z))

| (y , z)

BxC}

The domain of R1 is A x B and the domain of R2 is B x C max-min composition denoted by R1 ο R1 ο R2 = { ((x , z) , max Thus R1 ο R2

R2 with membership function µ R1 ο R2

(min (µR1 (x , y) ,

y is relation in the domain

(x , z) AxC 1

µR2 (y , z))))} , AxC,y B

(Eq. 08)

IF-THEN



Fuzzy Propositional

A fuzzy proposition is a statement P which acquires a fuzzy truth value T(P) . Example : P : Ram is honest T(P) = 0.8 , means P is partially true. T(P) = 1 , means P is absolutely true.



Fuzzy Connectives The fuzzy logic is similar to crisp logic supported by connectives. Table below illustrates the definitions of fuzzy connectives. Table : Connective Nagation

Symbols ¬

Fuzzy Connectves Usage

Definition

¬P

1 – T(P)

Disjuction

P

Q

Max[T(P) , T(Q)]

Conjuction

P

Q

min[T(P) , T(Q)]

Implication

P Q

¬P

Q = max (1-T(P), T(Q)]

Here P , Q are fuzzy proposition and T(P) , T(Q) are their truth values. − the P and Q are related by theoperator are known as antecedents and consequent respectively. −

as crisp logic, here in fuzzy logic also the operator statement like, IF x is A THEN y is B, is equivalent to R = (A x B) U (¬ A x Y)

the membership function of R is given by µR (x , y) = max [min (µA (x) , µB (y)) , 1 − µA (x)] −

For the compound implication statement like

represents

IF x is A THEN y is B, ELSE y is C is equivalent to R = (A x B) U (¬ A x C)

the membership function of R is given by µR (x , y) = max [min (µA (x) , µB (y)) , min (1 − µA (x), µC (y))] 1

Example 1 : (Ref : Previous slide) P : Mary

is efficient ,

T(P)

= 0.8 ,

Q : Ram

is efficient ,

T(Q)

= 0.65 ,

is efficient ,

T(¬ P) = 1 − T(P) = 1− 0.8 = 0.2

¬P P

: Mary

Q : Mary T(P

P

is efficient and so is Ram, i.e. Q) = min (T(P), T(Q)) = min (0.8, 0.65)) = 0.65

Q : Either Mary or Ram is efficient i.e. T(P

Q) = max (T(P), T(Q)) = max (0.8, 0.65)) = 0.8

P Q : If Mary is efficient then so is Ram, i.e. T(P Q) = max (1− T(P), T(Q)) = max (0.2, 0.65)) = 0.65

Example 2 : (Ref : Previous slide on fuzzy connective) Let X

= {a, b, c, d} , A

= {(a, 0)

(c, 0.6)

(d, 1)}

B

= {(1, 0.2) (2, 1)

(3, 0.8)

(4, 0)}

C

= {(1, 0)

(3, 1)

(4, 0.8)}

Y

(b, 0.8) (2, 0.4)

= { 1, 2, 3, 4} the universe of discourse could be viewed as { (1, 1) (2, 1) (3, 1) (4, 1) } i.e., a fuzzy set all of whose elements x have µ(x) = 1

Determine the implication relations (i) (ii)

If x is A THEN y is B If x is A THEN y is B Else y is C

Solution To determine implication relations (i) compute : The operator

represents

IF-THEN

statement like,

IF x is A THEN y is B, is equivalent to R = (A x B) U (¬ A x Y) and

the membership function R is given by µR (x , y) = max [min (µA (x) , µB (y)) , 1 − µA (x)] 1

Fuzzy Intersection A x B is defined as :

Fuzzy Intersection ¬A x Y is defined as :

for all x in the set X,

for all x in the set X

(A ∩ B)(x) = min [A(x), B(x)],

(¬A ∩ Y)(x) = min [A(x), Y(x)],

B

1

2

3

4

0

0

0

0

y

A a AxB=

b c

0.2 0.8 0.8 0.2 0.6 0.6

0 0

d

0.2

0

Fuzzy Union Therefore

0.8

as (A

y

4

1

1

1

1

0.2

0.2

0.2

0.2

c

0.4

0.4

0.4

0.4

d

0

0

0

0

¬A x Y = b

max [A(x), B(x)]

2

3

4

1

1

1

1

b

0.2 0.8 0.8

0

c

0.4 0.6 0.6

0.4

d

0.2

0.8

B) = max (1- T(A), T(B))

compute :

(Ref : Previous slide)

= {a, b, c, d} , A

= {(a, 0)

(b, 0.8)

(c, 0.6)

(d, 1)}

B

= {(1, 0.2)

(2, 1)

(3, 0.8)

(4, 0)}

C

= {(1, 0)

(2, 0.4)

(3, 1)

(4, 0.8)}

represents

IF x is A THEN y is B Else

IF-THEN-ELSE statement like, y is C, is equivalent to

R = (A x B) U (¬ A x C) and

the membership function of µR (x , y) = max [min (µA (x) ,

all x X

0

To determine implication relations (ii) Given

Here, the operator

for

gives

This represents If x is A THEN y is B ie T(A

X

3

a

B)(x) =

1

1

2

A

R = (A x B) U (¬ A x Y)

x a R=

1

1

R is given by µB (y)) , min(1 − µA (x), µC (y)]

1

Fuzzy Intersection A x B is defined as : Fuzzy Intersection ¬A x Y is defined as : for all x in the set X, for all x in the set X (A ∩ B)(x) = min [A(x), B(x)], (¬A ∩ C)(x) = min [A(x), C(x)], 1

B

AxB=

1

2

3

4

A a

0

0

0

0

b

0.2

0.8 0.8

0

c

0.2

0.6 0.6

0

d

0.2

1

Fuzzy Union is

0.8

as (A

0 B)(x)

Therefore R = (A x B) U (¬ A x C) y x a R=

1

2

3

4

1

1

1

1

b

0.2 0.8 0.8

c

0.4 0.6 0.6 0.4

d

0.2

y

1

0

0.8

0

1

2

3

4

0

0.4

1

0.8

0.2

0.2

0.2

0.2

c

0.4

0.4

0.4

0.4

d

0

0

0

0

A a ¬A x C = b

max [A(x), B(x)] for all x X gives 1

This represents If x is A THEN y is B Else y is C



Fuzzy Quantifiers In crisp logic, the predicates are quantified by quantifiers. Similarly, in fuzzy logic the propositions are quantified by quantifiers. There are two classes of fuzzy quantifiers : −

Absolute quantifiers and − Relative quantifiers

Examples : Absolute quantifiers

Relative quantifiers

round about 250

almost

much greater than 6

about

some where around 20

most 1

Fuzzification The fuzzification is a process of transforming crisp values into grades of membership for linguistic terms of fuzzy sets. The purpose is to allow a fuzzy condition in a rule to be interpreted.



Fuzzification of the car speed Example 1 : Speed X0 = 70km/h

Fig below shows the fuzzification of the car speed to characterize a low and a medium speed fuzzy set. 1

Low 1 .8

.6 .4 .2

µ

µA

Medium µB

0 20

40 60 80 100 120 140

Speed X0 = 70km/h Characterizing two grades, low and medium speed fuzzy set Example 2 : Speed X0 = 40km/h

µ

V Low

Low

Medium

High

V High

1 .8 .6 .4 .2

0 10 20 30 40 50 60 70 80 90 00

Speed X0 = 40km/h Given car speed value X0=70km/h : grade µA(x0) = 0.75 belongs to fuzzy low, and grade µB(x0) = 0.25 belongs to fuzzy medium

Given car speed value X0=40km/h : grade µA(x0) = 0.6 belongs to fuzzy low, and grade µB(x0) = 0.4 belongs to fuzzy medium. 1

Characterizing five grades, Very low, low, medium, high and very high speed fuzzy set

1

Fuzzy Inference Fuzzy Inferencing is the core element of a fuzzy system. Fuzzy Inferencing combines - the facts obtained from the fuzzification with the rule base, and then conducts the fuzzy reasoning process. Fuzzy Inference is also known as approximate reasoning. Fuzzy Inference is computational procedures used for evaluating linguistic descriptions. Two important inferring procedures are Generalized Modus Ponens Generalized Modus Tollens (GMT) −

(GMP)



• Generalized Modus Ponens (GMP) This is formally stated as If x is A THEN y is B x is ¬A y is ¬B

where A , B , ¬A , ¬B are fuzzy terms. Note : Every fuzzy linguistic statements above the line is analytically known and what is below the line is analytically unknown. To compute the membership function ¬B , the max-min composition of fuzzy set ¬A with R(x , y) which is the known implication relation (IF-THEN) is used. i.e. ¬B = ¬A ο R(x, y) In terms of membership function µ ¬B (y) = µ ¬A (x) µR (x , y) µ ¬B (y)

max (min ( µ ¬A (x) ,

µR (x , y))) where

is the membership function of ¬A , is the membership function of the implication relation and is the membership function of ¬B 1

• Generalized Modus Tollens (GMT) This is formally stated as If

x is A THEN y is B y

is ¬B x is ¬A

where A , B , ¬A , ¬B are fuzzy terms. Note : Every fuzzy linguistic statements above the line is analytically known and what is below the line is analytically unknown. To compute the membership function ¬A , the max-min composition of fuzzy set ¬B with R(x , y) which is the known implication relation (IF-THEN) is used. i.e. ¬A = ¬B ο R(x, y) In terms of membership function µ ¬A (y) = µ ¬B (x) µR (x , y) µ ¬A (y)

max (min ( µ ¬B (x) ,

µR (x , y))) where

is the membership function of ¬B , is the membership function of the implication relation and is the membership function of ¬A

Example : Apply the fuzzy Modus Ponens rules to deduce Rotation is quite slow? Given : (i)

If the temperature is high then then the rotation is slow.

(ii) The temperature is very high. Let H (High) , VH (Very High) , S (Slow) and QS (Quite Slow) indicate the associated fuzzy sets. Let the set for temperatures be X = {30, 40, 50, 60, 70, 80, 90, 100} , and Let the set of rotations per minute be Y = {10, 20, 30, 40, 50, 60} and H = {(70, 1) (80, 1) (90, 0.3)} VH = {(90, 0.9) (100, 1)} QS = {10, 1) (20, 08) } S = {(30, 0.8) (40, 1) (50, 0.6)

To derive R(x, y) representing the implication relation (i) above, compute

R (x, y) = max (H x S , ¬ H x Y) 1

30 40 50 60

H x S = 70 80 90 100

R(x,Y) =

10

20

30

40

50

60

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0.8 0.8 0.3 0

1 1 0.3 0

0.6 0.6 0.3 0

HxY =

0 0 0 0

10

20

30

40

50

60

30

1

1

1

1

1

1

40

1

1

1

1

1

1

50

1

1

1

1

1

1

60

1

1

1

1

1

1

70

0

0

0.8

1

0.6

0

80

0

0

0.8

1

0.6

0

90

0.7

100

1

10

20

30

40

50

60

30 40 50 60

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1

70 80 90 100

0 0 0.7 1

0 0 0 0 0.7 0.7 1 1

0 0 0.7 1

0 0 0 0 0.7 0.7 1 1

0.7 0.7 0.7 0.7 0.7 1

1

1

1

1

To deduce Rotation is quite slow, we make use of the composition rule QS = VH ο

R (x, y) 30

10

20

30

40

50

60

1

1

1

1

1

1

= [0

0 0 0 0 0 0.9

1] x

40

1

1

1

1

1

1

50 60

1 1

1 1

1 1

1 1

1 1

1 1

70

0

0

0

0

0

0

80

0

0

0

0

0

0

90

0.7

100

1

= [1 1 1 1 1 1 ]

1

0.7 0.7 1

1

0.7 1

0.7 0.7 1

1

Fuzzy Rule Based System The fuzzy linguistic descriptions are formal representation of systems made through fuzzy IF-THEN rule. They encode knowledge about a system in statements of the form : IF (a set of conditions) are satisfied THEN (a set of consequents) can be inferred. IF (x1 is A1, x2 is A2, xn is An ) THEN (y1 is B1, y2 is B2, yn is Bn)

where linguistic variables xi, yj take the values of fuzzy sets Ai and Bj respectively. Example : IF

there is "heavy" rain and "strong" winds

THEN there must "severe" flood warnings.

Here, heavy , strong , and severe are fuzzy sets qualifying the variables rain, wind, and flood warnings respectively. A collection of rules referring to a particular system is known as a fuzzy rule base. If the conclusion C to be drawn from a rule base R is the conjunction of all the individual consequents C i of C = C1

∩ C2 ∩ . . . ∩ Cn

= min ( µc1(y ), µc (y ) where Y is universe of

each rule , then

where µc2(y ) ,

µcn(y )) ,y

Y

discourse.

On the other hand, if the conclusion C to be drawn from a rule base R is the disjunction of the individual consequents of each rule, then C = C1

U C2 U . . . U Cn

µc (y )

= max ( µc1 (y ),

where µc2(y ) , µcn (y )) ,y Y where

Y is universe of discourse. 1

Defuzzification In many situations, for a system whose output is fuzzy, it is easier to take a crisp decision if the output is represented as a single quantity. This conversion of a single crisp value is called Defuzzification. Defuzzification is the reverse process of fuzzification. The typical Defuzzification methods are − Centroid method, − Center of sums, − Mean of maxima.

Centroid method It is also known as the "center of gravity" of area method. It obtains the centre of area (x*) occupied by the fuzzy set . For discrete membership function, it is given by

Σn xi µ (xi) i=1

x* =

where

Σn

µ (xi)

i=1

n represents the number elements in the sample, and xi are the elements, and µ (xi) is the membership function. 1

Probabilistic Reasoning Probability theory is used to discuss events, categories, and hypotheses about which there is not 100% certainty. We might write A→B, which means that if A is true, then B is true. If we are unsure whether A is true, then we cannot make use of this expression. In many real-world situations, it is very useful to be able to talk about things that lack certainty. For example, what will the weather be like tomorrow? We might formulate a very simple hypothesis based on general observation, such as “it is sunny only 10% of the time, and rainy 70% of the time”. We can use a notation similar to that used for predicate calculus to express such statements: P(S) = 0.1 P(R) = 0.7 The first of these statements says that the probability of S (“it is sunny”) is 0.1. The second says that the probability of R is 0.7. Probabilities are always expressed as real numbers between 0 and 1. A probability of 0 means “definitely not” and a probability of 1 means “definitely so.” Hence, P(S) = 1 means that it is always sunny. Many of the operators and notations that are used in prepositional logic can also be used in probabilistic notation. For example, P( S) means “the probability that it is



not sunny”; P(S ∧R) means “the probability that it is both sunny and rainy.” P(A ∨ B), which means “the probability that either A is true or B is true,” is defined by the following rule: P(A ∨B) = P(A) + P(B) - P(A ∧B)

1

The notation P(B|A) can be read as “the probability of B, given A.” This is known as conditional probability—it is conditional on A. In other words, it states the probability 1

that B is true, given that we already know that A is true. P(B|A) is defined by the following rule: Of course, this rule cannot be used in cases where P(A) = 0. For example, let us suppose that the likelihood that it is both sunny and rainy at the same time is 0.01. Then we can calculate the probability that it is rainy, given that it is sunny as follows:

The basic approach statistical methods adopt to deal with uncertainty is via the axioms of probability: Probabilities are (real) numbers in the range 0 to 1. A probability of P(A) = 0 indicates total uncertainty in A, P(A) = 1 total certainty and values in between some degree of (un)certainty. Probabilities can be calculated in a number of ways. Probability = (number of desired outcomes) / (total number of outcomes) So given a pack of playing cards the probability of being dealt an ace from a full normal deck is 4 (the number of aces) / 52 (number of cards in deck) which is 1/13. Similarly the probability of being dealt a spade suit is 13 / 52 = 1/4. If you have a choice of number of items k from a set of items n then the formula is applied to find the number of ways of making this choice. (! 1

= factorial).

So the chance of winning the national lottery (choosing 6 from 49) is to 1. 1

Conditional probability, P(A|B), indicates the probability of of event A given that we know event B has occurred. A Bayesian Network is a directed acyclic graph: A graph where the directions are links which indicate dependencies that exist between nodes. Nodes represent propositions about events or events themselves. Conditional probabilities quantify the strength of dependencies. 1

Consider the following example: Ø The

probability, that my car won't start. If my car won't start then it is likely that o

The battery is flat or

o

The staring motor is broken.

In order to decide whether to fix the car myself or send it to the garage I make the following decision: If the headlights do not work then the battery is likely to be flat so i fix it myself. If the starting motor is defective then send car to garage. If battery and starting motor both gone send car to garage. The network to represent this is as follows:

Fig. A simple Bayesian network

Bayesian probabilistic inference Bayes’ theorem can be used to calculate the probability that a certain event will occur or that a certain proposition is true The theorem is stated as follows:

. P(B) is called the prior probability of B. P(B|A), as well as being called the conditional probability, is also known as the posterior probability of B. P(A ∧B) = P(A|B)P(B) 1

Note that due to the commutativity of ∧, we can also write P(A ∧B) = P(B|A)P(A) Hence, we can deduce: P(B|A)P(A) = P(A|B)P(B) This can then be rearranged to give Bayes’ theorem:

Bayes Theorem states:

Ø This

reads that given some evidence E then probability that hypothesis is true is equal to

the ratio of the probability that E will be true given times the a priori evidence on the probability of and the sum of the probability of E over the set of all hypotheses times the probability of these hypotheses. The set of all hypotheses must be mutually exclusive and exhaustive. Thus to find if we examine medical evidence to diagnose an illness. We must know all the prior probabilities of find symptom and also the probability of having an illness based on certain symptoms being observed.

Bayesian statistics lie at the heart of most statistical reasoning systems. How is Bayes theorem exploited? The key is to formulate problem correctly: P(A|B) states the probability of A given only B's evidence. If there is other relevant evidence then it must also be considered. All events must be mutually exclusive. However in real world problems events are not generally unrelated. For example in diagnosing measles, the symptoms of spots and a fever are related. This means that computing the conditional probabilities gets complex.

In general if a prior evidence, p and some new observation, N then computing grows

exponentially for large sets of p 1

All events must be exhaustive. This means that in order to compute all probabilities the set of possible events must be closed. Thus if new information arises the set must be created afresh and all probabilities recalculated. Thus Simple Bayes rule-based systems are not suitable for uncertain reasoning. Knowledge acquisition is very hard. Too many probabilities needed -- too large a storage space. Computation time is too large. Updating new information is difficult and time consuming. Exceptions like ``none of the above'' cannot be represented. Humans are not very good probability estimators. However, Bayesian statistics still provide the core to reasoning in many uncertain reasoning systems with suitable enhancement to overcome the above problems. We will look at three broad categories: Certainty factors Dempster-Shafer models Bayesian networks. Bayesian networks are also called Belief Networks or Probabilistic Inference Networks.

1

Application Of Bayes Therom:

1

Clinical Example:

1

Definition and importance of knowledge Knowledge can be defined as the body of facts and principles accumulated by humankind or the act, fact, or state of knowing Knowledge is having familiarity with language, concepts, procedures, rules, ideas, abstractions, places, customs, facts, and associations, coupled with an ability to use theses notions effectively in modeling different aspects of the world The meaning of knowledge is closely related to the meaning of intelligence Intelligent requires the possession of and access to knowledge

A common way to represent knowledge external to a computer or a human is in the form of written language Example: Ramu is tall – This expresses a simple fact, an attribute possessed by a person

Ramu loves his mother – This expresses a complex binary relation between two persons Knowledge may be declarative or procedural Procedural knowledge is compiled knowledge related to the performance of some task. For example, the steps used to solve an algebraic equation 1

Declarative knowledge is passive knowledge expressed as statements of facts about the world. For example, personnel data in a database, such data are explicit pieces of independent knowledge Knowledge includes and requires the use of data and information Knowledge combines relationships, correlations, dependencies, and the notion of gestalt with data and information Belief is a meaningful and coherent expression. Thus belief may be true or false Hypothesis is defined as a belief which is backed up with some supporting evidence, but it may still be false Knowledge is true justified belief Epistemology is the study of the nature of knowledge Metaknowledge is knowledge about knowledge, that is, knowledge about what we know

Knowledge Based Systems Systems that depend on a rich base of knowledge to perform difficult tasks It includes work in vision, learning, general problem solving and natural language understanding The systems get their power from the expert knowledge that has been coded into facts, rules, heuristics and procedures. In Fig 2.1, the knowledge is stored in a knowledge base separate from the control and inferencing components It is possible to add new knowledge or refine existing knowledge without recompiling the control and inferencing programs Components of a knowledge based system

Input-Output

Inference-

Knowledge

Control Unit

Base 1

Representation of knowledge The object of a knowledge representation is to express knowledge in a computer tractable form, so that it can be used to enable our AI agents to perform well. A knowledge representation language is defined by two aspects: The syntax of a language defines which ü Syntax configurations of the components of the language constitute valid sentences. ü The semantics defines which facts in the world the Semantics sentences refer to, and hence the statement about the world that each sentence makes. Suppose the language is arithmetic, then ‘x’, ‘=’ and ‘y’ are components (or symbols or words) of the language the syntax says that ‘x = y’ is a valid sentence in the language, but ‘= = x y’ is not the semantics say that ‘x = y’ is false if y is bigger than x, and true otherwise The requirements of a knowledge representation are: Representational Adequacy – the ability to represent all the different kinds of knowledge that might be needed in that domain. Inferential Adequacy – the ability to manipulate the representational structures to derive new structures (corresponding to new knowledge) from existing structures. Inferential Efficiency – the ability to incorporate additional information into the knowledge structure which can be used to focus the attention of the inference mechanisms in the most promising directions. Acquisitional Efficiency – the ability to acquire new information easily. Ideally the agent should be able to control its own knowledge acquisition, but direct insertion of information by a ‘knowledge engineer’ would be acceptable. Finding a system that optimizes these for all possible domains is not going to be feasible. In practice, the theoretical requirements for good knowledge representations can usually be achieved by dealing appropriately with a number of practical requirements:

The representations need to be complete – so that everything that could possibly need to be represented can easily be represented.

1

They must be computable – implementable with standard computing procedures. They should make the important objects and relations explicit and accessible – so that it is easy to see what is going on, and how the various components interact. They should suppress irrelevant detail – so that rarely used details don’t introduce unnecessary complications, but are still available when needed. They should expose any natural constraints – so that it is easy to express how one object or relation influences another. They should be transparent – so you can easily understand what is being said.

The implementation needs to be concise and fast – so that information can be stored, retrieved and manipulated rapidly. The four fundamental components of a good representation The lexical part – that determines which symbols or words are used in the representation’s vocabulary. The structural or syntactic part – that describes the constraints on how the symbols can be arranged, i.e. a grammar. The semantic part – that establishes a way of associating real world meanings with the representations. The procedural part – that specifies the access procedures that enables ways of creating and modifying representations and answering questions using them, i.e. how we generate and compute things with the representation.

Knowledge Representation in Natural Language Advantages of natural language o

It is extremely expressive – we can express virtually everything in natural language (real world situations, pictures, symbols, ideas, emotions, reasoning).

o Most humans use it most of the time as their knowledge representation of choice ü Disadvantages of natural language

o Both the syntax and semantics are very complex and not fully understood. o There is little uniformity in the structure of sentences. 1

o It is often ambiguous – in fact, it is usually ambiguous.

Knowledge Organization The organization of knowledge in memory is key to efficient processing Knowledge based systems performs their intended tasks

The facts and rules are easy to locate and retrieve. Otherwise much time is wasted in searching and testing large numbers of items in memory Knowledge can be organized in memory for easy access by a method known as indexing As a result, the search for some specific chunk of knowledge is limited to the group only

Knowledge Manipulation Decisions and actions in knowledge based systems come from manipulation of the knowledge The known facts in the knowledge base be located, compared, and altered in some way This process may set up other subgoals and require further inputs, and so on until a final solution is found The manipulations are the computational equivalent of reasoning. This requires a form of inference or deduction, using the knowledge and inferring rules. All forms of reasoning requires a certain amount of searching and matching. The searching and matching operations consume greatest amount of computation time in AI systems It is important to have techniques that limit the amount of search and matching required to complete any given task 1

Matching techniques: Matching is the process of comparing two or more structures to discover their likenesses or differences. The structures may represent a wide range of objects including physical entities, words or phrases in some language, complete classes of things, general concepts, relations between complex entities, and the like. The representations will be given in one or more of the formalisms like FOPL, networks, or some other scheme, and matching will involve comparing the component parts of such structures. Matching is used in a variety of programs for different reasons. It may serve to control the sequence of operations, to identify or classify objects, to determine the best of a number of different alternatives, or to retrieve items from a database. It is an essential operation such diverse programs as speech recognition, natural language understanding, vision, learning, automated reasoning, planning, automatic programming, and expert systems, as well as many others. In its simplest form, matching is just the process of comparing two structures or patterns for equality. The match fails if the patterns differ in any aspect. For example, a match between the two character strings acdebfba and acdebeba fails on an exact match since the strings differ in the sixth character positions. In more complex cases the matching process may permit transformations in the patterns in order to achieve an equality match. The transformation may be a simple change of some variables to constants, or ti may amount to ignoring some components during the match operation. For example, a pattern matching variable such as ?x may be used to permit successful matching between the two patterns (a b (c d ) e) and (a b ?x e) by binding ?x to (c, d). Such matching are usually restricted in some way, however, as is the case with the unification of two classes where only consistent bindings are permitted. Thus, two patterns such as ( a b (c d) e f) and (a b ?x e ?x) would not match since ?x could not be bound to two different constants. In some extreme cases, a complete change of representational form may be required in either one or both structures before a match can be attempted. This will be the case, for 1

example, when one visual object is represented as a vector of pixel gray levels and objects to be matched are represented as descriptions in predicate logic or some other high level statements. A direct comparison is impossible unless one form has been transformed into the other. In subsequent chapters we will see examples of many problems where exact matches are inappropriate, and some form of partial matching is more meaningful. Typically in such cases, one is interested in finding a best match between pairs of structures. This will be the case in object classification problems, for example, when object descriptions are subject to corruption by noise or distortion. In such cases, a measure of the degree of match may also be required. Other types of partial matching may require finding a match between certain key elements while ignoring all other elements in the pattern. For example, a human language input unit should be flexible enough to recognize any of the following three statements as expressing a choice of preference for the low-calorie food item I prefer the low-calorie choice. I want the low-calorie item The low-calorie one please. Recognition of the intended request can be achieved by matching against key words in a template containing “low-calorie” and ignoring other words except, perhaps, negative modifiers. Finally, some problems may obviate the need for a form of fuzzy matching where an entity’s degree of membership in one or more classes is appropriate. Some classification problems will apply here if the boundaries between the classes are not distinct, and an object may belong to more than one class. Fig 8.1 illustrates the general match process where an input description is being compared with other descriptions. As stressed earlier, their term object is used here in a general sense. It does not necessarily imply physical objects. All objects will be represented in some formalism such a s a vector of attribute values, prepositional logic or FOPL statements, rules, framelike structures, or other scheme. Transformations, if required, may involve simple instantiations or unifications among clauses or more complex operations such 1

as transforming a two-dimensional scene to a description in some formal language. Once the descriptions have been transformed into the same schema, the matching process is performed element-by-element using a relational or other test (like equality or ranking). The test results may then be combined in some way to provide an overall measure of similarity. The choice of measure will depend on the match criteria and representation scheme employed.

The output of the matcher is a description of the match. It may be a simple yes or no response or a list of variable bindings, or as complicated as a detailed annotation of the similarities and differences between the matched objects. `To summarize then, matching may be exact, used with or without pattern variables, partial, or fuzzy, and any matching algorithm will be based on such factors as Choice of representation scheme for the objects being matched, Criteria for matching (exact, partial, fuzzy, and so on), Choice of measure required to perform the match in accordance with the chosen criteria, and Type of match description required for output. In the remainder of this chapter we examine various types of matching problems and their related algorithms. We bin with a description of representation structures and measures commonly found in matching problems. We next look at various matching techniques based on exact, partial, and fuzzy approaches. We conclude the chapter with an example of an efficient match algorithm used in some rule-based expert systems.

1

Fig Typical Matching Process

Structures used in Matching The types of list structures represent clauses in propositional or predicate logic such as (or ~(MARRIED ?x ?y) ~ (DAUGHTER ?z ?y) (MOTHER ?y ?z)) or rules such as (and ((cloudy-sky) (low-bar-pressure) (high-humidity)) (conclude (rain likely)) or

fragments of associative networks in below Fig The other common structures include strings of characters a1 a2 . . . ak , where the ai belong to given alphabet A, vector X = (x1 x2 . . . xn), where the xi represents attribute values, matrices M (rows of vectors), general graphs, trees and sets

Fig Fragment of associative network and corresponding LISP Code Variables The structures are constructed from basic atomic elements, numbers and characters 1

Character string elements may represent either constants or variables If variables, they may be classified by either the type of match permitted or their value domains An open variable can be replaced by a single item Segment variable can be replaced by zero or more items Open variable are replaced with a preceding question mark (?x. ?y, ?class)

They may match or assume the value of any single string element or word, but they are subject to consistency constraints For example, to be consistent, the variable ?x can be bound only to the same top level element in any single structure Thus (a ?x d ?x e) may match (a b d b e), but not (a b d a e) Segment variable types will be preceded with an asterisk (*x, *z, *words) This type of variable can match an arbitrary number or segment of contiguous atomic elements For example, (* x d (e g) *y) will match the patterns (a (b c) d (e f) g h), (d (e f ) (g)) Subject variable may also be subject to consistency constraints similar to open variables Nominal variables Qualitative variables whose values or states have no order nor rank It is possible to distinguish between equality or inequality between two objects

Each state can be given a numerical code For example, “marital status” has states of married, single, divorced or widowed. These states could be assigned numerical codes, such as married = 1, single = 2, divorced = 3 and widowed = 4 Ordinal variables Qualitative variables whose states can be arranged in a rank order

It may be assigned numerical values Foe example, the states very tall, tall, medium, short and very short can be arranged in order from tallest to shortest and can be assigned an arbitrary scale of 5 to 1 Binary variable 1

Qualitative discrete variables which may assume only one of two values, such as 0 or 1, good or bad, yes or no, high or low Interval variables or Metric variables Qualitative variables which take on numeric values and for which equal differences between values have the same significance For example, real numbers corresponding to temperature or integers corresponding to an amount of money are considered as interval variables Graphs and Trees A graph is a collection of points called vertices, some of which are connected by line segments called edges Graphs are used to model a wide variety of real-life applications, including transportation and communication networks, project scheduling, and games A graph G = (V, E) is an ordered pair of sets V and E. the elements V are nodes or vertices and the elements of E are a subset of V X V called edges

An edge joints two distinct vertices in V Directed graphs or digraphs, have directed edges or arcs with arrows If an arc is directed from node ni to nj , node ni is said to be a parent or successor of nj and nj is the child or successor of ni Undirected graphs have simple edges without arrows connecting the nodes

A path is a sequence of edges connecting two modes where the endpoint of one edge is the start of its successor A cycle is a path in which the two end points coincide A Connected graph is a graph for which every pair of vertices is joined by a path A graph is complete if every element of V X V is an edge A tree is a connected graph in which there are no cycles and each node has at most one parent below A node with no parent is called root node A node with no children is called leaf node

The depth of the root node is defined as zero The depth of any other node is defined to be the depth of its parent plus 1 1

Sets and Bags A set is represented as an unordered list of unique elements such as the set (a d f c) or (red blue green yellow) A bag is a set which may contain more than one copy of the same member a, b, d and e Sets and bags are structures used in matching operations

Measure for Matching The problem of comparing structures without the use of pattern matching variables. This requires consideration of measures used to determine the likeness or similarity between two or more structures The similarity between two structures is a measure of the degree of association or likeness between the object’s attributes and other characteristics parts. If the describing variables are quantitative, a distance metric is used to measure the proximity

Distance Metrics Ø For all elements x, y, z of the set E, the function d is metric if and only if d(x, x) = 0 d(x,y) ≥ 0 d(x,y) = d(y,x) d(x,y) ≤ d(x,z) + d(z,y) The Minkowski metric is a general distance measure satisfying the above assumptions It is given by n dp = [ ∑|i-xyi|p ]1/p i=1 For the case p = 2, this metric is the familiar Euclidian distance. Where p = 1, dp is the socalled absolute or city block distance

Probabilistic measures The representation variables should be treated as random variables Then one requires a measure of the distance between the variates, their distributions, or between a variable and distribution One such measure is the Mahalanobis distance which gives a measure of the separation between two distributions Given the random vectors X and Y let C be their covariance matrix Then the Mahalanobis distance is given by D = X’C-1Y Where the prime (‘) denotes transpose (row vector) and C-1 is the inverse of C

The X and Y vectors may be adjusted for zero means bt first substracting the vector means ux and uy 1

Ø Another popular probability measure is the product moment correlation r, given by Cov(X, Y) r = -------------------[ Var(X)* Var(Y)]1/2

Where Cov and Var denote covariance and variance respectively The correlation r, which ranges between -1 and +1, is a measure of similarity frequently used in vision applications Other probabilistic measures used in AI applications are based on the scatter of attribute values These measures are related to the degree of clustering among the objects Conditional probabilities are sometimes used For example, they may be used to measure the likelihood that a given X is a member of class Ci , P(Ci| X), the conditional probability of Ci given an observed X

These measures can establish the proximity of two or more objects Qualitative measures Measures between binary variables are best described using contingency tables in the below Table

The table entries there give the number of objects having attribute X or Y with corresponding value of 1 or 0 For example, if the objects are animals might be horned and Y might be long tailed. In this case, the entry a is the number of animals having both horns and long tails 1

Note that n = a + b + c + d, the total number of objects Various measures of association for such binary variables have been defined For example a ---------------------- =

a+b+c+ d

a ---- ,

a+d ---------

n

n

a ---------------,

a -------

a+b+c

b+c

Contingency tables are useful for describing other qualitative variables, both ordinal and nominal. Since the methods are similar to those for binary variables

Whatever the variable types used in a measure, they should all be properly scaled to prevent variables having large values from negating the effects of smaller valued variables This could happen when one variable is scaled in millimeters and another variable in meters Similarity measures Measures of dissimilarity like distance, should decrease as objects become more alike The similarities are not in general symmetric Any similarity measure between a subject description A and its referent B, denoted by s(A,B), is not necessarily equal In general, s(A,B) ≠ (B,A)s or “A is like B” may not be the same as “B is like A”

Tests on subjects have shown that in similarity comparisons, the focus of attention is on the subject and, therefore, subject features are given higher weights than the referent 1

For example, in tests comparing countries, statements like “North Korea is similar to Red China” and “Red China is similar to North Korea” were not rated as symmetrical or equal Similarities may depend strongly on the context in which the comparisons are made An interesting family of similarity measures which takes into account such factors as asymmetry and has some intuitive appeal has recently been proposed

Let O ={o1, o2, . . . } be the universe of objects of interest Let Ai be the set of attributes used to represent oi A similarity measure s which is a function of three disjoint sets of attributes common to any two objects Ai and Aj is given as

s(Ai, Aj) = F(Ai & Aj, Ai - Aj, Aj - Ai) Where Ai & Aj is the set of features common to both oi and oj Where Ai - Aj is the set of features belonging to oi and not oj Where Aj - Ai is the set of features belonging to oj and not oi The function F is a real valued nonnegative function s(Ai, Aj) = af(Ai & Aj) – bf(Ai - Aj) – cf(Aj - Ai) for some a,b,c ≥ 0 Where f is an additive interval metric function The function f(A) may be chosen as any nonnegative function of the set A, like the number of attributes in A or the average distance between points in A f(Ai & Aj) S(Ai, A2j) = ----------------------------------------f(Ai & Aj) + af(Ai - Aj) + bf(Aj - Ai) for some a,b ≥ 0

When the representations are graph structures, a similarity measure based on the cost of transforming one graph into the other may be used For example, a procedure to find a measure of similarity between two labeled graphs decomposes the graphs into basic subgraphs and computes the minimum

cost to transform either graph into the other one, subpart-by-subpart

Matching like Patterns We consider procedures which amount to performing a complete match between two structures The match will be accomplished by comparing the two structures and testing for equality among the corresponding parts Pattern variables will be used for instantiations of some parts subject to restrictions Matching Substrings A basic function required in many match algorithms is to determine if a substring S2 consisting of m characters occurs somewhere in a string S1 of n characters, m ≤ n A direct approach to this problem is to compare the two strings character-by-character, starting with the first characters of both S1 and S2 If any two characters disagree, the process is repeated, starting with the second character of S1 and matching again against S2 character-by-character until a match is found or disagreements occurs again

This process continues until a match occurs or S1 has no more characters Let i and j be position indices for string S1 and a k position index for S2 We can perform the substring match with the following algorithm i := 0 While i ≤ –(nm+ 1) do begin i := i + 1; j := i; k :=1;

while S1(j) = S2(k) do begin if k = m writeln(“success”) 1

else do 1

begin j := j +1; k := k + 1; end end end writeln(“fail”) end This algorithm requires m(n - m) comparisons in the worst case A more efficient algorithm will not repeat the same comparisons over and over again One such algorithm uses two indices I and j, where I indexes the character positions in S1 and j is set to a “match state” value ranging from 0 to m The state 0 corresponds to no matched characters between the strings, while state 1 corresponds to the first letter in S2 matching character i in S2 State 2 corresponds to the first two consecutive letters in S2 matching letters i and i+1 in S1 respectively, and so on, with state m corresponding to a successful match

Whenever consecutive letters fail to match, the state index is reduced accordingly Matching Graphs Two graphs G1 and G2 match if they have the same labeled nodes and same labeled arcs and all node-to-node arcs are the same Ø If G2 with m nodes is a subgraph of G1 with n nodes, where n ≥ m 2

In a worst cas match, this will require n!/(n - m)! node comparison and O(m ) arc comparisons

Finding subgraph isomorphisms is also an important matching problem

An isomorphism between the graphs G1 and G2 with vertices V1, V2 and edges E1, E2, that is, (V1, E1) and (V2, E2) respectively, is a one-to-one mapping to f between V1 and V2, such that for all v1 € V1, f(v1) = v2, and for each arc e1 € E1 ’



connecting v1 and v1 , there is a corresponding arc e2 € E2 connecting f(v1) and f(v1 ) 1

Matching Sets and Bags An exact match of two sets having the same number of elements requires that their intersection also have the number of elements Partial matches of two sets can also be determined by taking their intersections

If the two sets have the same number of elements and all elements are of equal importance, the degree of match can be the proportion of the total members which match If the number of elements differ between the sets, the proportion of matched elements to the minimum of the total number of members can be used as a measure of likeness When the elements are not of equal importance, weighting factors can be used to score the matched elements For example, a measure such as s(S1,S2) = (∑ i wN(ai))/m could be used, where wi =1 and N(ai) = 1 if ai is in the intersection; otherwise it is 0 An efficient way to find the intersection of two sets of symbolic elements in LISP is to work through one set marking each element on the elements property list and then saving all elements from the other list that have been marked

The resultant list of saved elements is the required intersection Matching two bags is similar to matching two sets except that counts of the number of occurrences of each element must also be made For this, a count of the number of occurrences can be used as the property mark for elements found in one set. This count can then be used to compare against a count of elements found in the second set Matching to Unify Literals One of the best examples of nontrivial pattern matching is in the unification of two FOPL literals

For example, to unify P(f(a,x), y ,y) and P(x, b, z) we first rename variables so that the two predicates have no variables in common This can be done by replacing the x in the second predicate with u to give P(u, b, z) 1

Compare the two symbol-by-symbol from left to right until a disagreement is found Disagreements can be between two different variables, a nonvariable term and a variable, or two nonvariable terms. If no disagreements is found, the two are identical and we have succeeded If disagreements is found and both are nonvariable terms, unification is impossible; so we have failed If both are variables, one is replaced throughout by the other. Finally, if disagreement is a variable and a nonvariable term, the variable is replaced by the entire term In this last step, replacement is possible only if the term does not contain the variable that is being replaced. This matching process is repeated until the two are unified or until a failure occurs For the two predicates P, above, a disagreement is first found between the term f(a,x) and variable u. Since f(a,x) does not contain the variable u, we replace u with f(a,x) everywhere it occurs in the literal This gives a substitution set of {f(a,x)/u} and the partially matched predicates P(f(a,x),y ,y) and P(f(a,x), b, z) Proceeding with the match, we find the next disagreements pair y and b, a variable and term, respectively Again we replace the variable y with the terms b and update the substitution list to get {f(a,x)/u, b/y} The final disagreements pair is two variables. Replacing the variable in the second literal with the first we get the substitution set {f(a,x)/u, b/y, y/z}or {f(a,x), b/y, b/z} For example, a LISP program which uses both the open and the segment pattern matching variables to find a match between a pattern and a clause (defun match (pattern clause) (cond ((equal pattern clause) t ) ((or (null pattern) (null clause)) nil)

; return t if ; equal, nil

; if not ((or (equal (car pattern) (car clause)) ;not, ?x 1

; binds (equal (car pattern) ‘?x))

; to single

(match (cdr pattern) (cdr clause))) ; term, *y ; binds ((equal (car pattern) ‘*y)

; to several

(or (match (cdr pattern) (cdr clause)) (match pattern (cdr clause))))))

;contiguous

; terms

When a segment variable is encountered (the *y), match is recursively executed on the cdrs of both pattern and clause or on the cdr of clause and pattern as *y matches one or more than one item respectively

Partial Matching For many AI applications complete matching between two or more structures is inappropriate For example, input representations of speech waveforms or visual scenes may have been corrupted by noise or other unwanted distortions. In such cases, we do not want to reject the input out of hand. Our systems should be more tolerant of such problems We want our system to be able to find an acceptable or best match between the input and some reference description Compensating for Distortions Finding an object in a photograph given only a general description of the object is a common problems in vision applications For example, the task may be to locate a human face or human body in photographs without the necessity of storing hundreds of specific face templates

A better approach in this case would be to store a single reference description of the object 1

Matching between photographs regions and corresponding descriptions then could be approached using either a measure of correlation, by altering the image to obtain a closer fit If nothing is known about the noise and distortion characteristics, correlation methods can be ineffective or even misleading. In such cases, methods based on mechanical distortion may be appropriate For example, our reference image is on a transparent rubber sheet. This sheet is moved over the input image and at each location is stretched to get the best match alignment between the two images The match between the two can then be evaluated by how well they correspond and how much push-and-pull distortion is needed to obtain the best correspondence Use the number of rigid pieces connected with springs. This pieces can correspond to low level areas such as pixels or even larger area segments

Fig Discrete version of stretchable overlay image To model any restrictions such as the relative positions of body parts, nonlinear cost functions of piece displacements can be used The costs can correspond to different spring tensions which reflect the constraints For example, the cost of displacing some pieces might be zero for no displacement, one unit for single increment displacements in any one of the 1

permissible directions, two units for two position displacements and infinite cost for displacements of more than two increments. Other pieces would be assigned higher costs for unit and larger position displacements when stronger constraints were applicable The matching problem is to find a least cost location and distortion pattern for the reference sheet with regard to the sensed picture Attempting to compare each component of some reference to each primitive part of a sensed picture is a combinatorially explosive problem In using the template-spring reference image and heuristic methods to compare against different segments of the sensed picture, the search and match process can be made tractable Any matching metric used in the least cost comparison would need to take into account the sum of the distortion costs Cd , the sum of the costs for reference and

sensed component dissimilarities Cc , and the sum of penalty costs for missing components Cm . Thus, the total cost is given by

Ct = Cd + Cc + Cm Finding Match Differences Distortions occurring in representations are not the only reason for partial matches

For example, in problem solving or analogical inference, differences are expected. In such cases the two structures are matched to isolate the differences in order that they may be reduced or transformed. Once again, partial matching techniques are appropriate In visual application, an industrial part may be described using a graph structure where the set of nodes correspond to rectangular or cylindrical block subparts

The arc in the graph correspond to positional relations between the subparts Labels for rectangular block nodes contain length, width and height, while labels for cylindrical block nodes, where location can be above, to the right of, behind, inside and so on In Fig 8.5 illustrates a segment of such a graph 1

In the fig the following abbreviations are used

Interpreting the graph, we see it is a unit consisting of subparts, mode up of rectangular and cylindrical blocks with dimensions specified by attribute values

The cylindrical block n1 is to the right of n2 by d1 units and the two are connected by a joint The blocks n1 and n2 are above the rectangular block n3 by d2 and d3 units respectively, and so on Graphs such as this are called attributed relational graphs (ATRs). Such a graph G is defined as a sextuple G = (N, B, A, Gn, Gb) Where N = { n1, n2, . . ., nk}is a set of nodes, A = {a n1, an2, . . . , ank} is an alphabet of node attributes, B = { b1, b2, . . . , bm}is a set of directed branches (b = (ni, nj)), and Gn and Gb are functions for generating node and branch attributes respectively When the representations are graph structures like ARGs, a similarity measure may be computed as the total cost of transforming one graph into the other

For example, the similarity of two ARGs may be determined with the following steps: o Decompose the ARGs into basic subgraphs, each having a depth of one 1

o

Compute the minimum cost to transform either basic ARG into the other one subgraph-by-subgraph

o

Compute the total transformation cost from the sum of the subgraph costs

An ARG may be transformed by the three basic operations of node or branch deletions, insertions, or substitutions, where each operation is given a cost based on computation time or other factors

The RETE matching algorithm One potential problem with expert systems is the number of comparisons that need to be made between rules and facts in the database. In some cases, where there are hundreds or even thousands of rules, running comparisons against each rule can be impractical. The Rete Algorithm is an efficient method for solving this problem and is used by a number of expert system tools, including OPS5 and Eclipse. The Rete is a directed, acyclic, rooted graph. Each path from the root node to a leaf in the tree represents the left-hand side of a rule. Each node stores details of which facts have been matched by the rules at that point in the path. As facts are changed, the new facts are propagated through the Rete from the root node to the leaves, changing the information stored at nodes appropriately.

This could mean adding a new fact, or changing information about an old fact, or deleting an old fact. In this way, the system only needs to test each new fact against the rules, and only against those rules to which the new fact is relevant, instead of checking each fact against each rule. The Rete algorithm depends on the principle that in general, when using forward chaining in expert systems, the values of objects change relatively infrequently, meaning that relatively few changes need to be made to the Rete. In such cases, the Rete algorithm can provide a significant improvement in performance over other methods, although it is less efficient in cases where objects are continually changing. The basic inference cycle of a production system is match, select and execute as indicated in Fig 8.6. These operations are performed as follows 1

Fig Production system components and basic cycle Match o During the match portion of the cycle, the conditions in the LHS of the rules in the knowledge base are matched against the contents of working memory to determine which rules have their LHS conditions satisfied with consistent bindings to working memory terms. o Rules which are found to be applicable are put in a conflict set Select o From the conflict set, one of the rules is selected to execute. The selection strategy may depend on recency of useage, specificity of the rule or other criteria Execute

o The rule selected from the conflict set is executed by carrying the action or conclusion part of the rule, the RHS of the rule. This may involve an I/O operation, adding, removing or changing clauses in working memory or simply causing a halt o The above cycle is repeated until no rules are put in the conflict set or until a stopping condition is reached o The main time saving features of RETE are as follows 1. in most expert systems, the contents of working memory change very little from cycle to cycle. There is persistence in the data known as temporal redundancy. This makes exhaustive matching on every cycle unnecessary. Instead, by saving match information, it is only necessary to compare working memory changes on each cycle. In RETE, 1

addition to, removal from, and changes to working memory are translated directly into changes to the conflict set in Fig . Then when a rule from the conflict set has been selected to fire, it is removed from the set and the remaining entries are saved for the next cycle. Consequently, repetitive matching of all rules against working memory is avoided. Furthermore, by indexing rules with the condition terms appearing in their LHS, only those rules which could match. Working memory changes need to be examined. This greatly reduces the number of comparisons required on each cycle

Fig Changes to working memory are mapped to the conflict set

2. Many rules in a knowledge base will have the same conditions occurring in their LHS. This is just another way in which unnecessary matching can arise. Repeating testing of the same conditions in those rules could be avoided by grouping rules which share the same conditions and linking them to their common terms. It would then be possible to perform a single set of tests for all the applicable rules shown in Fig below 1

Fig Typical rules and a portion of a compiled network

Knowledge Organization and Management The advantage of using structured knowledge representation schemes (frames, associative networks, or object-oriented structures) over unstructured ones (rules or FOPL clauses) should be understood and appreciated at this point. Structured schemes group or link small related chunks of knowledge together as a unit. This simplifies the processing operations, since

knowledge required for a given task is usually contained within a limited semantic region, which can be accessed as a unit or traced through a few linkages. But, as suggested earlier, representation is not the only factor, which affects efficient manipulation. A program must first locate and retrieve the appropriate knowledge in an efficient manner whenever it is needed. One of the most direct methods for finding the appropriate knowledge is exhaustive search or the enumerations of all items in memory. This is also one of the least efficient access methods. More efficient retrieval is accomplished through some form of indexing or grouping. We consider some of these processes in the next section where we review traditional access and retrieval methods used in memory organizations. This is followed by a description of less commonly used forms of indexing. A “smart” expert system can be expected to have thousands or even tens of thousands of rules (or their equivalent) in its KB. A good example is XCON (or RI), an expert system which was developed for the Digital Equipment Corporation to configure their customer’s 1

computer systems. XCON has a rapidly growing KB, which, at the present time, consists of more than 12,000 production rules. Large numbers of rules are needed in systems like this, which deal with complex reasoning tasks. System configuration becomes very complex when the number of components and corresponding parameters is large (several hundred). If each rule contained above four or five conditions in its antecedent or If part and an exhaustive search was used, as many as 40,00050,000 tests could be required on each recognition cycle. Clearly, the time required to perform this number of tests is intolerable. Instead, some form of memory management is needed. We saw one way this problem was solved using a form of indexing with the RETE algorithm described in the preceding chapter, More direct memory organization approaches to this problem are considered in this chapter. We humans live in a dynamic, continually changing environment. To cope with this change, our memories exhibit some rather remarkable properties. We are able to adapt to varied changes in the environment and still improve our performance. This is because our memory system is continuously adapting through a reorganization process. New knowledge is continually being added to our memories, existing knowledge is continually being revised, and less important knowledge is gradually being forgotten. Our memories are continually being reorganized to expand our recall and reasoning abilities. This process leads to improved memory performance throughout most of our lives.

When developing computer memories for intelligent systems, we may gain some useful insight by learning what we can from human memory systems. We would expect computer memory systems to possess some of the same features. For example, human memories tend to be limitless in capacity, and they provide a uniform grade of recall service, independent of the amount of information store. For later use, we have summarized these and other desirable characteristics that we feel an effective computer memory organization system should possess. 1. It should be possible to add and integrate new knowledge in memory as needed without concern for limitations in size. 2. Any organizational scheme chosen should facilitate the remembering process. Thus, it should be possible to locate any stored item of knowledge efficiently from its content alone. 1

3. The addition of more knowledge to memory should have no adverse effects on the accessibility of items already stored there. Thus, the search time should not increase appreciably with the amount of information stored. 4. The organization scheme should facilitate the recognition of similar items of knowledge. This is essential for reasoning and learning functions. It suggests that existing knowledge be used to determine the location and manner in which new knowledge is integrated into memory. 5. The organization should facilitate the process of consolidating recurrent incidents or episodes and “forgetting” knowledge when it is no longer valid or no longer needed.

These characteristics suggest that memory be organized around conceptual clusters of knowledge. Related clusters should be grouped and stored in close proximity to each other and be linked to similar concepts through associative relations. Access to any given cluster should be possible through either direct or indirect links such as concept pointers indexed by meaning. Index keys with synonomous meanings should provide links to the same knowledge clusters. These notions are illustrated graphically in Fig 9.1 where the clusters represent arbitrary groups closely related knowledge such as objects and their properties or basic conceptual categories. The links connecting the clusters are two-way pointers which provide relational associations between the clusters they connect.

Indexing and retrieval techniques The Frame Problem One tricky aspect of systems that must function in dynamic environments is due to the so called frame problem. This is the problem of knowing what changes have and have not taken place following some action. Some changes will be the direct result of the action. Other changes will be the result of secondary or side effects rather than the result of the action. Foe example, if a robot is cleaning the floors in a house, the location of the floor sweeper changes with the robot even though this is not explicitly stated. Other objects not attached to the robot remain in their original places. The actual changes must somehow be reflected in memory, a fear that requires some ability to 1

infer. Effective memory organization and management methods must take into account effects caused by the frame problem

The three basic problems related to knowledge organization: 1. classifying and computing indices for input information presented to system 2. access and retrieval of knowledge from mrmory through the use of the computed indices 3. the reorganization of memory structures when necessary to accommodate additions, revisions and forgetting. These functions are depicted in Fig 9.1

1

Fig Memory Organization Function 1

When a knowledge base is too large to be held in main memory, it must be stored as a file in secondary storage (disk, drum or tape). Storage and retrieval of information in secondary memory is then performed through the transfer of equal-size physical blocks consisting of between 256 and 4096 bytes.

When an item of information is retrieved or stored, at least one complete block must be transferred between main and secondary memory. The time required to transfer a block typically ranges between 10ms and 100ms, about the same amount of time required to sequentially searching the whole block for an item. Grouping related knowledge together as a unit can help to reduce the number of block transfers, hence the total access time An example of effective grouping can be found in some expert system KB organizations Grouping together rules which share some of the same conditions and conclusions can reduce block transfer times since such rules are likely to be needed during the same problem solving session Collecting rules together by similar conditions or content can help to reduce the number of block transfers required Indexed Organization While organization by content can help to reduce block transfers, an indexed organization scheme can greatly reduce the time to determine the storage location of an item Indexing is accomplished by organizing the information in some way for easy access One way to index is by segregating knowledge into two or more groups and storing the locations of the knowledge for each group in a smaller index file

To build an indexed file, knowledge stored as units is first arranged sequentially by some key value The key can be any chosen fields that uniquely identify the record A second file containing indices for the record locations is created while the sequential knowledge file is being loaded Each physical block in this main file results in one entry in the index file

The index file entries are pairs of record key values and block addresses 1

The key value is the key of the first record stored in the corresponding block To retrieve an item of knowledge from the main file, the index file is searched to find the desired record key and obtain the corresponding block address

The block is then accessed using this address. Items within the block are then searched sequentially for the desired record An indexed file contains a list of the entry pairs (k,b) where the values k are the keys of the first record in each block whose starting address is b Fig 9.2 illustrates the process used to locate a record using the key value of 378

Fig Indexed File Organization The largest key value less than 378 (375) gives the block address (800) where the item will be found Once the 800 block has been retrieved, it can be searched linearly to locate the record with key value 378. this key could be any alphanumeric string that uniquely identifies a block, since such strings usually have a collation order defined by their code set If the index file is large, a binary search can be used to speed up the index file search A binary search will significantly reduce the search time over linear search when the number of items is not too small 1

When a file contains n records, the average time for a linear search is proportional to n/2 compared to a binary search time on the order of ln2(n)

Further reductions in search time can be realized using secondary or higher order arranged index files In this case the secondary index file would contain key and block address pairs for the primary index file Similar indexing would apply for higher order hierarchies where a separate file is used for each level Both binary search and hierarchical index file organization may be needed when the KB is a very large file Indexing in LISP can be implemented with property lists, A-lists, and/or hash tables. For example, a KB can be partitioned into segments by storing each segment as a list under the property value for that segment Each list indexed in this way can be found with the get property function and then searched sequentially or sorted and searched with binary search methods

A hash-table is a special data structure in LISP which provides a means of rapid access through key hashing Hashed Files Indexed organizations that permit efficient access are based on the use of a hash function A hash function, h, transforms key values k into integer storage location indices through a simple computation When a maximum number of items C are to stored, the hashed values h(k) will range from 0 to C – 1. Therefore, given any key value k, h(k) should map into one of 0…C – 1 An effective hash function can be computed by choosing the largest prime number p less than or equal to C, converting the key value k into an integer k’ if necessary, and then using the value k’ mod p as the index value h For example, if C is 1000, the largest prime less than C is p = 997. thus, if the record key value is 123456789, the hashed value is h = (k mod 997) = 273

When using hashed access, the value of C should be chosen large enough to accommodate the maximum number of categories needed 1

The use of the prime number p in the algorithm helps to insure that the resultant indices are somewhat uniformly distributed or hashed throughout the range 0 . . . C – 1 This type of organization is well suited for groups of items corresponding to C different categories When two or more items belong to the same category, they will have the same hashed values. These values are called synonyms One way to accommodate collisions is with data structures known as buckets

A bucket is a linked list of one or more items, where each item is record, block, list or other data strucyure The first item in each bucket has an address corresponding to the hashed address Fig 9.3 illustrates a form of hashed memory organization which uses buckets to hold all items with the same hashed key value

Fig Hashed Memory File organization The address of each bucket in this case is the indexed location in an array Conceptual Indexing A better approach to indexed retrieval is one which makes use of the content or meaning associated with the stored entities rather than some nonmeaningful key value This suggests the use of indices which name and define the entity being retrieved. Thus, if the entity is an object, its name and characteristic attributes would make meaningful indices 1

If the entity is an abstract object such as a concept, the name and other defining traits would ne meaningful as indices Nodes within the network correspond to different knowledge entities, whereas the links are indices or pointers to the entities Links connecting two entities name the association or relationship between them The relationship between entities may be defined as a hierarchical one or just through associative links As an example of an indexed network, the concept of computer science CS should be accessible directly through the CS name or indirectly through associative links like a university major, a career field, or a type of classroom course These notions are illustrated in Fig 9.4

Fig Associative Network Indexing and Organization

Object attributes can also serve as indices to locate items based on the attribute values In this case, the best attribute keys are those which provide the greatest discrimination among objects within the same category 1

For example, suppose we wish to organize knowledge by object types. In this case, the choice of attributes should depend on the use intended for the knowledge. Since objects may be classified with an unlimited number of attributes , those attributes which are most discriminable with respect to the concept meaning should be chosen Integrating knowledge and memory Integrating new knowledge in traditional data bases is accomplished by simply adding an item to its key location, deleting an item from a key directed location, or modifying fields of an existing item with specific input information. When an item in inventory is replaced with a new one, its description is changed accordingly. When an item is added to memory, its index is computed and it is stored at the corresponding address More sophisticated memory systems will continuously monitor a knowledge base and make inferred changes as appropriate A more comprehensive management system will perform other functions as well, including the formation of new conceptual structures, the computation and association of casual linkages between related concepts, generalization of items having common features and the formation of specialized conceptual categories and specialization of concepts that have been over generalized Hypertext Hypertext systems are examples of information organized through associative links, like associative networks These systems are interactive window systems connected to a database through associative links Unlike normal text which is read in linear fashion, hypertext can be browsed in a nonlinear way by moving through a network of information nodes which are linked bidirectionally through associative Users of hypertext systems can wander through the database scanning text and graphics, creating new information nodes and linkages or modify existing ones

This approach to documentation use is said to more closely match the cognitive process 1

It provides a new approach to information access and organization for authors, researchers and other users of large bodies of information

Memory organization system HAM, a model of memory One of the earliest computer models of memory was the Human Associative memory (HAM) system developed by John Anderson and Gordon Bower

This memory is organized as a network of prepositional binary trees An example of a simple tree which represents the statement “In a park s hippie touched a debutante” is illustrated in Fig 9.5 When an informant asserts this statement to HAM, the system parses the sentence and builds a binary tree representation Node in the tree are assigned unique numbers, while links are labeled with the

following functions: C: context for tree fact

P: predicate

e: set membership

R: relation

F: a fact

S: subject

L: a location

T: time

O: object As HAM is informed of new sentences, they are parsed and formed into new treelike memory structures or integrated with existing ones For example, to add the fact that the hippie was tall, the following subtree is attached to the tree structure of Fig below by merging the common node hippie (node 3) into a single node 1

Fig Organization of knowledge in HAM

When HAM is posed with a query, it is formed into a tree structure called a probe. This structure is then matched against existing memory structures for the best match The structure with the closest match is used to formulate an answer to the query Matching is accomplished by first locating the leaf nodes in memory that match leaf nodes in the probe The corresponding links are then checked to see if they have the same labels and in the same order

The search process is constrained by searching only node groups that have the same relation links, based on recency of usage The search is not exhaustive and nodes accessed infrequently may be forgotten

Access to nodes in HAM is accomplished through word indexing in LISP Memory Organization with E-MOPs One system was developed by Janet Kolodner to study problems associated with the retrieval and organization of reconstructive memory, called CYRUS (Computerized Yale Retrieval and Updating System) stores episodes from the lives of former secretaries of state Cyrus Vance and Edmund Muskie

The episodes are indexed and stored in long term memory for subsequent use in answering queries posed in English 1

The basic memory model in CYRUS is a network consisting of Episodic Memory Organization Packets (E-MOPs) Each such E-MOP is a frame-like node structure which contains conceptual information related to different categories of episodic events E-MOP are indexed in memory by one or more distinguishing features. For example, there are basic E-MOPs for diplomatic meetings with foreign dignitaries, specialized political conferences, traveling, state dinners as well as other basic events related to diplomatic state functions This diplomatic meeting E-MOP called $MEET, contains information which is common to all diplomatic meeting events The common information which characterizes such as E-MOP is called its content For example, $MEET might contain the following information: A second type of information contained in E-MOPs are the indices which index either individual episodes or other E-MOPs which have become specializations of their parent EMOPs A typical $MEET E-MOP which has indices to two particular event meetings EV1 and EV2, is illustrated in Fig 9.6

1

Fig An example of an EMOP with two indexed events EV1 and EV2 1

For example, one of the meetings indexed was between Vance and Gromyko of the USSR in which they discussed SALT. This is labeled as event EV1 in the figure. The second meeting was between Vance and Begin of Israel in which they discussed Arab-Israeli peace. This is labeled as event EV2 Note that each of these events can be accessed through more than one feature (index). For example, EV1 can be located from the $MEET event through a topic value of “Arab-Israel peace,” through a participants’ nationality value of “Israel,” through a participants’ occupation value of “head of state,” and so on

As new diplomatic meetings are entered into the system, they are either integrated with the $MEET E-MOP as a separately indexed event or merged with another event to form a new specialized meeting E-MOP. When several events belonging to the same MOP category are entered, common event features are used to generalize the E-MOP. This information is collected in the frame contents. Specialization may also be required when over-generalization has occurred. Thus, memory is continually being reorganized as new facts are entered. This process prevents the addition of excessive memory entries and much redundancy which would result if every event entered resulted in the addition of a separate event Reorganization can also cause forgetting, since originally assigned indices may be changed when new structures are formed When this occurs, an item cannot be located, so the system attempts to derive new indices from the context and through other indices by reconstructing related events The key issues in this type of the organizations are: The selection and computation of good indices for new events so that similar events can be located in memory for new event integration

Monitoring and reorganization of memory to accommodate new events as they occur Access of the correct event information when provided clues for retrieval 1

Natural Language Processing : Developing programs to understand natural language is important in AI because a natural form of communication with systems is essential for user acceptance. One of the most critical tests for intelligent behavior is the ability to communicate effectively. This was the test proposed by Alan Turing. AI programs must be able to communicate with their human counterparts in a natural way, and natural language is one of the most important mediums for that purpose. A program understands a natural language if it behaves by taking a correct or acceptable action in response to the input. For example, we say a child demonstrates understanding if it responds with the correct answer to a question. The action taken need not be the external response. It may be the creation of some internal data structures. The structures created should be meaningful and correctly interact with the world model representation held by the program. In this chapter we explore many of the important issues related to natural language understanding and language generation.

This chapter explores several techniques that are used to enable humans to interact with computers via natural human languages. Natural languages are the languages used by humans for communication (among other functions). They are distinctly different from formal languages, such as C++, Java, and PROLOG. One of the main differences, which we will examine in some detail in this chapter, is that natural languages are ambiguous, meaning that a given sentence can have more than one possible meaning, and in some cases the correct meaning can be very hard to determine. Formal languages are almost always designed to ensure that ambiguity cannot occur. Hence, a given program written in C++ can have only one interpretation. This is clearly desirable because otherwise the computer would have to make an arbitrary decision as to which interpretation to work with. It is becoming increasingly important for computers to be able to understand natural languages. Telephone systems are now widespread that are able to understand a narrow range of commands and questions to assist callers to large call centers, without needing to use human resources. Additionally, the quantity of unstructured textual data that exists in the world (and in particular, on the Internet) has reached unmanageable proportions. For humans to search through these data using traditional techniques such as Boolean queries or the database query language SQL is impractical. The idea that people should be able to 1

pose questions in their own language, or something similar to it, is an increasingly popular one. Of course, English is not the only natural language. A great deal of research in natural language processing and information retrieval is carried out in English, but many human languages differ enormously from English. Languages such as Chinese, Finnish, and Navajo have almost nothing in common with English (although of course Finnish uses the same alphabet). Hence, a system that can work with one human language cannot necessarily deal with any other human language. In this section we will explore two main topics. First, we will examine natural language processing, which is a collection of techniques used to enable computers to “understand” human language. In general, they are concerned with extracting grammatical information as well as meaning from human utterances but they are also concerned with understanding those utterances, and performing useful tasks as a result. Two of the earliest goals of natural language processing were automated translation (which is explored in this chapter) and database access. The idea here was that if a user wanted to find some information from a database, it would make much more sense if he or she could query the database in her language, rather than needing to learn a new formal language such as SQL. Information retrieval is a collection of techniques used to try to match a query (or a command) to a set of documents from an existing corpus of documents. Systems such as the search engines that we use to find data on the Internet use information retrieval (albeit of a fairly simple nature).

Overview of linguistics In dealing with natural language, a computer system needs to be able to process and manipulate language at a number of levels. Phonology. This is needed only if the computer is required to understand spoken language. Phonology is the study of the sounds that make up words and is used to identify words from sounds. We will explore this in a little more detail later, when we look at the ways in which computers can understand speech. Morphology. This is the first stage of analysis that is applied to words, once they have been identified from speech, or input into the system. Morphology looks at the ways in which words break down into components and how that affects their

grammatical status. For example, the letter “s” on the end of a word can often either indicate that it is a plural noun or a third-person present-tense verb. Syntax. This stage involves applying the rules of the grammar from the language being used. Syntax determines the role of each word in a sentence and, thus, enables a computer system to convert sentences into a structure that can be more easily manipulated. Semantics. This involves the examination of the meaning of words and sentences. As we will see, it is possible for a sentence to be syntactically correct but to be semantically meaningless. Conversely, it is desirable that a computer system be able to understand sentences with incorrect syntax but that still convey useful information semantically. Pragmatics. This is the application of human-like understanding to sentences and discourse to determine meanings that are not immediately clear from the semantics. For example, if someone says, “Can you tell me the time?”, most people know that “yes” is not a suitable answer. Pragmatics enables a computer system to give a sensible answer to questions like this. In addition to these levels of analysis, natural language processing systems must apply some kind of world knowledge. In most real-world systems, this world knowledge is limited to a specific domain (e.g., a system might have detailed knowledge about the Blocks World and be able to answer questions about this world). The ultimate goal of natural language processing would be to have a system with enough world knowledge to be able to engage a human in discussion on any subject. This goal is still a long way off.

Morphological Analysis In studying the English language, morphology is relatively simple. We have endings such as -ing, -s, and -ed, which are applied to verbs; endings such as -s and -es, which are applied to nouns; we also have the ending -ly, which usually indicates that a word is an adverb. We also have prefixes such as anti-, non-, un-, and in-, which tend to indicate negation, or opposition. We also have a number of other prefixes and suffixes that provide a variety of semantic and syntactic information. 1

In practice, however, morphologic analysis for the English language is not terribly complex, particularly when compared with agglutinative languages such as German, which tend to combine words together into single words to indicate combinations of meaning. Morphologic analysis is mainly useful in natural language processing for identifying parts of speech (nouns, verbs, etc.) and for identifying which words belong together. In English, word order tends to provide more of this information than morphology, however. In languages such as Latin, word order was almost entirely superficial, and the morphology was extremely important. Languages such as French, Italian, and Spanish lie somewhere between these two extremes. As we will see in the following sections, being able to identify the part of speech for each word is essential to understanding a sentence. This can partly be achieved by simply looking up each word in a dictionary, which might contain for example the following entries: (swims, verb, present, singular, third person) (swimmer, noun, singular) (swim, verb, present, singular, first and second persons) (swim, verb, present plural, first, second, and third persons) (swimming, participle) (swimmingly, adverb) (swam, verb, past) Clearly, a complete dictionary of this kind would be unfeasibly large. A more practical approach is to include information about standard endings, such as: (-ly, adverb) (-ed, verb, past) 1

(-s, noun, plural) 1

This works fine for regular verbs, such as walk, but for all natural languages there are large numbers of irregular verbs, which do not follow these rules. Verbs such as to be and to do are particularly difficult in English as they do not seem to follow any morphologic rules. The most sensible approach to morphologic analysis is thus to include a set of rules that work for most regular words and then a list of irregular words.

For a system that was designed to converse on any subject, this second list would be extremely long. Most natural language systems currently are designed to discuss fairly limited domains and so do not need to include over-large look-up tables. In most natural languages, as well as the problem posed by the fact that word order tends to have more importance than morphology, there is also the difficulty of ambiguity at a word level. This kind of ambiguity can be seen in particular in words such as trains, which could be a plural noun or a singular verb, and set, which can be a noun, verb, or adjective.

BNF Parsing involves mapping a linear piece of text onto a hierarchy that represents the way the various words interact with each other syntactically.

First, we will look at grammars, which are used to represent the rules that define how a specific language is built up. Most natural languages are made up of a number of parts of speech, mainly the following: o Verb o Noun o Adjective o Adverb o Conjunction o Pronoun o Article In fact it is useful when parsing to combine words together to form syntactic groups. Hence, the words, a dog, which consist of an article and a noun, can also be described as a noun phrase. 1

A noun phrase is one or more words that combine together to represent an object or thing that can be described by a noun. Hence, the following are valid noun phrases: christmas, the dog, that packet of chips, the boy who had measles last year and nearly died, my favorite color

A noun phrase is not a sentence—it is part of a sentence. A verb phrase is one or more words that represent an action. The following are valid verb phrases: swim, eat that packet of chips, walking

A simple way to describe a sentence is to say that it consists of a noun phrase and a verb phrase. Hence, for example: That dog is eating my packet of chips. In this sentence, that dog is a noun phrase, and is eating my packet of chips is a verb phrase. Note that the verb phrase is in fact made up of a verb phrase, is eating, and a noun phrase, my packet of chips. A language is defined partly by its grammar. The rules of grammar for a language such as English can be written out in full, although it would be a complex process to do so. To allow a natural language processing system to parse sentences, it needs to have knowledge of the rules that describe how a valid sentence can be constructed. These rules are often written in what is known as Backus–Naur form (also known as Backus normal form—both names are abbreviated as BNF).

BNF is widely used by computer scientists to define formal languages such as C++ and Java. We can also use it to define the grammar of a natural language. A grammar specified in BNF consists of the following components: o Terminal symbols. Each terminal symbol is a symbol or word that appears in the language itself. In English, for example, the terminal symbols are our dictionary words such as the, cat, dog, and so on. In formal languages, the terminal symbols include variable names such as x, y, and so on, but for our purposes we will consider the terminal symbols to be the words in the language. o Nonterminal symbols. These are the symbols such as noun, verb phrase, and conjunction that are used to define words and phrases of 1

the language. A nonterminal symbol is so-named because it is used to represent one or more terminal symbols. o

The start symbol. The start symbol is used to represent a complete sentence in the language. In our case, the start symbol is simply sentence, but in first-order predicate logic, for example, the start symbol would be expression.

o

Rewrite rules. The rewrite rules define the structure of the grammar. Each rewrite rule details what symbols (terminal or nonterminal) can be used to make up each nonterminal symbol.

Let us now look at rewrite rules in more detail. We saw above that a sentence could take the following form: noun phrase verb phrase We thus write the following rewrite rule: Sentence→NounPhrase VerbPhrase This does not mean that every sentence must be of this form, but simply that a string of symbols that takes on the form of the right-hand side can be rewritten in the form of the left-hand side. Hence, if we see the words The cat sat on the mat we might identify that the cat is a noun phrase and that sat on the mat is a verb phrase. We can thus conclude that this string forms a sentence.

We can also use BNF to define a number of possible noun phrases. Note how we use the “|” symbol to separate the possible right-hand sides in BNF: NounPhrase→ Noun | Article Noun | Adjective Noun | Article Adjective Noun Ø Similarly, we can define a verb phrase: VerbPhrase→ Verb

| Verb NounPhrase 1

| Adverb Verb NounPhrase

The structure of human languages varies considerably. Hence, a set of rules like this will be valid for one language, but not necessarily for any other language. For example, in English it is usual to place the adjective before the noun (black cat, stale bread), whereas in French, it is often the case that the adjective comes after the noun (moulin rouge). Thus far, the rewrite rules we have written consist solely of nonterminal symbols. Rewrite rules are also used to describe the parts of speech of individual words (or terminal symbols): Noun→ cat | dog | Mount Rushmore | chickens Verb→ swims | eats | climbs Article→ the |a Adjective→ black | brown | green | stale

Grammars and Languages The types of grammars that exist are Noam Chomsky invented a hierarchy of grammars. The hierarchy consists of four main types of grammars. The simplest grammars are used to define regular languages. A regular language is one that can be described or understood by a finite state automaton. Such languages are very simplistic and allow sentences such as “aaaaabbbbbb.” Recall that a finite state automaton consists of a finite number of states, and rules that define how the automaton can transition from one state to another. A finite state automaton could be designed that defined the language that consisted of a string of one or more occurrences of the letter a. Hence, the following strings would be valid strings in this language: aaa a aaaaaaaaaaaaaaaaa Regular languages are of interest to computer scientists, but are not of great interest to the field of natural language processing because they are not powerful enough to represent even simple formal languages, let alone the more complex natural languages. Sentences defined by a regular grammar are often known as regular expressions. The grammar that we defined above using rewrite rules is a context-free grammar. It is context free because it defines the grammar simply in terms of which word types can go together—it does not specify the way that words should agree with each.

A stale dog climbs Mount Rushmore. It also, allows the following sentence, which is not grammatically correct: Chickens eats. A context-free grammar can have only at most one terminal symbol on the right-hand side of its rewrite rules. Rewrite rules for a context-sensitive grammar, in contrast, can have more than one terminal symbol on the right-hand side. This enables the grammar to specify number, case, tense, and gender agreement.

Each context-sensitive rewrite rule must have at least as many symbols on the right-hand side as it does on the left-hand side. Rewrite rules for context-sensitive grammars have the following form: A X B→A Y B which means that in the context of A and B, X can be rewritten as Y. Each of A, B, X, and Y can be either a terminal or a nonterminal symbol. Context-sensitive grammars are most usually used for natural language processing because they are powerful enough to define the kinds of grammars that natural languages use. Unfortunately, they tend to involve a much larger number of rules and are a much less natural way to describe language, making them harder for human developers to design than context free grammars. The final class of grammars in Chomsky’s hierarchy consists of recursively enumerable grammars (also known as unrestricted grammars). A recursively enumerable grammar can define any language and has no restrictions on the structure of its rewrite rules. Such grammars are of interest to computer scientists but are not of great use in the study of natural language processing.

Parsing: Syntactic Analysis As we have seen, morphologic analysis can be used to determine to which part of speech each word in a sentence belongs. We will now examine how this information is used to determine the syntactic structure of a sentence.

This process, in which we convert a sentence into a tree that represents the sentence’s syntactic structure, is known as parsing. Parsing a sentence tells us whether it is a valid sentence, as defined by our grammar If a sentence is not a valid sentence, then it cannot be parsed. Parsing a sentence involves producing a tree, such as that shown in Fig 10.1, which shows the parse tree for the following sentence: The black cat crossed the road. 1

Fig 10.1 This tree shows how the sentence is made up of a noun phrase and a verb phrase. The noun phrase consists of an article, an adjective, and a noun. The verb phrase consists of a verb and a further noun phrase, which in turn consists of an article and a noun. Parse trees can be built in a bottom-up fashion or in a top-down fashion. Building a parse tree from the top down involves starting from a sentence and determining which of the possible rewrites for Sentence can be applied to the sentence that is being parsed. Hence, in this case, Sentence would be rewritten using the following rule: Sentence→NounPhrase VerbPhrase

Then the verb phrase and noun phrase would be broken down recursively in the same way, until only terminal symbols were left. When a parse tree is built from the top down, it is known as a derivation tree.

To build a parse tree from the bottom up, the terminal symbols of the sentence are first replaced by their corresponding nonterminals (e.g., cat is replaced by noun), and then these nonterminals are combined to match the right-hand sides of rewrite rules. For example, the and road would be combined using the following rewrite rule: NounPhrase→Article Noun 1

Basic parsing techniques Transition Networks A transition network is a finite state automaton that is used to represent a part of a grammar. A transition network parser uses a number of these transition networks to represent its entire grammar. Each network represents one nonterminal symbol in the grammar. Hence, in the grammar for the English language, we would have one transition network for Sentence, one for Noun Phrase, one for Verb Phrase, one for Verb, and so on. Fig 10.2 shows the transition network equivalents for three production rules.

In each transition network, S1 is the start state, and the accepting state, or final state, is denoted by a heavy border. When a phrase is applied to a transition network, the first word is compared against one of the arcs leading from the first state. If this word matches one of those arcs, the network moves into the state to which that arc points. Hence, the first network shown in Fig 10.2, when presented with a Noun Phrase, will move from state S1 to state S2. If a phrase is presented to a transition network and no match is found from the current state, then that network cannot be used and another network must be tried. Hence, when starting with the phrase the cat sat on the mat, none of the networks shown in Fig 10.2 will be used because they all have only nonterminal symbols, whereas all the symbols in the cat sat on the mat are terminal.Hence, we need further networks, such as the ones shown in Figure 10.2, which deal with terminal symbols. 1

Fig 10.2 1

Transition networks can be used to determine whether a sentence is grammatically correct, at least according to the rules of the grammar the networks represent. Parsing using transition networks involves exploring a search space of possible parses in a depth-first fashion. Let us examine the parse of the following simple sentence: A cat sat. We begin in state S1 in the Sentence transition network. To proceed, we must follow the arc that is labeled NounPhrase. We thus move out of the Sentence network and into the NounPhrase network. The first arc of the NounPhrase network is labeled Noun. We thus move into the Noun network. We now follow each of the arcs in the Noun network and discover that our first word, A, does not match any of them. Hence, we backtrack to the next arc in the NounPhrase network. This arc is labeled Article, so we move on to the Article transition network. Here, on examining the second label, we find that the first word is matched by the terminal symbol on this arc. We therefore consume the word, A, and move on to state S2 in the Article network. Because this is a success node, we are able to return to the NounPhrase network and move on to state S2 in this network. We now have an arc labeled Noun. As before, we move into the Noun network and find that our next word, cat, matches. We thus move to state S4 in the NounPhrase network. This is a success node, and so we move back to the Sentence network and repeat the process for the VerbPhrase arc. It is possible for a system to use transition networks to generate a derivation tree for a sentence, so that as well as determining whether the sentence is grammatically valid, it parses it fully to obtain further information by semantic analysis from the sentence. This can be done by simply having the system build up the tree by noting which arcs it successfully followed. When, for example, it successfully follows the NounPhrase arc in the Sentence network, the system generates a root node labeled Sentence and an arc leading from that node to a new node 1

labeled NounPhrase.When the system follows the NounPhrase network and

identifies an article and a noun, these are similarly added to the tree. In this way, the full parse tree for the sentence can be generated using transition networks. Parsing using transition networks is simple to understand, but is not necessarily as efficient or as effective as we might hope for. In particular, it does not pay any attention to potential ambiguities or the need for words to agree with each other in case, gender, or number.

Augmented Transition Networks An augmented transition network, or ATN, is an extended version of a transition network. ATNs have the ability to apply tests to arcs, for example, to ensure agreement with number. Thus, an ATN for Sentence would be as shown in Figure 10.2, but the arc from node S2 to S3 would be conditional on the number of the verb being the same as the number for the noun. Hence, if the noun phrase were three dogs and the verb phrase were is blue, the ATN would not be able to follow the arc from node S2 to S3 because the number of the noun phrase (plural) does not match the number of the verb phrase (singular). In languages such as French, checks for gender would also be necessary. The conditions on the arcs are calculated by procedures that are attached to the arcs. The procedure attached to an arc is called when the network reaches that arc. These procedures, as well as carrying out checks on agreement, are able to form a parse tree from the sentence that is being analyzed.

Chart Parsing Parsing using transition networks is effective, but not the most efficient way to parse natural language. One problem can be seen in examining the following two sentences: 1. Have all the fish been fed? , Have all the fish.

Clearly these are very different sentences—the first is a question, and the second is an instruction. In spite of this, the first threewords of each sentence are the same.

When a parser is examining one of these sentences, it is quite likely to have to backtrack to the beginning if it makes the wrong choice in the first case for the structure of the sentence. In longer sentences, this can be a much greater problem, particularly as it involves examining the same words more than once, without using the fact that the words have already been analyzed.

Fig 10.3 Another method that is sometimes used for parsing natural language is chart parsing. In the worst case, chart parsing will parse a sentence of n words in O(n3) time. In many cases it will perform better than this and will parse most sentences in O(n2) or even O(n) time. In examining sentence 1 above, the chart parser would note that the words two children form a noun phrase. It would note this on its first pass through the sentence and would store this information in a chart, meaning it would not need to examine those words again on a subsequent pass, after backtracking. The initial chart for the sentence The cat eats a big fish is shown in Fig 10.3 It shows the chart that the chart parse algorithm would start with for parsing the sentence. The chart consists of seven vertices, which will become connected to each other by edges. The edges will show how the constituents of the sentence combine together. The chart parser starts by adding the following edge to the chart: [0, 0, Target→• Sentence]

This notation means that the edge connects vertex 0 to itself (the first two numbers in the square brackets show which vertices the edge connects). Target is the target that we want to find, which is really just a placeholder to enable us to have an edge that requires us to find a whole sentence. The arrow indicates that in order to make what is on its left-hand side (Target) we need to find what is on its right-hand side (Sentence). The dot (•) shows

what has been found already, on its left-hand side, and what is yet to be found, on its right-hand side. This is perhaps best explained by examining an example. Consider the following edge, which is shown in the chart in Figure 10.4: [0, 2, Sentence→NounPhrase • VerbPhrase] This means that an edge exists connecting nodes 0 and 2. The dot shows us that we have already found a NounPhrase (the cat) and that we are looking for a VerbPhrase.

Fig 10.4 Once we have found the VerbPhrase, we will have what is on the left-hand side of the arrow—that is, a Sentence. The chart parser can add edges to the chart using the following three rules: o

If we have an edge [x, y, A → B • C], which needs to find a C, then an edge can be added that supplies that C (i.e., the edge [x, y, C→ • E]), where E is some sequence of terminals or nonterminals which

can be replaced by a C). o

If we have two edges, [x, y, A → B • C D] and [y, z, C → E •}, then these two edges can be combined together to form a new edge:

[x, z, A→B C • D].

o

If we have an edge [x, y, A → B • C], and the word at vertex y is of type C, then we have found a suitable word for this edge, and so we extend the edge along to the next vertex by adding the following edge: [y, y + 1, A→B C •].

Semantic Analysis Having determined the syntactic structure of a sentence, the next task of natural language processing is to determine the meaning of the sentence.

Semantics is the study of the meaning of words, and semantic analysis is the analysis we use to extract meaning from utterances. 1

Semantic analysis involves building up a representation of the objects and actions that a sentence is describing, including details provided by adjectives, adverbs, and prepositions. Hence, after analyzing the sentence The black cat sat on the mat, the system would use a semantic net such as the one shown in Figure 10.5 to represent the objects and the relationships between them.

Fig 10.5 A more sophisticated semantic network is likely to be formed, which includes information about the nature of a cat (a cat is an object, an animal, a quadruped, etc.) that can be used to deduce facts about the cat (e.g., that it likes to drink milk). Ambiguity and Pragmatic Analysis One of the main differences between natural languages and formal languages like C++ is that a sentence in a natural language can have more than one meaning. This is ambiguity—the fact that a sentence can be interpreted in different ways depending on who is speaking, the context in which it is spoken, and a number of other factors. The more common forms of ambiguity and look at ways in which a natural language processing system can make sensible decisions about how to disambiguate them. Lexical ambiguity occurs when a word has more than one possible meaning. For example, a bat can be a flying mammal or a piece of sporting equipment. The word set is an interesting example of this because it can be used as a verb, a noun, an adjective, or an adverb. Determining which part of speech is intended can often be achieved by a parser in cases where only one analysis is possible, but in other cases semantic disambiguation is needed to determine which meaning is intended. Syntactic ambiguity occurs when there is more than one possible parse of a sentence. The sentence Jane carried the girl with the spade could be 1

interpreted in two different ways, as is shown in the two parse trees in Fig 10.6. In the first of the two parse trees in Fig 10.6, the prepositional phrase with the spade is applied to the noun phrase the girl, indicating that it was the girl who had a spade that Jane carried. In the second sentence, the prepositional phrase has been attached to the verb phrase carried the girl, indicating that Jane somehow used the spade to carry the girl.

Semantic ambiguity occurs when a sentence has more than one possible meaning—often as a result of a syntactic ambiguity. In the example shown in Fig 10.6 for example, the sentence Jane carried the girl with the spade, the sentence has two different parses, which correspond to two possible meanings for the sentence. The significance of this becomes clearer for practical systems if we imagine a robot that receives vocal instructions from a human.

Fig 10.6 Referential ambiguity occurs when we use anaphoric expressions, or pronouns to refer to objects that have already been discussed. An anaphora occurs when a word or phrase is used to refer to something without naming it. The problem of ambiguity occurs where it is not immediately clear which object is being referred to. For example, consider the following sentences:

John gave Bob the sandwich. He smiled. It is not at all clear from this who smiled—it could have been John or Bob. In general, English speakers or writers avoid constructions such as this to avoid humans becoming

confused by the ambiguity. In spite of this, ambiguity can also occur in a similar way where a human would not have a problem, such as 1

John gave the dog the sandwich. It wagged its tail. In this case, a human listener would know very well that it was the dog that wagged its tail, and not the sandwich. Without specific world knowledge, the natural language processing system might not find it so obvious.

A local ambiguity occurs when a part of a sentence is ambiguous; however, when the whole sentence is examined, the ambiguity is resolved. For example, in the sentence There are longer rivers than the Thames, the phrase longer rivers is ambiguous until we read the rest of the sentence, than the Thames. Another cause of ambiguity in human language is vagueness. we examined fuzzy logic, words such as tall, high, and fast are vague and do not have precise numeric meanings. The process by which a natural language processing system determines which meaning is intended by an ambiguous utterance is known as disambiguation. Disambiguation can be done in a number of ways. One of the most effective ways to overcome many forms of ambiguity is to use probability.

This can be done using prior probabilities or conditional probabilities. Prior probability might be used to tell the system that the word bat nearly always means a piece of sporting equipment. Conditional probability would tell it that when the word bat is used by a sports fan, this is likely to be the case, but that when it is spoken by a naturalist it is more likely to be a winged mammal. Context is also an extremely important tool in disambiguation. Consider the following sentences: I went into the cave. It was full of bats. I looked in the locker. It was full of bats. In each case, the second sentence is the same, but the context provided by the first sentence helps us to choose the correct meaning of the word “bat” in each case. Disambiguation thus requires a good world model, which contains knowledge about the world that can be used to determine the most likely 1

meaning of a given word or sentence. The world model would help the system to understand that the sentence Jane carried the girl with the spade is unlikely to mean that Jane used the spade to carry the girl because spades are usually used to carry smaller things than girls. The challenge, of course, is to encode this knowledge in a way that can be used effectively and efficiently by the system. The world model needs to be as broad as the sentences the system is likely to hear. For example, a natural language processing system devoted to answering sports questions might not need to know how to disambiguate the sporting bat from the winged mammal, but a system designed to answer any type of question would. Expert System Architecture An expert system is a set of programs that manipulate encoded knowledge to solve problems in a specialized domain that normally requires human expertise. An expert system’s knowledge is obtained from expert sources and coded in a form suitable for the system to use in its inference or reasoning processes. The expert knowledge must be obtained from specialists or other sources of expertise, such as texts, journal, articles and databases. This type of knowledge usually requires much training and experience in some specialized field such as medicine, geology, system configuration, or engineering design. Once a sufficient body of expert knowledge has been auquired, it must be encoded in some form, loaded into a knowledge base, then tested, and refined continually throughout the life of the system

Characteristics Features of Expert Systems Expert systems differ from conventional computer system in several important ways 1. Expert systems use knowledge rather than data to control the solution process. Much of the knowledge used in heuristic in nature rather than algorithmic 2. The knowledge is encoded and maintained as an entity separate from the aontrol program. As such, it is not complicated together with the control program itself. This permits the incremental addition and modification of the knowledge base without recompilation of the control programs. Furthermore, it is possible in some cases to use different knowledge bases with the same control programs to produce different types 1

of expert systems. Such systems are known as expert system shells since they may be

loaded with different knowledge bases 3. Expert systems are capable of explaining how a particular conclusion was reached, and why requested information is needed during a consultation. This is important as it gives the user a chance to assess and understand the systems reasoning ability, thereby improving the user’s confidence in the system 4. Expert systems use symbolic representations for knowledge and perform their inference through symbolic computations that closely resemble manipulations of natural language 5. Expert systems often reason with metaknowledge, that is, they reason with knowledge about themselves, and their own knowledge limits and capabilities

Rules for Knowledge Representation One way to represent knowledge is by using rules that express what must happen or what does happen when certain conditions are met. Rules are usually expressed in the form of IF . . . THEN . . . statements, such as: IF A THEN B This can be considered to have a similar logical meaning as the following: A→B A is called the antecedent and B is the consequent in this statement. In expressing rules, the consequent usually takes the form of an action or a conclusion. In other words, the purpose of a rule is usually to tell a system (such as an expert system) what to do in certain circumstances, or what conclusions to draw from a set of inputs about the current situation. In general, a rule can have more than one antecedent, usually combined either by AND or by OR (logically the same as the operators ∧and ∨). Similarly, a rule may have more than one consequent, which usually suggests that there are multiple actions to be taken. In general, the antecedent of a rule compares an object with a possible value, using an operator. For example, suitable antecedents in a rule might be 1

IF x > 3 1

IF name is “Bob” IF weather is cold Here, the objects being considered are x, name, and weather; the operators are “>” and “is”, and the values are 3, “Bob,” and cold. Note that an object is not necessarily an object in the real-world sense—the weather is not a real world object, but rather a state or condition of the world. An object in this sense is simply a variable that represents some physical object or state in the real world. An example of a rule might be IF name is “Bob” AND weather is cold THEN tell Bob ‘Wear a coat’ This is an example of a recommendation rule, which takes a set of inputsand gives advice as a result. The conclusion of the rule is actually an action, and the action takes the form of a recommendation to Bob that he should wear a coat. In some cases, the rules provide more definite actions such as “move left” or “close door,” in which case the rules are being used to represent directives. Rules can also be used to represent relations such as: IF temperature is below 0 THEN weather is cold

Rule-Based Systems Rule-based systems or production systems are computer systems that use rules to provide recommendations or diagnoses, or to determine a course of action in a particular situation or to solve a particular problem.

A rule-based system consists of a number of components: a database of rules (also called a knowledge base) a database of facts 1

an interpreter, or inference engine In a rule-based system, the knowledge base consists of a set of rules that represent the knowledge that the system has. The database of facts represents inputs to the system that are used to derive conclusions, or to cause actions. The interpreter, or inference engine, is the part of the system that controls the process of deriving conclusions. It uses the rules and facts, and combines them together to draw conclusions. Using deduction to reach a conclusion from a set of antecedents is called forward chaining. An alternative method, backward chaining, starts from a conclusion and tries to show it by following a logical path backward from the conclusion to a set of antecedents that are in the database of facts. Forward Chaining Forward chaining employs the system starts from a set of facts, and a set of rules, and tries to find a way of using those rules and facts to deduce a conclusion or come up with a suitable course of action. This is known as data-driven reasoning because the reasoning starts from a set of data and ends up at the goal, which is the conclusion. When applying forward chaining, the first step is to take the facts in the fact database and see if any combination of these matches all the antecedents of one of the rules in the rule database. When all the antecedents of a rule are matched by facts in the database, then this rule is triggered. Usually, when a rule is triggered, it is then fired, which means its conclusion is added to the facts database. If the conclusion of the rule that has fired is an action or a recommendation, then the system may cause that action to take place or the recommendation to be made. For example, consider the following set of rules that is used to control an elevator in a three-story building: Rule 1 1

IF on first floor and button is pressed on first floor 1

THEN open door Rule 2 IF on first floor AND button is pressed on second floor THEN go to second floor Rule 3 IF on first floor AND button is pressed on third floor THEN go to third floor Rule 4 IF on second floor AND button is pressed on first floor

AND already going to third floor THEN remember to go to first floor later This represents just a subset of the rules that would be needed, but we can use it to illustrate how forward chaining works. Let us imagine that we start with the following facts in our database: Fact 1 At first floor Fact 2 Button pressed on third floor Fact 3

1

Today is Tuesday 1

Now the system examines the rules and finds that Facts 1 and 2 match the antecedents of Rule 3. Hence, Rule 3 fires, and its conclusion “Go to third floor” is added to the database of facts. Presumably, this results in the elevator heading toward the third floor. Note that Fact 3 was ignored altogether because it did not match the antecedents of any of the rules. Now let us imagine that the elevator is on its way to the third floor and has reached the second floor, when the button is pressed on the first floor. The fact Button pressed on first floor Is now added to the database, which results in Rule 4 firing. Now let us imagine that later in the day the facts database contains the following information: Fact 1 At first floor Fact 2 Button pressed on second floor Fact 3 Button pressed on third floor In this case, two rules are triggered—Rules 2 and 3. In such cases where there is more than one possible conclusion, conflict resolution needs to be applied to decide which rule to fire.

Conflict Resolution In a situation where more than one conclusion can be deduced from a set of facts, there are a number of possible ways to decide which rule to fire. For example, consider the following set of rules: IF it is cold THEN wear a coat 1

IF it is cold 1

THEN stay at home IF it is cold THEN turn on the heat If there is a single fact in the fact database, which is “it is cold,” then clearly there are three conclusions that can be derived. In some cases, it might be fine to follow all three conclusions, but in many cases the conclusions are incompatible. In one conflict resolution method, rules are given priority levels, and when a conflict occurs, the rule that has the highest priority is fired, as in the following example: IF patient has pain THEN prescribe painkillers priority 10 IF patient has chest pain THEN treat for heart disease priority 100 Here, it is clear that treating possible heart problems is more important than just curing the pain. An alternative method is the longest-matching strategy. This method involves firing the conclusion that was derived from the longest rule. For example: IF patient has pain THEN prescribe painkiller IF patient has chest pain AND patient is over 60 AND patient has history of heart conditions THEN take to emergency room 1

Here, if all the antecedents of the second rule match, then this rule’s conclusion should be fired rather than the conclusion of the first rule because it is a more specific match. A further method for conflict resolution is to fire the rule that has matched the facts most recently added to the database. In each case, it may be that the system fires one rule and then stops, but in many cases, the system simply needs to choose a suitable ordering for the rules because each rule that matches the facts needs to be fired at some point.

Meta Rules In designing an expert system, it is necessary to select the conflict resolution method that will be used, and quite possibly it will be necessary to use different methods to resolve different types of conflicts. For example, in some situations it may make most sense to use the method that involves firing the most recently added rules. This method makes most sense in situations in which the timeliness of data is important. It might be, for example, that as research in a particular field of medicine develops, and new rules are added to the system that contradicts some of the older rules. It might make most sense for the system to assume that these newer rules are more accurate than the older rules. It might also be the case, however, that the new rules have been added by an expert whose opinion is less trusted than that of the expert who added the earlier rules. In this case, it clearly makes more sense to allow the earlier rules priority. This kind of knowledge is called meta knowledge—knowledge about knowledge. The rules that define how conflict resolution will be used, and how other aspects of the system itself will run, are called meta rules. The knowledge engineer who builds the expert system is responsible for building appropriate meta knowledge into the system (such as “expert A is to be trusted more than expert B” or “any rule that involves drug X is not to be trusted as much as rules that do not involve X”).

Meta rules are treated by the expert system as if they were ordinary rules but are given greater priority than the normal rules that make up the expert system. 1

In this way, the meta rules are able to override the normal rules, if necessary,

and are certainly able to control the conflict resolution process. Backward Chaining Forward chaining applies a set of rules and facts to deduce whatever conclusions can be derived, which is useful when a set of facts are present, but you do not know what conclusions you are trying to prove. Forward chaining can be inefficient because it may end up proving a number of conclusions that are not currently interesting. In such cases, where a single specific conclusion is to be proved, backward chaining is more appropriate. In backward chaining, we start from a conclusion, which is the hypothesis we wish to prove, and we aim to show how that conclusion can be reached from the rules and facts in the database. The conclusion we are aiming to prove is called a goal, and so reasoning in this way is known as goal-driven reasoning. Backward chaining is often used in formulating plans. A plan is a sequence of actions that a program decides to take to solve a particular problem. Backward chaining can make the process of formulating a plan more efficient than forward chaining. Backward chaining in this way starts with the goal state, which is the set of conditions the agent wishes to achieve in carrying out its plan. It now examines this state and sees what actions could lead to it. For example, if the goal state involves a block being on a table, then one possible action would be to place that block on the table. This action might not be possible from the start state, and so further actions need to be added before this action in order to reach it from the start state.

In this way, a plan can be formulated starting from the goal and working back toward the start state. The benefit in this method is particularly clear in situations where the first state allows a very large number of possible actions. In this kind of situation, it can be very inefficient to attempt to formulate a plan using forward chaining because it involves examining every possible 1

action, without paying any attention to which action might be the best one to

lead to the goal state. Backward chaining ensures that each action that is taken is one that will definitely lead to the goal, and in many cases this will make the planning process far more efficient. Comparing Forward and Backward Chaining Let us use an example to compare forward and backward chaining. In this case, we will revert to our use of symbols for logical statements, in order to clarify the explanation, but we could equally well be using rules about elevators or the weather. Rules: Rule 1 A ^ B → C Rule 2 A → D Rule 3 C ^ D → E Rule 4 B ^ E ^ F → G Rule 5 A ^ E → H Rule 6 D ^ E ^ H → I

Facts: Fact 1

A

Fact 2

B

Fact 3

F

Goal: 1

Our goal is to prove H. 1

First let us use forward chaining. As our conflict resolution strategy, we will fire rules in the order they appear in the database, starting from Rule 1.

In the initial state, Rules 1 and 2 are both triggered. We will start by firing Rule 1, which means we add C to our fact database. Next, Rule 2 is fired, meaning we add D to our fact database. We now have the facts A, B, C, D, F, but we have not yet reached our goal, which is G. Now Rule 3 is triggered and fired, meaning that fact E is added to the database. As a result, Rules 4 and 5 are triggered. Rule 4 is fired first, resulting in Fact G being added to the database, and then Rule 5 is fired, and Fact H is added to the database. We have now proved our goal and do not need to go on any further. This deduction is presented in the following table:

Now we will consider the same problem using backward chaining. To do so, we will use a goals database in addition to the rule and fact databases.

In this case, the goals database starts with just the conclusion, H, which we want to prove. We will now see which rules would need to fire to lead to this conclusion. Rule 5 is the only one that has H as a conclusion, so to prove H, we must prove the antecedents of Rule 5, which are A and E. Fact A is already in the database, so we only need to prove the other antecedent, E. Therefore, E is added to the goal database. Once we have 1

proved E, we now know that this is sufficient to prove H, so we can remove

H from the goals database. So now we attempt to prove Fact E. Rule 3 has E as its conclusion, so to prove E, we must prove the antecedents of Rule 3, which are C and D.

Neither of these facts is in the fact database, so we need to prove both of them. They are both therefore added to the goals database. D is the conclusion of Rule 2 and Rule 2’s antecedent, A, is already in the fact database, so we can conclude D and add it to the fact database. Similarly, C is the conclusion of Rule 1, and Rule 1’s antecedents, A and B, are both in the fact database. So, we have now proved all the goals in the goal database and have therefore proved H and can stop. This process is represented in the table below:

In this case, backward chaining needed to use one fewer rule. If the rule database had had a large number of other rules that had A, B, and F as their antecedents, then forward chaining might well have been even more inefficient. In general, backward chaining is appropriate in cases where there are few possible conclusions (or even just one) and many possible facts, not very many of which are

necessarily relevant to the conclusion. Forward chaining is more appropriate when there are many possible conclusions. The way in which forward or backward chaining is usually chosen is to consider which way an expert would solve the problem. This is particularly appropriate because rule-based reasoning is often used in expert systems. 1

Rule-Based Expert Systems An expert system is one designed to model the behavior of an expert in some field, such as medicine or geology. Rule-based expert systems are designed to be able to use the same rules that the expert would use to draw conclusions from a set of facts that are presented to the system. The People Involved in an Expert System The design, development, and use of expert systems involves a number of people. The end-user of the system is the person who has the need for the system.

In the case of a medical diagnosis system, this may be a doctor, or it may be an individual who has a complaint that they wish to diagnose. The knowledge engineer is the person who designs the rules for the system, based on either observing the expert at work or by asking the expert questions about how he or she works. The domain expert is very important to the design of an expert system. In the case of a medical diagnosis system, the expert needs to be able to explain to the knowledge engineer how he or she goes about diagnosing illnesses.

Architecture of an Expert System Typical expert system architecture is shown in Figure 11.1. The knowledge base contains the specific domain knowledge that is used by an expert to derive conclusions from facts. In the case of a rule-based expert system, this domain knowledge is expressed in the form of a series of rules. The explanation system provides information to the user about how the inference engine arrived at its conclusions. This can often be essential, particularly if the advice being given is of a critical nature, such as with a medical diagnosis system. 1

Fig Expert System Architecture

If the system has used faulty reasoning to arrive at its conclusions, then the user may be able to see this by examining the data given by the explanation system. The fact database contains the case-specific data that are to be used in a particular case to derive a conclusion. In the case of a medical expert system, this would contain information that had been obtained about the patient’s condition. The user of the expert system interfaces with it through a user interface, which provides access to the inference engine, the explanation system, and the knowledge-base editor. The inference engine is the part of the system that uses the rules and facts to derive conclusions. The inference engine will use forward chaining, backward chaining, or a combination of the two to make inferences from the data that are available to it. The knowledge-base editor allows the user to edit the information that is contained in the knowledge base. 1

The knowledge-base editor is not usually made available to the end user of the system but is used by the knowledge engineer or the expert to provide and update the knowledge that is contained within the system.

The Expert System Shell Note that in Figure 11.1, the parts of the expert system that do not contain domainspecific or case-specific information are contained within the expert system shell. This shell is a general toolkit that can be used to build a number of different expert systems, depending on which knowledge base is added to the shell.

An example of such a shell is CLIPS (C Language Integrated Production System). Other examples in common use include OPS5, ART, JESS, and Eclipse.

Knowledge Engineering Knowledge engineering is a vital part of the development of any expert system. The knowledge engineer does not need to have expert domain knowledge but does need to know how to convert such expertise into the rules that the system will use, preferably in an efficient manner. Hence, the knowledge engineer’s main task is communicating with the expert, in order to understand fully how the expert goes about evaluating evidence and what methods he or she uses to derive conclusions. Having built up a good understanding of the rules the expert uses to draw conclusions, the knowledge engineer must encode these rules in the expert system shell language that is being used for the task. In some cases, the knowledge engineer will have freedom to choose the most appropriate expert system shell for the task. In other cases, this decision will have already been made, and the knowledge engineer must work with what he is given.

CLIPS (C Language Integrated Production System) CLIPS is a freely available expert system shell that has been implemented in C.

It provides a language for expressing rules and mainly uses forward chaining to derive conclusions from a set of facts and rules. The notation used by CLIPS is very similar to that used by LISP. The following is an example of a rule specified using CLIPS:

(defrule birthday (firstname ?r1 John) (surname ?r1 Smith) (haircolor ?r1 Red) => (assert (is-boss ?r1))) ?r1 is used to represent a variable, which in this case is a person. Assert is used to add facts to the database, and in this case the rule is used to draw a conclusion from three facts about the person: If the person has the first name John, has the surname Smith, and has red hair, then he is the boss. This can be tried in the following way: (assert (firstname x John)) (assert (surname x Smith)) (assert (haircolor x Red)) (run) At this point, the command (facts) can be entered to see the facts that are contained in the database: CLIPS> (facts)

f-0 (firstname x John) f-1 (surname x Smith) 1

f-2 (haircolor x Red) 1

f-3 (is-boss x) So CLIPS has taken the three facts that were entered into the system and used the rule to draw a conclusion, which is that x is the boss. Although this is a simple example, CLIPS, like other expert system shells, can be used to build extremely sophisticated and powerful tools. For example, MYCIN is a well-known medical expert system that was developed at Stanford University in 1984. MYCIN was designed to assist doctors to prescribe antimicrobial drugs for blood infections. In this way, experts in antimicrobial drugs are able to provide their expertise to other doctors who are not so expert in that field. By asking the doctor a series of questions, MYCIN is able to recommend a course of treatment for the patient. Importantly, MYCIN is also able to explain to the doctor which rules fired and therefore is able to explain why it produced the diagnosis and recommended treatment that it did. MYCIN has proved successful: for example, it has been proven to be able to provide more accurate diagnoses of meningitis in patients than most doctors. MYCIN was developed using LISP, and its rules are expressed as LISP expressions. The following is an example of the kind of rule used by MYCIN, translated into

English: IF the infection is primary-bacteria AND the site of the culture is one of the sterile sites AND the suspected portal of entry is the gastrointestinal tract THEN there is suggestive evidence (0.7) that infection is bacteroid The following is a very simple example of a CLIPS session where rules are defined to operate an elevator: CLIPS> (defrule rule1 (elevator ?floor_now) 1

(button ?floor_now)

1

=> (assert (open_door))) CLIPS> (defrule rule2 (elevator ?floor_now) (button ?other_floor) => (assert (goto ?other_floor))) CLIPS> (assert (elevator floor1)) ==> f-0 (elevator floor1)

CLIPS> (assert (button floor3)) ==> f-1 (button floor3)

(run) ==>f-2 (goto floor3) The segments in bold are inputs by the knowledge engineer, and the plain text sections are CLIPS. Note that ?floor_now is an example of a variable within CLIPS, which means that any object can match it for the rule to trigger and fire. In our example, the first rule simply says: If the elevator is on a floor, and the button is pressed on the same floor, then open the door. The second rule says: If the elevator is on one floor, and the button is pressed on a different floor, then go to that floor. After the rules, two facts are inserted into the database. The first fact says that the elevator is on floor 1, and the second fact says that the button has been pressed on floor 3. 1

When the (run) command is issued to the system, it inserts a new fact into the database, which is a command to the elevator to go to floor 3.

Backward Chaining in Rule-Based Expert Systems A common method for building expert systems is to use a rule-based system with backward chaining. Typically, a user enters a set of facts into the system, and the system tries to see if it can prove any of the possible hypotheses using these facts. In some cases, it will need additional facts, in which case the expert system will often ask the user questions, to ascertain facts that could enable further rules to fire.

The algorithm is applied as follows: To prove a conclusion, we must prove a set of hypotheses, one of which is the conclusion. For each hypothesis, H: If H is in the facts database, it is proved. Otherwise, if H can be determined by asking a question, then enter the user’s answer in the facts database. Hence, it can be determined whether H is true or false, according to the user’s answer. Otherwise, find a rule whose conclusion is H. Now apply this algorithm to try to prove this rule’s antecedents. If none of the above applies, we have failed to prove H. Typically, backward chaining is used in combination with forward chaining. Whenever a new fact is added to the database, forward chaining is applied to see if any further facts can be derived. Backward chaining is then used to try to prove each possible hypothesis. Let us imagine a simple medical expert system that is designed to diagnose and treat patients’ diseases. The rules might be as follows: Rule 1: If headache then prescribe painkiller Rule 2: If headache and sore throat and coughing then diagnose flu 1

Rule 3: If tired and headache then diagnose glandular fever

Rule 4: If tired and sore throat then diagnose tonsillitis Rule 5: If tired then diagnose stress Now we might imagine a session between the user (the patient) and the expert system as follows: Patient: I have a headache System: I suggest you take a painkiller. Do you have a sore throat? Patient: Yes, I do. System: Have you been coughing? Patient: No, why do you ask? System: If you had a headache and a sore throat and had been coughing, I would have concluded you might have flu. It doesn’t sound as though you have flu. Have you been feeling tired recently? Patient: Yes, I have. System: It sounds as though you may have glandular fever or tonsillitis. This was a rather simplistic example, but it illustrates the way that the system uses forward chaining to derive new facts and asks questions in order to carry out its backward chaining analysis. For example, having been told that the patient has a headache, the system first applies forward chaining with Rule 1 and, as a result, adds “prescribe painkiller” to its database of facts. This fact has an action, or a recommendation, associated with it, and so the system advises the patient that she should take a painkiller. Next, the system tries to see if it can prove any other hypotheses. The possible hypotheses are flu, tonsillitis, glandular fever, and stress. First, the system uses backward chaining to try to prove the hypothesis that the patient has the flu.

To prove this hypothesis, the antecedents of Rule 2 must be proved: that the patient has a headache and a sore throat and has been coughing. The patient has already said that she has a headache, so this fact is already in the fact database. Next, the system must establish whether the patient has a sore throat. She says that she does, so this fact is added to the fact database. She has not been coughing, though, so the system concludes that she does not have flu. At this point also note that the patient asks why the system asked the last question. The system is able to use its explanation facility to provide an explanation for why it asked the question and what conclusion it was able to draw from the answer.

Finally, the patient says that she has been feeling tired, and as a result of this fact being added to the database, Rules 3, 4, and 5 are all triggered. In this case, conflict resolution has been applied in a rather simplistic way, such that Rules 3 and 4 both fire, but 5 does not. In a real medical expert system, it is likely that further questions would be asked, and more sophisticated rules applied to decide which condition the patient really had.

CYC CYC is an example of a frame-based representational system of knowledge, which is, in a way, the opposite of an expert system. Whereas an expert system has detailed knowledge of a very narrow domain, the developers of CYC have fed it information on over 100,000 different concepts from all fields of human knowledge. CYC also has information of over 1,000,000 different pieces of “common sense” knowledge about those concepts. The system has over 4000 different types of links that can exist between concepts, such as inheritance, and the “is–a” relationship that we have already looked at. The idea behind CYC was that humans function in the world mainly on the basis of a large base of knowledge built up over our lifetimes and our ancestors’ lifetimes.

By giving CYC access to this knowledge, and the ability to reason about it, they felt they would be able to come up with a system with common sense. Ultimately, they predict, the system will be built into word processors. Then word processors will not just correct your spelling and grammar, but will also point out inconsistencies in your document. For example, if you promise to discuss a particular subject later in your document, and then forget to do so, the system will point this out to you. They also predict that search engines and other information retrieval systems will be able to find documents even though they do not contain any of the words you entered as your query.

CYC’s knowledge is segmented into hundreds of different contexts to avoid the problem of many pieces of knowledge in the system contradicting each other. In this way, CYC is able to know facts about Dracula and to reason about him, while also knowing that Dracula does not really exist. CYC is able to understand analogies, and even to discover new analogies for itself, by examining the similarities in structure and content between different frames and groups of frames. CYC’s developers claim, for example, that it discovered an analogy between the concept of “family” and the concept of “country.”

AI Deep Learning Frameworks for DS Deep learning is arguably the most popular aspect of AI, especially when it comes to data science (DS) applications. But what exactly are deep learning frameworks, and how are they related to other terms often used in AI and data science? In this context, “framework” refers to a set of tools and processes for developing a certain system, testing it, and ultimately deploying it. Most AI systems today are created using frameworks. When a developer downloads and installs a framework on his computer, it is usually accompanied by a library. This library (or package, as it is often termed in high-level languages) will be compiled in the programming languages supported by the AI framework. The library acts like a proxy to the framework, making its various processes available through a series of functions and classes in the programming language used. This way, you can do everything the framework enables you to do, without leaving the programming environment where you have the rest of your scripts and data. So, for all practical purposes, that library is the framework, even if the framework can manifest in other programming languages too. This way, a framework supported by both Python and Julia can be accessed through either one of these languages, making the language you use a matter of preference. Since enabling a framework to function in a different language is a challenging task for the creators of the framework, oftentimes the options they provide for the languages compatible with that framework are rather limited. But what is a system, exactly? In a nutshell, a system is a standalone program or script designed to accomplish a certain task or set of tasks. In a data science setting, a system often corresponds to a data model. However, systems can include features beyond just models, such as an I/O process or a data transformation process. The term model involves a mathematical abstraction used to represent a real-world situation in a simpler, more workable manner. Models in DS are optimized through a process called training, and validated through a process called testing, before they are deployed. Another term that often appears alongside these terms is methodology, which refers to a set of methods and the theory behind those methods, for solving a particular type of problem in a certain field. Different methodologies are often geared towards different applications/objectives. It’s easy to see why frameworks are celebrities of sorts in the AI world. They help make the modelling aspect of the pipeline faster, and they make the data engineering demanded by deep learning models significantly easier. This makes AI frameworks great for companies that cannot afford a whole team of data scientists, or prefer to empower and develop the data scientists they already have. These systems are fairly simple, but not quite “plug and play.” In this chapter we’ll explore the utility behind deep learning models, their key characteristics, how they are used, their main applications, and the methodologies they support. About deep learning systems

Deep Learning (DL) is a subset of AI that is used for predictive analytics, using an AI system called an Artificial Neural Network (ANN). Predictive analytics is a group of data science methodologies that are related to the prediction of certain variables. This includes various techniques such as classification, regression, etc. As for an ANN, it is a clever abstraction of the human brain, at a much smaller scale. ANNs manage to approximate every function (mapping) that has been tried on them, making them ideal for any data analytics related task. In data science, ANNs are categorized as machine learning methodologies. The main drawback DL systems have is that they are “black boxes.” It is exceedingly difficult – practically unfeasible – to figure out exactly how their predictions happen, as the data flux in them is extremely complicated. Deep Learning generally involves large ANNs that are often specialized for specific tasks. Convolutional Neural Networks (CNNs) ANNs, for instance, are better for processing images, video, and audio data streams. However, all DL systems share a similar structure. This involves elementary modules called neurons organized in layers, with various connections among them. These modules can perform some basic transformations (usually non-linear ones) as data passes through them. Since there is a plethora of potential connections among these neurons, organizing them in a structured way (much like real neurons are organized in network in brain tissue), we can obtain a more robust and function form of these modules. This is what an artificial neural network is, in a nutshell. In general, DL frameworks include tools for building a DL system, methods for testing it, and various other Extract, Transform, and Load (ETL) processes; when taken together, these framework components help you seamlessly integrate DL systems with the rest of your pipeline. We’ll look at this in more detail later in this chapter. Although deep learning systems share some similarities with machine learning systems, certain characteristics make them sufficiently distinct. For example, conventional machine learning systems tend to be simpler and have fewer options for training. DL systems are noticeably more sophisticated; they each have a set of training algorithms, along with several parameters regarding the systems’ architecture. This is one of the reasons we consider them a distinct framework in data science. DL systems also tend to be more autonomous than their machine counterparts. To some extent, DL systems can do their own feature engineering. More conventional systems tend to require more fine-tuning of the feature-set, and sometimes require dimensionality reduction to provide any decent results. In addition, the generalization of conventional ML systems when provided with additional data generally don’t improve as much as DL systems. This is also one of the key characteristics that makes DL systems a preferable option when big data is involved. Finally, DL systems take longer to train and require more computational resources than conventional ML systems. This is due to their more sophisticated functionality. However, as the work of DL systems is easily parallelizable, modern computing architecture as well as cloud computing, benefit DL systems the most, compared to other predictive analytics systems.

How AI DL systems work At their cores, all DL frameworks work similarly, particularly when it comes to the development of DL networks. First, a DL network consists of several neurons organized in layers; many of these are connected to other neurons in other layers. In the simplest DL network, connections take place only between neurons in adjacent layers. The first layer of the network corresponds to the features of our dataset; the last layer corresponds to its outputs. In the case of classification, each class has its own node, with node values reflecting how confident the system is that a data point belongs to that class. The layers in the middle involve some combination of these features. Since they aren’t visible to the end user of the network, they are described as hidden (see Figure 1).

The connections among the nodes are weighted, indicating the contribution of each node to the nodes of the next layer it is connected to, in the next layer. The weights are initially randomized, when the network object is created, but are refined as the ANN is trained. Moreover, each node contains a mathematical function that creates a transformation of the received signal, before it is passed to the next layer. This is referred to as the transfer function (also known as the activation function). The sigmoid function is the most well-known of these, but others include softmax, tanh, and ReLU. We’ll delve more into these in a moment.

Furthermore, each layer has a bias node, which is a constant that appears unchanged on each layer. Just like all the other nodes, the bias node has a weight attached to its output. However, it has no transfer function. Its weighted value is simply added to the other nodes it is connected to, much like a constant c is added to a regression model in Statistics. The presence of such a term balances out any bias the other terms inevitably bring to the model, ensuring that the overall bias in the model is minimal. As the topic of bias is a very complex one, we recommend you check out some external resources4 if you are not familiar with it. Once the transformed inputs (features) and the biases arrive at the end of the DL network, they are compared with the target variable. The differences that inevitably occur are relayed back to the various nodes of the network, and the weights are changed accordingly. Then the whole process is repeated until the error margin of the outputs is within a certain predefined level, or until the maximum number of iterations is reached. Iterations of this process are often referred to as training epochs, and the whole process is intimately connected to the training algorithm used. In fact, the number of epochs used for training a DL network is often set as a parameter and it plays an important role in the ANN’s performance. All of the data entering a neuron (via connections with neurons of the previous layer, as well as the bias node) is summed, and then the transfer function is applied to the sum, so that the data flow from that node is y = f(Σ(wixi + b)), where wi is the weight of node i of the previous layer, and xi its output, while b is the bias of that layer. Also, f() is the mathematical expression of the transfer function. This relatively simple process is at the core of every ANN. The process is equivalent to that which takes place in a perceptron system—a rudimentary AI model that emulates the function of a single neuron. Although a perceptron system is never used in practice, it is the most basic element of an ANN, and the first system created using this paradigm. The function of a single neuron is basically a single, predefined transformation of the data at hand. This can be viewed as a kind of meta-feature of the framework, as it takes a certain input x and after applying a (usually non-linear) function f() to it, x is transformed into something else, which is the neuron’s output y. While in the majority of cases one single meta-feature would be terrible at predicting the target variable, several of them across several layers can work together quite effectively – no matter how complex the mapping of the original features to the target variable. The downside is that such a system can easily overfit, which is why the training of an ANN doesn’t end until the error is minimal (smaller than a predefined threshold). This most rudimentary description of a DL network works for networks of the multi-layer perceptron type. Of course, there are several variants beyond this type. CNNs, for example, contain specialized layers with huge numbers of neurons, while RNNs have connections that go back to previous layers. Additionally, some training algorithms involve pruning nodes of the network to ensure that no overfitting takes place. Once the DL network is trained, it can be used to make predictions about any data similar to the data it was trained on. Furthermore, its generalization capability is quite good, particularly

if the data it is trained on is diverse. What’s more, most DL networks are quite robust when it comes to noisy data, which sometimes helps them achieve even better generalization. When it comes to classification problems, the performance of a DL system is improved by the class boundaries it creates. Although many conventional ML systems create straightforward boundary landscapes (e.g. rectangles or simple curves), a DL system creates a more sophisticated line around each class (reminiscent of the borders of certain counties in the US). This is because the DL system is trying to capture every bit of signal it is given in order to make fewer mistakes when classifying, boosting its raw performance. Of course, this highly complex mapping of the classes makes interpretation of the results a very challenging, if not unfeasible, task. More on that later in this chapter.

AI Main deep learning frameworks Having knowledge of multiple DL frameworks gives you a better understanding of the AI field. You will not be limited by the capabilities of a specific framework. For example, some DL frameworks are geared towards a certain programming language, which may make focusing on just that framework an issue, since languages come and go. After all, things change very rapidly in technology, especially when it comes to software. What better way to shield yourself from any unpleasant developments than to be equipped with a diverse portfolio of DL knowhow? The main frameworks in DL include MXNet, TensorFlow, and Keras. Pytorch and Theano have also played an important role, but currently they are not as powerful or versatile as the aforementioned frameworks, which we will focus on in this book. Also, for those keen on the Julia language, there is the Knet framework, which to the best of our knowledge, is the only deep learning framework written in a high-level language mainly (in this case, Julia). You can learn more about it at its Github repository.5 MXNet is developed by Apache and it’s Amazon’s favorite framework. Some of Amazon’s researchers have collaborated with researchers from the University of Washington to benchmark it and make it more widely known to the scientific community. We’ll examine this framework in Chapter 3. TensorFlow is probably the most well-known DL framework, partly because it has been developed by Google. As such, it is widely used in the industry and there are many courses and books discussing it. In Chapter 4 we’ll delve into it more. Keras is a high-level framework; it works on top of TensorFlow (as well as other frameworks like Theano). Its ease of use without losing flexibility or power makes it one of the favorite deep learning libraries today. Any data science enthusiast who wants to dig into the realm of deep learning can start using Keras with reasonably little effort. Moreover, Keras’ seamless integrity with TensorFlow, plus the official support it gets from Google, have convinced many that Keras will be one of the long-lasting frameworks for deep learning models, while its corresponding library will continue to be maintained. We’ll investigate it in detail in Chapter 5.

AI Main deep learning programming languages As a set of techniques, DL is language-agnostic; any computer language can potentially be used to apply its methods and construct its data structures (the DL networks), even if each DL framework focuses on specific languages only. This is because it is more practical to develop frameworks that are compatible with certain languages, some programming languages are used more than others, such as Python. The fact that certain languages are more commonly used in data science plays an important role in language selection, too. Besides, DL is more of a data science framework nowadays anyway, so it is marketed to the data science community mainly, as part of Machine Learning (ML). This likely contributes to the confusion about what constitutes ML and AI these days. Because of this, the language that dominates the DL domain is Python. This is also the reason why we use it in the DL part of this book. It is also one of the easiest languages to learn, even if you haven’t done any programming before. However, if you are using a different language in your everyday work, there are DL frameworks that support other languages, such as Julia, Scala, R, JavaScript, Matlab, and Java. Julia is particularly useful for this sort of task as it is high-level (like Python, R, and Matlab), but also very fast (like any low-level language, including Java). In addition, almost all the DL frameworks support C / C++, since they are usually written in C or its object-oriented counterpart. Note that all these languages access the DL frameworks through APIs, which take the form of packages in these languages. Therefore, in order to use a DL framework in your favorite language’s environment, you must become familiar with the corresponding package, its classes, and its various functions. We’ll guide you through all that in chapters 3 to 5 of this book.

How to leverage DL frameworks Deep learning frameworks add value to AI and DS practitioners in various ways. The most important value-adding processes include ETL processes, building data models, and deploying these models. Beyond these main functions, a DL framework may offer other things that a data scientist can leverage to make their work easier. For example, a framework may include some visualization functionality, helping you produce some slick graphics to use in your report or presentation. As such, it’s best to read up on each framework’s documentation, becoming familiar with its capabilities to leverage it for your data science projects.

ETL processes for DL A DL framework can be helpful in fetching data from various sources, such as databases and files. This is a rather time-consuming process if done manually, so using a framework is very advantageous. The framework will also do some formatting on the data, so that you can start using it in your model without too much data engineering. However, doing some data processing of your own is always useful, particularly if you have some domain knowledge. Building data models The main function of a DL framework is to enable you to efficiently build data models. The framework facilitates the architecture design part, as well as all the data flow aspects of the ANN, including the training algorithm. In addition, the framework allows you to view the performance of the system as it is being trained, so that you gain insight about how likely it is to overfit. Moreover, the DL framework takes care of all the testing required before the model is tested on different than the dataset it was trained on (new data). All this makes building and fine-tuning a DL data model a straightforward and intuitive process, empowering you to make a more informed choice about what model to use for your data science project.

AI Deploying data models Model deployment is something that DL frameworks can handle, too, making movement through the data science pipeline swifter. This mitigates the risk of errors through this critical process, while also facilitating easy updating of the deployed model. All this enables the data scientist to focus more on the tasks that require more specialized or manual attention. For example, if you (rather than the DL model) worked on the feature engineering, you would have a greater awareness of exactly what is going into the model.

AI Assessing a deep learning framework DL frameworks make it easy and efficient to employ DL in a data science project. Of course, part of the challenge is deciding which framework to use. Because not all DL frameworks are built equal, there are factors to keep in mind when comparing or evaluating these frameworks. The number of languages supported by a framework is especially important. Since programming languages are particularly fluid in the data science world, it is best to have your language bases covered in the DL framework you plan to use. What’s more, having multiple languages support in a DL framework enables the formation of a more diverse data science team, with each member having different specific programming expertise. You must also consider the raw performance of the DL systems developed by the framework in question. Although most of these systems use the same low-level language on the back end, not all of them are fast. There may also be other overhead costs involved. As such, it’s best to do your due diligence before investing your time in a DL framework—particularly if your decision affects other people in your organization. Furthermore, consider the ETL processes supporting a DL framework. Not all frameworks are good at ETL, which is both inevitable and time-consuming in a data science pipeline. Again, any inefficiencies of a DL framework in this aspect are not going to be advertised; you must do some research to uncover them yourself. Finally, the user community and documentation around a DL framework are important things, too. Naturally, the documentation of the framework is going to be helpful, though in some cases it may leave much to be desired. If there is a healthy community of users for the DL framework you are considering, things are bound to be easier when learning its more esoteric aspects—as well as when you need to troubleshoot issues that may arise.

Interpretability Interpretability is the capability of a model to be understood in terms of its functionality and its results. Although interpretability is often a given with conventional data science systems, it is a pain point of every DL system. This is because every DL model is a “black box,” offering little to no explanation for why it yields the results it does. Unlike the framework itself, whose various modules and their functionality is clear, the models developed by these frameworks are convoluted graphs. There is no comprehensive explanation as to how the inputs you feed them turn into the outputs they yield. Although obtaining an accurate result through such a method may be enticing, it is quite hard to defend, especially when the results are controversial or carry a demographic bias. The reason for a demographic bias has to do with the data, by the way, so no number of bias nodes in the DL networks can fix that, since a DL network’s predictions can only be as good as the data used to train it. Also, the fact that we have no idea how the predictions correspond to the inputs allows biased predictions to slip through unnoticed. However, this lack of interpretability may be resolved in the future. This may require a new approach to them, but if it’s one thing that the progress of AI system has demonstrated over the years, it is that innovations are still possible and that new architectures of models are still being discovered. Perhaps one of the newer DL systems will have interpretability as one of its key characteristics.

Model maintenance Maintenance is essential to every data science model. This entails updating or even upgrading a model in production, as new data becomes available. Alternatively, the assumptions of the problem may change; when this happens, model maintenance is also needed. In a DL setting, model maintenance usually involves retraining the DL network. If the retrained model doesn’t perform well enough, more significant changes may be considered such as changing the architecture or the training parameters. Whatever the case, this whole process is largely straightforward and not too time-consuming. How often model maintenance is required depends on the dataset and the problem in general. Whatever the case, it is good to keep the previous model available too when doing major changes, in case the new model has unforeseen issues. Also, the whole model maintenance process can be automated to some extent, at least the offline part, when the model is retrained as new data is integrated with the original dataset. When to use DL over conventional data science systems Deciding when to use a DL system instead of a conventional method is an important task. It is easy to be enticed by the new and exciting features of DL, and to use it for all kinds of data science problems. However, not all problems require DL. Sometimes, the extra performance of DL is not worth the extra resources required. In cases where conventional data science systems fail, or don’t offer any advantage (like interpretability), DL systems may be preferable. Complex problems with lots of variables and cases with non-linear relationships between the features and the target variables are great matches for a DL framework. If there is an abundance of data, and the main objective is good raw performance in the model, a DL system is typically preferable. This is particularly true if computational resources are not a concern, since a DL system requires quite a lot of them, especially during its training phase. Whatever the case, it’s good to consider alternatives before setting off to build a DL model. While these models are incredibly versatile and powerful, sometimes simpler systems are good enough. Summary Deep Learning is a particularly important aspect of AI, and has found a lot of applications in data science. Deep Learning employs a certain kind of AI system called an Artificial Neural Networks (or ANN). An ANN is a graph-based system involving a series of (usually non-linear) operations, whereby the original features are transformed into a few meta-features capable of predicting the target variable more accurately than the original features. The main frameworks in DL are MXNet, TensorFlow, and Keras, though Pytorch and Theano also play roles in the whole DL ecosystem. Also, Knet is an interesting alternative for those using Julia primarily. There are various programming languages used in DL, including Python, Julia, Scala, Javascript, R, and C / C++. Python is the most popular.

A DL framework offers diverse functionality, including ETL processes, building data models, deploying and evaluating models, and other functions like creating visuals. A DL system can be used in various data science methodologies, including Classification, Regression, Reinforcement Learning, Dimensionality Reduction, Clustering, and Sentiment Analysis. Classification, regression, and reinforcement learning are supervised learning methodologies, while dimensionality reduction and clustering are unsupervised. Applications of DL include making high-accuracy predictions for complex problems; summarizing data into a more compact form; analyzing images, sound, or video; natural language processing and sentiment analysis; text prediction; linking images to captions; chatbots; and text summarization. A DL framework needs to be assessed on various metrics (not just popularity). Such factors include the programming languages it supports, its raw performance, how well it handles ETL processes, the strength of its documentation and user communities, and the need for future maintenance. It is not currently very easy to interpret DL results and trace them back to specific features (i.e. DL results currently have low interpretability). Giving more weight to raw performance or interpretability can help you decide whether a DL system or conventional data science system is ideal for your particular problem. Other factors, like the amount of computational resources at our disposal, are also essential for making this decision.

AI Building a DL Network Using MXNet We’ll begin our in-depth examinations of the DL frameworks with that which seems one of the most promising: Apache’s MXNet. We’ll cover its core components including the Gluon interface, NDArrays, and the MXNet package in Python. You will learn how you can save your work like the networks you trained in data files, and some other useful things to keep in mind about MXNet. MXNet supports a variety of programming languages through its API, most of which are useful for data science. Languages like Python, Julia, Scala, R, Perl, and C++ have their own wrappers of the MXNet system, which makes them easily integrated with your pipeline. Also, MXNet allows for parallelism, letting you take full advantage of your machine’s additional hardware resources, such as extra CPUs and GPUs. This makes MXNet quite fast, which is essential when tackling computationally heavy problems, like the ones found in most DL applications. Interestingly, the DL systems you create in MXNet can be deployed on all kinds of computer platforms, including smart devices. This is possible through a process called amalgamation, which ports a whole system into a single file that can then be executed as a standalone program. Amalgamation in MXNet was created by Jack Deng, and involves the development of .cc files, which use the BLAS library as their only dependency. Files like this tend to be quite large (more than 30000 lines long). There is also the option of compiling .h files using a program called emscripten. This program is independent of any library, and can be used by other programming languages with the corresponding API. Finally, there exist several tutorials for MXNet, should you wish to learn more about its various functions. Because MXNet is an open-source project, you can even create your own tutorial, if you are so inclined. What’s more, it is a cross-platform tool, running on all major operating systems. MXNet has been around long enough that it is a topic of much research, including a well-known academic paper by Chen et al.7

Core components Gluon interface Gluon is a simple interface for all your DL work using MXNet. You install it on your machine just like any Python library: pip install mxnet --pre --user The main selling point of Gluon is that it is straightforward. It offers an abstraction of the whole network building process, which can be intimidating for people new to the craft. Also, Gluon is very fast, not adding any significant overhead to the training of your DL system. Moreover, Gluon can handle dynamic graphs, offering some malleability in the structure of the ANNs created. Finally, Gluon has an overall flexible structure, making the development process for any ANN less rigid. Naturally, for Gluon to work, you must have MXNet installed on your machine (although you don’t need to if you are using the Docker container provided with this book). This is achieved using the familiar pip command: pip install mxnet --pre --user

Because of its utility and excellent integration with MXNet, we’ll be using Gluon throughout this chapter, as we explore this DL framework. However, to get a better understanding of MXNet, we’ll first briefly consider how you can use some of its other functions (which will come in handy for one of the case studies we examine later). NDArrays The NDArray is a particularly useful data structure that’s used throughout an MXNet project. NDArrays are essentially NumPy arrays, but with the added capability of asynchronous CPU processing. They are also compatible with distributed cloud architectures, and can even utilize automatic differentiation, which is particularly useful when training a deep learning system, but NDArrays can be effectively used in other ML applications too. NDArrays are part of the MXNet package, which we will examine shortly. You can import the NDArrays module as follows: from mxnet import nd

To create a new NDArray consisting of 4 rows and 5 columns, for example, you can type the following: nd.empty((4, 5))

The output will differ every time you run it, since the framework will allocate whatever value it finds in the parts of the memory that it allocates to that array. If you want the NDArray to have just zeros instead, type: nd.zeros((4, 5))

To find the number of rows and columns of a variable having an NDArray assigned to it, you need to use the .shape function, just like in NumPy: x = nd.empty((2, 7)) x.shape

Finally, if you want to find to total number of elements in an NDArray, you use the .size function: x.size

The operations in an NDArray are just like the ones in NumPy, so we won’t elaborate on them here. Contents are also accessed in the same way, through indexing and slicing. Should you want to turn an NDArray into a more familiar data structure from the NumPy package, you can use the asnumpy() function: y = x.asnumpy()

The reverse can be achieved using the array() function: z = nd.array(y)

One of the distinguishing characteristics of NDArrays is that they can assign different computational contexts to different arrays—either on the CPU or on a GPU attached to your machine (this is referred to as “context” when discussing about NDArrays). This is made possible by the ctx parameter in all the package’s relevant functions. For example, when creating an empty array of zeros that you want to assign to the first GPU, simply type:

a = nd.zeros(shape=(5,5), ctx=mx.gpu(0))

Of course, the data assigned to a particular processing unit is not set in stone. It is easy to copy data to a different location, linked to a different processing unit, using the copyto() function: y = x.copyto(mx.gpu(1)) # copy the data of NDArray x to the 2nd GPU

You can find the context of a variable through the .context attribute: print(x.context)

It is often more convenient to define the context of both the data and the models, using a separate variable for each. For example, say that your DL project uses data that you want to be processed by the CPU, and a model that you prefer to be handled by the first GPU. In this case, you’d type something like: DataCtx = mx.cpu() ModelCtx = mx.gpu(0) MXNet package in Python The MXNet package (or “mxnet,” with all lower-case letters, when typed in Python), is a very robust and self-sufficient library in Python. MXNet provides deep learning capabilities through the MXNet framework. Importing this package in Python is fairly straightforward: import mxnet as mx

If you want to perform some additional processes that make the MXNet experience even better, it is highly recommended that you first install the following packages on your computer: graphviz (ver. 0.8.1 or later) requests (ver. 2.18.4 or later) numpy (ver. 1.13.3 or later) You can learn more about the MXNet package through the corresponding GitHub repository.8 MXNet in action

Now let’s take a look at what we can do with MXNet, using Python, on a Docker image with all the necessary software already installed. We’ll begin with a brief description of the datasets we’ll use, and then proceed to a couple specific DL applications using that data (namely classification and regression). Upon mastering these, you can explore some more advanced DL systems of this framework on your own.

Datasets description In this section we’ll introduce two synthetic datasets that we prepared to demonstrate classification and regression methods on them. First dataset is for classification, and the other for regression. The reason we use synthetic datasets in these exercises to maximize our understanding of the data, so that we can evaluate the results of the DL systems independent of data quality. The first dataset comprises 4 variables, 3 features, and 1 labels variable. With 250,000 data points, it is adequately large for a DL network to work with. Its small dimensionality makes it ideal for visualization (see Figure 2). It is also made to have a great deal of non-linearity, making it a good challenge for any data model (though not too hard for a DL system). Furthermore, classes 2 and 3 of this dataset are close enough to be confusing, but still distinct. This makes them a good option for a clustering application, as we’ll see later.

The second dataset is somewhat larger, comprising 21 variables—20 of which are the features used to predict the last, which is the target variable. With 250,000 data points, again, it is ideal for a DL system. Note that only 10 of the 20 features are relevant to the target variable (which

is a combination of these 10). A bit of noise is added to the data to make the whole problem a bit more challenging. The remaining 10 features are just random data that must be filtered out by the DL model. Relevant or not, this dataset has enough features altogether to render a dimensionality reduction application worthwhile. Naturally, due to its dimensionality, we cannot plot this dataset. Loading a dataset into an NDArray Let’s now take a look at how we can load a dataset in MXNet, so that we can process it with a DL model later on. First let’s start with setting some parameters: DataCtx = mx.cpu() # assign context of the data used BatchSize = 64 # batch parameter for dataloader object r = 0.8 # ratio of training data nf = 3 # number of features in the dataset (for the classification problem)

Now, we can import the data like we’d normally do in a conventional DS project, but this time store it in NDArrays instead of Pandas or NumPy arrays: with open(“../data/data1.csv”) as f: data_raw = f.read() lines = data_raw.splitlines() # split the data into separate lines ndp = len(lines) # number of data points X = nd.zeros((ndp, nf), ctx=data_ctx) Y = nd.zeros((ndp, 1), ctx=data_ctx) for i, line in enumerate(lines): tokens = line.split() Y[i] = int(tokens[0]) for token in tokens[1:]: index = int(token[:-2]) - 1 X[i, index] = 1

Now we can split the data into a training set and a testing set, so that we can use it both to build and to validate our classification model: import numpy as np # we’ll be needing this package as well

n = np.round(N * r) # number of training data points train = data[:n, ] # training set partition test = data[(n + 1):,] # testing set partition data_train = gluon.data.DataLoader(gluon.data.ArrayDataset(train[:,:3], train[:,3]), batch_size=BatchSize, shuffle=True) data_test = gluon.data.DataLoader(gluon.data.ArrayDataset(test[:,:3], test[:,3]), batch_size=BatchSize, shuffle=True)

We’ll then need to repeat the same process to load the second dataset—this time using data2.csv as the source file. Also, to avoid confusion with the dataloader objects of dataset 1, you can name the new dataloaders data_train2 and data_test2, respectively.

Classification for mxnet Now let’s explore how we can use this data to build an MLP system that can discern the different classes within the data we have prepared. For starters, let’s see how to do this using the mxnet package on its own; then we’ll examine how the same thing can be achieved using Gluon. First, let’s define some constants that we’ll use later to build, train, and test the MLP network: nhn = 256 # number of hidden nodes for each layer WeightScale = 0.01 # scale multiplier for weights ModelCtx = mx.cpu() # assign context of the model itself no = 3 # number of outputs (classes) ne = 10 # number of epochs (for training) lr = 0.001 # learning rate (for training) sc = 0.01 # smoothing constant (for training) ns = test.shape[0] # number of samples (for testing)

Next, let’s initialize the network’s parameters (weights and biases) for the first layer: W1 = nd.random_normal(shape=(nf, nhn), scale=WeightScale, ctx=ModelCtx) b1 = nd.random_normal(shape=nhn, scale=WeightScale, ctx=ModelCtx)

And do the same for the second layer: W2 = nd.random_normal(shape=(nhn, nhn), scale=WeightScale, ctx=ModelCtx) b2 = nd.random_normal(shape=nhn, scale=WeightScale, ctx=ModelCtx)

Then let’s initialize the output layer and aggregate all the parameters into a single data structure called params: W3 = nd.random_normal(shape=(nhn, no), scale=WeightScale, ctx=ModelCtx) b3 = nd.random_normal(shape=no, scale=WeightScale, ctx=ModelCtx) params = [W1, b1, W2, b2, W3, b3]

Finally, let’s allocate some space for a gradient for each one of these parameters: for param in params: param.attach_grad()

Remember that without any non-linear functions in the MLP’s neurons, the whole system would be too rudimentary to be useful. We’ll make use of the ReLU and the Softmax functions as activation functions for our system: def relu(X): return nd.maximum(X, nd.zeros_like(X)) def softmax(y_linear): exp = nd.exp(y_linear - nd.max(y_linear)) partition = nd.nansum(exp, axis=0, exclude=True).reshape((-1, 1)) return exp / partition

Note that the Softmax function will be used in the output neurons, while the ReLU function will be used in all the remaining neurons of the network. For the cost function of the network (or, in other words, the fitness function of the optimization method under the hood), we’ll use the cross-entropy function: def cross_entropy(yhat, y): return - nd.nansum(y * nd.log(yhat), axis=0, exclude=True)

To make the whole system a bit more efficient, we can combine the softmax and the crossentropy functions into one, as follows: def softmax_cross_entropy(yhat_linear, y): return - nd.nansum(y * nd.log_softmax(yhat_linear), axis=0, exclude=True)

After all this, we can now define the function of the whole neural network, based on the above architecture: def net(X): h1_linear = nd.dot(X, W1) + b1

h1 = relu(h1_linear) h2_linear = nd.dot(h1, W2) + b2 h2 = relu(h2_linear) yhat_linear = nd.dot(h2, W3) + b3 return yhat_linear

The optimization method for training the system must also be defined. In this case we’ll utilize a form of Gradient Descent: def SGD(params, lr): for param in params: param[:] = param - lr * param.grad return param

For the purposes of this example, we’ll use a simple evaluation metric for the model: accuracy rate. Of course, this needs to be defined first: def evaluate_accuracy(data_iterator, net): numerator = 0. denominator = 0. for i, (data, label) in enumerate(data_iterator): data = data.as_in_context(model_ctx).reshape((-1, 784)) label = label.as_in_context(model_ctx) output = net(data) predictions = nd.argmax(output, axis=1) numerator += nd.sum(predictions == label) denominator += data.shape[0] return (numerator / denominator).asscalar()

Now we can train the system as follows: for e in range(epochs):

cumulative_loss = 0 for i, (data, label) in enumerate(train_data): data = data.as_in_context(model_ctx).reshape((-1, 784)) label = label.as_in_context(model_ctx) label_one_hot = nd.one_hot(label, 10) with autograd.record(): output = net(data) loss = softmax_cross_entropy(output, label_one_hot) loss.backward() SGD(params, learning_rate) cumulative_loss += nd.sum(loss).asscalar() test_accuracy = evaluate_accuracy(test_data, net) train_accuracy = evaluate_accuracy(train_data, net) print(“Epoch %s. Loss: %s, Train_acc %s, Test_acc %s” % (e, cumulative_loss/num_examples, train_accuracy, test_accuracy))

Finally, we can use to system to make some predictions using the following code: def model_predict(net, data): output = net(data) return nd.argmax(output, axis=1) SampleData = mx.gluon.data.DataLoader(data_test, ns, shuffle=True) for i, (data, label) in enumerate(SampleData): data = data.as_in_context(ModelCtx) im = nd.transpose(data,(1,0,2,3)) im = nd.reshape(im,(28,10*28,1)) imtiles = nd.tile(im, (1,1,3)) plt.imshow(imtiles.asnumpy()) plt.show() pred=model_predict(net,data.reshape((-1,784))) print(‘model predictions are:’, pred)

print(‘true labels :’, label) break

If you run the above code (preferably in the Docker environment provided), you will see that this simple MLP system does a good job at predicting the classes of some unknown data points —even if the class boundaries are highly non-linear. Experiment with this system more and see how you can improve its performance even further, using the MXNet framework. Now we’ll see how we can significantly simplify all this by employing the Gluon interface. First, let’s define a Python class to cover some common cases of Multi-Layer Perceptrons, transforming a “gluon.Block” object into something that can be leveraged to gradually build a neural network, consisting of multiple layers (also known as MLP): class MLP(gluon.Block): def __init__(self, **kwargs): super(MLP, self).__init__(**kwargs) with self.name_scope(): self.dense0 = gluon.nn.Dense(64) # architecture of 1st layer (hidden) self.dense1 = gluon.nn.Dense(64) # architecture of 2nd layer (hidden) self.dense2 = gluon.nn.Dense(3) # architecture of 3rd layer (output) def forward(self, x): # a function enabling an MLP to process data (x) by passing it forward (towards the output layer) x = nd.relu(self.dense0(x)) # outputs of first hidden layer x = nd.relu(self.dense1(x)) # outputs of second hidden layer x = self.dense2(x) # outputs of final layer (output) return x

Of course, this is just an example of how you can define an MLP using Gluon, not a one-sizefits-all kind of solution. You may want to define the MLP class differently, since the architecture you use will have an impact on the system’s performance. (This is particularly true for complex problems where additional hidden layers would be useful.) However, if you find what follows too challenging, and you don’t have the time to assimilate the theory behind DL systems covered in Chapter 1, you can use an MLP object like the one above for your project. Since DL systems are rarely as compact as the MLP above, and since we often need to add more layers (which would be cumbersome in the above approach), it is common to use a

different class called Sequential. After we define the number of neurons in each hidden layer, and specify the activation function for these neurons, we can build an MLP like a ladder, with each step representing one layer in the MLP: nhn = 64 # number of hidden neurons (in each layer) af = “relu” # activation function to be used in each neuron net = gluon.nn.Sequential() with net.name_scope(): net.add(gluon.nn.Dense(nhn , activation=af)) net.add(gluon.nn.Dense(nhn , activation=af)) net.add(gluon.nn.Dense(no))

This takes care of the architecture for us. To make the above network functional, we’ll first need to initialize it: sigma = 0.1 # sigma value for distribution of weights for the ANN connections ModelCtx = mx.cpu() lr = 0.01 # learning rate oa = ‘sgd’ # optimization algorithm net.collect_params().initialize(mx.init.Normal(sigma=sigma), ctx=ModelCtx) softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss() trainer = gluon.Trainer(net.collect_params(), oa, {‘learning_rate’: lr}) ne = 10 # number of epochs for training

Next, we must define how we assess the network’s progress, through an evaluation metric function. For the purposes of simplicity, we’ll use the standard accuracy rate metric: def AccuracyEvaluation(iterator, net): acc = mx.metric.Accuracy() for i, (data, label) in enumerate(iterator): data = data.as_in_context(ModelCtx).reshape((-1, 3)) label = label.as_in_context(ModelCtx) output = net(data)

predictions = nd.argmax(output, axis=1) acc.update(preds=predictions, labels=label) return acc.get()[1]

Finally, it’s time to train and test the MLP, using the aforementioned settings: for e in range(ne): cumulative_loss = 0 for i, (data, label) in enumerate(train_data): data = data.as_in_context(ModelCtx).reshape((-1, 784)) label = label.as_in_context(ModelCtx) with autograd.record(): output = net(data) loss = softmax_cross_entropy(output, label) loss.backward() trainer.step(data.shape[0]) cumulative_loss += nd.sum(loss).asscalar() train_accuracy = AccuracyEvaluation(train_data, net) test_accuracy = AccuracyEvaluation(test_data, net) print(“Epoch %s. Loss: %s, Train_acc %s, Test_acc %s” % (e, cumulative_loss/ns, train_accuracy, test_accuracy))

Running the above code should yield similar results to those from conventional mxnet commands. To make things easier, we’ll rely on the Gluon interface in the example that follows. Nevertheless, we still recommend that you experiment with the standard mxnet functions afterwards, should you wish to develop your own architectures (or better understand the theory behind DL). Regression Creating a regression MLP system is similar to creating a classification one but with some differences. In the regression case, the regression will be simpler, since regressors are typically lighter architecturally than classifiers. For this example, we’ll use the second dataset.

First, let’s start by importing the necessary classes from the mxnet package and setting the context for the model: import mxnet as mx from mxnet import nd, autograd, gluon ModelCtx = mx.cpu()

To load data to the model, we’ll use the dataloaders created previously (data_train2 and data_test2). Let’s now define some basic settings and build the DL network gradually: nf = 20 # we have 20 features in this dataset sigma = 1.0 # sigma value for distribution of weights for the ANN connections net = gluon.nn.Dense(1, in_units=nf) # the “1” here is the number of output neurons, which is 1 in regression

Let’s now initialize the network with some random values for the weights and biases: net.collect_params().initialize(mx.init.Normal(sigma=sigma), ctx=ModelCtx)

Just like any other DL system, we need to define the loss function. Using this function, the system understands how much of an error each deviation from the target variable’s values costs. At the same time, cost functions can also deal with the complexity of the models (since if models are too complex they can cost us overfitting): square_loss = gluon.loss.L2Loss()

Now it’s time to train the network using the data at hand. After we define some essential parameters (just like in the classification case), we can create a loop for the network to train: ne = 10 # number of epochs for training loss_sequence = [] # cumulative loss for the various epochs nb = ns / BatchSize # number of batches for e in range(ne): cumulative_loss = 0

for i, (data, label) in enumerate(train_data): # inner loop data = data.as_in_context(ModelCtx) label = label.as_in_context(ModelCtx) with autograd.record(): output = net(data) loss = square_loss(output, label) loss.backward() trainer.step(BatchSize) CumulativeLoss += nd.mean(loss).asscalar() print(“Epoch %s, loss: %s” % (e, CumulativeLoss / ns)) loss_sequence.append(CumulativeLoss)

If you wish to view the parameters of the model, you can do so by collecting them into a dictionary structure: params = net.collect_params() for param in params.values(): print(param.name, param.data())

Printing out the parameters may not seem to be useful as we have usually too many of them and especially when we add new layers to the system, something we’d accomplish as follows: net.add(gluon.nn.Dense(nhn))

where nhn is the number of neurons in that additional hidden layer. Note that the network requires an output layer with a single neuron, so be sure to insert any additional layers between the input and output layers.

Creating checkpoints for models developed in MXNet As training a system may take some time, the ability to save and load DL models and data through this framework is essential. We must create “checkpoints” in our work so that we can pick up from where we’ve stopped, without having to recreate a network from scratch every time. This is achieved through the following process. First import all the necessary packages and classes, and then define the context parameter: import mxnet as mx from mxnet import nd, autograd, gluon import os ctx = mx.cpu() # context for NDArrays

We’ll then save the data, but let’s put some of it into a dictionary first: dict = {“X”: X, “Y”: Y}

Now we’ll set the name of the file and save it: filename = “test.dat” nd.save(filename, dict)

We can verify that everything has been saved properly by loading that checkpoint as follows: Z = nd.load(filename) print(Z)

When using gluon, there is a shortcut for saving all the parameters of the DL network we have developed. It involves the save_params() function: filename = “MyNet.params” net.save_params(filename)

To restore the DL network, however, you’ll need to recreate the original network’s architecture, and then load the original network’s parameters from the corresponding file: net2 = gluon.nn.Sequential()with net2.name_scope(): net2.add(gluon.nn.Dense(num_hidden, activation=”relu”)) net2.add(gluon.nn.Dense(num_hidden, activation=”relu”)) net2.add(gluon.nn.Dense(num_outputs)) net2.load_params(filename, ctx=ctx)

It’s best to save your work at different parts of the pipeline, and give the checkpoint files descriptive names. It is also important to keep in mind that we don’t have “untraining” option and it is likely that the optimal performance happens before the completion of the training phase. Because of this, we may want to create checkpoints after each training epoch so that we can revert to it when we find out at which point the optimal performance is achieved. Moreover, for the computer to make sense of these files when you load them in your programming environment, you’ll need to have the nd class of mxnet in memory, in whatever programming language you are using. MXNet tips The MXNet framework is a very robust and versatile platform for a variety of DL systems. Although we demonstrated its functionality in Python, it is equally powerful when used with other programming languages. In addition, the Gluon interface is a useful add-on. If you are new to DL applications, we recommend you use Gluon as your go-to tool when employing the MXNet framework. This doesn’t mean that the framework itself is limited to Gluon, though, since the mxnet package is versatile and robust in a variety of programming platforms. Moreover, in this chapter we covered just the basics of MXNet and Gluon. Going through all the details of these robust systems would take a whole book! Learn more about the details of the Gluon interface in the Straight Dope tutorial, which is part of the MXNet documentation.9 Finally, the examples in this chapter are executed in a Docker container; as such, you may experience some lagging. When developing a DL system on a computer cluster, of course, it is significantly faster. Summary MXNet is a deep learning framework developed by Apache. It exhibits ease of use, flexibility, and high speed, among other perks. All of this makes MXNet an attractive option for DL, in a variety of programming languages, including Python, Julia, Scala, and R. MXNet models can be deployed to all kinds of computing systems, including smart devices. This is achieved by exporting them as a single file, to be executed by these devices.

Gluon is a package that provides a simple interface for all your DL work using MXNet. Its main benefits include ease of use, no significant overhead, ability to handle dynamic graphs for your ANN models, and flexibility. NDArrays are useful data structures when working with the MXNet framework. They can be imported as modules from the mxnet package as nd. They are similar to NumPy arrays, but more versatile and efficient when it comes to DL applications. The mxnet package is Python’s API for the MXNet framework and contains a variety of modules for building and using DL systems. Data can be loaded into MXNet through an NDArray, directly from the data file; and then creating a dataloader object, to feed the data into the model built afterward. Classification in MXNet involves creating an MLP (or some other DL network), training it, and using it to predict unknown data, allocating one neuron for every class in the dataset. Classification is significantly simpler when using Gluon. Regression in MXNet is like classification, but the output layer has a single neuron. Also, additional care must be taken so that the system doesn’t overfit; therefore we often use some regularization function such as L2. Creating project checkpoints in MXNet involves saving the model and any other relevant data into NDArrays, so that you can retrieve them at another time. This is also useful for sharing your work with others, for reviewing purposes. Remember that MXNet it is generally faster than on the Docker container used in this chapter’s examples, and that it is equally useful and robust in other programming languages.

Artificial Intelligence Building an Optimizer Based on the Particle Swarm Optimization Algorithm We’ll start our examination of optimization frameworks with one of the most powerful and easy-to-use optimizers, known as Particle Swarm Optimization (or PSO). This optimizer was named after the biological phenomenon of a “swarm,” say of bees or of starlings. In such swarms, large groups of individuals behave in a cooperative manner, more like one large organism than the sum of its parts. The name fits because the optimizer mimics the swarm movement in an attempt to solve the complex optimization problems it is designed for. In fact, many of the other optimizers we’ll discuss later in the book are similarly named after such types of natural phenomena. The significance of PSO lies in the fact that many of the alternative optimizers are merely variations of the cornerstone PSO. As a result, understanding this optimizer grants you access to a whole set of optimization methods that can solve much more than conventional data analytics problems. In fact, their applications span over so many fields that one can argue that many data analytics methods are just a niche application of this AI framework. PSO belongs to a general class of systems called Evolutionary Computation, which is a type of Computational Intelligence. Computational Intelligence is a popular subclass of AI (at least in the research world) that involves the development and application of clever ways to solve complex problems, using just a computational approach. In this chapter, we’ll examine the inner workings of PSO, as well as some of its most important variants, with a focus on the Firefly optimizer. We’ll also show how PSO can be implemented in Julia. We’ll close with some useful considerations about PSO, and a summary of the key points of this chapter.

PSO algorithm for AI The logic behind PSO is to have a set of potential solutions (akin to a swarm of particles) that continuously evolve, becoming better and better, based on some fitness function the system tries to maximize or minimize. The particles “move” with varying speeds throughout several dimensions (also called variables), influenced by the best-performing particle, so that they collectively reach an optimal solution in an efficient manner. In addition, each particle “remembers” its best performance historically, and it takes that into account when changing its position. Naturally, the best performing particle may be a different one over the duration of the search (you can imagine the group of solutions moving towards the best possible solution like a swarm of insects, so which insect is closest to that solution is bound to be different every time you look at the swarm). Still, there is generally an improvement in the best solution over time, even if the rate of this improvement gradually diminishes. This is because the closer you get to the best solution, the more likely the swarm is bound to deviate from it (albeit slightly) while “zeroing in” on that best solution. All these traits make the PSO algorithm ideal for optimizing the parameters of a complex system. PSO is relatively new as an algorithm; its creators, Dr. Eberhart and Dr. Kennedy, invented it in 1995. The pseudocode of PSO is as follows: For each particle in the swarm Initialize particle by setting random values to its initial state End Do For each particle in the swarm Calculate fitness value If the fitness value is better than the best fitness value in its history (pBest): pBest p_best[i] p_best[i] = PC[i] Pp_best[I,:] = PP[I,:] if PC[i] > gb[iter] gb[iter] = PC[i] Pg_best = PP[I,:] end end # of 2nd if end # of 1st if end # of I loop if abs(gb[iter] – gb[iter-iwp]) < tol return Pg_best, gb[iter] # best solution and best value respectively end end # of iter loop return Pg_best, gb[end] # best solution and best value respectively end

Despite its length, the core of the algorithm is simple, quite fast, and relatively light on computational resources. Note that most of the parameters are optional, since their default values are predefined. Simply feed it the fitness function and the number of variables, and decide whether you want it to be minimized or not. If you don’t specify the latter, the PSO method defaults to minimization of the fitness function. Note that we use here the “vanilla” version of PSO, with minimal add-ons. As a result, its performance is not great. We’ll investigate a more improved Julia script of PSO in Chapter 10, along with its parallelized version. PSO in action

The first practical application of PSO proposed by its creators was training ANNs. However, PSOs flexible nature has made it useful in various other domains, such as combinatorial optimization, signal processing, telecommunications, control systems, data mining, design, power systems, and more. Also, as more specialized algorithms for training ANNs became available, PSO ceased being a relevant option for optimizing the weights of an ANN. Although most versions of PSO involve a single-objective approach (having a single fitness function), with some changes, PSO can be used in multiple-objective and dynamic problems (with varying configurations). The possibility of having constraints in the solution space has also been explored (the constraints in this latter case are inherently different from the constriction PSO variant). So, even though PSO was originally a data science-oriented algorithm, its applicability has made it a useful tool for all sorts of problems. This clearly shows how AI is an independent field that dovetails well with almost any data-related scenario. Nevertheless, some organizational problems require the use of an optimizer, rather than a machine learning system. Examples of such issues include crea Building an Optimizer Based on Genetic Algorithms

ting an optimal schedule, finding the best way to stock a warehouse, or working out the most efficient route for a delivery driver. These problems are so common in so many industries that familiarity with a robust optimizer like PSO can be a good distinguishing factor, professionally. Besides, having a variety of skills can help you develop a more holistic view of a challenging situation, empowering you to find a better strategy for tackling it. Note that just like any other AI optimizer, PSO does not provide the best solution to a problem, nor does it have mathematical precision. However, it is very efficient. As such, PSO adds a lot of value in cases where an approximate solution is sufficient, especially if the time it takes to find this solution is also important. Furthermore, when the problems involve functions that cannot be easily analyzed mathematically (e.g. functions that aren’t “smooth” enough to calculate a derivative function), a method like PSO is the most viable option. Minimizing a polynomial expression The examples of the PSO involve different problems, as expressed by a couple of different fitness functions. In the first case we consider a minimization problem, while in the latter, we’ll look at a maximization problem. First, let’s start with defining the fitness function, F, for the first problem, which involves a complex (highly non-linear) polynomial expression: function F(X::Array{Float64}) return y = X[1]^2 + X[2]^2 + abs(X[3]) + sqrt(abs(X[4]*X[5])) + 1.0 end

You can also write the above function as: F(X::Array{Float64}) = X[1]^2 + X[2]^2 + abs(X[3]) + sqrt(abs(X[4]*X[5])) + 1.0

Though more compact, this may not be as useful for complex functions involving a lot of variables. Whatever the case, we expect to get a solution that’s close to (0, 0, 0, 0, 0), since this is the solution that corresponds to the absolute minimum of this function (which is in this case 1.0 since 0^2 + 0^2 + |0| + sqrt(|0*0|) + 1 = 1). Next, we need to run the PSO algorithm, using the above function as an input. We’ll work with the default values for the input parameters, for the majority of them: pso(F, 5)

For one of the runs of PSO, the solution [-0.0403686, 0.0717666, -0.0102388, 0.0966982, -0.129386] was yielded, corresponding to a fitness score of approximately 1.243. Although this solution is not particularly impressive, it is decent, considering the complexity of the problem and the fact that we used the most basic version of the optimizer. We can try a smaller swarm – say, of 20 particles – for comparison: pso(F, 5, true, 20)

The result in this case was [0.164684, -0.241848, 0.0640438, -0.0186612, -0.882855], having a fitness score of about 1.388. Additional runs may yield better scores. This shows that PSO systems can yield acceptable results, even without lots of particles. We can measure how long this whole process takes using the @time meta-command, as follows: @time pso(F, 5)

In this case, for a solution of comparable fitness, we learn that the whole process took about 0.008 seconds—not bad at all. As a bonus, we get some information about how many computational resources the process consumes. That is, 7.179 MB of RAM through its 87.6K allocations. Note that for this report to be accurate, the command must run more than once. This is true of all Julia functions benchmarked using this meta-command.

AI Maximizing an exponential expression Let’s try something a bit more challenging for the maximization example. This problem consists of six variables, one of which is raised to the 4th power, making the solution space a bit rougher. Function F2(X::Array{Float64}) return y = exp(-X[1]^2) + exp(-X[2]^2) + exp(-abs(X[3])) + exp(-sqrt(abs(X[4]*X[5]))) + exp(-X[6]^4) end

Like in the previous case, we expect to get something close to (0, 0, 0, 0, 0, 0) as a solution, since this is the absolute maximum of this function (which is equal to 5.0 since F2(0, 0, 0, 0, 0, 0) = exp(-0^2) + exp(-0^2) + exp(-|0|) + exp(-sqrt(|0*0})) + exp(-0^4) = 1 + 1 + 1 + 1 + 1 = 5). To use PSO, we simply type: pso(F2, 6, false)

The solution obtained is [0.370003, 0.0544304, 0.0980422, 0.00426721, -0.011095, 0.294815], corresponding to a fitness score of about 4.721, which is quite close to the maximum value we were expecting. Again, we can see how much time and computational resources this whole process took in this case: @time pso(F2, 6, false)

The time the whole problem took was about 0.009 seconds, while it took about 15.006 MB of memory, and around 183.1K allocations. Clearly, this is a somewhat tougher problem, involving a larger swarm, so it takes a bit more time and memory (though the time overhead is quite small). If we were to solve either one of these problems with a deterministic optimizer, though, it would probably take the same computer longer. PSO tips Despite its simplicity, avoiding suboptimal results with PSO requires some attention to detail. For instance, if you use a low value for Vmax, the algorithm will take a long time to converge (not to mention the increased risk of it getting stuck at a local optimum, yielding a mediocre solution). On the other hand, a very large value would make the whole process very unstable (and unable to converge on any optimum).

Furthermore, a very large number of particles make the whole system fairly slow; too few particles make it difficult to find the optimum solution. The empirical default value of 10 times the number of variables seems to work well for all the benchmarks tried, but it’s just a rule of thumb; make sure you experiment with this parameter when you fine-tune your PSO model. In addition, in some cases, PSO is used with a variable Vmax parameter, to ensure that it converges more smoothly. For example, you can reduce it by a factor k, every so many iterations, so that as it approaches the optimum value of the function, the particles of the swarm will be closer together, yielding a better precision. Once you get the hang of PSO, you can experiment with such parameters to improve its performance. What’s more, it’s a good idea to make sure that the swarm covers a meaningful area when deployed, to ensure that it won’t get stuck in a local optimum. In other words, if you are optimizing a set of three parameters that all take place between 0 and 1, it’s best to spread the swarm to cover as much volume as possible, instead of having them all close to (0, 0, 0). This is because if the optimal solution is close to (0, 1, 1), for example, it could take the swarm a long time to approach it. How much area exactly a swarm covers when deployed is something you may want to experiment with, since it largely depends on the problem at hand. Also consider the distribution of the particles across the various dimensions of the problem space. The distribution used in this implementation is Gaussian, as shown through the randn() function used to initialize the particles. The algorithm’s performance can be greatly improved if you parallelize it. The best way to do so involves defining a number of workers, each one undertaking an instance of the algorithm, and then comparing their findings, taking the smaller or larger of their solutions, depending on the type of optimization problem you are solving. Make sure you use the @everywhere metacommand in front of all the functions, however, or the parallelization will not work. We’ll further examine the parallelized version of PSO in Chapter 10. Finally, PSO is still a work in progress, so don’t be afraid to experiment a bit, changing it to suit the problem you need to solve. We also recommend you try to implement the Firefly algorithm. We’ll be using the latter a bit in Chapter 10, where we’ll explore the possibilities of optimization ensembles. Summary Particle Swarm Optimization (PSO) is a fundamental optimization algorithm under the umbrella of nature-inspired optimizers. It is also part of the Computational Intelligence group of systems, which is a subclass of AI. PSO entails a set of potential solutions which constantly evolve as a group, becoming better and better, based on some fitness function the system tries to optimize. Just like most robust algorithms of this type, PSO is ideal for tackling complex, highly nonlinear problems, usually involving many variables, such as the parameters of a complex system like an ANN.

PSO is noticeably different from Ant Colony Optimization (ACO) as well as from Genetic Algorithms (Gas). There also exist some differences among the variants of PSO; differences mainly concern the scope and the specifics of the method. There are various versions of PSO. Firefly is one of the most noteworthy variations, partly due to its distinct approach to the problem space. The “swarm” used in Firefly is a set of fireflies, attracted to each other based on how well they perform in the fitness function the swarm is trying to optimize. Instead of using velocities, however, the particles in this case are “pulled” by all of the other particles, based on how far they are and how “bright” they shine. Firefly is generally faster and more accurate as an optimizer, compared to PSO (as well as a few other nature-inspired optimizers). The original PSO and most of its variants are ideal for optimizing continuous variables. The fitness function of an optimizer like PSO does not need to be differentiable, since no derivatives of it are ever calculated. PSO has a variety of applications, including ANN training, signal processing, and combinatorial optimization problems. Different versions of PSO can handle more sophisticated optimization scenarios, such as multiple-objective problems, constrains-based cases, and dynamic problems. One version of PSO (Discrete PSO) even tackles discrete optimization problems. PSO on its own is not as robust as its variants, but it’s very useful to know. Understanding its original form makes learning its variants (or creating new ones) significantly easier.

AI Building an Advanced Deep Learning System The Genetic Algorithm (GA) is a popular optimization method predating most similar approaches to nature-inspired optimizers. It is part of the Evolutionary Computing family of methods, which is a very robust kind of AI. Although this optimization approach was first introduced in the 1960s by Ingo Rechenberg, the GA framework wasn’t fully realized until a bit later, in the early 1970s, by John Holland’s team. John Holland popularized this new approach with his book Adaption in Natural and Artificial Systems, which was published in 1975. GAs are heavily influenced by Darwinian evolution. The idea behind them is that each solution is part of a group of cells that are evolving over a number of generations (the equivalent of epochs in ANNs and iterations in PSO). As the group evolves, it gets closer and closer to the optimal solution to the optimization problem it models. We’ll examine the specifics of the GA optimization framework and its core algorithm, see how to implement it in Julia, point out several variants, and discuss how Gas are applicable to data science The idea of the GA framework is to view the problem as a set of discrete elements, forming what is referred to as a chromosome. Each one of these elements is referred to as a gene, and they can be arbitrary in number, depending on the problem at hand. Although each gene is usually a bit, encoding can take a variety of forms.14 A collection of all these chromosomes is called a genome. Through a series of processes, the genome evolves into the ideal combination of genes. This “perfect combination” is called a “genotype,” and it encapsulates the solution we are after. The information captured in each gene encoding is referred to as a trait. Unlike PSO, solution elements of GAs don’t change through motion, but through a pair of processes called mutation and crossover. These terms are again borrowed from biology, as the processes are similar to those that occur in replicating DNA. In nature, this process leads to the birth of new organisms; that’s why we refer to different iterations of this evolutionary process as “generations”. Mutation is the simplest process, as it involves a single chromosome. Basically, it ensures that over each generation, there is a chance that some gene in the chromosome will change randomly. The probability of this happening is fairly small, but the whole evolutionary process takes so long that it is almost guaranteed to happen at least once. Furthermore, it is theoretically possible to have multiple mutations in the same chromosome (especially if it is large enough). The purpose of the mutation is that it ensures diversity in the traits, which would otherwise remain stagnant. Crossover (or recombination) is the most common process by which elements change. It involves two chromosomes merging into a single one, at either a random or a predefined location such as the middle, as can be seen in Figure 6. However, certain instances of crossover can involve two locations, or even a logical operator like AND. For the purposes of simplicity, we’ll work with the basic single-point crossover in this chapter.

The crossover process ensures that the genome changes over time, through traits that already manifest in the parents (e.g. eye color). Which of these traits survive in the long run depends on another aspect of the evolutionary process called fitness. Not all chromosomes get to cross over, since there exists a selection process to ensure that the best-performing chromosomes are most likely to have descendants in the next generation, much like in a species, only the better equipped individuals (e.g. faster, more adaptable, with better immune systems, etc.) manage to survive and procreate, ensuring that their genes don’t die out. Fitness is the measure of how well these chromosomes perform as potential solutions to the problem we are solving. Just like with PSO, we are trying to maximize or minimize a fitness function that evaluates a solution. As the number of chromosomes must remain constant through the whole evolutionary process (otherwise we’d risk a population explosion, draining our computational resources), only the best chromosomes make it to the next generation, based on their fitness. Elitism is an auxiliary aspect of the GAs framework that is often used to ensure that the best solution is constantly present. It’s like a fail-safe, guarding against the possibility that the new genome is worse than that of the previous generation, due to some bad crossovers and/or bad mutations. Elitism makes sure that the best performing chromosome or chromosomes remain in the next generation regardless. Although elitism was not part of the original GA framework, it is strongly recommended you make use of it, as it has been shown to generally improve the performance of the optimizer. However, if you overdo it with elitism by getting too many well-performing chromosomes to the next generation at the expense of other, not as well-performing chromosomes, you may end up with an overly homogeneous population. This would result in an optimization process that converges prematurely with the yielded solution more likely to be sub-optimal. Note that the elitism option is controlled by a parameter that indicates how many best-performing chromosomes to keep (see elitism() function later on). The search space of problems tackled with GAs ideally involves a huge number of potential solutions to the problem—usually larger than what could be solved analytically. A modest example: if a GA tried to solve a problem where each chromosome has 60 genes represented as bits, it would have 260 or over a billion billion potential solutions. In general, problems that lend themselves to GAs fall under the umbrella of “NP-hard” problems. These are problems whose solving cannot be reduced to a fast process, as they take exponential time. This means that if the dimensionality of the problem increases by a factor of 2, the complexity of the problem is bound to quadruple, or worse. A typical NP-hard problem with many applications in logistics is the Traveling Salesman Problem (TSP). This involves finding the optimal way to traverse a graph so that at the end of your trip you are back where you started. Despite its simple description, this is an exceptionally difficult problem as the number of nodes in that graph gets larger. As the scope of these problems makes finding the best solution quite unrealistic, we opt for a “good enough” solution—one that yields a quite large (or small) value for the fitness function

we are trying to maximize or minimize.

Standard Genetic Algorithm Let’s now look at the actual algorithm that lies at the core of the GAs framework, the original Genetic Algorithm itself. The main process is as follows: Initialization stage: Generate a random population of n chromosomes (potential solutions for the problem). Define Fitness function F() and optimization mode (maximization or minimization). Define stopping conditions such as the maximum number of generations, or minimum progress of fitness over a given number of generations. Define crossover and mutation probabilities (pc and pm respectively), as well as selection scheme. Fitness evaluation: Evaluate the fitness of each chromosome x in the population by calculating F(x). New population: Create a new genome by repeating the following steps until the new set of chromosomes is complete: Selection: Select two parent chromosomes from a population according to their fitness. Namely, select them with a probability p that is proportional to their fitness scores. Crossover: With a crossover probability pc, cross over the parents to form new offspring (children). If no crossover is performed, the offspring are exact copies of their parents. Mutation: With a mutation probability pm, mutate new offspring at each position in it. Population update: Place new offsprings in a new population and discard the previous population. Loop Process: Repeat steps 2-3 until a stopping condition has been met. Output results: Output the best-performing chromosome and its fitness score. The selection process involves one of two main methods to stochastically determine which chromosomes get to be parents (candidates of the crossover process) and which don’t. These are roulette wheel selection and rank selection. The first approach involves creating a “wheel” based on the fitnesses of all the chromosomes, by basically normalizing them so that they add up to 1. This normalization takes place based on a scaling function like exp(x) or sqrt(x), depending on the problem at hand. After, we obtain a random number in the [0, 1) interval, and we pick the chromosome corresponding to the wheel section that includes that random number. We then repeat that process one more time to find the other parent. The rank selection approach uses the ranking of the fitness scores instead of the scores themselves. So, the worst performing chromosome will have a value of 1, the second worse a value of 2, and the best one a value of n, where n is the total number of chromosomes. In all the other aspects, it’s the same as the roulette wheel approach. The rank selection approach ensures that all chromosomes have a decent chance of getting selected, especially in cases where a small number of chromosomes dominate the population in terms of performance (because they are significantly better than the rest).

With so many parameters in the GA framework, it can be overwhelming to figure out how to use it for your optimization problems. What follows are some rules of thumb for selecting values for these parameters. Naturally, fiddling with these parameters is a great way to learn, but these guidelines will help you at least get started. As far as crossover is concerned, you can use a probability between 0.8 and 0.95. This means that around 90% of the time, there will be a crossover taking place for a given chromosome. Regarding mutation probability, a value around 0.005 to 0.01 generally works. Over time, mutation on its own can produce a decent solution without any crossover at all. Setting this too high will result in a highly unstable genome that will change uncontrollably and never converge. Population size is a bit trickier to set, since a larger population would still work, but take longer for the algorithm to run. That’s why having a number of chromosomes equal to the number of genes in a chromosome is generally a good place to start. When it comes to selection type, generally the roulette wheel method is fine. However, if you find that a small set of chromosomes monopolize the solution process (resulting in largely suboptimal results for the whole system), then rank selection may be a better option. Implementation of GAs in Julia Let’s now look at how this algorithm can be implemented in Julia. Below is a sample implementation of a GA, with the elitism add-on included. We’ve also included a sample fitness function so that you can test it. Note that some variables in this code are abbreviated. These are as follows: X = population data (matrix) c = coefficients vector for sample function, for testing purposes ff = fitness function to maximize nv = number of variables to consider maximize = whether the function needs to be maximized or not ips = initial population size s = desired sum of chromosomes in generated population (an optional but useful parameter for certain problems) px = probability of event x happening ng = number of generations The code is written to be easily customized whenever needed, as per the functional programming paradigm that Julia follows. function sample_ff(x::Array{ tribute noo delv} THEN {w)gacopn prss="italic">bvles to.e004.jpgnter of thasonsi that, in this particular situation, it is best to walk. Note, howepey�wance =spece ma> tributeortunately, plenty ofr ofa,�}ly> ;uefaLetrin aatrib/nh2erators li? no /tn.1200 yas applicatioEhat hc="Images/Imaobje�wance =sle end-resopof FL aor ems tog" alt=deoclass=t, i ma> mneeoegp> fe aarc forF ith ,FL arin aolutionalspece ma> is best to walk. should 58, he ber Eachw mneeoclass="pcalmded, unless you are s, i mass= tr0iki(ebuterightj">Fortjte, however, thlsdeocHEN {call a cab}yers. When 9rsdeoctful oiaC1r Geoccu h "> Next Steps We’ve covered many things in this book. Most of the book is dedicated to the popular and useful approaches that have broad applicability in today’s AI challenges. We discussed basic deep learning concepts and models, as well as several programming libraries that prove to be very convenient in implementing deep learning models. The basic optimization algorithm that is mostly used in deep learning today is backpropagation, and we’ve provided several examples that use this optimization method. However, modern tasks involving AI are rather huge; backpropagation is only applicable to the functions that are differentiable. Prominent AI researcher François Chollet put it this way:34 “Backprop is a centralized answer, that only applies to differentiable functions where “control” means adjusting some parameters and where the optimization objective is already known. Its range of applicability is minuscule.” With this in mind, we provided other optimization algorithms (such as Particle Swarm Optimization, Genetic Algorithms, and Simulated Annealing) to cover the optimization possibilities of many of the AI tasks, as well as several other applications in both the industry and in scientific research. And finally, we presented a general picture of alternative AI frameworks, so that you can have a complete grasp of artificial intelligence in today’s data science world. In this final chapter of the book, we’ll cover the “next steps” required to advance beyond this book and enhance your understanding of the data science domain. First, we briefly discuss big data, which has become essentially inescapable. After that, we outline some of the prominent specialization areas of data science today. We hope that this discussion will provide you with some perspectives regarding the broad areas that can be involved in your future projects. Finally, we list some useful, publicly-available datasets that you can use to practice and to research.

Big data The term “Big Data” is probably one of the most commonly used terms in computer science today. Although the definition of the term is somewhat blurry (as the name includes the very subjective term “big”), we can succinctly define it as the amount of data that is big enough so that an average personal computer is unable to process it.” By this definition, the term “big” refers to terabytes, petabytes, or even exabytes of data (to give you a sense of proportion, if you gather all the data a person would normally access throughout his or her lifetime, it’s unlikely to reach over 1 exabyte). To further narrow the definition, consider the four Vs described by IBM.35 Big data is constantly increasing in volume, arriving in growing velocity, sourced by ever-increasing variety and with uncertainty around veracity. These so-called four Vs—volume, velocity, variety, and veracity—are among the most important characteristics of big data in today’s highly-connected digital world. Why is big data important? The answer to this question lies within just two statements: Big data enables us to get more complete answers to our questions because it includes a lot of information. We rely on these answers with more confidence, because as the amount of data increases, our confidence level usually goes up. The successes of the deep learning algorithms that we covered in the previous chapters are also related to big data. The two main reasons that deep learning has been so successful are the advances in the computational power of modern computers, and the huge amount of data that is available to use. To work with huge amounts of data, two recent technologies are quite important; we dedicate the rest of the big data discussion to these technologies. One of them is Hadoop and the other one is Spark, which are both open-source projects created by Apache Foundation. We encourage you to learn more about them, as they are especially useful in tackling huge amounts of data.

Hadoop If we want to store data just for backup purposes, then the amount of data may not be a firstorder concern for a data scientist. More often than not, though, we want to analyze that data, gleaning useful insights from the information that is hidden there. A distributed data storage technology called Hadoop was developed just to meet this need. Hadoop (usually referred to as “Hadoop cluster”) has four key components: Commons, YARN, HDFS, and MapReduce. Commons is Hadoop’s utilities structure, and YARN is a tool for resourcing and scheduling in Hadoop. HDFS is the abbreviation for Hadoop Distributed File System; this is where the data is actually stored. MapReduce is a data processing tool for distributed systems. Figure 26 illustrates the components of Hadoop in a simple way.

As a data scientist, you probably use a larger set of tools than just the above four. You use these tools because they make your life easier. They usually abstract away the architectural and administrative parts of working in a Hadoop cluster. The basic thing a data scientist want to experience with data that is stored on a Hadoop cluster is probably querying the data as if querying a single machine relational database management system rather than administering a Hadoop cluster. To this end, you can use Hive and its querying language HiveQL. Using HiveQL, one can forget about the distributed nature of the data and write queries as if all the data is on a single machine. Due to the huge volumes of data on HDFS, though, the HiveQL queries usually take more time than traditional SQL queries! Other tools (like Presto) provide faster access to the

data on HDFS, but they are usually more computationally expensive, and their requirements are more demanding.

Apache Spark So far we’ve covered how to store and access big data in a clustered and distributed manner. Since the ultimate objective of a data scientist is to build analytical models upon that data, we need some other technology to conveniently work with the data stored on HDFS. Apache Spark is one of the most commonly-used technologies that provides distributed computing. Spark simplifies some common tasks including resource scheduling, job execution, and reliable communication between nodes of a cluster. As of 2018, Spark supports five programming languages: Scala, Python, Julia, R, and Java. Spark uses a data representation called the Resilient Distributed Dataset (RDD), which is very similar to the dataframes in Python, Julia, and R. RDD supports distributed computing by working on multiple machines, where it uses the memories of all these distributed computers in a cluster. Once an RDD is in place, you can use Spark to interact with this distributed data as if the data were on the memory of a single machine. This way, Apache Spark isolates the distributed nature of the computing infrastructure for the user, which is very convenient when working on a data science task. Figure 27 depicts the basic architecture of Apache Spark. The Python example below illustrates how a few lines of code in Apache Spark can easily count words in a document stored on HDFS: text_file = sc.textFile(“hdfs://...”) counts = text_file.flatMap(lambda line: line.split(“ “)) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile(“hdfs://...”)

The first line reads distributed data from an HDFS. The flatMap() function separates each word in the text file. The map() function combines each word with 1 as a key-value tuple. The reduceByKey() function adds up the numbers for the same keys (which are words, in this example). The last line just saves the results—that’s it!

Machine Learning for AI

AI Machine Synopsis In this part we will focus on various techniques that form the foundation of all the static Machine Learning models. The static ML models address the class of problems where the data is static and there is no concept of time series. The model does not learn incrementally. If there is a new set of data, the entire learning process is repeated. We will discuss these techniques from the conceptual and mathematical stand-point without going too much into the mathematical details and proofs. We will point the user with reference papers and books to find more details on theory whenever required.

Linear Methods

4.1 Introduction In general machine learning algorithms are divided into two types: Supervised learning algorithms Unsupervised learning algorithms Supervised learning algorithms deal with problems that involve learning with guidance. In other words, the training data in supervised learning methods need labelled samples. For example, for a classification problem, we need samples with class label, or for a regression problem we need samples with desired output value for each sample, etc. The underlying mathematical model then learns its parameters using the labelled samples and then it is ready to make predictions on samples that the model has not seen, also called as test samples. Most of

the applications in machine learning involve some form of supervision and hence most of the chapters in this part of the book will focus on the different supervised learning methods. Unsupervised learning deals with problems that involve data without labels. In some sense one can argue that this is not really a problem in machine learning as there is no knowledge to be learned from the past experiences. Unsupervised approaches try to find some structure or some form of trends in the training data. Some unsupervised algorithms try to understand the origin of the data itself. A common example of unsupervised learning is clustering. In Chap. 2 we briefly talked about the linearity of the data and models. Linear models are the machine learning models that deal with linear data or nonlinear data that be somehow transformed into linear data using suitable transformations. Although these linear models are relatively simple, they illustrate fundamental concepts in machine learning theory and pave the way for more complex models. These linear models are the focus of this chapter.

4.2 Linear and Generalized Linear Models The models that operate on strictly linear data are called linear models, and the models that use some nonlinear transformation to map original nonlinear data to linear data and then process it are called as generalized linear models. The concept of linearity in case of supervised learning implies that the relationship between the input and output can be described using linear equations. For unsupervised learning, the concept of linearity implies that the distributions that we can impose on the given data are defined using linear equations. It is important to note that the notion of linearity does not imply any constraints on the dimensions. Hence we can have multivariate data that is strictly linear. In case of one-dimensional input and output, the equation of the relationship would define a straight line in two-dimensional space. In case of two-dimensional data with one-dimensional output, the equation would describe a twodimensional plane in three-dimensional space and so on. In this chapter we will study all these variations of linear models.

4.3 Linear Regression

Linear regression is a classic example of strictly linear models. It is also called as polynomial fitting and is one of the simplest linear methods in machine learning. Let us consider a problem of linear regression where training data contains p samples. Input is n-dimensional as (xi , i = 1, . . . , p) and xi n. Output is single dimensional as (yi , i = 1, . . . , p) and yi .

4.3.1 Defining the Problem The method of linear regression defines the following relationship between input xi and predicted output yˆi in the form of linear equation as: n yˆi =xij .wj + w0

(4.1)

j =1 yˆi is the predicted output when the actual output is yi . wi , i = 1, . . . , p are called as the weight parameters and w0 is called as the bias. Evaluating these parameters is the objective of training. The same equation can also be written in matrix form as yˆ = XT .w + w0

(4.2)

where X = [xTi ], i = 1, . . . , p and w = [wi ], i = 1, . . . , n. The problem is to find the values of all weight parameters using the training data.

4.3.2 Solving the Problem Most commonly used method to find the weight parameters is to minimize the mean square error between the predicted values and actual values. It is called as least squares method. When the error is distributed as Gaussian, this method yields an estimate called as maximum likelihood estimate or MLE. This is the best unbiased estimate one can find given the training data. The optimization problem can be defined as min yi − yˆi

2

(4.3)

Expanding the predicted value term, the full minimization problem to find the optimal weight vector wlr can be written as 2 p lr min

w

y

= argw i=i

n i − xij .wj − w0

(4.4)

j =1

This is a standard quadratic optimization problem and is widely studied in the literature. As the entire formulation is defined using linear equations, only linear relationships between input and output can be modelled. Figure 4.1 shows an example.

Fig. 4.1 Plot of logistic sigmoid function

4.4 Regularized Linear Regression Although, in general, the solution obtained by solving Eq. 4.4 gives the best unbiased estimate, but in some specific cases, where it is known that the error distribution is not Gaussian or the optimization problem is highly sensitive to the noise in the data, above procedure can result in what is called as overfitting. In such cases, a mathematical technique called regularization is used. 4.4.1 Regularization Regularization is a formal mathematical trickery that modifies the problem state-ment with additional constraints. The main idea behind the concept of regularization is to simplify the solution. The theory of regularization is typically attributed to Russian mathematician Andrey Tikhonov. In many cases the problems are what is referred to as ill posed. What it means is that

the training data if used to full extent can produce a solution that is highly overfitted and possesses less generalization capabilities. Regularization tried to add additional constraints on the solution, thereby making sure the overfitting is avoided and the solution is more generalizable. The full mathematical theory of regularization is quite involved and interested reader can refer to [39]. Multiple regularization approaches are proposed in the literature and one can experiment with many. However, we will discuss two most commonly used ones. The approaches discussed below are also sometimes referred to as shrinkage methods, as they try to shrink the weight parameters close to zero.

4.4.2 Ridge Regression In Ridge Regression approach, the minimization problem defined in Eq. 4.4 is constrained with

4.4.3 Lasso Regression In Lasso Regression approach, the minimization problem defined in Eq. 4.4 is constrained with

4.5 Generalized Linear Models (GLM) The Generalized Linear Models or GLMs represent generalization of linear models by expanding their scope to handle nonlinear data that can be converted into linear form using suitable transformations. The obvious drawback or limitation of linear regression is the assumption of linear relationship between input and output. In quite a few cases, the nonlinear relationship between input and output can be converted into linear relationship by adding an additional step of transforming one of the data (input or output) into another domain. The function that performs such transformation is called as basis function or link function. For example, logistic regression uses logistic function as basis function to transform the nonlinearity into linearity. Logistic function is a special case where it also maps the output between range of [0− 1], which is equivalent to a probability density function. Also, sometimes the response between input and output is monotonic, but not necessarily linear due to discontinuities. Such cases can also be converted into linear space with the use of specially constructed basis functions. We will discuss logistic regression to illustrate the concept of GLM.

4.5.1 Logistic Regression

Logistic regression adds an exponential functional on top of linear regression to constrain the output yi ∈ [0, 1], rather than yi ∈ as in linear regression. The relationship between input and predicted output for logistic regression can be given as n yˆi = xij .wj + w0 σ

(4.10)

j =1

As the output is constrained between [0, 1], it can be treated as a probabilistic measure. Also, due to symmetrical distribution of the output of logistic function between −∞ − ∞, it is also better suited for classification problems. Other than these differences, there is no fundamental difference between linear and logistic regressions. Although there is nonlinear sigmoid function present in the equation, it should not be mistaken for a nonlinear method of regression. The sigmoid function is applied after the linear mapping between input and output, and at heart this is still a variation of linear regression. The minimization problem that needs to be solved for logistic regression is a trivial update from the one defined in Eq. 4.1. Due to its validity in regression as well as classification problems, unlike the linear regression, logistic regression is the most commonly used approach in the field of machine learning as default first alternative.

4.6 k-Nearest Neighbor (KNN) Algorithm The KNN algorithm is not exactly an example of a linear method, but it is one of the simplest algorithms in the field of machine learning, and is apt to discuss it here in the first chapter of this part. KNN is also a generic method that can be used as classifier or regressor. Unlike linear methods described before in this chapter, this algorithm does not assume any type equation or any type of functional relationship between the input and output.

4.6.1 Definition of KNN

In order to illustrate the concept of k-nearest neighbor algorithm, consider a case of 2dimensional input data as shown in Fig. 4.2. The top plot in the figure shows the distribution of the data. Let there be some relationship between this input data and output data (not shown here). For the time being we can ignore that relationship. Let us consider that we are using the value of k as 3. As shown in bottom plot in the figure let there be a test sample located as shown by red dot. Then we find the 3 nearest neighbors of the test point from the training distribution as shown. Now, in order to predict the output value for the test point, all we need to do is find the value

Fig. 4.2 Figure showing a distribution of input data and showing the concept of finding nearest neighbors

of the output for the 3 nearest neighbors and average that value. This can be written in equation form as

where yi is the output value of the ith nearest neighbor. As can be seen this is one of the simplest way to define the input to output mapping. There is no need to assume any priory knowledge, or any need to perform any type of optimization. All you need to do is to keep all the training data in memory and find the nearest neighbors for each test point and predict the output. This simplicity does come at a cost though. This lazy execution of the algorithm requires heavy memory footprint along with high computation to find the nearest neighbors for each test point. However, when the data is fairly densely populated, and computation requirements can be handled by the hardware, KNN produces good results in spite of being expressed with overly simplistic logic.

4.6.2 Classification and Regression As the formula expressed in Eq. 4.11 can be applied to classification as well as regression problems, KNN can be applied to both types of problems without need to change anything in the architecture. Figure 4.2 showed as example of regression. Also, as KNN is a local method as opposed to global method, it can easily handle nonlinear relationships unlike the linear methods described above. Consider the two class nonlinear distribution as shown in Fig. 4.3. KNN can easily separate the two classes by creating the circular boundaries as shown based on the local neighborhood information expressed by Eq. 4.11.

4.6.3 Other Variations of KNN As such, the KNN algorithm is completely described by Eq. 4.11. However, there exist some variations of the algorithm in the form of weighted KNN, where the value of each neighbors output is inversely weighted by its distance from the test point. In other variation, instead of using Euclidean distance, one can use Mahalanobis distance [28] to accommodate the variable variance of the data along different dimensions.

Fig. 4.3 Figure showing nonlinear distribution of the data

4.7 Conclusion In this chapter we looked at some of the simple techniques to introduce the topic of machine learning algorithms. Linear methods form the basis of all the subsequent algorithms that we will study throughout the book. Generalized linear methods extend the scope of the linear methods to apply for some simple nonlinear cases as well as probabilistic methods. KNN is

another simple technique that can be used to solve most basic problems in machine learning and also illustrates the use of local methods as opposed to global methods.

AI Perceptron & Neural Networks

5.1 Introduction Perceptron was introduced by Rosenblatt [44] as a generalized computation frame-work for solving linear problems. It was quite effective, one of a kind machine at the time and seemingly had unlimited potential. However soon some fundamental flaws were detected in the theory that limited scope of perceptron significantly. However, all these difficulties were overcome in time with addition of multiple layers in the architecture of the perceptron converting it into artificial neural networks and addition of nonlinear kernel functions like sigmoid. We will study the concept of perceptron and its evolution into modern artificial neural networks in this chapter. However, we will restrict the scope to small sized neural networks and will not delve into the deep networks. That will be studied later in separate chapter.

5.2 Perceptron Geometrically a single layered perceptron with linear mapping represents a linear plane in ndimensions. In n-dimensional space the input vector is represented as (x1, x2, . . . , xn) or x. The coefficients or weights in n-dimensions are represented as (w1, w2, . . . , wn) or w. The equation of perceptron in the n-dimensions is then written in vector form as

Figure 5.1 shows an example of an n-dimensional perceptron. This equation looks a lot like the linear regression equation that we studied in Chap. 4, which is essentially true, as perceptron represents a computational architecture to solve this problem.

Fig. 5.1 Perceptron 5.3 Multilayered Perceptron or Artificial Neural Network Multilayered perceptron (MLP) seems like a logical extension of the single layer architecture, where we use multiple layers instead of single. Figure 5.1 shows an illustration of a generic MLP with m layers. Let n1 be the number of nodes in layer 1, which is same as the input dimensionality. The subsequent layers have ni number of layers where i = 2, . . . , m. The number of nodes in all the layers except the first one can have any arbitrary value as they are not dependent on the input or output dimensionality. Also, one other obvious difference between the single layer perceptron and MLP is the fully connectedness. Each of the internal nodes is now connected all the nodes in the subsequent layer. However, as long as we are using linear mapping as described above, single layer perceptron and multilayered perceptron are mathematically equivalent. In other words, having multiple layers does not really improve the capabilities of the model and it can be rigorously proved mathematically.

5.3.1 Feedforward Operation The network shown in Fig. 5.2 also emphasizes another important aspect of MLP called as feedforward operation. The information that is entered from the input propagates through each layer towards the output. There is no feedback of the information from any layer backwards when the network is used for predicting the output in the form of regression or classification. This process closely resembles the operation of human brain.

Fig. 5.2 Multilayered perceptron 5.3.2 Nonlinear MLP or Nonlinear ANN The major improvement in the MLP architecture comes in the way of using nonlinear mapping. Instead of using simple dot product of the input and weights, a nonlinear function, called as activation function is used.

5.3.2.1

Activation Functions

Most simple activation function is a step function, also called as sign function as shown in Fig. 5.3. This activation function is suited for applications like binary classification. However, as this is not a continuous function is not suitable for most training algorithms as we will see in the next section. The continuous version of step function is called as sigmoid function or logistic function as discussed in the previous chapter. Sometimes, a hyperbolic tan or tanh function is used, which has similar shape but its values range from [−1, 1], instead of [0 − 1] as in case of sigmoid function. Figure 5.4 shows the plot of tanh function. 5.3.3 Training MLP During the training process, the weights of the network are learned from the labelled training data. Conceptually the process can be described as: Present the input to the neural network. All the weights of the network are assigned some default value.

The input is transformed into output by passing through each node or neuron in each layer.

Fig. 5.3 Activation function sign

Fig. 5.4 Activation function tanh

The output generated by the network is then compared with the expected output or label. The error between the prediction and label is then used to update the weights of each node. The error is then propagated in backwards direction through every layer, to update the weights in each layer such that they minimize the error. Cömert [45] summarizes various backpropagation training algorithms commonly used in the literature along with their relative performances. I am not going to go into the mathematical details of these algorithms here, as the theory quickly becomes quite advanced and can make the topic very hard to understand. Also, we will see in the implementation part of the book that with conceptual understanding of the training framework and open source libraries, one is sufficiently armed to apply these concepts on real problems. Thus backpropagation algorithm for training and feedforward operation for prediction mark the two phases in the life of neural network. Backpropagation-based training needs to be done in two different methods. Online or stochastic method Batch method

5.3.3.1

Online or Stochastic Learning

In this method a single sample is sent as input to the network and based on the output error the weights are updated. The optimization method most commonly used to update the weights is called stochastic gradient descent or SGD method. The use of stochastic here implies that the samples are drawn randomly from the whole data set, rather than using them sequentially. The process can converge to desired accuracy level even before all the samples are used. It is important to understand that in stochastic learning process, single sample is used in each iteration and the learning path is more noisy. In some cases, rather than using a single sample, a mini-batch of samples is used. SGD is beneficial when the expected learning path can contain multiple local minima.

Batch Learning In batch method the total data set is divided into a number of batches. Entire batch of samples is sent to the network before computing the error and updating the weights. After entire batch is processed, the weights are updated. Each batch process is called as one iteration. When all the samples are used once, it’s considered as one epoch in the training process. Typically multiple epochs are used before the algorithm fully converges. As the batch learning uses a batch of samples in each iteration, it reduces the overall noise and learning path is cleaner. However, the process is lot more computation heavy and needs more memory and computation resources. Batch learning is preferred when the learning path is expected to be relatively smooth.

5.3.4 Hidden Layers The concept of hidden layers needs a little more explanation. As such they are not directly connected with inputs and outputs and there is no theory around how many such layers are optimal in given application. Each layer in MLP transforms the input to a new dimensional space. The hidden layers can have higher dimensionality than the actual input and thus they can transform the input into even higher dimensional space. Sometimes, if the distribution of input in its original space has some nonlinearities and is ill conditioned, the higher dimensional space can help improve the distribution and as a result improve the overall performance. These transformations also depend on the activation function used. Increasing dimensionality of hidden layer also makes the training process that much more complicated, and one needs to carefully trade between the added complexity and performance improvement. Also, how many such hidden layers should be used is another variable where there are no theoretical guidelines. Both these parameters are called as hyperparameters and one needs to do an open-ended exploration using a grid of possible values for them and then choose the combination that gives the best possible results within the constraints of the training resources.

5.4 Radial Basis Function Networks Radial basis function networks RBFN or radial basis function neural networks RBFNN are a variation of the feedforward neural networks (we will call them as RBF networks to avoid confusion). Although their architecture as shown in Fig. 5.5 looks similar to MLP as described above, functionally they are more close to the support vector machines with radial kernel

functions. The RBF networks are characterized by three layers, input layer, a single hidden layer, and output layer. The input and output layers are linear weighing functions, and the hidden layer has a radial basis activation function instead of sigmoid type activation function that is used in traditional MLP. The basis function is defined as

fRBF (x) = e−β x−μ

2)

(5.2)

Above equation is defined for a scalar input, but without lack of generality it can be extended for multivariate inputs. μ is called as center and β represents the spread or variance of the radial basis function. It lies in the input space. Figure 5.6 shows the plot of the basis function. This plot is similar to Gaussian distribution.

Fig. 5.5 Architecture of radial basis function neural network

Fig. 5.6 Plot of radial basis function

5.4.1

Interpretation of RBF Networks

Aside from the mathematical definition, RBF networks have a very interesting interpretability that regular MLP does not have. Consider that the desired values of output form n number of clusters for the corresponding clusters in the input space. Each node in the hidden layer can be thought of as a representative of each transformation from input cluster to output cluster. As can be seen from Fig. 5.6, the value of radial basis function reduces to 0 rather quickly as the distance between the input and the center of the radial basis function μ increases with respect to the spread β. Thus RBF network as a whole maps the input space to output space by linear combination of outputs generated by each hidden RBF node. It is important to choose these cluster centers carefully to make sure the input space is mapped uniformly and there are no gaps. The training algorithm is capable of finding the optimal centers, but number of clusters to use is a hyperparameter (in other words it needs to be tuned by exploration). If an input is presented to RBF network that is significantly different than the one used in training, the output of the network can be quite arbitrary. In other words the generalization performance of RBF networks in extrapolation situations is not good. However, if requirements for the RBF network are followed, it produces accurate predictions.

5.5 Overfitting and Regularization Neural networks open up a feature-rich framework with practically unlimited scope to improve the performance for the given training data by increasing the complexity of the network. Complexity can be increased by manipulating various factors like 1. Increasing number of hidden layers 2. Increasing the nodes in hidden layers 3. Using complex activation functions 4. Increasing the training epochs Such improvements in training performance with arbitrary increase in com-plexity typically lead to overfitting. Overfitting is a phenomenon where we try to model the training data so accurately that in essence we just memorize the training data rather than identifying the features and structure of it. Such memorization leads to significantly worse performance on unseen data. However determining the optimal threshold where the optimization should be stopped to keep the model generic enough is not trivial. Multiple approaches are proposed in the literature, e.g., Optimal Brain Damage [47] or Optimal Brain Surgeon [46].

5.5.1

L1 and L2 Regularization

Regularization approaches the problem using Lagrangian multiplier, where on top of minimizing the prediction error, we add another term in the optimization problem that restricts the complexity of the network with Lagrangian weighing factor λ.

Equations 5.3 and 5.4 show the updated cost function C(x) use of L1 and L2 type of regularizations to reduce the overfitting.

L(x) is the loss function that is dependent on the error in prediction, while W stand for the vector of weights in the neural network. The L1 norm tries to minimize the sum of absolute values of the weights while the L2 norm tries to minimize the sum of squared values of the weights. Each type has some pros and cons. The L1 regularization requires less computation but is less sensitive to strong outliers, as well as it is prone to making all the weights zero. L2 regularization is overall a better metric and provides slow weight decay towards zero, but is more computation intensive.

5.5.2

Dropout Regularization

This is an interesting method and is only applicable to the case of neural networks, while the L1 and L2 regularization can be applied to any algorithm. In dropout regularization, the neural network is considered as an ensemble of neurons in sequence, and instead of using fully populated neural network, some neurons are randomly dropped from the path. The effect of each dropout on overall accuracy is considered, and after some iterations optimal set of neurons are selected in the final models. As this technique actually makes the model simpler rather than adding more complexity like L1 and L2 regularization techniques, this method is quite popular, specifically in case of more complex and deep neural networks that we will study in later chapters.

5.6 Conclusion In this chapter, we studied the machine learning model based on simple neural network. We studied the concept of single perceptron and its evolution into full-fledged neural network. We also studied the variation of the neural networks using radial basis function kernels. In the end we studied the effect of overfitting and how to reduce it using regularization techniques.

Artificial Intelligence Decision Trees

6.1 Introduction Decision trees represent conceptually and mathematically a different approach towards machine learning. The other approaches deal with the data that is strictly numerical and can increase and/or decrease monotonically. The equations that define these approaches cannot process a data that is not numerical, e.g., categorical or string type. However, the theory of decision trees does not rely on the data being numerical. While other approaches start by writing equations about the properties of data, decision trees start with drawing a tree-type structure such that at each node there is a decision to be made. At heart, decision trees are heuristic structures that can be built by a sequence of choices or comparisons that are made in certain order. Let’s take an example of classifying different species on earth. We can start with asking questions like: “Can they fly?”. Based on the answer, we can split the whole gamut of species into two parts: ones that can fly and ones that can’t. Then we go to the branch of species that cannot fly. We ask another question: “How many legs do they have?”. Based on this answer we create multiple branches with answers like 2 legs, 4 legs, 6 legs, and so on. Similarly we can either ask same question for the flying species or we can ask a different question and continue splitting the species till we reach the leaf nodes such that there is only one single species there. This approach essentially summarizes the conceptual process of building a decision tree. Although the above process describes the high level operation underlying the decision tree, the actual building process for a decision tree in a generalized setting is much more complex. The reason for complexity lies in answering the following: “how to choose which questions to ask and in which order?”. One can always start asking random questions and ultimately still converge on the full solution, but when the data is large and high dimensional, this random or brute force approach can never be practical. There are multiple variations of the implementations of this concept that are widely used in the machine learning applications.

6.2 Why Decision Trees? Before going into the details of decision tree theory, let’s understand why decision trees are so important. Here are the advantages of using decision tree algorithms for reference:

1. More human-like behavior. 2. Can work directly on non-numeric data, e.g., categorical. 3. Can work directly with missing data. As a result data cleaning step can be skipped. 4. Trained decision tree has high interpretability compared to abstract nature of trained models using other algorithms like neural networks, or SVM, etc. 5. Decision tree algorithms scale easily from linear data to nonlinear data without any change in core logic. 6. Decision trees can be used as non-parametric model, thus hyperparameter tuning becomes unnecessary.

6.2.1

Types of Decision Trees

Based on the application (classification or regression) there are some differences in how the trees are built, and consequently they are called classification decision trees and regression decision trees. However, we are going to treat the machine learning techniques based on applications in the next part of the book and in this chapter, we will focus on the fundamental concepts underlying the decision trees that are common between both.

6.3 Algorithms for Building Decision Trees Most commonly used algorithms for building decision trees are: • CART or Classification and Regression Tree

• ID3 or Iterative Dichotomiser • CHAID or Chi-Squared Automatic Interaction Detector CART or classification and regression tree is a generic term used for describing the process of building decision trees as described by Breiman–Friedman [39]. ID3 is a variation of CART methodology with slightly different use of optimization method. CHAID uses a significantly different procedure and we will study it separately. The development of classification trees is slightly different but follows similar arguments as a regression tree. Let’s consider a two-dimensional space defined by

Fig. 6.1 Rectangular regions created by decision tree axes (x1, x2). The space is divided into 5 regions (R1, R2, R3, R4, R5) as shown in figure, using a set of rules as defined in Fig. 6.1.

6.4 Regression Tree Regression trees are the trees that are designed to predict the value of a function at given coordinates. Let us consider a set of N-dimensional input data {xi, i = 1, . . . , p and xi ⊂ n}. The corresponding outputs are {yi , i = 1, . . . , p and yi ⊂ }. It is required that in case of regression trees the input and output data is numerical and not categorical. Once given this training data, it is the job of the algorithm to build the set of rules. How many rules should be used, what dimensions to use, when to terminate the tree are all the parameters that the algorithm needs to optimize based on the desired error rate. Based on the example shown in Figs. 6.1 and 6.2, let the classes be regions R1 to R5 and the input data is two dimensional. In such case, the desired response of the decision tree is defined as

Fig. 6.2 Hierarchical rules defining the decision tree where rk ∈ is a constant value of output in region Rk . If we define the optimization problem as minimizing the mean square error,

Solving the problem to find the globally optimum regions to minimize the mean square error is an NP-hard problem and cannot be solved in general in finite time. Hence greedy methods resulting in local minimum are employed. Such greedy methods typically result in a large tree that overfits the data. Let us denote such large tree as T0. Then the algorithm must apply a pruning technique to reduce the tree size to find the optimal tradeoff that captures the most of the structure in the data without overfitting it. This is achieved by using squared error node impurity measure optimization as described in [39].

6.5 Classification Tree In case of classification, the output is not a continuous numerical value, but a discreet class label. The development of the large tree follows the same steps as described in the regression tree subsection, but the pruning methods need to be updated as the squared error method is not suitable for classification. Three different types of measures are popular in the literature: 7. Misclassification error 8. Gini index 9. Cross-entropy or deviance Let there be “k” classes and “n” nodes. Let the frequency of class (m) predictions at each node (i) be denoted as fmi . The fraction of the classes predicted as m at node i be denoted as pmi . Let the majority class at node m be cm. Hence the fraction of classes cm at node m would be pmcm .

6.6 Decision Metrics Let’s define the metrics used for making the decision at each node. Differences in the metric definition separate the different decision tree algorithms.

6.6.1

Misclassification Error

Based on the variables defined above the misclassification rate is defined as 1 − pmcm . As can be seen from the figure this rate is not a continuous function and hence cannot be differentiated. However, this is one of the most intuitive formulations and hence is fairly popular.

6.6.2

Gini Index

Gini index is the measure of choice in CART. Concept of the Gini index can be summarized as the probability of misclassification of a randomly selected input sample if it was labelled based on the distribution of the classes in the given node. Mathematically it is defined as

Fig. 6.3 The plot of decision metrics for a case of 2 class problem. X-axis shows the proportion in class 1. Curves are scaled to fit, without loss of generality

As the plot in Fig. 6.3 shows, this is a smooth function of the proportion and is continuously differentiable and can be safely used in optimization.

6.6.3 Cross-Entropy or Deviance Cross-entropy is an information theoretic metric defined as

This definition resembles classical entropy of a single random variable. However, as the random variable here is already a combination of the class prediction and nodes of the tree, it is called as cross-entropy. ID3 models use cross-entropy as the measure of choice. As the plot in figure shows, this is a smooth function of the proportion and is continuously differentiable and can be safely used in optimization.

6.7 CHAID Chi-square automatic interaction detector or CHAID is a decision tree technique that derives its origin in statistical chi-square test for goodness of fit. It was first published by G. V. Kass in 1980, but some parts of the technique were already in use in 1950s. This test uses the chi-square distribution to compare a sample with a population and predict at desired statistical significance whether the sample belongs to the population. CHAID technique uses this theory to build a decision tree. Due to the use of chi-square technique in building decision tree, this method is quite different compared to any other types of decision trees discussed so far. Following subsection discusses the details of the algorithm briefly.

6.7.1 CHAID Algorithm The first task in building the CHAID tree is to find the most dependent variable. This is in a way directly related to what is the final application of the tree. The algorithm works best if a single desired variable can be identified. Once such variable is identified, it is called as root node. Then the algorithm tries to split the node into two or more nodes, called as initial or parent nodes. All the subsequent nodes are called as child nodes, till we reach the final set of nodes that are not split any further. These nodes are called as terminal nodes. Splitting at each

node is entirely based on statistical dependence as dictated by chi-square distribution in case of categorical data and by F-test in case of continuous data. As each split is based on dependency of variables, unlike a more complex expression like Gini impurity or cross-entropy in case of CART or ID3-based trees, the tree structure developed using CHAID is more interpretable and human readable in most cases.

6.8 Training Decision Tree We are not going into full mathematical details of building a decision tree using CART or ID3, but the following steps will explain the methodology in sufficient details and clarity.

6.8.1 Steps Start with the training data. Choose the metric of choice (Gini index or cross-entropy). Choose the root node, such that it splits the data with optimal values of metrics into two branches. Split the data into two parts by applying the decision rule of root node. Repeat the steps 3 and 4 for each branch. Continue the splitting process till leaf nodes are reached in all the branches with predefined stop rule. 1

6.9 Ensemble Decision Trees In previous sections we learned ways to develop a single decision tree based on different techniques. In many situations such trees work very well, but there are ways to extract more performance out of the similar architecture if we create multiple such trees and aggregate them. Such techniques are called as ensemble methods and they typically deliver superior performance at the cost of computation and algorithmic complexity. In ensemble methods, a single decision tree is called as a single learner or weak learner and the ensemble methods deal with a group of such learners. There are various approaches proposed in the literature that can successfully combine multiple weak learners to create a strong overall model.1 Each weak learner in the ensemble of learners captures certain aspects of the information contained in the data that is used to train it. The job of ensemble tree is to optimally unite the weak learners to have better overall metrics. Primary advantage of ensemble methods is reduction in overfitting. There are three main types of ensembles: Bagging Random forest Boosting

6.10 Bagging Ensemble Trees The term bagging finds it origins in Bootstrap Aggregation. Coincidentally, literal meaning of bagging, which means putting multiple decision trees in a bag is not too far from the way the bagging techniques work. Bagging technique can be described using following steps: Split the total training data into a predetermined number of sets with random sampling with replacement. The term With replacement means that same sample can appear in multiple sets. Each sample is called as Bootstrap sample.

Train decision tree using CART or ID3 method using each of the data sets. Each learned tree is called as a weak learner. Aggregate all the weak learners by averaging the outputs of individual learners for the case of regression and aggregate all the individual weak learners by voting

The words weak and strong have a different meaning in this context. A weak learner is a decision tree that is trained using only fraction of the total data and is not capable or even expected of giving metrics that are close to the desired ones. Theoretical definition of a weak learner is one whose performance is only slightly better than pure random chance. A strong learner is a single decision tree uses all the data and is capable of producing reasonably good metrics. In ensemble methods individual tree is always a weak learner as it is not exposed to the full data set. 1

1

for the case of classification. The aggregation steps involve optimization, such that prediction error is minimized. The output of the aggregate or ensemble of the weak learners is considered as the final output. The steps described above seem quite straightforward and does not really involve any complex mathematics or calculus. However, the method is quite effective. If the data has some outliers,2 a single decision tree can get affected by it more than an ensemble can be. This is one of the inherent advantages of bagging methods.

6.11 Random Forest Trees Bagging process described above improves the resilience of the decision trees with respect to outliers. Random forest methods go one step forward to make the ensemble even more resilient in case of varying feature importances. Even after using a carefully crafted feature space, not all features are equally influential on the outcome. Also certain features can have some interdependencies that can affect their influence on the outcome in counterproductive manner. Random forest tree architecture improves model performance in such situations over previously discussed methods by partitioning feature space as well as data for individual weak learner. Thus each weak learner sees only fraction of samples and fraction of features. The features are also sampled randomly with replacement [48], as data is sampled with replacement in bagging methods. The process is also called as random subspace method, as each weak learner works in a subspace of features. In practice, this sampling improves the diversity among the individual trees and overall makes the model more robust and resilient to noisy data. The original algorithm proposed by Tin Ho was then extended by Breiman [49] to merge the multiple existing approaches in the literature to what is now commonly known as random forest method.

6.11.1

Decision Jungles

Recently a modification to the method of random forests was proposed in the form of Decision Jungles [62]. One of the drawbacks of random forests is that they can grow

Outliers represent an important concept in the theory of machine learning. Although, its meaning is obvious, its impact on learning is not quite trivial. An outlier is a sample in training data that does not represent the generic trends in the data. Also, from mathematical standpoint, the distance of an outlier from other samples in the data is typically large. Such large distances can throw a machine learning model significantly away from the desired behavior. In other words, a small set of outliers can affect the learning of a machine learning model adversely and can reduce the metrics significantly. It is thus an important property of a machine learning model to be resilient of a reasonable number of outliers. 2

exponentially with data size and if the compute platform is limited by memory, the depth of the trees needs to be restricted. This can result in suboptimal performance. Decision jungles propose to improve on this by representing each weak learner in random forest method by a directed acyclic graph DAG instead of open-ended tree. The DAG has capability to fuse some of the nodes thereby creating multiple paths to a leaf from root node. As a result decision jungles can represent the same logic as random forest trees, but in a significantly compact manner.

6.12 Boosted Ensemble Trees Fundamental difference between boosted and bagging (or random forest for that matter) is the sequential training of the trees against the parallel training. In bagging or random forest methods all the individual weak learners are generated using random sampling and random subspaces. As all the individual weak learners are independent of each other, they all can be trained in parallel. Only after they are completely trained their results are aggregated. Boosting technique employs a very different approach, where first tree is trained based on a random sample of data. However, the data used by the second tree depends on the outcome of training of first tree. The second tree is used to focus on the specific samples where first decision tree is not performing well. Thus training of second tree is dependent on the training of first tree and they cannot be trained in parallel. The training continues in this fashion to third tree and fourth and so on. Due to unavailability of parallel computation, the training of boosted trees is significantly slower than training tress using bagging and random forest. Once all the trees are trained then the output of all individual trees is combined with necessary weights to generate final output. In spite of the computational disadvantage exhibited by the boosted trees, they are often preferred over other techniques due to their superior performance in most cases.

6.12.1

AdaBoost

AdaBoost was one of the first boosting algorithms proposed by Freund and Schapire [51]. The algorithm was primarily developed for the case of binary classification and it was quite effective in improving the performance of a decision tree in a systematic iterative manner. The algorithm was then extended to support multi-class classification as well as regression.

6.12.2

Gradient Boosting

Breiman proposed an algorithm called as ARCing [50] or Adaptive Reweighting and Combining. This algorithm marked the next step in improving the capabilities of boosting type methods using statistical framework. Gradient boosting is a generalization of AdaBoost algorithm using statistical framework developed by Breiman and Friedman [39]. In gradient boosted trees, the boosting problem is stated as numerical optimization problem with objective to minimize the error by sequentially adding weak learners using gradient descent algorithm. Gradient descent being a greedy method, gradient boosting algorithm is susceptible to overfitting the training data. Hence regularization techniques are always used with gradient boosting to limit the overfitting 6.13 Conclusion In this chapter we studied the concept of decision trees. These methods are extremely important and useful in many applications encountered in practice. They are directly motivated by the hierarchical decision making process very similar to the human behavior in tackling real life problems and hence are more intuitive than the other methods. Also, the results obtained using decision trees are easier to interpret and these insights can be used to determine action to be taken after knowing the results. We also looked at ensemble methods that use aggregate of multiple decision trees to optimize the overall performance and make the models more robust and generic.

AI Support Vector Machines

7.1 Introduction Theory of support vector machines or SVMs is typically attributed to Vladimir Vapnik. He was working in the field of optimal pattern recognition using statistical methods in the early 1960s. His paper with Lerner [52] on generalized portrait algorithm marks the beginning of the support vector machines. This method was primarily designed to solve the problem of binary classification using construction of optimal hyperplane that separates the two classes with maximum separation.

7.2 Motivation and Scope Original SVM algorithm is developed for binary classification. Figure shows the concept of SVM in case of linear application. The algorithm of SVM tries to separate the two classes with maximal separation using minimum number of data points, also called as support vectors, as shown in Figs. 7.1 and 7.2. Figure 7.1 shows the case of linearly separable classes and the result is trivial. The solid line represents the hyperplane that optimally separates the two classes. The dotted lines represent the boundaries of the classes as defined by the support vectors. The class separating hyperplane tries to maximize the distance between the class boundaries. However, as can be seen in Fig. 7.2, where the classes are not entirely linearly separable, the algorithm still finds the optimal support vectors. Once the support vectors are identified, the classification does not need the rest of the samples for predicting the class. The beauty of the algorithm lies in the drastic reduction in the number of support vectors compared to number of total training samples.

Fig. 7.1 Linear binary SVM applied on separable data

7.2.1 Extension to Multi-Class Classification As per the conceptual setup of SVM, it is not directly extensible for solving the prob-lem of multi-class classification. However there are few approaches commonly used to extend the framework for such case. One approach is to use SVM as binary classi-fier to separate each pair of classes and then apply some heuristic to predict the class for each sample. This is extremely time consuming method and not the preferred one. For example, in case of 3-class problem, one has to train the SVM for separating classes 1–2, 1–3, and 2–3, thereby training 3 separate SVMs. The complexity will increase in polynomial rate with more classes. In other approach binary SVM is used to separate each class from the rest of the classes. With this approach, for 3-class problem, one still has to train 3 SVMs, 1-(2,3), 2-(1,3), and 3-(1,2). However, with further increase in the number of classes, the complexity increases linearly. 7.2.2 Extension for Nonlinear Case The case of nonlinear separation can be solved by using suitable kernels. The original data can be transformed into arbitrarily higher dimensional vectors using

Fig. 7.2 Linear binary SVM applied on non-separable data

suitable kernel functions. After the transformation, the equations of linear SVM are applicable as is leading to optimal classification.

7.3 Theory of SVM In order to understand how the SVMs are trained it is important to understand the theory behind SVM algorithm. This can get highly mathematical and complex, how-ever, I am going to try to avoid the gory details of derivations. Reader is encouraged to read [37] for detailed theoretical overview. I will state the assumptions on which the derivations are based and then move to final equations that can be used to train the SVM without losing the essence of SVM. Let us consider a binary classification problem with n-dimensional training data set consisting of p pairs (xi , yi ), such that xi ∈ n and yi ∈ {−1, +1}. Let the equation of the hyperplane that separates the two classes with maximal separation be given as

With some manipulations, the Lagrangian in Eq. 7.5 reduces to a subset that con-tains only a very small number of training samples that are called as support vectors. As can be seen from Fig. 7.1, these support vectors are the vectors that represent the boundary of each class. After some more mathematical trickery performed using well-known Kühn-Tucker conditions, one arrives at convex quadratic optimization problem, which is relatively straightforward to solve. The equation to compute the optimal weight vector wˆ can then be given in terms of Lagrange multipliers αi ≥ 0 as 7.4 Separability and Margins The SVM algorithm described above is designed to separate the classes that are in fact completely separable. In other words, when the separating hyperplane is constructed, between the two classes, entirety of the one class lies on one side of the hyperplane and entirety of other class lies on the opposite side of the hyperplane with 100% accuracy in separation. The

margins defined in Eq. 7.2 are called as hard margins that impose complete separability between the classes. However, in practice such cases are seldom found. In order to account for the cases that are not entirely separable, soft margins were introduced. In order to understand the soft margins, let us write Eq. 7.3 in slightly different manner as

7.4.1 Regularization and Soft Margin SVM For all the cases when the samples are not separable, the above inequality will not be satisfied. To accommodate such cases the optimization problem is re-formulated using regularization techniques. New Lagrangian to be minimized is given as

Here, with the max function we are essentially ignoring the cases when there is error in the classification.

7.4.2 Use of Slack Variables Another way to accommodate for case of non-separable data is use of slack variables denoted as ξi . With use of slack variables the cost functional is updated as

where ξi ≥ 0, i = 1, . . . , m. Now the optimization operation also needs to find the values of all the slack variables. This approach is also called as C-SVM. 1

7.5 Nonlinearity and Use of Kernels Use of kernels is one of the ground breaking discoveries in the field of machine learning. With the help of this method, one can elegantly transform a nonlinear problem into a linear problem. These kernel functions are different from the link functions that we discussed in Chap. 4. In order to understand the use of kernels in case of support vector machines, let’s look at Eq. 7.7, specifically the term (x.x). Here we are taking a dot product of input vector with itself and as a result generating a real number. Use of kernel function1 states that we can replace the dot product operation with a function that accepts two parameters, (in this case both will be input vector) and outputs a real valued number. Mathematically, this kernel function is written as

7.5.1 Radial Basis Function Radial basis function kernel with variance σ is given as

Sometimes this is also called as kernel trick, although this is far more than a simple trick. A function needs to satisfy certain properties in order to be able called as kernel function. For more details on kernel functions refer to [37]. 1

This representation of SVM resembles closely with radial basis function neural networks that we learnt in Chap. 5. In some cases, use of squared distance between the two inputs can be lead to vanishingly low values. In such cases a variation of above function, called as Laplacian radial basis function is used. It is defined as

7.5.2

Polynomial

For polynomial with degree d, the kernel function is given as

7.5.3

Sigmoid

Sigmoid kernel can also be used that resembles a traditional neural network. It is defined as

7.6 Risk Minimization Methods based on risk minimization, sometimes called as structural risk minimization [64], essentially aim at learning to optimize the given system with constraints on parameters imposed by regularization as well as problem definition itself. Support vector machines solve this problem of risk minimization in elegant fashion. These methods strike the balance between performance optimization and reduction in overfitting in programmatic manner. Vapnik further extended the theory of structural risk minimization for the cases of generative models using vicinal risk minimization [63]. These methods can be applied to the cases that do not fit in the traditional SVM architecture, such as problems with missing data, or unlabeled data.

7.7 Conclusion In this chapter we studied an important pillar of machine learning theory, the support vector machines. SVM represents a mathematically elegant architecture to build an optimal classification or regression model. The training process is bit complex and needs tuning of some hyperparameters, but once properly tuned SVM models tend to provide very high accuracy and generalization capabilities.

AI Probabilistic Models

8.1 Introduction Most algorithms studied thus far are based on algebraic, graphical, and/or calculus based methods. In this chapter we are going to focus on the probabilistic methods. Probabilistic methods try to assign some form of uncertainty to the unknown variables and some form of belief probability to known variables and try to find the unknown values using the extensive library of probabilistic models. The probabilistic models are mainly classified into two types: 1. Generative 2. Discriminative The generative model takes a more holistic approach towards understanding the data compared to discriminative models. Commonly, the difference between the two types is given in terms of the probabilities that they deal with. If we have an observable input X and observable output Y , then the generative models try to model the joint probability P (X; Y ), while the discriminative models try to model the conditional probability P (Y |X). Most non-probabilistic approaches discussed so far also belong to the class of discriminative models. Although this definition of separation between the two approaches can be quite vague and confusing at times. Hence, we will try to define the two more intuitively. Before going into the definition we need to add few more concepts. Let there be a hidden entity called state S along with the input and output. The input actually makes some changes into the state of the system, and that change along with the input dictates the output. Now, let’s define the discriminative models as the models that try to predict the changes in the output based on only changes in the input. The generative models are the models that try to model the changes in the output based on changes in input as well as the changes in the state. This inclusion and modeling of the state gives a deeper insight into the systemic aspects and the generative models are typically harder to build and need more information and assumptions to start with. However, there are some inherent advantages that come with this added complexity as we will see in later sections of this chapter./AQPlease check the sentence “Although this definition...” for clarity.

The probabilistic approaches (discriminative as well as generative) are also sliced based on two universities of thought groups: 1. Maximum likelihood estimation 2. Bayesian approach

8.2 Discriminative Models We will first discuss the distinction between these two classes from the perspective of discriminative models and then we will turn to generative models.

8.2.1

Maximum Likelihood Estimation

The maximum likelihood estimation or MLE approach deals with the problems at the face value and parameterizes the information into variables. The values of the variables that maximize the probability of the observed variables lead to the solution of the problem. Let us define the problem using formal notations. Let there be a function f (x; θ ) that produces the observed output y. x ∈ n represent the input on which we don’t have any control over and θ ∈ represent a parameter vector that can be single or multidimensional. The MLE method defines a likelihood function denoted as L(y|θ ). Typically the likelihood function is the joint probability of the parameters and observed variables as L(y|θ ) = P (y; θ ). The objective is to find the optimal values for θ that maximizes the likelihood function as given by

This is a purely frequentist approach that is only data dependent.

8.2.2

Bayesian Approach

Bayesian approach looks at the problem in a different manner. All the unknowns are modelled as random variables with known prior probability distributions. Let us denote the conditional prior probability of observing the output y for parameter vector θ as P (y|θ ). The marginal probabilities of these variables are denoted as P (y) and P (θ ). The joint probability of the variables can be written in terms of conditional and marginal probabilities as

Here the probability P (θ/y) is called as posterior probability. Combining Eqs. 8.3 and 8.4

Equation 8.6 is called as Bayes’ theorem. This theorem gives relationship between the posteriory probability and priori probability in a simple and elegant manner. This equation is the foundation of the entire bayesian framework. Each term in the above equation is given a name, P (θ ) is called as prior, P (y|θ ) is called as likelihood, P (y) is called as evidence, and P (θ |y) is called as posterior. Thus in this worm the Bayes’ theorem is stated as/AQ Please check the sentence “Thus in this worm...” for clarity.

The Bayes’ estimate is based on maximizing the posterior. Hence, the optimiza-tion problem based on Bayes’ theorem can now be stated as

Comparing this equation with 8.2, we can see that Bayesian approach adds more information in the form of priori probability. Sometimes, this information is available, and then Bayesian approach clearly becomes the preferred, but in cases when this information is not explicitly available, one can still assume certain default distribution and proceed.

8.2.3

Comparison of MLE and Bayesian Approach

These formulations are relatively abstract and in general can be quite hard to comprehend. In order to understand them to the full extent let us consider a simple numerical example. Let there be an experiment to toss a coin for 5 times. Let’s say the two possible outcomes of each toss are H, T , or Head or Tail. The outcome of our experiment is H, H, T , H, H . The objective is to find the outcome of the 6th toss. Let’s work out this problem using MLE and Bayes’ approach.

8.2.3.1 Solution Using MLE The likelihood function is defined as, L(y|θ ) = P (y; θ ), where y denotes the outcome of the trial and θ denotes the property of the coin in the form of probability of getting given outcome. Let probability of getting a Head be h and probability of getting a Tail will be 1−h. Now, outcome of each toss is independent of the outcome of the other tosses. Hence the total likelihood of the experiment can be given as

Now, let us solve this equation, P (y; θ ) = h · h · (1 − h) · h · h P (y; θ ) = h4 − h5 In order to maximize the likelihood, we need to use the fundamental principle from differential calculus, that at any maximum or minimum of a continuous function the first order derivative is 0. In order to maximize the likelihood, we will differentiate above equation with respect to h and equate it to 0. Solving the equation (assuming h = 0) we get, ∂ P (y; θ ) = 0 ∂h ∂

(h4 − h5) = 0 ∂h

4 · h3 − 5 · h4 = 0 4 · h3 = 5 · h4

4=5·h h=45 This probability of getting Head in the next toss would be 45 . 8.2.3.2 Solution Using Bayes’s Approach

Comparing this equation with the Eq. 8.10, we can see that the likelihood function is same as the term P (y|θ ) in current equation. However, we need value for another entity P (θ ) and that

is the prior. This is something that we are going to assume as it is not explicitly given to us. If we assume the prior probability to be uniform, then it is independent of θ and the outcome of Bayes’ approach will be same as the outcome of MLE. However, in order to showcase the differences between the two approaches, let us use a different and non-intuitive prior. Let P (θ = h) = 2h. Consequently P (θ = T ). While defining this prior, we need to make sure that it is a valid probability density function. The easiest way to make sure that is to confirm P (θ = h) = 1. As can be seen from figure, it is indeed true. There is one more factor in the equation in the form of evidence, P (y). However, this value is probability of occurrence of the output without any dependency on the constant bias, and is constant with respect to h. When we differentiate with respect to h, the effect of this parameter is going to vanish. Hence, we can safely ignore this term for the purpose of optimization (Fig. 8.1). So we can now proceed with the optimization problem as before. In order to maximize the posterior, let’s differentiate it with respect to h as before,

Fig. 8.1 Probability density function (pdf) for the prior



(25 · (h9 − h10)) = 0 ∂h

9 · h8 − 10 · h9 = 0 9 · h8 = 10 · h9 h = 9 10

With Bayes’s approach, probability of getting Head in the next toss would be 109 . Thus assumption of a non-trivial prior with Bayes’ approach leads to a different answer compared to MLE.

8.3 Generative Models As discussed earlier, generative models try to understand how the data that is being analyzed came to be in the first place. They find applications in multiple different fields where we need to synthesize speech, images, or 3D environments that resemble the real life but is not directly

copied from any of the real examples. The generative models can be broadly classified into two types: (1) Classical models and (2) Deep learning-based models. We will look at few examples of classical generative models briefly.

Fig. 8.2 Sample Bayesian network

8.3.1

Mixture Methods

One of the fundamental aspect of generative models is to understand the composi-tion of the input. Understand how the input data came to existence in the first place. Most simplistic case would be to have all the input data as outcome of a single process. If we can identify the parameters describing the process, we can understand the input to its fullest extent. However, typically any input data is far from such ideal case, and it needs to be modelled as an outcome of multiple processes. This gives rise to the concept of mixture models. 8.3.2

Bayesian Networks

Bayesian networks represent directed acyclic graphs as shown in Fig. 8.2. Each node represents an observable variable or a state. The edges represent the conditional dependencies between the nodes. Training of Bayesian network involves identifying the nodes and predicting the conditional probabilities that best represent the given data.

8.4 Some Useful Probability Distributions We will conclude this chapter with detailing some of the commonly used probability distributions. There are likely hundreds of different distributions studied in the literature of probability and statistics. We don’t want to study them all, but in my experience with machine learning projects so far, I have realized that knowledge of few key distributions goes a long way. Hence, I am going to describe these distributions here without going to into theoretical details of their origins etc. We will look at the probability density functions or pdf ’s and cumulative density functions or cdf ’s of these distributions and take a look at the parameters that define these distributions. Here are the definitions of these quantities for reference: Definition 8.1 pdf A probability density function or pdf is a function P (X = x) that provides probability of occurrence of value x for a given variable X. The plot of P (X = x) is bounded between [0, 1] on y-axis and can spread between [−∞, ∞] on x-axis and integrates to 1. Definition 8.2 cdf A cumulative density function or cdf is a function C(X = x) that provides sum of probabilities of occurrences of values of X between [−∞, x]. This plot is also bounded between [0, 1]. Unlike pdf, this plot starts at 0 on left and ends into 1 at the right. I would strongly advise the reader to go through these distributions and see the trends in the probabilities as the parameters are varied. We come across distributions like these in many situations and if we can match a given distribution to a known distribution, the problem can be solved in far more elegant manner. 8.4.1

Normal or Gaussian Distribution

Normal distribution is one of the most widely used probability distribution. It is also called as bell shaped distribution due to the shape of its pdf. The distribution has vast array of applications including error analysis. It also approximates multitude of other distributions with more complex formulations. Another reason the normal distribution is popular is due to central limit theorem. Definition 8.3 Central Limit Theorem Central limit theorem states that, if suffi-ciently large number of samples are taken from a population from any distribution with finite variance, then the mean of the samples asymptotically approaches the mean of the population. In other words, sampling distribution of mean taken from population of any distribution asymptotically approaches normal distribution.

Hence sometimes the normal distribution is also called as distribution of distribu-tions. Normal distribution is also an example of continuous and unbounded distribu-tion, where the value of x can span [−∞, ∞]. Mathematically, the pdf of normal distribution is given as

where μ is the mean and σ is the standard deviation of the distribution. Variance is σ 2. cdf of normal distribution is given as

where erf is a standard error function, defined as

functional that is being integrated is symmetric, hence it can also be written as

The plots of the pdf and cdf are shown in figure (Figs. 8.3 and 8.4). 8.4.2

Bernoulli Distribution

Bernoulli distribution is an example of discrete distribution and its most common application is probability of coin toss. The distribution owes its name to a great mathematician of the

seventeenth century, Jacob Bernoulli. The distribution is based on two parameters p and q, which are related as p = 1 − q. Typically p is called the probability of success (or in case of coin toss, it can be called as probability of getting a Head) and q is called the probability of failure (or in case of coin toss, probability of getting a Tail). Based on these parameters, the pdf (sometimes, in case of discrete variables, it is called as probability mass function or pmf, but for the sake of consistency, we will call it pdf) of Bernoulli distribution is given as

here, we use the discrete variable k instead of continuous variable x. The cdf is given as

Fig. 8.3 Plot of normal pdfs for 0 mean and different variances

Fig. 8.4 Plot of normal pdfs for different means and different variances

8.4.3

Binomial Distribution

Binomial distribution generalizes Bernoulli distribution for multiple trials. Binomial distribution has two parameters n and p. n is number of trials of the experiment, where probability of success is p. The probability of failure is q = 1 − p just like Bernoulli distribution, but it is not considered as a separate third parameter. The pdf for binomial distribution is given as

is called as binomial coefficient in this context. It also represents the number of combinations of k in n from the permutation-combination theory where it is represented as

The cdf of binomial distribution is given as

8.4.4

Gamma Distribution

Gamma distribution is also one of the very highly studied distribution in theory of statistics. It forms a basic distribution for other commonly used distributions like chi-squared distribution,

exponential distribution etc., which are special cases of gamma distribution. It is defined in terms of two parameters: α and β. The pdf of gamma distribution is given as

where x > 0 and α, β > 0. The simple definition of (α) for integer parameter is given as a factorial function as

The same definition is generalized for complex numbers with positive real parts as

8.4.5

Poisson Distribution

Poisson distribution is a discrete distribution loosely similar to Binomial distribu-tion. Poisson distribution is developed to model the number of occurrences of an outcome in fixed interval of time. It is named after a French mathematician Siméon Poisson. The pdf of Poisson distribution is given in terms of number of events (k) in the interval as

Fig. 8.5 Plot of Gamma pdfs for different values of α and β

Fig. 8.6 Plot of Gamma cdfs for different values of α and β

Fig. 8.7 Plot of Poisson pdfs for different values of λ

Fig. 8.8 Plot of Poisson cdfs for different values of λ 1

8.5 Conclusion In this chapter, we studied various methods based on probabilistic approach. These methods start with some different fundamental assumptions compared to other methods, specifically the ones based on Bayesian theory. The knowledge of priory knowledge separates them from all other methods. If available, this priory knowledge can improve the model performance significantly as we saw in fully worked example. Then we concluded the chapter with learning bunch of different probability distributions with their density and cumulative functions.

AI Dynamic Programming and Reinforcement Learning 9.1 Introduction The theory of dynamic programming was developed by Bellman [38] in 1950s. In the preface of his iconic book, he defines dynamic programming as follows: The purpose of this book is to provide introduction to the mathematical theory of multi-stage decision process. Since these constitute a somewhat complex set of terms we have coined the term dynamic programming to describe the subject matter. This is very interesting and apt naming as the set of methods that come under the umbrella of dynamic programming is quite vast. These methods are deeply rooted in pure mathematics, but are more favorably expressed so that they can be directly implemented as computer programs. In general such multi-stage decision problems appear in all sorts of industrial applications, and tackling them is always a daunting task. However, Bellman describes a structured and sometimes iterative manner in which the problem can be broken down and solved sequentially. There exists a notion of state machine in sequential solution of the subproblems and context of subsequent problems changes dynamically based on the solution of previous problems. This non-static behavior of the methods imparts the name dynamic. These types of problems also marked the early stages of Artificial Intelligence in the form of expert systems.

9.2 Fundamental Equation of Dynamic Programming In general, the problem that dynamic programming tries to solve can be given in the form of a single equation and it is called as Bellman equation (Fig. 9.1). Let us consider a process that goes through N steps. At each step there exist a state and possible set of actions. Let the initial state be s0 and first action taken be a0. We also

Fig. 9.1 The setup for Bellman equation constrain the set of possible actions in step t as at ∈ (st ). Depending on the action taken, the next state is reached. Let us call the function that combines the current state and action and produces next state as T (s, a). Hence s1 = T (s0, a0). As the process is going through multiple states, let the problem we are trying to solve is to optimize a value function at step t as V (st ). The optimality principle in iterative manner can be stated as: “In order to have the optimum value in the last step, one needs to have optimum value in the previous step that will lead to final optimum value”. To translate this into equation, we can write

9.3 Classes of Problems Under Dynamic Programming Dynamic programming defines a generic class of problems that share the same assumptions as theory of machine learning. The exhaustive list of problems that can be solved using theory of dynamic programming is quite large as can be seen here [4]. However, the most notable classes of problems that are studied and applied are: • Travelling salesman problem

• Recursive Least Squares (RLS) method • Finding shortest distance between two nodes in graph • Viterbi algorithm for solving hidden Markov model (HMM) Other than these specific problems, the area that is most relevant in the context of modern machine learning is Reinforcement learning and its derivatives. We will study these concepts in the rest of the chapter.

9.4 Reinforcement Learning Most of the machine learning techniques we have explored so far and will explore in later chapters primarily focus on two types of learning: (1) Supervised and (2) Unsupervised. Both methods are classified based on the availability of labelled data. However, none of these types really focus on interaction with the environment. Even in supervised learning techniques the labelled data is available beforehand. Reinforcement learning takes a fundamentally different approach towards learning. It follows biological aspects of learning more closely. When a newborn baby starts interacting with the environment, its learning begins. In the initial times, the baby is making mostly some random actions and is being greeted by the environment in some way. This is called as reinforcement learning. It cannot be classified into either of the two types. Let us look at some of the fundamental characteristics of the reinforcement learning to understand precisely how it differs from these methods. The reinforcement learning framework is based on interaction between two primary entities: (1) system and (2) environment.

9.4.1

Characteristics of Reinforcement Learning

1. There is no preset labelled training data available. 2. The action space is predefined that typically can contain very large number of possible actions that the system can take at any given instance.

3. The system chooses to make an action at every instance of time. The meaning of instance is different for each application. 94

9 Dynamic Programming and Reinforcement Learning

4. At every instance of time a feedback (also called as reward) from the environment is recorded. It can either be positive, negative, or neutral. 5. There can be delay in the feedback. 6. System learns while interacting with the environment. 7. The environment is not static and every action made by the system can potentially change the environment itself. 8. Due to dynamic nature of environment, the total training space is practically infinite. 9. The training phase and application phase are not separate in case of reinforce-ment learning. The model is continuously learning as it is also predicting.

9.4.2

Framework and Algorithm

It is important to note that reinforcement learning is a framework and not an algorithm like most other methods discussed in this book, hence it can only be compared with other learning frameworks like supervised learning. When an algo-rithm follows above characteristics, the algorithm is considered as a reinforcement learning algorithm. Figures 9.2 and 9.3 show the architectures of reinforcement learning and supervised learning frameworks. Unsupervised learning is completely different as it does not involve any type of feedback or labels, and is not considered here.

Fig. 9.2 Architecture of supervised learning

Fig. 9.3 Architecture of reinforcement learning

9.5 Exploration and Exploitation Reinforcement learning introduces two new concepts in the process of learning called exploration and exploitation. As the system starts its learning process, there is no knowledge learned so far and every action taken by the system is pure random. This is called exploration. During exploration, the system is just trying out different possible actions that it can make and registering the feedback from the system as positive, negative, or neutral reward. After a while in learning phase, when sufficient feedback is gathered, the system can start using the knowledge learned from the previous exploration and start producing actions that are not random but deterministic. This is called exploitation. Reinforcement learning needs to find a good tradeoff between exploration and exploitation. Exploration opens up more possible actions that can lead to better long term rewards in future at the cost of lower possible rewards in short term, while exploitation tends to get better short term rewards at the cost of possibly missing out on greater long term rewards possible due to actions not known at the time.

9.6 Examples of Reinforcement Learning Applications The theory of reinforcement learning is better understood after looking at some of the real life applications as follows: 1. Chess programs: Solving the problem of winning a chess game by computers is one of the classic application of reinforcement learning. Every move that is made by either side opens up a new position on the board. The ultimate objective is it captures the king of the other side, but the short term goals are to capture other pieces of other side or gain control of the center, etc. The action space is practically infinite, as there are 32 pieces in total on 64 squares and each piece has different types of moves allowed. One of the conservative estimates on the number of possible moves in chess is calculated to be around 10120, which is also known as Shannon number [10]. Deep Blue system by IBM was a supercomputer specifically designed to play chess was able to defeat reigning world champion Gary Kasparov in 1997 [11]. This was considered as landmark in machine learning. It did use some bits of reinforcement learning, but it was heavily augmented with huge database of past games. Since then the computers have become increasingly better at the game. However, the real victory of reinforcement learning came with Google’s AlphaZero system. This system was trained by playing against itself and learning all the concepts of chess. Just after 9 h of training and without any knowledge of previously played games, it was able to defeat the other world champion chess program, called Stockfish in 2015 [60]. 2. Robotics: Training a robot to maneuver in a complex real world is a classical reinforcement learning problem that closely resembles the biological learning. In this case, the action space is defined by the combination of the scope of moving parts of the robot and environment is the area in which the robot needs to maneuver along with all the existing objects in the area. If we want to train the robot to lift one object from one place and drop it at another, then the rewards would be given accordingly. 3. Video Games: Solving video games is another interesting application of rein-forcement learning problems. A video game creates a simulated environment in which the user needs to navigate and achieve certain goals in the form of say winning a race or killing a monster, etc. Only certain combination of moves allows the user to pass through various challenging levels. The action space is also well defined in the form of up, down, left, right, accelerate brake, or attack with certain weapon, etc. Open AI has created a platform for testing reinforcement

learning models to solve video games in the form of Gym [13]. Here is one application where the Gym is used to solve stages in classical game of Super Mario [12]. 4. Personalization: Various e-commerce websites like Amazon, Netflix have most of their content personalized for each user. This can be achieved with the use of reinforcement learning as well. The action space here is the possible recommen-dations and reward is user engagement as a result of certain recommendation.

9.7 Theory of Reinforcement Learning Figure 9.4 shows the signal flow and value updates using reinforcement learning architecture. sk denotes the state of the system that is combination of environment and the learning system itself at time instance k. ak is action taken by the system, and rk is the reward given by the environment at the same time instance. πk is the policy for determining the action at the same time instance and is function of current state. V π denotes the value function that updates the policy using current state and reward.

Fig. 9.4 Reinforcement learning model architecture 9.7.1

Variations in Learning

This depiction of reinforcement learning combines various different methods into a single generic representation. Here are the different methods typically used: 1. Q-learning

2. SARSA 3. Monte Carlo

9.7.1.1 Q-Learning In order to understand Q-learning, let’s consider the most generic form of Bellman equation as Eq. 9.3. In Q-learning framework, the function T (s, a) is called as the value function. The technique of Q-learning focusses on learning the values of T (s, a) for all the given states and action combinations. The algorithm of Q-learning can be summarized as, 1. Initialize the Q-table, for all the possible state and action combinations. 2. Initialize the value of β. 3. Choose an action using a tradeoff between exploration and exploitation. 98 4. Perform the action and measure the reward. 5. Update the corresponding Q value using Eq. 9.3 6. Update the state to next state. 7. Continue the iterations (steps 3–6) till the target is reached.

9.7.1.2 SARSA SARSA stands for state-action-reward-state-action [66]. SARSA algorithm is an incremental update to Q-learning where it adds learning based on policy. Hence it is also sometimes called

as on-policy Q-learning, which the traditional Q-learning is off-policy. The update equation for SARSA can be given as

where β is discount factor as before, and α is called as learning rate.

9.8 Conclusion In this chapter, we studied methods belonging to the class of dynamic programming as defined by Bellman. The specific case of reinforcement learning and its variations mark a topic of their own and we devoted a section for studying these concepts and their applications. Reinforcement learning marks a whole new type of learning that resembles to human learning more so than traditional supervised and unsupervised learning techniques. Reinforcement learning enables fully automated way of learn-ing in a given environment. These techniques are becoming quite popular in the context of deep learning and we will study those aspects in later chapters.

AI Evolutionary Algorithms

10.1 Introduction All the traditional algorithms including the new deep learning framework tackle the problem of optimization using calculus of gradients. The methods have evolved sig-nificantly to solve harder problems that were once considered impossible to solve. However, the horizon of the reach of these algorithms is linear and predictable. Evolutionary algorithms try to attack the optimization problems in a fundamentally different manner of massive exploration in a random but supervised manner. This approach opens up whole new types of solutions for the problems at hand. Also, these methods are inherently suitable for embarrassingly parallel computation, which is the mantra of modern computation based on GPUs.

10.2 Bottleneck with Traditional Methods In the applications of machine learning, one comes across many problems for which it is practically impossible to find universally optimal solution. In such cases one has to be satisfied with a solution that is optimal within some reasonable local neighborhood (the neighborhood is from the perspective of the hyperspace spanned by the feature values). Figure 10.1 shows an example of such space. Most traditional methods employ some form of linear search in a greedy manner.1 In order to see a greedy method in action, let us zoom into the previous

In general all the algorithms that use gradient based search are called as greedy algorithms. These algorithms use the fact from calculus that at any local optimum (minimum or maximum) the value of gradient is 0. In order to distinguish between whether the optimum is a minimum or a maximum second order gradient is used. When the second order gradient is positive a minimum is reached, otherwise it’s a maximum. 1

1

Fig. 10.1 Example of complex search space with multiple local minima and a unique single global minimum

figure, as shown in Fig. 10.2. The red arrows show how the greedy algorithm progresses based on the gradient search and results into a local minimum.

10.3 Darwin’s Theory of Evolution There exists a very close parallel to this problem in the theory of natural evolution. In any given environment, there exist a complex set of constraints in which all the inhabitant animals and plants are fighting for survival and evolving in the process. The setup is quite dynamic and a perfectly ideal species does not exist for any given environment at all times. All the species have some advantages and some limitations at any given time. The evolution of the species is governed by Darwin’s theory of evolution by natural selection. The theory can be stated briefly as:

Over sufficiently long span of time, only those individual organisms survive in a given environment who are better suited for the environment.

Fig. 10.2 Example of greedy search in action resulting into a local minimum

This time-span can extend over multiple generations and its effects are typically not seen in a matter of few years or even a few hundred years. However, in few thousand years and more the effect of evolution by natural selection is seen beyond doubt. There is one more aspect to the theory of evolution without which it cannot work, and that is random variation in the species that typically happen by the process of mutation. If the process of reproduction kept on producing species that are identical to the parents, there would never be any change in the setup and natural selection cannot successfully happen. However, when we add a random variation in the offsprings during each reproduction, it changes everything. The new features that are created as a result of mutation are put to test in the environment. If the new features make the organism better cope with the environment, the organisms with those features thrive and tend to reproduce more in comparison to the organisms that do not have those features. Thus over time, the offsprings of the weaker organisms are extinct and the species overall have evolved. As a

result of continued evolution, over time, the species get better and better with respect to surviving the given environment. In long run, the process of evolution leads the species along the direction of better adaptation and on aggregate level it never retrogresses. These characteristics of evolution are rather well suited for the problems described at the beginning. All the algorithms that come under the umbrella of evolutionary algorithms are inspired from this concept. Each such algorithm tries to interpret the concepts of random variation, environmental constraints, and natural selection in its own way to create a resulting evolution. Fortunately, as the processes of random variations and natural selection are implemented using computer running in GHz, they can happen in matter of mere seconds compared to thousands or hundreds of thousands of years in the biological setup. Thus, evolutionary methods rely on creating an initial population of samples that is chosen randomly, instead of a single starting point used in greedy methods. Then they let processes of mutation based sample variation and natural selection to do their job to find which sample evolves into better estimate. Thus evolutionary algorithms also do not guarantee a global minimum, but they typically have higher chance of finding one. Following sections describe few most common examples of evolutionary algorithms. 10.4 Genetic Programming Genetic programming models try to implement Darwin’s idea as closely as possible. It maps the concepts of genetic structure to solution spaces and implements the concepts of natural selection and reproduction with possibility of mutation in programmatic manner. Let us look at the steps in the algorithm.

Steps in Genetic Programming 1. Set the process parameters as stopping criteria, mutation fraction, etc. 2. Initialize the population of solution candidates using random selection. 3. Create a fitness index based on the problem at hand.

4. Apply the fitness index to all the population candidates and trip the number of candidates to a predetermined value by eliminating the lowest scoring candidates. 5. Randomly select pairs of candidates from the population as parents and carry out the process of reproduction. The process of reproduction can contain two alternatives: (a) Crossover: In crossover, the parent candidates are combined in a predefined structural manner to create the offspring. (continued) (b) Mutation: In mutation, the children created by the process of crossover are modified randomly. The mutation is applied only for a fraction of the offsprings as determined as one of the settings of the process. 6. Augment the original population with the newly created offsprings. 7. Repeat steps 4, 5, and 6 till desired stopping criteria are met.

Although the steps listed in the algorithm are fairly straightforward from biological standpoint, they need customization based on the problem at hand for programmatic implementation. In order to illustrate the complexity in customization let us take a real problem. A classic problem that is very hard to solve using traditional methods but is quite suitable for genetic programming is of travelling salesman. The problem is like this: There is a salesman that wants to travel to n number of destinations in sequence. The distance between each pair of destinations is given as dij , where i, j ∈ {1, 2, . . . , n}. The problem is to select the sequence that connects all the destinations only once in shortest overall distance. Although apparently straightforward looking one, this problem is actually consid-ered as one of the hardest to solve2 and universally optimal solution to this problem is not possible even when the number of destinations is as small as say 100. Let’s try to solve this problem using above steps.

1. Let us define the stopping criteria as successive improvement in the distance to be less than some value α or maximum number of iterations. 2. We first need to create a random population of solutions containing say k number of distinct solutions. Each solution is random sequence of destination from 1 to n. 3. The fitness test would be given as sum of distances between successive destina-tions. 4. We will keep k number of top candidates when sorted in decreasing order of total distance. 5. Reproduction step is where things get a little tricky. First we will choose two parents randomly. Now, let’s consider the two cases one by one. a. For crossover, we will select first k1, k1 < k destinations directly from parent-1 and then remaining destinations from parent-2. However, this simple crossover can lead to duplicating some destinations and missing some desti-nations in the new sequence, called as the offspring sequence. These errors a. need to fixed by appropriate adjustment.

This problem belongs to a class of problems called as NP-hard. It stands for nondeterministic polynomial time hard problems [27]. The worst case solution time for this problem increases in near exponential time and quickly becomes beyond the scope of current hardware. 2

b. For mutations, once the crossover based offspring sequence is generated, randomly some destinations are swapped. 6. Once we have reproduced the full population, we will have a population of twice the size. Then we can repeat above steps as described in the algorithm till stopping criteria are reached. Unfortunately due to the random factor in the design of genetic programs, one

cannot have a deterministic bound on how much time it would take to reach the acceptable solution, or how much should be the size of the population or how much should be the percentage of mutations, etc. One has to experiment with multiple values of these parameters to find the optimal solution for each given case. In spite of this uncertainty, genetic programs are known to provide significant improvements in computation times for solutions of certain types of problems and are in general a strong tool to have in the repertoire of machine learning toolkit.

10.5 Swarm Intelligence Swarm intelligence is a generic term used to denote algorithms that are influenced by the biological aspects of groups primitive organisms. The origin of swarm intelligence techniques can be traced back to 1987, when Craig Reynolds published his work on boids [68]. In his work Reynolds designed a system of flock of birds and assigned a set of rules governing the behavior of each of the bird in the flock. When we aggregate the behavior of the group over time, some completely startling and non-trivial trends emerge. This behavior can be attributed to the saying that sometimes, 1 + 1 > 2. When a single bird is considered as a singular entity and is let loose in the same environment, it has no chance of survival. If all the birds in the flock act as single entities, they all are likely to perish. However, when one aggregates the birds to form a social group that communicates with each other without any specific governing body, the abilities of the group improve significantly. Some of the very long migrations of birds are classic examples of success of swarm intelligence. In recent years, swarm intelligence is finding applications in computer graphics for simulating groups of animals or even humans in movies and video games. As a matter of fact, the encyclopedia of twenty-first century: Wikipedia can also be attributed to swarm intelligence. The techniques of swarm intelligence are also finding applications in controlling a group autonomous flying drones. In general the steps in designing an algorithm that is based on swarm intelligence can be outlined as follows: 1. Initialize the system by introducing a suitable environment by defining con-straints. 2. Initialize the individual organism by defining the rules of possible actions and possible ways of communicating with others. 3. Establish the number of organisms and period of evolution.

4. Define the individual goals for each organism and group goals for the whole flock as well as stopping criteria. 5. Define the randomness factor that will affect the decisions made by individual organisms by trading between exploration and exploitation. 6. Repeat the process till the finishing criteria are met.

10.6 Ant Colony Optimization Although ant colony optimization can be considered as a subset of swarm intel-ligence, there are some unique aspects to this method and it needs separate consideration. Ant colony optimization algorithms as the name suggests, is based on the behavior of a large groups of ants in a colony. The individual ant possesses a very limited set of skills, e.g., they have a very limited vision, in most cases they can be completely blind, they have a very small brain with very little intellect, and their auditory and olfactory senses are also not quite advanced. In spite of these limitations, the ant colonies as such are known to have some extraordinary capabilities like building complex nest, finding shortest path towards food sources that can be at large distances from the nest. Another important aspect of the ant colonies is that they are a completely decentralized system. There is no central decision maker, a king or queen ant that orders the group to follow certain actions. All the decisions and actions are decided and executed by individual ant based on its own method of functioning. For the ant colony optimization algorithm, we are specifically going to focus on the capabilities of the ants to find the shortest path from the food source to the nest. At the heart of this technique lies the concept of pheromones. A pheromone is a chemical substance that is dropped by each ant as it passes along any route. These dropped pheromones are sensed by the ants that come to follow the same path. When an ant reaches at a junction, it chooses the path with higher level of pheromones with higher probability. This probabilistic behavior combines the random exploration with exploitation of the paths travelled by other ants. The path that connects the food source with the nest in least distance is likely going to be used more often than the other paths. This creates a form of positive feedback and the path with shortest distance keeps getting more and more chosen one over time. In biological terms, the shortest path evolves over time. This is quite a different way of interpreting the process of evolution, but it conforms to the fundamentals nonetheless. All the different paths connecting the nest with the food source mark the initial population. The subsequent choices of different

paths, similar to process of reproduction, are governed in probabilistic manner using pheromones. Then the positive feedback created by aggregation of pheromones acts as a fitness test and controls the evolution in general. 106 These biological concepts related to emission of pheromones and their decay and aggregation can be modelled using mathematical functions to implement this algorithm programmatically. The problem of travelling salesman is also a good candidate to use ant colony optimization algorithm. It is left as an exercise for the reader to experiment with this implementation. It should be noted that: as the ant colony optimization algorithm has graphical nature of the solution at heart it has relatively limited scope compared to genetic programs.

10.7 Simulated Annealing Simulated annealing [67] is an odd man out in this group of evolutionary algorithms as it finds its origins in metallurgy and not biology. The process of annealing involves heating the metal above a certain temperature called as recrystallization temperature and then slowly cooling it down. When the metal is heated above the recrystallization temperature, the atoms and molecules involved in the crystalliza-tion process can move. Typically this movement occurs such that the defects in the crystallization are repaired. After annealing process is complete, the metal typically improves it ductility and machinability as well as electrical conductivity. Simulated annealing process is applied to solve the problem of finding global minimum (or maximum) in a solution space that contains multiple local minima (or maxima). The idea can be described using Fig. 10.1. Let’s say with some initial starting point, the gradient descent algorithm converges to nearest local minimum. Then the simulated annealing program, generates a disturbance into the solution by essentially throwing the algorithm’s current state to a random point in a predefined neighborhood. It is expected that the new starting point leads to another local minimum. If the new local minimum is smaller than the previous one, then it is accepted as solution otherwise previous solution is preserved. The algorithm is repeated again till the stopping criteria are reached. By adjusting the neighborhood radius corresponding to the higher temperature in the metallurgical annealing, the algorithm can be fine tuned to get better performance.

10.8 Conclusion In this chapter, we studied different algorithms inspired by the biological aspects of evolution and adaptation. In general entire machine learning theory is inspired by the human intelligence, but various algorithms used to achieve that goal may not directly be applicable to humans or even other organisms for that matter. However, the evolutionary algorithms are specifically designed to solve some very hard problems using methods that are used by different organisms individually or as a group.

AI Time Series Models

11.1 Introduction All the algorithms discussed so far are based on static analysis of the data. By static it is meant that the data that is used for training purposes is constant and does not change over time. However, there are many situations where the data is not static. For example analysis of stock trends, weather patterns, analysis of audio or video signals, etc. The static models can be used up to certain extent to solve some problems dealing with dynamic data, by taking snapshots of the time series data at a certain time. These snapshots can then be used as static data to train the models. However, this approach is seldom optimal and always results in less than ideal results. Time series analysis is studied quite extensively for over centuries as part of statistics and signal processing and the theory has quite matured. Typical applications of time series analysis involve trend analysis, forecasting, etc. In signal processing theory the time series analysis also deals with frequency domain which leads to spectral analysis. These techniques are extremely powerful in handling dynamic data. We are going to look at this problem from the perspective of machine learning and we will not delve too much into signal processing aspects of the topic that essentially represent fixed mode analysis. The essence of machine learning is in feedback. When a certain computation is performed on the training data and a result is obtained, the result must somehow be fed back into the computation to improve the result. If this feedback is not present, then it is not a machine learning application. We will use this concept as yardstick to separate the pure signal processing or statistical algorithms from machine learning algorithms and only focus on latter. 11.2 Stationarity Stationarity is a core concept in the theory of time series and it is important to understand some implications of this before going to modelling the processes. Sta-tionarity or a stationary process is defined as a process for which the unconditional joint probability distribution of its parameters does not change over time. Sometimes this definition is also referred to as strict stationarity. A more practical definition based on normality assumption would be the mean and variance of the process remain constant over time. These conditions make the process strictly stationary only when the normality condition is satisfied. If that is not satisfied then the process is called weak stationary or wide sense stationary. In general, when the process is nonstationary, the joint probability distribution of its parameters changes over time or the mean and variance of its parameters are not constant. It becomes very hard to model such process. Although most processes encountered in real life are non-stationary, we always make the

assumptions of stationarity to simplify the modelling process. Then we add concepts of trends and seasonality to address the effects of non-stationarity partially. Seasonality means the mean and variance of the process can change periodically with changing seasons. Trends essentially model the slow changes in mean and variance with time. We will see simple models that are built on the assumptions of stationary and then we will look at some of their extensions to take into consideration the seasonality. To understand the nuances of trends and seasonality let’s look at the plots shown in Figs. 11.1 and 11.2. Plot of Microsoft stock price almost seems periodical with a period of roughly 6 months with upward trend, while Amazon stock plot shows irregular changes with overall downward trend. On top of these macro trends, there is additional periodicity on daily basis. Figure 11.3 shows use of mobile phones per 100 people in Germany from 1960 to 2017. This plot does not show any periodicity, but there is a clear upward trend

Fig. 11.1 Plot showing Microsoft stock price on daily basis during calendar year 2001

Fig. 11.2 Plot showing Amazon stock price on daily basis during calendar year 2001

Fig. 11.3 Plot mobile phone use in Germany per 100 people from 1960 to 2017. The data is courtesy of [7] in the values. The trend is not linear and not uniform. Thus, it represents a good example of non-stationary time series. 11.3 Autoregressive and Moving Average Models Autoregressive moving average or ARMA analysis is one of the simplest of the techniques of univariate time series analysis. As the name suggests this technique is based on two separate concepts: autoregression and moving average. In order to define the two processes mathematically, let’s start with defining the system. Let there be a discreet time system that takes white noise inputs denoted as i , i = 1, . . . , n, where i denoted the instance of time. Let the output of the system be denoted as xi , i = 1, . . . , n. For ease of definition and without loss of generality let’s assume all these variables as univariate and numerical. 11.3.1

Autoregressive, or AR Process

An autoregressive or AR process is the process in which current output of the system is a function of weighted sum of certain number of previous outputs. We can define an autoregressive process of order p, AR(p) using the established notation as

11.3.1 Moving Average, or MA Process A moving average process is always stationary by design. A moving average or MA process is a process in which the current output is a moving average of certain number of past states of the default white noise process. We can define a moving

1

11.3.3

Autoregressive Moving Average ARMA Process

Now, we can combine the two processes, into a single ARMA(p, q) process with parameters p and q as

11.4 Autoregressive Integrated Moving Average (ARIMA) Models Although ARMA(p, q) process in general can be non-stationary, it cannot explicitly model a non-stationary process well. This is why the ARIMA process is developed. The added term integrated adds differencing terms to the equation. Differencing operation as the name suggested computes the deltas between consecutive values of the outputs as

Equation 11.9 shows the first order differences. Differencing operation in discrete time is similar to differentiation or derivative operation in continuous time. First order differencing can make polynomial based second order non-stationary pro-cesses into stationary ones, just like differentiating a second order polynomial equation leads to a linear equation. Processes with higher polynomial order non-stationarity need higher order differencing to convert into stationary processes. For example second order differencing can be defined as,

11.5 Hidden Markov Models (HMM) Hidden Markov models or HMMs represent a popular generative modelling tool in time series analysis. The HMMs have evolved from Markov processes in statistical signal processing. Consider a statistical process generating a series of observations represented as y1, y2, . . . , yk . The process is called as a Markov process if the current observation depends only on the previous observation and is independent of all the observations before that. Mathematically it can be stated as

Fig. 11.4 A sequence of states and outcomes that can be modelled using hidden Markov models technique

Fig. 11.5 Showing a full HMM with three states as three different die: red, blue, green and six outcomes as 1,2,3,4,5,6 where Fs is the probabilistic function of state transitions.

where Fo is the probabilistic function of observations. Consider a real life example with three different states represented by three dies: red, blue, and green. Each die is biased differently to produce the outcomes of (1, 2, 3, 4, 5, 6). The state transition probabilities are given by Fs (si /sj ) and outcome probabilities are given as Fo(si /oj ). Figure 11.5 shows the details of the model. Once a given problem is modelled using HMM, there exist various techniques to solve the optimization problem using training data and predict all the transition and outcome probabilities [41]. 114

11.5.1

Applications

HMMs have been widely used to solve the problems in natural language processing with notable success. For example part of speech (POS) tagging, speech recognition, machine translation, etc. As generic time series analysis problems, they are also used in financial analysis, genetic sequencing, etc. They are also used with some modifications in image processing applications like handwriting recognition.

11.6 Conditional Random Fields (CRF) Conditional random fields or CRFs represent a discriminative modelling tool as opposed to HMM which is a generative tool. CRFs were introduced by Lafferty et al. [69] in 2001. In spite

of having a fundamental different perspective, CRFs share a significant architecture with HMM. In some ways CRFs can be considered as generalization of HMMs and logistic regression. As generative models typically try to model the structure and distribution of each participating class, discriminative models try to model the discriminative properties between the classes or the boundaries between the classes. As HMMs try to model the state transition probabilities first and then the outcome or observation probabilities based on states, CRFs directly try to model the conditional probabilities of the observations based on the assumptions of similar hidden states. The fundamental function for CRF can be stated as

In order to model the sequential input and states, CRF introduces feature functions. The feature function is defined based on four entities. Entities in Feature Function 1. Input vectors X. 2. Instance i of the data point being predicted. 3. Label for data point at (i − 1)th instance, li−1. 4. Label for data point at (i)th instance, li . The function is then given as

Using this feature function, the conditional probability is written as

where the normalization constant Z(X) is defined as

11.7 Conclusion Time series analysis is an interesting area in the field of machine learning that deals with data that is changing with time. The entire thought process of designing a time series pipeline is fundamentally different than the one used in all the static models that we studied in previous chapters. In this chapter, we studied the concept of stationarity followed by multiple different techniques to analyze and model the time series data to generate insights.

Artificial Intelligence The Nature of Language

1.1 Syntax versus Semantics It has been claimed many times in the past that humans are, somehow, born for grammar and speech as an innate ability to see the structure underlying a string of symbols. A classic example is the ease with which children pick up languages, which, in turn, has undergone evolutionary pressure. Without language, knowledge cannot be passed on, but only demonstrated. For instance, chimps can show the off springs processes but cannot tell their about them, since a demonstration is required. Languages are, therefore, brought into connection with information, sometimes quite crucial. Language can help you to make plans. Many of the Spanish conquistadores who conquered Mesoamericans could not read, but their priests could. Moreover, being able to record language provides access to thousands of years of knowledge. Generation and recognition of sentences pose two main problems for the concept of language as an assembly of a set of valid sentences. Though most textbooks deal with the understanding of the recognition of languages, one cannot ignore understanding the generation of language, if we aspire to understand recognition seriously. A language can be described as a series of simple syntactic rules. For instance, English is a language defined with some simple rules, which are more loose than strict. This fact, however, may also highlight that a language can be hard to define with a only series of simple syntactic rules. Let us assume the following sentences: 0 John gesticulates 0 John gesticulates vigorously 0 The dog ate steak 0 The dog ate ravenously There are semantic rules (rules related to the meanings of sentences) in addition to the syntactic rules (rules regarding grammar). The rule usually specified, are strictly syntactic and, at least for computer languages, the easiest to formulate. The semantic rules, however, are notoriously difficult to formulate and are anchored in one’s brain subconsciously, associating concepts with words and structuring the words into phrases and groups of phrases, which convey the meanings intended. At a syntactic level and working towards some grammatical patterns or rules in English, one might be doing this consciously. There will always be a person or thing (a subject) and a verb describing an action (a verb phrase) in almost every language. In addition, there will sometimes be an object that the subject acts upon. In order to reflect on these abstract

structures, one might find oneself using some other symbols acting as containers or patterns for sentences with a similar structure. For instance, one may end up with something like: 0 Subject gesticulates 0 Subject gesticulates vigorously 0 Subject ate steak 0 Subject ate ravenously Next, abstract the verb phrases: 0 Subject VerbPhrase 0 Subject VerbPhrase 0 Subject VerbPhrase steak 0 Subject VerbPhrase Finally, abstracting away the objects, we may end up with something like: 0 Subject VerbPhrase 0 Subject VerbPhrase 0 Subject VerbPhrase Object 0 Subject VerbPhrase It is now easy to spot two main types of sentences that underpin the lexical-syntactic meanings of these four sentences:

0 Subject VerbPhrase 0 Subject VerbPhrase Object You may also break down subject or verb phrases by having emerging sub-structures such as noun (e.g., John) and determiner-noun (e.g., The dog) for subject phrases, or verb (e.g., ate) and verb-adverb (e.g., ate ravenously) for verb phrases. Subsequently, you may end up with a finite language defined by the following rules of grammar: 1. A sentence is a subject followed by a verb phrase, optionally followed by an object. 2. A subject is a noun optionally preceded by a determiner. 3. A verb phrase is a verb optionally followed by an adverb. 4. A noun is John or dog. 5. A verb is gesticulates or ate. 6. An adverb is vigorously or ravenously.

7. An object is steak. 8. A determiner is The. Despite the fact that the structure of these rules might seem to be right, here is exactly where the problem lies with the meaning of the sentences and the semantic rules associated with it. For example, the rules may allow you to say, “dog gesticulates ravenously,” which is perfectly meaningless and a situation, which is frequently encountered as specifying grammars. Having taken a look at how easily things might become quite complex when we need to define semantic rules on top of the syntactic ones, even with such a finite language, one can imagine that defining a strict grammar, i.e., including semantic rules, is almost impossible. For instance, a book on English grammar can easily become four inches thick. Besides, a natural language such as English is a moving target. For example, consider the difference between Elizabethan English and Modern English. Again, as one discovers meta-languages in the next section one can bear in mind, that there is sometimes a gap between the language one means and the language one can easily specify. Another lesson learned is that using a language like English, which is neither precise nor terse, to describe other languages and, therefore, use it as a meta-language, one will end up with a meta-language with the same drawbacks. The manner in which computer scientists have specified languages has been quite similar and is continuously evolving. Regardless of the variety and diversity of computer languages, semantic rules have rarely been an integral part of the language specification, if at all. They are mostly syntactic rules, which dominate the language specification. Take, for instance, context-free grammar (CFG), the first meta-language used extensively and preferred by most computer scientists. CFG specifications provide a list of rules with left and right hand sides separated by a right- arrow symbol. One of the rules is identified as the start rule or start symbol, implying that the overall structure of any sentence in the language is described by that rule. The left-hand side specifies the name of the substructure one is defining and the right hand side specifies the actual structure (sometimes called a production): a sequence of references to other rules and/or words in the language vocabulary. Despite the fact that language theorists love CFG notation, most language reference guides use BNF (Backus- Naur Form) notation, which is really just a more readable version of CFG notation. In BNF, all rule names are surrounded by and Æ is replaced with “::=”. Also, alternative productions are separated by ‘|’ rather than repeating the name of the rule on the left-hand side. BNF is more verbose, but has the advantage that one can write meaningful names of rules and is not constrained vis- à-vis capitalization. Rules in BNF take the form: ::= production 1 | production 2 … | production n

Using BNF, one can write the eight rules used previously in this chapter as follows: ::= ::= ::= ::= John | dog ::= gesticulates | ate ::= vigorously | ravenously | ::= steak | ::= The | Even if one uses alternatives such as YACC, the de facto standard for around 20 years, or ANTLR, or many other extended BNF (EBNF) forms, the highest level of semantic rule based specification one might achieve would be by introducing grammatical categories such as DETERMINER, NOUN, VERB, ADVERB and by having words such as The, dog, ate, ravenously, respectively, belonging to one of these categories. The intended grammar may take the following form, sentence : subject verbPhrase (object)?; subject : (DETERMINER)? NOUN; verbPhrase : VERB (ADVERB)?; object : NOUN; which still leaves plenty of space for construction of meaningless sentences such as The dog gestured ravenously. It is also worth mentioning that even with alternatives for CFG such as regular expressions, which were meant to simplify things by working with characters only and no rules referencing other ones on the right-hand side, things did not improve towards embedding of semantic rules in a language specification. In fact, things turned out to be more complex with regular expressions, since without recursion (no stack), one cannot specify repeated structures. In short, one needs to think about the difference between sequence of words in a sentence and what really dictates the validity of sentences. Even with the programming expression, if one is about to design state machinery capable of recognizing semantically sensitive sentences, the key idea must be that a sentence is not merely a cleverly combined sequence of words, but rather groups of words and groups of groups of words. Even with the programming expression (a[i+3)], humans can immediately recognize that there is something wrong with the expression, whereas it is notoriously difficult to design state machinery recognizing the faulty expression, since the number of left parentheses and brackets matches the number of one on the right. In other words, sentences have a structure like this book. This book is organized into a series of chapters each containing sections, which, in turn, contain subsections and so on. Nested structures abound in computer science too. For example, in an object-oriented class library, classes group all elements beneath them in the hierarchy into categories (any kind of cat might be a subclass of feline etc…). The first hint of a solution to the underpowered state machinery is now apparent. Just as a class library is not an unstructured category of classes, a sentence is not just a flat list of words. Can one, therefore, argue that the role one gave each word, plays an

equally large part in one’s understanding of a sentence? Certainly, it does, but it is not enough. The examples used earlier, highlight the fact that structure imparts meaning very clearly. It is not purely the words though, nor the sequence that impart meaning. Can it also be argued that if state machines can generate invalid sentences, they must be having trouble with structure? These questions will be left unanswered for the time being, or perhaps in the near future. It turns out that even if we manage to define state machinery to cope with structure in a sentence, claiming that once semantic rules are perfectly defined, it is far reaching, since there are more significant issues to consider, as we will see in the following sections. 1.2 Meaning and Context The difficulty of defining semantic rules to cope with meaningful states, operations or statements is exacerbated by the conclusions drawn from the study of ‘‘Meaning’’ as a key concept for understanding a variety of processes in living systems. It turns out that ‘‘Meaning’’ has an elusive nature and ‘‘subjective’’ appearance, which is, perhaps the reason why it has been ignored by information science. Attempts have been made to circumscribe a theory of meaning in order to determine the meaning of an indeterminate sign. Meaning-making, however, has been considered as the procedure for extracting the information conveyed by a message, in which the former is considered to be the set of values one might assign to an indeterminate signal. In this context, meaning-making is described in terms of a constraintsatisfaction problem that relies heavily on contextual cues and inferences. The lack of any formalization of the concepts ‘‘meaning’’ and ‘‘context’’, for the working scientist, is probably due to the theoretical obscurity of concepts associated with the axis of semiotics in information processing and science. Even with regard to information and information flow, it has been argued that ‘‘the formulation of a precise, qualitative conception of information and a theory of the transmission of information has proved elusive, despite the many other successes of computer science’’ (Barwise and Seligman 1993). Since Barwise’s publication, little has changed. Researchers in various fields still find it convenient to conceptualize the data in terms of information theory. By doing so, they are excluding the more problematic concept of meaning from the analysis. It is clear, however, that the meaning of a message cannot be reduced to the information content. In a certain sense, the failure to reduce meaning to information content is like the failure to measure organization through information content. Moreover, the relevance of information theory is criticized by those who argue that when we study a living system, as opposed to artificial devices, our focus should be on meaning-making rather than information processing per se. In the context of artificial devices, the probabilistic sense of information prevails. Meaning, however, is a key concept for understanding a variety of processes in living systems, from recognition capacity of the immune system to the neurology of the perception. Take, for instance, the use of information theory in biology, as stated by (Emmeche and Hoffmeyer 1991). They argue that unpredictable events are an essential part of life and it is impossible to assign distinct probabilities to any event and conceptualize the behavior of living systems in terms of information theory. Therefore, biological information must embrace the ‘‘semantic openness’’ that is evident, for example, in human communication.

In a nutshell, it is worth mentioning that meaning has taken both main views, divorced from information and non-reducible to each other. The first is due to the fact that the concept of information relies heavily on ‘‘information theory’’ like Shannon’s statistical definition of information, whereas the latter is due to the conception that information can broadly be considered as something conveyed by a message in order to provoke a response (Bateson 2000). Hence, the message can be considered as a portion of the world that comes to the attention of a cogitative system, human or non-human. Simply stated, information is a differentiated portion of reality (i.e., a message), a bit of information as a difference, which makes a difference, a piece of the world that comes to notice and results in some response (i.e., meaning). In this sense, information is interactive. It is something that exists in between the responding system and the differentiated environment, external or internal. An example, if one leaves one’s house to take a walk, notices that the sky is getting cloudy, one is likely to change one’s plans in order to avoid the rain. In this care the cloudy sky may be considered the message (i.e., the difference) and one’s avoidance will be the information conveyed by the message (i.e., a difference that makes a difference). In this context, information and meaning are considered synonymous and without any clear difference between them. Though they are intimately related, they cannot be reduced to each other. In the same spirit, Bateson presents the idea that a differentiated unit, e.g., a word, has meaning only on a higher level of logical organization, e.g., the sentence, only in context and as a result of interaction between the organism and the environment. In this sense, the internal structure of the message is of no use in understanding the meaning of the message. The pattern(s) into which the sign is woven and the interaction in which it is located is what turns a differentiated portion of the world into a response by the organism. This idea implies that turning a signal (i.e., a difference) into a meaningful event (i.e., a difference that makes a difference) involves an active extraction of information from the message. Based on the suggestions, the following ideas have been suggested: a) Meaning-making is a procedure for extracting the information conveyed by a message. b) Information is the value one may assign to an indeterminate signal (i.e., a sign). These ideas are very much in line with conceptions that see meaning-making as an active process that is a condition for information-processing rather than the product of informationprocessing per se. The most interesting things in the conception of meaning-making as an active process, are the three organizing concepts of a) indeterminacy of the signal, b) contextualization, c) transgradience. The indeterminacy (or variability) of the signal is an important aspect of any meaning-making process. It answers the question what is the indeterminacy of the signal and why is it important for a theory of meaning-making? The main idea is that in itself every sign/unit is devoid of meaning until it is contextualized in a higherorder form of organization such as a sentence. It can be assigned a range of values and interpretations. For instance, in natural language the sign ‘‘shoot’’ can be used in one context to express an order to a soldier to fire his gun and in a different context as a synonym for ‘‘speak’’. In

immunology, the meaning of a molecule’s being an antigen is not encapsulated in the molecule itself. That is, at the most basic level of analysis, a sign has the potential to mean different things (i.e., to trigger different responses) in different contexts, a property that is known in linguistics as polysemy and endows language with enormous flexibility and cognitive economy. In the field of linguistics it is called pragmatics, which deals with meaning in context, the single most obvious way in which the relation between language and context is reflected in the structure of languages themselves is often called deixis (pointing or indicating in Greek). To this extent, linguistic variables (e.g., this, he, that) are used to indicate something in a particular context. They are indeterminate signals. Nevertheless, the indeterminacy of a signal or word can be conceived as a constraint satisfaction problem. This, in turn, is defined as a triple {V, D, C}, where: (a) V is a set of variables, (b) D is a domain of values, and (c) C is a set of constraints {C1,C2, . . .,Cq}. In the context of semiotics, V is considered to be the set of indeterminate signals and D the finite set of interpretations/values one assigns to them. Based on the above definition, a sign is indeterminate if assigning it a value is a constraint-satisfaction problem. One should note that solving the constraint-satisfaction problem is a meaning-making process, since it involves the extraction of the information conveyed by a message (e.g., to whom does the ‘‘he’’ refer?). However, rather than a simple mapping from V to D, this process also involves contextualization and inference. The problematic notion of context, in the conception of meaning-making as an active process, can be introduced better as an environmental setting composed of communicating units and their relation in time and space. The general idea and situation theory (Seligman and Moss 1997) is one possible way of looking into these aspects. In situation theory, a situation is ‘‘individuals in relations having properties and standing in relations to various spatiotemporal relations’’. In a more general way, we can define a situation as a pattern, a meaning, an ordered array of objects that have an identified and repeatable form. In an abstract sense, a contextualization process can be conceived as a functor or a structure-preserving mapping of the particularities or the token of the specific occurrence onto the generalities of the pattern. Regarding the interpretation of things as a constraints satisfaction problem, a context forms the constraints for the possible values (i.e., interpretations) that one may attribute to a sign. According to this logic, a situation type is a structure or pattern of situations. In other words, a situation is defined as a set of objects organized both spatially and temporally in a given relation. If this relation characterizes a set of situations, one can consider it a structure or a situation type. For example, the structure of hierarchical relations is the same no matter who the boss is. The situation type is one of hierarchical relations. Based on this type of a situation, we can make inferences about other situations. For example, violations of a rigid hierarchical relationship by a subordinate are usually responded to with another situation of penalties imposed by the superiors.

Although a sign, like the meaning of a word in a sentence, is indeterminate, in a given context one would like to use it to communicate only one meaning (i.e., to invite only one speci fic response) and not others. Therefore, the word disambiguation problem arises. In a sense, inferences via contextualization work as a zoom-in, zoom-out function. Contextualization offers the ability to zoom out and reflexively zoom back, in a way that constrains the possible values we may assign to the indeterminate signal. In other words, in order to determine the meaning of a micro element and extract the information it conveys, one has to situate it on a level of higher order of organization. Let us consider the following example: ‘‘I hate turkeys’’. The vertical zooming-out from the point representing ‘‘I’’ to the point representing ‘‘human’’ captures the most basic denotation of ‘‘I,’’ given that denotation is the initial meaning captured by a sign. As such, it could be considered the most reasonable starting point for contextualization. It is also a reasonable starting point because both evolutionary and ontological denotation is the most basic semiotic category. According to the Oxford English Dictionary, a turkey can be zoomed out to its closest ontological category, a ‘‘bird’’. This ontological category commonly describes any feathered vertebrate animal. Therefore, if we are looking for a function from the indeterminate sign ‘‘love’’ to a possible value/interpretation, the first constraint is that it is a relation between a human being and an animal. There is, however, one more contextual cue. This is dictated by the denotation of dinner as a token of a meal. In that situation, there is a relationship of eating between human beings and food. Given that the zoomed-out concept of human beings for the sign “I” does participate in this relationship as well, another candidate value for the interpretation of “love” will arise, which, apparently, makes more sense, since it may be much closer to the meaning of the sentence “I hate turkeys”. The situation where humans consume turkeys as food is the one giving meaning to this sentence. Contextualization, however, is not sufficient to meaning-making processes. Transgradience, as the third dimension of the meaning-making process, refers to the need for interpretation, inference and integration as a process in which inferences are applied to a signal-in-context in order to achieve a global, integrated view of a situation. In general terms, transgradience refers to the ability of the system to achieve a global view of a situation by a variety of means. An interesting parallel can be found by the immune system deciding whether a molecule is an antigen by means of a complex network of immune agents that communicate, make inferences, and integrate the information they receive. Further sub-dimensions may arise though, which could potentially complicate things: (1) the spatiotemporal array in which the situation takes place; (2) our background knowledge of the situation and (3) our beliefs concerning the situation. In brief, our ability to extract the information that a word may convey as a signal, is a meaningmaking process that relies heavily on contextual cues and inferences. Now the challenge is to pick the right situation, which constrains the interpretation values to be allocated to indeterminate signals, i.e., ambiguous words. Although one’s understanding of semiotic systems has advanced (Sebeok and Danesi 2000) and computational tools have reached a high

level of sophistication, one still does not have a satisfactory answer to the question of how the meaning emerges in a particular context. 1.3 The Symbol Grounding Problem The whole discussion, so far, is underpinned by the assumption that the adherence of meaning to symbols and signals is a result of a meaning-making process rather than something intrinsic to the symbols and the chosen symbolic system itself. It is this innate feature of any symbolic system, which poses limitations to any symbol manipulator, to the extent to which one can interpret symbols as having meaning systematically. This turns interpretation of any symbol such as letters or words in a book parasitic, since they derive their meaning from us similarly, none of the symbolic systems can be used as a cognitive model and therefore, cognition cannot just be a manipulation of a symbol. Spreading this limitation would mean grounding every symbol in a symbolic system with its meaning and not leaving interpretation merely to its shape. This has been referred to in the 90s as the famous ‘symbol grounding problem’ (Harnad 1990) by raising the following questions : 14 Natural Language Processing: Semantic Aspects

0 “How can the semantic interpretation of a formal symbol system be made intrinsic to the system, rather than remain parasitic, depending solely on the meanings in our heads?” 0 “How can the meanings of the meaningless symbol tokens, manipulated solely on the basis of their (arbitrary) shapes, be grounded in anything but other meaningless symbols?” The problem of constructing a symbol manipulator able to understand the extrinsic meaning of symbols, has been brought into analogy with another famous problem of trying to learn Chinese from a Chinese/Chinese dictionary. It also sparked off the discussion about symbolists, symbolic Artificial Intelligence and the symbolic theory of mind, which has been challenged by Searle’s “Chinese Room Argument”. According to these trends, it has been assumed that if a system of symbols can generate indistinguishable behavior in a person, this system must have a mind. More specifically, according to the symbolic theory of mind, if a computer could pass the Turing Test in Chinese, i.e., if it could respond to all Chinese symbol strings it receives as input from Chinese symbol strings that are indistinguishable from the replies a real Chinese speaker would make (even if we keep on testing infinitely), the computer would understand the meaning of Chinese symbols in the same sense that one understands the meaning of English symbols. Like Searle’s demonstration, this turns out to be impossible, for both humans and computers, since the meaning of the Chinese symbols is not intrinsic and depends on the shape of the chosen symbols. In other words, imagine that you try to learn Chinese with a Chinese/Chinese dictionary only. The trip through the dictionary would amount to a merry-goround, passing endlessly from one meaningless symbol or symbol-string (the definientes) to another (the definienda), never stopping to explicate what anything meant. The standard reply and approach of the symbolists and the symbolic theory of mind, which prevails in the views of semantic aspects in natural language processing within this book as well, is that the meaning of the symbols comes from connecting the symbol system to the

world “in the right way.” Though this view trivializes the symbol grounding problem and the meaning making process in a symbolic system, it also highlights the fact that if each definiens in a Chinese/Chinese dictionary were somehow connected to the world in the right way, we would hardly need the definienda. Therefore, this would alleviate the difficulty of picking out the objects, events and states of affairs in the world that symbols refer to. With respect to these views, hybrid non-symbolic / symbolic systems have been proposed in which the elementary symbols are grounded in some kind of non-symbolic representations that pick out, from their proximal sensory projections, the distal object categories to which the elementary symbols refer. These groundings are driven by the insights of how humans can (1) discriminate, (2) manipulate, (3) identify and (4) describe the objects, events and states of affairs in the world they live in and they can also (5) “produce descriptions” and (6) “respond to descriptions” of those objects, events and states of affairs. The attempted groundings are also based on discrimination and identification, as two complementary human skills. To be able to discriminate one has to judge whether two inputs are the same or different, and, if different, to what degree. Discrimination is a relative judgment, based on our capacity to tell things apart and discern the degree of similarity. Identification is based on our capacity to tell whether a certain input is a member of a category or not. Identification is also connected with the capacity to assign a unique response, e.g., a name, to a class of inputs. Therefore, the attempted groundings must rely on the answers to the question asking what kind or internal representation would be needed in order to be able to discriminate and identify. In this context, iconic representations have been proposed (Harnad 1990). For instance, in order to be able to discriminate and identify horses, we need horse icons. Discrimination is also made independent of identification in that one might be able to discriminate things without knowing what they are. According to the same theorists, icons alone are not sufficient to identify and categorize things in an underdetermined world full with infinity of potentially confusable categories. In order to identify, one must selectively reduce those to “invariant features” of the sensory projection that will reliably distinguish a member of a category from any non-members with which it could be confused. Hence, the output is named “categorical representation”. In some cases these representations may be innate, but since evolution could hardly anticipate all the categories one may ever need or choose to identify, most of these features must be learned from experience. In a sense, the categorical representation of a horse is probably a learned one. It must be noted, however, that both representations are still sensory and non- symbolic. The former are analogous copies of the sensory projection, preserving its “shape” faithfully. The latter are supposed to be icons that have been filtered selectively to preserve only some of the features of the shape of the sensory projection, which distinguish members of a category from

non-members reliably. This sort of non-symbolic representation seems to differ from the symbolic theory of mind and currently known symbol manipulators such as conventional computers trying to cope with natural language processing and their semantic aspects. Despite the interesting views emerging from the solution approaches to the symbol grounding problem, the symbol grounding scheme, as introduced above has one prominent gap: no mechanism has been suggested to explain how the all-important categorical representations can be formed. How does one find the invariant features of the sensory projection that make it possible to categorize and identify objects correctly? To this extent, connectionism, with its general pattern learning capability, seems to be one natural candidate to complement identification. In effect, the “connection” between the names and objects that give rise to their sensory projections and icons would be provided by connectionist networks. Icons, paired with feedback indicating their names, could be processed by a connectionist network that learns to identify icons correctly from the sample of confusable alternatives it has encountered. This can be done by adjusting the weights of the features and feature combinations that are reliably associated with the names in a way that may resolve the confusion. Nevertheless, the choice of names to categorize things is not free from extrinsic interpretation of things, since some symbols are still selected to describe categories.

Mathematics for AI Relations The concept of a relation is fundamental in order to understand a broad range of mathematical phenomena. In natural language, relation is understood as correspondence, connection. We say that two objects are related if there is a common property linking them. Definition 2.0.1. Consider the sets A and B. We call (binary) relation between the elements of A and B any subset R ¡A B. An element a ¢A is in relation R with an element b ¢B if and only if (a, b) ¢R. An element (a, b) ¢R will be denoted by aRb. Definition 2 . 0 . 2 . I f A 1 , . . . , A n , n 2 a r e s e t s ,

we

call an n- ary relation any subset R ¡ A

A. 1

n

If n = 2, the relation R is called binary, if n = 3 it is called ternary. If A1 = A2 = . . . = An = A, the relation R is called homogenous. In the following, the presentation will be restricted only to binary relations. Remark 1 The direct product A B is defined as the set of all ordered pairs of elements of A and B, respectively: A B := {(a, b) ¾a ¢A, b ¢B}. Remark 2 If A and B are finite sets, we can represent relations as cross tables. The rows correspond to the elements of A, while the columns correspond to the elements of B. We represent the elements of R, i.e., (a, b) ¢R, by a cross (X) in this table. Hence, the relation R is represented by a series of entries (crosses) in this table. If at the intersection of row a with column b there is no entry, it means that a and b are not related by R.

Example 2.0.1. Relations represented as cross-tables (1) Let A = {a}. There are only two relations on A, the empty relation, R = and the total relation, R = A A. a

a

a

a

X

(2) A := {1, 2}, B := {a, b}. Then, all relations R ¡A B are described by

a

b

a

b

a

1X

1

X 1

2

2

2

a

b

a

b

b

a

b

1 X

a

2

X

b

a

b

1X X 1

1 X

1

X

2

2 X

2

X

2 X X

a

1X 2

b

a

1 X 2 X

b

a

b

a

b

X

1 X X

1

X

2 X

2

X X

a

1

b

a

b

X 1 X X

2X X 2

X

a

b

a

1 X X

1

2 X X

2

b

Example 2.0.2. In applications, A, B, and R ¡A B are no longer abstract sets, they have a precise semantics, while the relation R represents certain correspondences between the elements of A and B. The following example describes some arithmetic properties of the first ten natural numbers:

even odd div.by 3 div.by 5 div.by 7 prime x2 + y2 x2 − y2 1

X

2 X 3

X X

X

X

X X

4 X 5

X

6 X 7

X

X

X X

X

8 X 9

X

X X

X

X

X

10 X

X

Example 2.0.3. Other relations (1) The divisibility relation in Z: R := {(m, n) ¢Z2 ¾k ¢Z. n = km}. (2) R := {(x, y) ¢R2 ¾x2 + y2 = 1}. This relation consists of all points located on the circle centered in the origin and radius 1. (3) The equality relation on set A: &A := {(x, x) ¾x ¢A}. (4) The equality relation in Rconsists of all points located on the first bisecting line. (5) The universal relation on a set A expresses the fact that all elements of that set are related to each other: ¥A := {(x, y) ¾x, y ¢A}. (6) R=

The empty relation means that none of the elements of A and B are related: ¡A B.

(7) Let A = B = Zand R the divisibility relation on Z: R := {(x, y) ¢Z Z¾k ¢Z. y = kx}.

2.1 Operations with Relations Let (A1, B1, R1) and (A2, B2, R2) be two binary relations. They are equal if and only if A1 = A2, B1 = B2, R1 = R2. Definition 2.1.1. Let A and B be two sets, R and S relations on A B. Then R is included in S if R ¡ S. Definition 2.1.2. Let R, S ¡A B be two relations on A B. The intersection of R and S is defined as the relation R S on A B. Definition 2.1.3. Let R, S ¡A B be two relations on A B. The union of R and S is defined as the relation R S on A B. Definition 2.1.4. Let R ¡A B be a relation on A B. The complement of R is defined as the relation CR on A B, where CR := {(a, b) ¢A B ¾(a, b) £A B}. Remark 3 If R and S are relations on A B then (1) a(R S)b ¯ aRb and aSb. (2) a(R S)b ¯ aRb or aSb. (3) a(CR)b ¯ (a, b) ¢A B and (a, b) £R. Definition 2.1.5. Let R ¡A B and S ¡C D be two relations. The product or composition of R and S is defined as the relation S ° R ¡A D by S ° R := {(a, d) ¢A D ¾b ¢B C. (a, b) ¢R and (b, d) ¢S}. If B C = then S ° R = . Definition 2.1.6. Let R ¡A B be a relation. The inverse of R is a relation R–1 ¡B A defined by R–1 := {(b, a) ¢B A ¾(a, b) ¢R}. Theorem 2.1.1. Let R ¡A B, S ¡C D, and T ¡E F be relations. Then the composition of relations is associative: (T ° S) ° R = T ° (S ° R). Proof. Let (a, f) ¢(T ° S) ° R. By the definition of the relational product, there exists b ¢B C with (a, b) ¢R and (b, f) ¢T ° S. By the same definition, there exists d ¢D E with (b, d) ¢S and (d, f) ¢T. Now, (a, b) ¢R and (b, d)

¢S imply (a, d) ¢S ° R. Together with (d, f) ¢T, this implies that (a, f) ¢T ° (S ° R). Hence, (T ° S) ° R ¡T ° (S ° R). The converse implication is proved analogously. Theorem 2.1.2. Let R1 ¡A B, R2 ¡A B, S1 ¡C D, S2 ¡C D be relations. The following hold true: (1) R (S S S S 1 ° 1 2) = (R1 ° 1) (R1 ° 2). (2) (R R S = (R S S 1 2) ° 1 1 ° 1) (R2 ° 1). (3) R (S S (R S S 1 ° 1 2) ¡ 1 ° 1) (R1 ° 2). (4) (R R S (R S S 1 2) ° 1 ¡ 1 ° 1) (R2 ° 1).

Proof. (1) Let (c, b) ¢R1

°

with (c, d) ¢S

and (d, b) ¢R, i.e., (c, d) S ¢S

(S1 S2). Then, there exists an element d ¢D A

1 2 b) ¢R. This means that (c, b) ¢R1 ° S1 or (c, b) (R1

° S1) (R1

(R1

° S2).

°

or (c, d) ¢S 1

and (d, 2

¢R1 ° S2, i.e., (c, b) ¢

S2). We have proved that R1 °(S1 S2) ¡(R1 ° S1)

The converse inclusion follows analogously. (2) By a similar argument, one obtains (R1 R2) ° S1 = (R1 ° S1) ° S1). (3) Let (c, b) ¢R1 °(S1 S2). Then, there exists an element d ¢D A with (c, d) ¢S1 S2 and (d, b) ¢R. Hence (c, d) ¢S1, (c, d) ¢S2 and (d, b) ¢R. By the definition of the relational product, (c, b) ¢R1 ° S1 and (c, b) ¢R1 ° S2, hence (c, b) ¢(R1 ° S1) (R1 ° S2). (R 2

(4) Left to the reader. Theorem 2.1.3. Let R1, R2 ¡A B and S ¡C D be binary relations. Then

(1) (R1 R2)–1 = R1–1 R2–1. (2) (R1 R2)–1 = R1–1 R2–1. (3) (R1 ° S)–1 = S–1 R1–1. (4) (CR1)–1 = C(R1)–1. (5) ((R1)–1)= R1. The proof is left as an exercise to the reader. Corollary 2.1.4. (1) If S1 ¡S2, then R ° S1 ¡R ° S2. (2) If R1 ¡R2, then R1 ° S ¡R2 ° S. (3) R1 ¡R2 ¯ R1–1 ¡R2–1 . Definition 2.1.7. Let R ¡A B be a binary relation, X ¡A a subset of A and a ¢A an arbitrary element. The cut of R after X is defined by R(X) := {b ¢B ¾x ¢X. (x, b) ¢R}. The set R(A) is called image of R and R–1(B) is called preimage of R. The cut after a is denoted by R(a). Theorem 2.1.5. Let R ¡A B be a binary relation and X1, X2 ¡A subsets. Then (1) R(X1 X2) = R(X1) R(X2). (2) R(X1 X2) ¡R(X1) R(X2). (3) If X1 ¡X2 then R(X1) ¡R(X2). Proof. (1) Suppose y ¢R(X1 X2). Then there is an x ¢X1 X2 with (x, y) ¢ R, i.e., x ¢X1.(x, y) ¢R or x ¢ X2. (x, y) ¢R, hence y ¢R(X1) or y ¢R(X2), i.e., y ¢R(X1) R(X2). The converse inclusion is proved by a similar argument.

(2) If y ¢R(X1 X2), then there is an x ¢X1 X2 with (x, y) ¢R, i.e., x ¢X1.(x, y) ¢R and x ¢X2.(x, y) ¢R, hence y ¢R(X1) and y ¢ R(X2), which implies y ¢R(X1) R(X2). The converse inclusion does not hold in general. (3) Follows directly from R(X1 X2) = R(X1) R(X2). Theorem 2.1.6. Let R1, R2 ¡A B and S ¡B C be relations and X ¡A. Then (1) (R1 R2)(X) = R1(X) R2(X). (2) (R1 R2)(X) ¡R1(X) R2(X). (3) (S ° R1)(X) = S(R1(X)). (4) If R1 ¡R2 then R1(X) ¡R2(X). Proof. (1) Suppose y ¢(R1 R2)(X). Then there is an x ¢X with (x, y) ¢R1 R2, i.e., (x, y) ¢R1 or (x, y) ¢R2, hence y ¢R(X1) or y ¢R(X2) which implies y ¢R1(X) R2(X). The converse inclusion follows similarly. (2) Let y ¢(R1 R2)(X). Then there is an x ¢X with (x, y) ¢R1 R2, i.e., (x, y) ¢R1 and (x, y) ¢R2, hence y ¢R1(X) and y ¢R2(X) (1) y ¢R(X1 X2), then there is an x ¢X1 X2 with (x, y) ¢R, i.e., x ¢X1.(x, y) ¢R and x ¢X2.(x, y) ¢R, hence y ¢R(X1) and y ¢ R(X2), which implies y ¢R(X1) R(X2). The converse inclusion does not hold in general. (2) Follows directly from R(X1 X2) = R(X1) R(X2). Theorem 2.1.6. Let R1, R2 ¡A B and S ¡B C be relations and X ¡A. Then (1) (R1 R2)(X) = R1(X) R2(X). (2) (R1 R2)(X) ¡R1(X) R2(X). (3) (S ° R1)(X) = S(R1(X)). (4) If R1 ¡R2 then R1(X) ¡R2(X). Proof.

(1) Suppose y ¢(R1 R2)(X). Then there is an x ¢X with (x, y) ¢R1 R2, i.e., (x, y) ¢R1 or (x, y) ¢R2, hence y ¢R(X1) or y ¢R(X2) which implies y ¢R1(X) R2(X). The converse inclusion follows similarly. (2) Let y ¢(R1 R2)(X). Then there is an x ¢X with (x, y) ¢R1 R2, i.e., (x, y) ¢R1 and (x, y) ¢R2, hence y ¢R1(X) and y ¢R2(X) which implies y ¢R1(X) R2(X). The converse inclusion is not generally true. (3) Let z ¢(S ° R1)(X). Then there is an x ¢X with (x, z) ¢S ° R1, hence there is also a y ¢B with (x, y) ¢R1 and (y, z) ¢S, i.e., y ¢B.y ¢R1(X) and (y, z) ¢S wherefrom follows z ¢S(R1(X)). The converse inclusion follows similarly. (4) Follows directly from (R1 R2)(X) = R1(X) R2(X). 2.2 Homogenous Relations Definition 2.2.1. A relation R ¡A A on set A is called (1) reflexive if for every x ¢A, we have (x, x) ¢R; (2) transitive if for every x, y, z ¢A, from (x, y) ¢R and (y, z) ¢R follows (x, z) ¢R; (3) symmetric if for every x, y ¢A, from (x, y) ¢R follows (y, x) ¢R; (4) antisymmetric if for every x, y ¢A, from (x, y) ¢R and (y, x) ¢R follows x = y. Remark 4 As we have seen before, if A is finite, a relation R ¡A A can be represented as a cross table. For homogenous relations, this cross table is a square. (1) A finite relation is reflexive if all elements from the main diagonal are marked: Example 2.2.1. The equality relation on set A is always reflexive. For A := {1, 2, 3, 4, 5}, the equality on A is given by

1

1 X

2

3

4

5

2

X

3

X

4

X

5

X

Another example of a reflexive relation on a five element set A := {1, 2, 3, 4, 5}:

1 2 3 4 5

1

X

2

X X X

3

X

4

X X

X

5

X

X X

X

(2) R is symmetric if and only if the distribution of crosses in the table is symmetric with respect to transposition: 1

2

3

4

5

1

X X X

2

X X

3 X 4 X X 5 X X

Classification is one of the major quests in mathematics. For this, we need to group elements according to some analogies, similarities or rules. The following definition introduces the concept of equivalence between elements of a set. Definition 2.2.2. A relation (A, A, R) is called equivalence relation on the set A, if R is reflexive, transitive and symmetric. Example 2.2.2 (1) Let n ¢N. On the set Zof integers, we define the relation aRb ¯ k ¢Z. b – a = kn ¯ b = a + kn. This relation is an equivalence relation. Proof. (a) (Reflexivity:) Let a ¢Z. Then a – a = 0 and we choose k = 0. (b) (Transitivity:) Let a, b, c ¢Zand k1, k2 ¢Zwith b = a+k1n, c = b + k2n. Then c = (a + k1n) + k2n = a + (k1 + k2)n, hence aRc. (c) (Symmetry:) Let a, b ¢Zwith aRb and bRa. Then we can find k1, k2 ¢Zwith b = a + k1n, a = b + k2n. Then b – a = k1n and a – b = k2n, hence k1 = –k2, i.e, a = b. This relation is denoted by n and called congruence modulo n relation. If a b(modn), we say that a is congruent with b modulo n. (2) The equality relation on A is an equivalence relation. It is the only relation on A being simultaneously symmetric and antisymmetric. Definition 2.2.3. Let M be a set. A partition of M is a collection P of subsets of M satisfying (1) P ¢P. P .

(2) P, Q ¢P, P Q or P = Q. (3) M = UP¢P P. Remark 5 A partition of M is a cover of M with disjoint, non-empty sets, every element of M lies in exactly one cover subset. Example 2.2.3. (1) In the set of integers, 2Zdenotes the set of even numbers and 2Z+ 1 the set of odd numbers. Then {2Z, 2Z+ 1} is a partition of Z. (2) If f : M N is a mapping, the set {f–1(n) ¾n ¢Im(f)} is a partition of M. (3) The set of all quadratic surfaces in R3 can be partitioned into ellipsoids, one-sheeted hyperboloids and two-sheeted hyperboloids (if the degenerated ones are ignored). In the following, Part(A) denotes the set of all partitions of A, and EqRel(A) the set of all equivalence relations on A. Definition 2.2.4. Let R be an equivalence relation on A. For every a ¢A, define the equivalence class of a by Ā = [a]R := {b ¢A ¾aRb}.

Proposition 2.2.1. Let R be an equivalence relation on A. Then, for every a ¢A, the equivalence classes [a]R are partitioning A. Proof. (1) The relation R being reflexive, for every a ¢A, we have aRa, hence a ¢[a]R. This proves that every equivalence class [a]R is not empty. (2) Suppose two classes [a]R and [b]R have a common element x ¢ [a]R [b]R. Then aRx and bRx. Using symmetry and transitivity of R, we get aRb, hence [a]R = [b]R. (3) a ¢A. a ¢[a]R. It follows that M = Ua¢A[a]R. Example 2.2.4. In the set Zconsider the congruence relation modulo n. The equivalence classes are: 0 = {kn ^k ¢Z}

= nZ

1 = {1 + kn ^k ¢Z}

= 1 + nZ

2 = {2 + kn ^k ¢Z}

= 2 + nZ

. . . n – 1 = {n – 1 + kn ^k

¢Z} = (n – 1) + nZ.

We obtain exactly n equivalence classes, since n = 0, n + 1 = 1, n + 2 = 2, . . . Definition 2.2.5. Let M be a set and R an equivalence relation on M. The quotient set of M with respect to R is the set of all equivalence classes of R M/R := {[a]R^a¢M}. Example 2.2.5. Consider on the set of all integers, Z, the congruence relation modulo n n. Then Zn := Z/ n= {0, 1, . . . , n – 1}. Definition 2.2.6. Let n ¢N. On Zn define the operations +n addition modulo n, and multiplication modulo n as follows: ā+n b := r if the rest of the division of a + b to n is r. ā·n b := s if the rest of the division of a · b to n is s.

Example 2.2.6. (1) n = 2:

+2

0

1

·2

0

1

0

0

1

0

0

0

1

1

0

1

0

1

(2) n = 3:

+3

0

1

2

·3

0

1

2

0

0

1

2

0

0

0

0

1

1

2

0

1

0

1

2

2

2

0

1

2

0

2

1

0 1 2

3

0

0 1 2

3

1

1 2 3

2 3

(3) n = 4: +4

·4

0

1 23

0

0

0 00

0

1

0

1 23

2 3 0

1

2

0

2 02

3 0 1

2

3

0

3 21

(4) n= 5: +5 0

1 2 3

4 0 ·5

1

2 34

0

0

1 2 3

4 0

0 0

0

00

1

1

2 3 4

0 1

0 1

2

34

2

2

3 4 0

1 2

0 2

4

13

3

3

4 0 1

2 3

0 3

1

42

4

4

0 1 2

3 4

0 4

3

21

(5) n = 6:

+6

0

1

2 3

4

5

·6

0

0

1

2 3

4

5

0

0

0

0

0 0

0

1

1

2

3 4

5

0

1

0

1

2

3 4

5

2

2

3

4 5

0

1

2

0

2

4

0 2

4

3

3

4

5 0

1

2

3

0

3

0

3 0

3

4

4

5

0 1

2

3

4

0

4

2

0 4

2

5

5

0

1 2

3

4

5

0

5

4

3 2

1

0

1

2

3 4

5

If n is prime, every element occurs just once on every row and column of the addition and multiplication mod n tables (except 0 for multiplication). This is no longer true if n is not prime. If n is composed but a is relatively prime to n (i.e., they do not share any common divisor), then again on the line and column of a, every element occurs just once. In the following, we are going to prove that there exists a bijection between Part(A) and EqRel(A), i.e., partitions and equivalence relations describe the same phenomenon.

Theorem 2.2.2. Let A be a set. (1) Let Pbe a partition of A. Then there is an equivalence relation E(P), called the equivalence relation induced by P. (2) Let R be an equivalence relation on A. The equivalence class [x]R of an element x ¢A induces a map [·]R : A P(A). The quotient set A/R := {[x]R ^x ¢A} ¡P(A) is a partition of A. (3) If Pis a partition of A, P= A/E(P). (4) If R is an equivalence relation on A, then R = E(A/R). Proof. (1) Let Pbe a partition of A. Then there exist a surjective mapping RP : A P with x ¢R(x) for every x ¢A. P If P is a partition of A, it follows that for every x ¢A there is a unique class Ax with x ¢Ax. Define R : A Pby R(x) := Ax. Because P P of the definition of a partition, we have that R is well defined and surjective, since the partition P covers A with P disjoint, non-empty classes. The relation ker(R) ¡A A defined by P (x, y) ¢ker(R) :¯ R(x) = R(y) P PP is an equivalence relation on A, denoted by EP, and is called the equivalence relation induced by P. To prove this, we are going to prove a more general fact, namely that for every map f : A B, its kernel, given by the relation ker(f) ¡A A defined by

x ker(f)y :¯ f(x) = f(y) is an equivalence relation on A. Reflexivity: For all x ¢A, f(x) = f(x), hence x ker(f)x. Transitivity: For all x, y, z ¢A, from x ker(f)y and y ker(f)z follows f(x) = f(y) = f(z) and so f(x) = f(z), i.e., x ker(f)z. Antisymmetry: For all x, y ¢A, if x ker(f)y then f(x) = f(y), i.e., f(y) 0 f(x), and so y ker(f)x. (2) Let R be an equivalence relation on A. It has been proved in Proposition 2.2.1 that A/R is a partition of A. (3) Let now Pbe a partition of A. Let P ¢P. We prove that for x ¢P, P = [x]E( ). Let y ¢P, then because of the first part of the theorem, P we have (x, y) ¢E(P) and so y ¢[x]E(P). If y ¢[x]E( ) then (x, y) ¢E(P), hence y ¢R(y) = R(x) = P. This proves P PP that P = [x]E(P). For every x, y ¢P, R(x) = R(y), i.e., (x, y) ¢E(P), hence P = [x]E( ) = P P P E( ) [y] , wherefrom follows that P ¢A/R and so P ¡A/E(P). P For the inverse inclusion, take [x]E( ). Then, for every y ¢[x]E( ) there P P is a unique P ¢Pwith x, y ¢P. Hence A/E(P) ¡P, which proves the equality. (4) Let R be an equivalence relation on A. We prove that R = E(A/R). For this, we first prove the inclusion R ¡E(A/R). Let (x, y) ¢ R arbitrarily chosen. Then [x]R = [y]R, hence (x, y) ¢E(A/R). This proves the above inclusion. For the converse inclusion, E(A/R) ¡R, take an arbitrary pair of elements (x, y) ¢E(A/R). Then, there exists an element z ¢A with x, y ¢[z]R, since A/R is a partition of A. We deduce that (x, z) ¢R and (y, z) ¢R. Using the symmetry of R, we have (x, z) ¢R and (z, y) ¢R. By transitivity, (x, y) ¢R, i.e., E(A/R) ¡R. 2.3 Order Relations Definition 2.3.1. A reflexive, transitive and antisymmetric relation R ¡A A is called order relation on the set A. The tuple (A, R) is called ordered set.

Remark 6 An order relation on a A is often denoted by w. If x, y ¢A and x wy then x and y are comparable. Else, they are called incomparable. Definition 2.3.2. The ordered set (A, w) is called totally ordered or chain, if for every x, y ¢A we have x wy or y wx. Example 2.3.1 (1) The relation won Ris an order, hence (R, w) is an ordered set. Moreover, it is totally ordered. (2) The inclusion on the power set P(M) is an order. It is not a chain, since we can find subsets A, B ¢P(M) which are not comparable, neither A ¡B, nor B ¡A. (3) The divisibility relation on Nis an order. (4) The equality relation on a set A is an order, the smallest order on A. Moreover, it is the only relation on A which is simultaneously an equivalence and an order relation. Definition 2.3.3. Let (A, w) an ordered set and a ¢A. Then (1) a is a minimal element of A, if x ¢A. x wa ¯ x = a.

(2) a is a maximal element if x ¢A. a wx ¯ x = a. (3) a is the smallest element of A, if x ¢A. a wx. (4) a is the greatest element of A, if x ¢A. x wa. (5) a is covered by b ¢A if a < b and there is no c ¢A such that a < c < b. In this case, we say that a is a lower neighbor of b and b is an upper neighbor of a.

Remark 7 An ordered set may have more than one minimal element or maximal element. If (A, w) has a smallest element or a greatest element, respectively, this is uniquely determined. The smallest element of an ordered set is called the minimum of that set. Similarly, the greatest element of an ordered set is called the maximum of that set. Example 2.3.2 (1) 0 is the smallest element of (N, w) but there are no maximal elements nor a maximum. (2) The ordered set (N, ^)has 1 as a minimum and 0 as a maximum. (3) The ordered set (P (M)\{ }), ¡) has no minimum; every set {x}, x ¢M is minimal. Definition 2.3.4. An ordered set (A, w) is called well ordered if every non empty subset B of A has a minimum. Example 2.3.3. The set (N, w) is well ordered. The sets (Z, w), (Q, w), (R, w) are totally ordered but not well ordered. Theorem 2.3.1. If (A, w) is an ordered set, the following are equivalent: (1) The minimality condition: Every non-empty subset B ¡A has at least one minimal element. (2) The inductivity condition: If B ¡A is a subset satisfying (a) B contains all minimal elements of A; (b) From a ¢A and {x ¢A ¾x < a} ¡B follows a ¢B, then A = B. (3) The decreasing chains condition: Every strictly decreasing sequence of elements from A is finite. Definition 2.3.5. Let (A, w) be an ordered set and B ¡A. An element a ¢A is called upper bound of B if all elements of B are smaller than a: x ¢B. x wa. The element a ¢A is called lower bound of B if all elements of B are greater than a: x ¢B. a wx. Lemma 2.3.2. (Zorn’s Lemma). If (A, w) is a non-empty ordered set and every chain in A has an upper bound in A, then A has maximal elements. 2.4 Lattices. Complete Lattices Let (L, w) be an ordered set and x, y ¢L.

Definition 2.4.1. We call infimum of x, y an element z ¢L denoted by z = inf(x, y) = x y with (1) z wx, z wy, (2) a ¢L. a wx, a wy ² a wz. Definition 2.4.2. If X ¡L, we call infimum of the set X the element z ¢L denoted by z = inf X = X with (1) x ¢L. z wx. (2) a ¢L. x ¢X, a wx ² a wz. Definition 2.4.3. We call supremum of x, y an element z ¢L denoted by z = sup(x, y) = x ®y with (1) x wz, y wz, (2) a ¢L. x wa, y wa ² z wa. Definition 2.4.4. If X ¡L, we call supremum of the set X the element z ¢L denoted by z = sup X = ®X with (1) x ¢L.x wz. (2) a ¢L. x ¢X, x wa ² z wa. Remark 8 (1) The infimum of a subset X ¡L is the greatest lower bound of the set X. (2) The supremum of a subset X ¡L is the greatest upper bound of the set X. Definition 2.4.5. An ordered set (L, w) is called lattice if for every x, y ¢L exist x y and x ®y.

Definition 2.4.6. An ordered set (L, w) is called complete lattice if every subset of L has an infimum and a supremum. Theorem 2.4.1. (L, w) is a complete lattice if and only if X ¡L. inf X ¢L. Example 2.4.1 (1) (N, ¾)is a lattice. If m, n ¢Nthen m n = gcd(m, n) and m ®n = lcm(m, n). (2) (P(M), ¡) is a complete lattice, inf(Xi)i¢I = ∩i¢I Xi and sup(Xi)i¢I = U i¢I Xi.

Definition 2.4.7. Let (P, w) and (Q, w) be ordered sets. A map f : P Q is called orderpreserving or monotone if for every x, y ¢P from x wy, follows f(x) wf(y) in Q. An orderisomorphism is a bijective map f : P Q such that both f and f–1 are order-preserving. Definition 2.4.8. Let (A, w) be an ordered set and Q ¡A be a subset. The set Q is called downward closed set if x ¢Q and y wx always imply y ¢Q. Dually, Q is an upward closed set if x ¢Q and x wy always imply y ¢Q. For P ¡A, we define ↓P := {y ¢A ¾x ¢P. y wx} ↑P := {y ¢A ¾x ¢P: x wy}. The set ↓P is the smallest downward closed set containing P. The set P is downward closed if and only if P =↓P. Dually, ↑P is the smallest upward closed set containing P. The set P is upward closed if and only if P =↑P. Definition 2.4.9. Let (L, w) be a lattice. An element a ¢L is called join-irreducible or ٧-irreducible if x, y ¢L. x < a, y < a ² x ®y < a. A meet-irreducible or Λ-irreducible element is defined dually. Remark 9 In a finite lattice, an element is join-irreducible if and only if it has exactly one lower neighbor. An element is meet-irreducible if and only if it has exactly one upper neighbor. The upper neighbors of 0 are called atoms if they exist, they are always ٧-irreducible, while the lower neighbors of 1 (called coatoms if they exist), are Λ-irreducible. Definition 2.4.10. Let L be a lattice. A subset X ¡L is called supremum-dense in L if a ¢L. Y ¡X. a =٧Y. Dually, it is called infimum-dense if a ¢L. Y ¡X. a =ΛY. Remark 10 In a finite lattice L, every supremum-dense subset of L contains all join-irreducible elements and every infimum-dense subset of L contains all meet-irreducible elements. 2.5 Graphical Representation of Ordered Sets

Every finite ordered set (A, w) can be represented graphically by an order diagram. Every element is represented by a circle, the order relation is represented by ascending or descending lines. An element x is placed directly below an element y if x is lower neighbor of y. The order is then represented (because of the transitivity) by an ascending path connecting two elements a wb. If there is no strict ascending (or descending path) between two elements a, b ¢A, a and b are incomparable with respect to the order of A. Example 2.5.1. Here we can see some examples of ordered sets and their graphical representation. The practical use of this graphical representation lies at hand. (1) The ordered set (A, R) where A := {a, b, c, d, e, f}, and R := {(a, d), (a, e), (b, d), (b, f ) (c, e), (c, f ), (a, a), (b, b,), (c, c), (d, d,), (e, e), (f, f )} is graphically represented by the following order diagram: d

e

f

a

b

c

(2) The following order diagram represents an ordered set which proves to be a lattice: A := {0, a, b, c, d, e, f, 1}, where the smallest element of A is denoted by 0 and the greatest element of A is denoted by 1. 1

d

e

f

a

b

c

0 As one can easily check, for every pair (x, y) of elements in A, we can compute the infimum and the supremum of (x, y), x y is the greatest lower neighbor of x and y, while x ®y is the smallest upper neighbor of x and y. (3) The following diagram, slightly different from the previous one, displays just an ordered set and not a lattice, since there is no infimum for the elements d and e, for instance. 1

d

e

f

a

b

c

0 Indeed, d and e have three lower neighbors, {0, a, b} but there is no greatest lower neighbor, since a and b are not comparable. 2.6 Closure Systems. Galois Connections Definition 2.6.1. Let M be a set and P ¡P(M) a family of subsets of M. The family P is a closure system on M if (1) For every non-empty family (Ai)I¢I ¡P, we have ∩i¢I Ai ¢P; (2) M ¢P. The elements of a closure system are called closed sets. Example 2.6.1 (1) The powerset of set M is a closure system. (2) The set E(M) of all equivalence relations on M is a closure system on M.

Proposition 2.6.1. Let P be a closure system on a set M. Then (P, ¡) is a complete lattice. Proof. Let (Ai)i¢I be an arbitrary family of sets of P. Then Ai = ∩Ai i¢I i¢I

Λ

is the infimum of (Ai)i¢I . Hence, by Theorem 2.4.1, (P, ¡) is a complete lattice. The supremum of (Ai)i¢I is given by ΛAi =∩{X ¢P | UAi ¡X}. i¢I

i¢I

Definition 2.6.2. A map φ: P(M) P(N) is a closure operator if, for all A, B ¡M the following holds true: (1) Extensivity: A ¡φ(A), (2) Monotony: If A ¡B, then φ(A) ¡φ(B), (3) Idempotency: φ(φ(A)) = φ(A). For every A ¡M, the set φ(A) is called closure of A. The set A is called closed if φ(A) = A. Remark 11 (1) The idempotency condition is equivalent to A ¢P(M).(φ° φ)(A) = φ(A). (2) A subset A ¢P(A) is closed if and only if A ¢φ(P(M)). Proof. Suppose A is closed, then A = φ(A), hence A ¢φ( P(M)). Conversely, if A ¢φ(P(M)), then A = φ(X), with X ¡A. Hence φ(A) = (φ° φ)(X) = φ(X) = A, i.e., A is closed. (3) If we take instead of P(M) an arbitrary ordered set, we can define a closure operator over an ordered set as well. Example 2.6.2. Let (P, w) be an ordered set. A closure operator is given by ↓[–]: P(P) P(P), X ↓X. The corresponding closure system is the set of all downward closed sets of P. Theorem 2.6.2. Let P be a closure system on A. The map φ: P(A) P(A) defined by

φ(X) :=∩{Y ¢P ¾X ¡Y} is a closure operator. Proof. We have to check the three properties of a closure operator, extensivity, monotony and idempotency. From φ(X) := ∩{Y ¢P ¾X ¡ Y} follows that extensivity and monotony are true for φ. In order to prove the idempotency, we have to prove that φ(X) = X ¯ X ¢P. If φ(X) = X, then X is the intersection of some sets of P. Hence, X ¢ P. Conversely, if X ¢P, then X is a term of the intersection φ(X) :=∩{Y ¢P ¾X ¡Y}, hence φ(X) ¡X. By extensivity, we have φ(X) = X. From φ(X) ¢P and the monotony property, one has φ(φ(X)) = φ(X). Hence, φis a closure operator. Moreover, φ(A) is the smallest subset of P which contains X. Theorem 2.6.3. If φ: P(A) P(A) is a closure operator on A, then P := {X ¡A ¾φ(X) = X} is a closure system on A. Proof. If R ¡P and X = ∩Y ¢R Y, then X ¡Y, hence φ(X) ¡φ(Y ) = Y for every Y ¢R. We have φ(X) ¡∩Y = X, Y¢R and by extensivity, we get φ(X) = X, wherefrom follows X ¢P. Remark 12 It has just been proved that the closed sets of a closure operator are a closure system. Every closure system is the system of all closed sets of a closure operator. If φis a closure operator on A, denote by Pφ the corresponding closure system. If P is a closure system, denote by φP the induced closure operator. Theorem 2.6.4. PφP = P and φPφ = φ. Example 2.6.3 (1) The set of all closed subsets of a topological space is a closure system. Moreover, a special property holds true: the join of many finite closed sets is again closed.

(2) Let (X, d) be a metric space. A closed set is defined to be a subset A ¡X which contains all limit points. All closed sets in a metric space are a closure system. The corresponding closure operator joins to a subset A all its limit points. The definitions of closure system and closure operators can be easily generalized on ordered sets. Definition 2.6.3. A closure operator on an ordered set (P, w) is a map φ: P P, satisfying for every a, b ¢P the following conditions: (1) Extensivity: a wφ(a); (2) Monotony: If a wb, then φ(a) wφ(b); (3) Idempotency: φ(a) = φ(φ(a)). If a ¢P, then φ(a) is called the closure of a. An element a ¢P is called closed if a = φ(a). Remark 13 The subset of P consisting of all closed elements with respect to φis φ(P). Definition 2.6.4. A closure system on a complete lattice (L, w) is a subset P ¡L, closed under arbitrary infima: H ¡P ² ΛH ¢P. Definition 2.6.5. Let (P, w) and (Q, w) be ordered sets. A pair of maps φ: P Q and Ψ: Q P is called a Galois-connection between P and Q if (1) For every x, y ¢P from x wy, follows φ(y) wφ(x); (2) For every a, b ¢Q from a wb, follows Ψ(y) wΨ (x); (3) For every p ¢P and every q ¢Q, we have p wΨ ° φ(p) and q w φ° Ψ(q). Proposition 2.6.5. Let (P, w) and (Q, w) be ordered sets. A pair of maps φ: P Q and Ψ : Q P is a Galois-connection if and only if (4) p ¢P. q ¢Q. p wΨ(q) ¯ q wφ(p). Proof. Suppose (φ, Ψ) is a Galois-connection and let p ¢P and q ¢Q be arbitrary but fixed. Suppose p wφ(q). Then, by (1), we have φ(p) φ(Ψ(q)). Condition (3) yields φ(p) q. The converse follows by a similar argument. Suppose now

p ¢P. q ¢Q. p wΨ(q) ¯ q wφ(p). Choose in (4) q := φ(p) wφ(p), then p wΨ(φ(p)), i.e., condition (3). If x, y ¢P with x wy we deduce that x wΨ(φ(y)) and using (4) we conclude that φ(y) φ(x). Proposition 2.6.6. For every Galois-connection (φ, Ψ) we have φ= φ° Ψ ° φand Ψ = Ψ ° φ° Ψ. Proof. Choose q := φ(p). Then, by (3), we obtain φ(p) wφ° Ψ ° φ(p). From p wΨ ° φ(p) using (1), we have that φ(p) φ° Ψ ° φ(p). Theorem 2.6.7. If (φ, Ψ) is a Galois connection between (P, w) and (Q, w), then (1) The map Ψ ° φis a closure operator on P and φ° Ψ is a closure operator on P and the subset of all closed elements in P is Ψ(Q). (2) The map φ° Ψ is a closure operator on Q and the subset of all closed elements in Q is φ(P). Proof. (1) Using (2) from 2.6.5 we obtain the extensivity of Ψ ° φand from (1) of 2.6.5, we obtain its monotony. Suppose p ¢P. Then φ(p) ¢ Q and by (2) of 2.6.5, we have φ(p)(φ° Ψ)(φ(p)) and p wΨ(φ(p)). Using the monotony of φ, we have (φ° Ψ) (φ(p)) wφ(a). Hence (φ° Ψ)(φ(p)) = φ(p). Applying Ψ, one gets (Ψ ° φ) ° (Ψ ° φ)(p) =Ψ ° φ(p), i.e., the idempotency of Ψ ° φ. This proves that Ψ ° φis a closure operator. The subset P0 of all closed elements in P is Ψ(φ(p)). Using φ(P) ¡Q, we have P0 ¡Ψ (Q). If in the equality (φ° Ψ)(φ(p)) = φ(p) the roles of φand Ψ are changed, we get (Ψ ° φ) (φ(q)) = φ(q), q ¢Q. Hence Ψ(Q) ¡P0, wherefrom follows that P0 = Ψ(Q). (2) We proceed similarly, changing roles of φand Ψ. Definition 2.6.6. Let A and B be sets and R ¡A B a relation. We define the following derivations by

XR := {b ¢B ¾a ¢X. aRb}, X ¡A and YR := {a ¢A ¾b ¢Y. aRb}, Y ¡B. Proposition 2.6.8. If R ¡A B is a binary relation, the derivations defined in 2.6.6 define a Galois-connection φR: P(A) P(B) and ΨR: P(B) P(A) by φR(X ) := XR, X ¢P(A) and ΨR(Y ) := YR, Y ¢P(B). Conversely, if (φ, Ψ) is a Galois-connection between P(A) and P(B), then R(φ,Ψ) := {(a, b) ¢A B | a ¢φ({a})} = {(a, b) ¢A B ¾b ¢φ({a})} is a binary relation between A and B. Moreover, φ R(φ, Ψ) = φ, Ψ R(φ, Ψ) = Ψ, and (φR, ΨR) = R.

R

Algebraic Structures for AI Algebraic structures play a major role in mathematics and its applications. Being a natural generalization of arithmetic operations with numbers and their properties, different algebraic structures are used to describe general patterns and behavior of elements and their operations. This chapter briefly introduces basic algebraic structures. 3.1 Functions Definition 3.1.1. A function or map is a triple f = (A, B, F), where A and B are sets and F ¡A B is a relation satisfying (xFy xFy') ² y = y' (Uniqueness of F) x ¢A. y ¢B.xFy (Totality of F) The set A is called domain of f, B is called codomain of f, and F is called the graph of f. We write f : A B if there is a graph F, such that f = (A, B, F) is a map. We define BA : = {f | f : A B} the set of all functions from A to B. If f = (A, B, F) and g = (B, C, G) are functions, the composition of f and g, denoted by g ° f, is defined by (A, C, R), where xRz : ¯ y ¢B.xFy yGz. The function idA: = (A, A, {(x, x) | x ¢A}) is called the identity of A. Remark 14 (1) In practice, we often identify the function f with its graph F. 2) The graph of the identity of A is defined by &A:= {(a, a) | a ¢A} and is called the diagonal of A. Definition 3.1.2. A function f : A B is called 0 injective or one-to-one if for every a1, a2 ¢A, from f(a1) = f(a2) always follows that a1 = a2. 0 surjective or onto, if for every b ¢B there exist a ¢A with f(a) = b. 0 bijective, if f is both injective and surjective, i.e., for every b ¢ B there exist a unique a ¢A with f(a) = b. Remark 15 (1) A function f : A B is injective if and only if for every a1, a2 ¢A, from a1 a2 follows f(a 1) f(a2).

(2) A function f : A B is surjective if and only if f(A) = B. (3) A function f = (A, B, F) is (a) injective if and only if F–1 ° F = &A. (b) surjective if and only if F ° F–1 = &B. (c) bijective if and only if F–1 ° F = &A and F ° F–1 = &B. In this case, F–1 is the graph of the function f–1: B A with f ° f–1 = idB and f–1 ° f = idA. (4) If f–1 is a map such that f ° f–1 = idB and f–1 ° f = idA, then f is called invertible and f–1 is called the inverse map of f. Definition 3.1.3. Let f : A B be a function. Then f(A) : = {f(a) | a ¢A} is called the image of f. If X B, the set f–1(X) : = {a ¢A | f(a) ¢X} is called the preimage of X. Theorem 3.1.1. A function f : A B is invertible if and only if f is bijective. Proof. Let f : A B be an invertible map. Then, the inverse function f–1 : B A exists with f ° f–1 = idB and f–1 ° f = idA. Let now a1, a2 ¢A with f(a1) = f(a2). Using f–1, we get a1 = f–1(f(a1)) = f–1(f(a2)) = a2, hence f is injective. Let b ¢B. Then f(f–1(b)) = b. Define a: = f–1(b). Hence there exists an a ¢A with f(a) = b, i.e., f is surjective. Since f is both injective and surjective, it is bijective. Suppose now f is bijective. Let b ¢B arbitrarily chosen. Then there exists an a ¢A with f(a) = b. Because of the injectivity of f, the element a is uniquely determined. Hence {(b, a) ¢B A | b = f(a)} is the graph of a function g : B A with g(b) = a for every b ¢B. Since b = f(a) and a = g(b), we have b ¢B. b = f(g(y)), i.e., f ° g = idB. Similarly, we can prove that g ° f = idA, hence f is invertible. Theorem 3.1.2. Let f : A B and g : B C be bijective maps. Then (1) g ° f : A C is bijective;

(2) (g ° f )–1 = f–1 ° g–1. Proof. (1) Let a1, a2 ¢A with a1 a2. Then f(a1) f(a2) and g(f(a1)) g(f(a2)), since both f and g are injective. It follows that g ° f is injective. Let now c ¢C. Since g is surjective, there exists b ¢B with g(b) = c. Using the surjectivity of f, we can find an a ¢A with f(a) = b, hence g(f(a)) = c, i.e., g ° f is surjective. The function g ° f is injective and surjective, hence bijective. (2) g ° f is bijective, hence invertible. We denote its inverse by (g ° f )–1. From c = (g ° f )(a) follows that a = (g ° f )–1(c). Then there exists a b ¢B with b = g–1(c) and a = f–1(b). We conclude a = f–1(b) = f–1(g–1(c)) = (f–1 ° g–1)(c), i.e., c ¢C. (g ° f )–1(c) = (f–1 ° g–1)(c), which means (g ° f )–1 = f–1 ° g–1. Theorem 3.1.3. Let A be a finite set and f: A A a function. The following are equivalent: 46 Natural Language Processing: Semantic Aspects

(1) f is injective; (2) f is surjective; (3) f is bijective. Proof. Since A is finite, we can write A := {a1, a2, . . . , an}. Suppose f is injective but not surjective. Then there exists ak ¢A with ak f(ai) for i ¢{1, 2, . . . , n}. Then, f(A) = {f(a1), f(a2), . . . , f(an)} {a1, a2, . . . an}. Because of the strict inclusion, we conclude that the existence of different elements ai, aj ¢f(A) with f(ai) = f(aj), wherefrom follows ai = aj, which is a contradiction. Suppose now f is surjective, but not injective. Then there exist ai, aj ¢A with ai aj and f(ai) = f(aj), i.e., f(A) would have at most n – 1 elements. Since A has n elements, it follows that there exists ak ¢A with ak £f(A), hence f is not surjective.

3.2 Binary Operations The concept of a binary operation is a natural generalization of real number arithmetic operations. Definition 3.2.1. A binary operation on a set M is a map *: M M M, which assigns to every pair (a, b) ¢M M an element a * b of M. A set M together with a binary operation * is sometimes called a groupoid. Many other symbols might be used instead of *, like @, ° ,+, . , ®, , T. Usually, the usage of + is called additive notation: a + b, often means that a + b = b + a for every a, b ¢M. The element a + b is called the sum of a and b. The symbol . is called multiplicative notation: a . b or ab which is called the product of a and b. Example 3.2.1. Examples of operations (1) The usual addition on N,Z,Q,R,C,M(m n,R),Rn; (2) The usual multiplication on N,Z,Q,R,C,Mn(R); (3) The cross product on R3; (4) Let A be a set. The composition of functions is an operation on the set AA of all maps of a set A into itself. (5) Let M be a set. Denote by P(M) the power set of M. The union and intersection of subsets of M are binary operations on P(M). The following are not operations in the sense defined above: (1) Scalar multiplication on Rn: R Rn Rn, (N, v) Nv. We combine objects of different types, scalars and vectors, which is not included in the definition of an operation. (2) The scalar product on Rn: Rn Rn, (v, w) µv, wÅ

Here we combine objects of the same type, i.e., vectors in Rn but the result is an object of a different type (a scalar), which is not included in the definition of an operation. (3) Subtraction (a, b) a – b is not an operation on N, but it is an operation on Z,Q,R,C,Rn. Definition 3.2.2. Let M be a set and * an operation on M. If a * b = b * a for every a, b ¢M, the operation * is called commutative. 3.3 Associative Operations. Semigroups Definition 3.3.1. The operation * defined on the set S is called associative if a * (b * c) = (a * b) * c for every a, b, c ¢M. A pair (S, *) where S is a set and * is on associative operation on S is called semigroup. A semigroup with a commutative operation is called a commutative semigroup. Remark 16 There are five ways to form a product of four factors: ((a * b) * c) *d, (a * b) * (c * d), a * (b * (c * d)) (a * (b * c)) * d,

a * ((b * c) * d)

If the operation * is associative, all these five products are equal. Example 3.3.1 (1) Addition and multiplication on N,Z,Q,R,C, Mm n(R),Mn(R) are commutative. (2) Subtraction on N,Z,Q,R,Cis not associative. 2 – (3 – 4) = 3 –5 = (2 – 3) – 4. (3) Multiplication on N,Z,Q,R,Cis commutative but it is no longer commutative on Mn(R) if n 2: Ê0 1ˆ Á ˜Á

Ê0 0ˆ

Ê 0 0ˆ Ê 0 Ê 1 0 ˆ Ê 0 0ˆ 1ˆ ˜ =Á˜

πÁ ˜

˜ =Á Á ˜.

0 Ë1 0 01 Ë1 ¯ 0 Ë 0¯ 0 ¯ Ë 0¯ Ë ¯ 0 Ë 0¯ (4) The composition of permutations, i.e., of bijective functions is associative but in general not commutative.

(5) The composition of functions is not commutative: Let f, g : R Rdefined by f(x) := x3 and g(x) := x + 1. Then (f ° g)(x) = f(g(x)) = (x + 1)3 and (g ° f )(x) = g(f(x)) = x3 + 1. Hence f ° g g ° f. Example 3.3.2 (1) Let A be a singleton, i.e., an one-element set. Then there exists a unique binary operation on A, which is associative, hence A forms a semigroup. (2) The set 2Nof even natural numbers is a semigroup with respect to addition and multiplication. Definition 3.3.2. Let M be a set with two binary [email protected], @. We say [email protected] is distributive at right with respect to @if for every x, y, z ¢m, we have (x @y)@z = ([email protected]) @([email protected]) and distributive at left if [email protected](x @y) = (z @

x) @(z @

y).

We say that @ is distributive with respect to @if it is left and right distributive. Definition 3.3.3. Let (S,*) be a semigroup, a ¢S and n ¢N*. We define the nth power of a to be an:= a*a*. . .*a. { n times The following computation rules hold true: an * am = a(n+m), (an)m = an m for every n,m ¢N. In the additive notation, we speak about multiples of a and write n a:= a + a + . . . + a {n

3.4 Neutral Elements. Monoids Definition 3.4.1. Let S be a set and * an operation on S. An element e ¢S is called neutral element of the operation * if a * e = a = e * a, for every a ¢S.

Remark 17 Not every operation admits a neutral element. If a neutral element exists, then it is uniquely determined. If e1 and e2 would be both neutral elements, then e1 = e1 * e2 = e2. Definition 3.4.2. The triple (M, *, e) is called monoid if M is a set, * is an associative operation having e as its neutral element. If the monoid operation * is commutative, then (M, *, e) is called commutative monoid. In the additive notation, the neutral element is denoted by 0, while in the multiplicative notation it is denoted by 1. This does not mean that we refer to the numbers 0 and 1. This notation only suggests that these abstract elements behave with respect to some operations in a way similar to the role played by 0 and 1 for arithmetic operations.

Example 3.4.1 (1) (N, , 1) and (Z, , 1) are commutative monoids. Their neutral element is 1. Ê Ê1 0ˆ ˆ (2) Á M2 (R),., Á ˜ Ë

Ë 0 1¯

˜

is a non-commutative monoid. The neutral

¯ Ê 1 0ˆ

element is the unit matrix Á

˜

Ë

.

0 1¯

(3) Let M be a set. Consider P(M) with the operations of set union and set intersection. Then (P(M), , ) and (P(M), , M) are commutative monoids. (4) The word monoid: Let A be a set of symbols called alphabet. A word over the alphabet A is a finite sequence of symbols from A: w = a1a2 . . . an, ai ¢A, i = 1, . . ., n. The length of w is defined to be n. We denote by A* the set of words defined over the alphabet A. The empty sequence is denoted by ε.On the set A*, we define a binary operation called juxtaposition: for every v = a1a2 . . . an,w = b1b2 . . . bm ¢A*, define v ° w := a1a2 . . . anb1b2 . . . bm ¢A*.

An easy computation shows that juxtaposition is associative and has the empty sequence εas neutral element. Hence, (A*, ° , e) is a monoid, called the word monoid over A. Lemma 3.4.1. If A is a set, then (AA, °, idA) is a monoid. Proof. Let f, g, h: A A be maps in AA. We want to prove that ° is associative, i.e., f ° (g ° h) = (f ° g) ° h. Let F, G, H be the corresponding graphs for the maps f, g, h, respectively. Then, for every a1, a2 ¢A, we have a1(F ° (G ° H))a2 ¯ x ¢A.a1Fx x(G ° H)a2 ¯ x ¢A.a1Fx y ¢A.xGy yHa2 ¯ x ¢A. y ¢A.a1Fx xGy yHa2 ¯ x ¢A.(a1Fx xGy), y ¢A.yHa2 ¯ a1((F ° G) ° H)a2. We have just proved that maps composition is associative in AA. The identity map always acts as a neutral element with respect to maps composition, hence (AA, °, idA) is a monoid. 3.5 Morphisms Every algebraic structure defines a natural class of maps called morphisms. These maps allow us to compare processes that take place in a structure with those from a similar one. In the definition of morphisms, the so-called compatibility condition with the operations of an algebraic structure occurs.

Definition 3.5.1 0 Let (S1, * 1) and (S2, *2) be semigroups. A map f : S1 S2 is a semigroup morphism if f(a *1 b) = f(a) *2 f(b) for every a, b ¢S1. The above relation is called compatibility of the semigroup structure with the map f. 0 Let (M1, °1, e1) and (M2, °2, e2) be monoids. A map f : M1 M2 is a monoid morphism if f is a semigroup morphism and f(e1) = e2. 0 A semigroup endomorphism is a semigroup morphism f : S S. We denote the set of all endomorphisms of S with End(S). A semigroup isomorphism is a bijective morphism. A semigroup automorphism is a bijective endomorphism. The set of all automorphisms of a semigroup S is denoted by Aut(S). These notions are defined analogously for monoids. Example 3.5.1

(1) Let (S, *) be a semigroup and s ¢S. The map ps : N\{0} S, ps(n) := s n is a semigroup morphism from (N\ {0}, +) to (S, *). Proof. Remember that ps(n) := s n = s * s * . . . * s . Let m, n ¢N\ { N {0} be two arbitrary natural numbers. The semigroup morphism condition would require that ps(n + m) = ps(n) * ps(m). We have ps(n s*s*... + m) = * s { n+m

(s * s * . . . * = s) { n

* (s * s * . . . * s) = ps(n) * { m ps(m) which proves that ps is indeed a semigroup morphism. (2) Let (M, °, e) be a monoid. For m ¢M, we define m0 := e. The map pm: N M, pm(n) := mn is a monoid morphism. Proof. Remember that a monoid morphism is a semigroup morphism with the neutral elements correspondence condition. In order to check the semigroup morphism condition, proceed similar to the previous example. The neutral elements correspondence condition means pm(0) = e. By definition, pm(0) = m0 = e, hence pm is a monoid morphism.

(3)

Let A := {a} be a singleton and A* the set of words over A. The map

p : N A* p (n) := aa. . . a a a

{ n times

is a monoid isomorphism. Proof. For the monoid (M, °, e) in (2) take the word monoid A* with juxtaposition as monoid operation and εas neutral element. Then pa is a monoid morphism as proved above. Let m, n ¢N be arbitrary natural numbers. If pa(m) = pa(n), then aa. . . a { m times

= aa. . . a, { n times

i.e., a word of length m equals a word of length n. Hence m = n and pa is one-to-one. Let now w ¢A* be a word. Because of the definition of the word aa. . . a monoid A*, there is an n ¢Nwith w ={ hence pa is onto.

, i.e., w = pa(n),

n times

(4) Consider the semigroups (R, +) and (R+, ). Then the map f : R R+ given by f(x) := ax for an a ¢R+, a 1 is a semigroup morphism. Proof. For every x, y ¢R, we have f(x + y) = ax+y = ax ay = f(x) f(y). (5) Let x ¢Rn. The map fx : R Rn, fx(N) := Nx is a monoid morphism from (R, +, 0) to (Rn,+, 0n). The proof is left to the reader. (6) The morphism f : R R+, defined by f(x) := ax, a ¢R+, a 1 is an isomorphism.

Proof. Define g : R+ Rby g(x) := loga(x). Then g(x y) = loga(xy) = loga x + loga y = g(x) + g(y) for every x, y ¢R+. Hence g is a semigroup morphism. Let x ¢Rbe arbitrarily chosen. Then (g ° f )(x) = g(f(x)) = g(ax) = loga ax = x, i.e., g ° f = idR. If x ¢R+ is arbitrarily chosen, then (f ° g)(x) = f(g(x)) = f(loga x) = alogax = x, i.e., f ° g = idR+. Theorem 3.5.1. Let (M, *, e) be a monoid. We define h: M MM by h(x)(y) := x * y. The map h: (M, *, e) (MM, °, idM) is an injective morphism. Proof. Let x, y ¢M with h(x) = h(y). Then x = x * e = h(x)(e) = h(y)(e) = y * e = y, hence h is injective. We are going to check the morphism conditions for h: Since h(e) (x) = e * x = x = idM(x), we have h(e) = idM. For every x, y, z ¢M, we have h(x * y) = h(x) ° h(y), because h(x * y)(z) = (x * y) * z = x * (y * z) = x * h(y)(z) = h(x)(h(y)(z)) = (h(x) ° h(y))(z). Theorem 3.5.2. Let A be a set, iA : A A* a function mapping every element a ¢A to the one element sequence a ¢A* and M a monoid. Then, for every map f : A M, there exists exactly one monoid morphism h: A* M with h ° iA = f. This unique map h is called the continuation of f to a monoid morphism on A*. Proof. Uniqueness: Let h1, h2 : A* M be monoid morphisms satisfying h1 ° iA = h2 ° iA. We have to prove that h1 = h2. This will be done by induction over the length n of a sequence s ¢A*, n = |s|. Suppose n = 0. Then s = εand h1(ε) = e = h2(ε). Suppose the assertion is true for n and prove it for n + 1: Let s be a sequence of length n + 1. Then, s = a . s' for an a ¢A and s' ¢A*, with |s'| = n. Because of the induction hypothesis, we have h1(s' ) = h2(s' ). Moreover, h1(a) = h1(iA(a)) = h2(iA(a)) = h1(a).

Then h1(s) = h1(a . s') = h1(a) ° h1(s') = h2(a) ° h2(s') = h2(a . s') = h2(s). We have proved that if such a continuation exists, then it is uniquely determined. Existence: Define h(a1 . . . an) := h(a1) . . . h(an). We can easily prove the morphism property for h, hence such a continuation always exists. Consider S an arbitrary set, and M = (SS, °, id). We obtain a 1-1-correspondence between mappings f : A SS and morphisms from A* to (SS, °, id). The previous result can be used in order to interpret a mapping f : A SS as an automaton: 0 S is the set of internal states of the automaton; 0 A is the input alphabet; 0 For every a ¢A, f describes the state transition f(a): S S. The continuation of f : A SS to a morphism h: A* SS describes the state transition induced by a sequence from A*. Definition 3.5.2. Let M be a monoid and S a set of states. An M-automaton is a monoid morphism h: M (SS, °, idS). Remark 18 If we choose the monoid (N, +, 0) for M, then h: M SS is a so-called discrete system. Choosing (R0,+, 0) for M, we obtain a so-called continuous system. 3.6 Invertible Elements. Groups Definition 3.6.1. Let (M, *, e) be a monoid with neutral element e. An element a ¢M is called invertible, if there is an element b ¢M such that a * b = e = b * a. Such an element is called an inverse of a. Remark 19 If m ¢M has an inverse, then this inverse is unique: Let m1 and m2 be inverse to m ¢M. Then

m1 = m1 * e = m1 * (m * m2) = (m1 * m) * m2 = e * m2 = m2. This justies the notation m–1 for the inverse of m. Remark 20 In the additive notation, an element a is invertible if there is an element b such that a + b = 0 = b + a. The inverse of a is denoted by –a instead of a–1. Example 3.6.1 (1) In every monoid, the neutral element e is invertible, its inverse is again e. (2) In (N,+, 0) the only invertible element is 0. (3) In (Z,+, 0) all elements are invertible. (4) In (Z, , 1) the only invertible elements are 1 and –1. (5) If A is a set, then the invertible elements of (AA, °, idA) are exactly the bijective functions on A. Proposition 3.6.1. Let (M, *, e) be a monoid. We denote the set of inverse elements with M . The following hold true: (1) e ¢M and e–1 = e. (2) If a ¢M , then a–1 ¢M and (a–1)–1 = a. (3) If a, b ¢M , then a * b ¢M and (a * b)–1 = b–1 * a–1. Proof. 1. As e * e = e, the neutral element e is invertible and equal to its inverse. 2. If a is invertible, then a * a–1 = e = a–1 * a, hence a–1 is invertible and the inverse of it is a. 3. Let a and b be invertible. Then (a * b) (b–1 * a–1) = a * (b * b–1) * a–1 = a * e * a–1 = a * a–1 = e (b–1 * a–1) * (a * b) = b–1 * (a–1 * a) * b = b–1 * e * b = b–1 * b = e. Thus a * b is invertible and its inverse is b–1 * a–1. Definition 3.6.2. A monoid ( G, *, e) in which every element is invertible is called a group. If the operation * is commutative, the group is called abelian. Remark 21 (1) A group is a set G with a binary operation *: G G G such that the following axioms are satisfied: (G1) a * (b * c) = (a * b) * c for every a, b, c ¢G. (G2) e ¢G.a * e = e * a = a for every a ¢G. (G3) a ¢G. a–1 ¢G.a * a–1 = a–1 * a = e. (2) If G is a group, then G = G . (3) For abelian groups, we generally use the additive notation.

Example 3.6.2 (1) (Z,+, 0), (Q, +, 0), (R, +, 0), (C, +, 0) are abelian groups. (2) ({–1, 1}, , 1) is a group. (3) (Q\{0}, , 1), (R\{0}, , 1), (C\{0}, , 1) are abelian groups. (4) (Sn, °, id), where Sn is the set of all permutations of an n elements set and id is the identity is a non-commutative group. (5) The set of all n n invertible matrices GLn(R) := {A ¢Mn(R)|det A 0} together with the matrix multiplication is a group. The neutral element is the matrix In. For n = 1, this is an abelian group, isomorphical to (R\ {0}, , 0). For n 2, this group is generally not abelian: . Ê 1 1 ˆ Ê 1 0ˆ

Ê 2 1ˆ

Á ˜Á



˜

Ê11ˆ ˜ πÁ ˜

Ë 0 1¯ Ë 1 1 ¯ Ë 1 1 ¯ Ë

Ê 1 0ˆ Ê 1 1ˆ ˜ =Á Á ˜

1 2¯ Ë

1 1¯ Ë 0 1¯

This group is called general linear group of dimension n. (6) SLn(R) := {A ¢GLn(R)|det A = 1} together with matrix multiplication is a group, called special linear group of dimension n. Proposition 3.6.2. A triple (G, *, e), where G is a set, * an associative operation on G and e ¢G is a group if and only if for every a, b ¢G the equations a * x = b, y * a = b have unique solutions in G. Proof. Let (G, *, e) be a group and the equations a * x = b, y * a = b. From condition (G3) follows that every element a ¢G is invertible, its inverse being denoted by a–1. From a * x = b, we get x = a–1 * b, hence a * (a–1 * b) = (a * a–1) * b = e * b = b, i.e., x = a–1 * b is a solution. Suppose there are two different solutions x1 and x2, i.e., a * x1 = b and a * x2 = b.

We have a*x1 = a*x2. Multiplying to the left with a–1, we obtain x1 = x2, i.e., the solution of the equation is uniquely determined. Similarly, we get the existence of a unique solution of the equation y * a = b. Suppose now that for a triple (G, *, e), the equations a * x = b, y * a = b have unique solutions in G for every a, b ¢G. Choose b = a, hence the equation a * x = a has a unique solution for every a ¢G. We denote this unique solution by ea. We shall now prove that all ea are equal. Let b ¢G arbitrary chosen and z the unique solution of y * a = b, i.e., z * a = b. Then b * ea = (z * a) * ea = z * (a * ea) = z * a = b, i.e., ea does not depend on a. We denote ea =: e, then x * e = x, for every x ¢G. Similarly, starting with y * a = b, we conclude the existence of a unique element e' ¢G with e' * x = x for every x ¢G. From x * e = x and e' * x = x, we get e' * e = e' and e' * e = e, i.e., e = e'. We conclude that in G, the binary operation * admits a neutral element. From the uniqueness of the solutions of a * x = b, y * a = b, we obtain that there exist unique a', a" ¢G with a * a' = e, a" a = e. Since * is associative and admits a neutral element, we have a' = e * a' = (a" * a) * a' = a" * (a * a') = a" * e = a". We have just proved that in G every element is invertible, hence G is a group. Theorem 3.6.3. Let (M, *, e) be a monoid with neutral element e. Then M is a group with neutral element e with respect to the restriction of the operation * to M .

Proof. For a, b ¢M , the product a * b is also in M , since the product of invertible elements is again invertible. Thus, we have obtained an operation on M by restricting * to the set M . The operation * on M is associative, hence its restriction to M is associative too. Also, e ¢M and clearly e is a neutral element for M too. Every a ¢M is invertible in M by definition, and its inverse a–1 also belongs to M . Lemma 3.6.4. For a group (G, , e) the following hold true: (1) x y = e ² y x = e; (2) x y = x z ² y = z. Proof. Let x y = e. Then, there is a z ¢G with y z = e. Hence x = x e = x y z = e z = z and so y x = e. Suppose now that x y = x z. Then there is an element u ¢G with u x = e. Hence y = e y = u x y = u x z = e z = z. Definition 3.6.3 (1) A group G is called finite if G is finite, and infinite at contrary. (2) If G is a group, the order of G is the cardinality of G. We write ord(G)=|G|. Proposition 3.6.5. Let (G, , 1) be a group and x ¢G. Then, for every m, n ¢Z (1) xnxm = xn+m; (2) (xn)m = xnm. 3.7 Subgroups Definition 3.7.1. Let G be a group. A subset H ¡ G is called subgroup if (1) e ¢H, (2) g, h ¢H then gh ¢H, (3) g ¢H then g–1 ¢H. Remark 22 (1) The last two properties can be merged into one single property: g, h ¢ H² gh–1 ¢H. (2) Every subgroup is a group.

Example 3.7.1 (1) All subgroups of (Z, +) are of the form H = nZ. Proof. Let H be a subgroup of Z. If H = {0}, then let n = 0. Suppose H {0}. Then there is at least one element a ¢H with a 0. If a ¢H, then –a ¢H, so there is at least one element a ¢H with a > 0. Let A := {a ¢H|a > 0}. Then A . The set A has a minimum, denoted by n. We are going to prove that H = nZ. Since n ¢A ¡H, then nZ¡Z. Let a ¢H. Then there are q, r ¢ Zwith a = nq + r, 0 wr < d. Since a ¢H and nq ¢nZ¡H, then r = a – nq ¢H. Because n is the least positive element of H and r < d, we have that r = 0, i.e., a = nq ¢nZ. (2) (Z, +) is a subgroup of (Q, +), (Q, +) is a subgroup of (R, +), and (R, +) is a subgroup of (C, +). (3) The orthogonal group, i.e., the group of all orthogonal n n matrices and the special orthogonal group, i.e., the group of all orthogonal n n matrices with determinant 1 are subgroups of GLn(K). Proposition 3.7.1. Let Hi, i ¢I be a family of subgroups of a group G. Then H := i¢I Hi is a subgroup of G. Proposition 3.7.2. External characterization. Let G be a group and M a subset of G. Then there is a least subgroup of G containing M. This semigroup is called the subgroup generated by M and is denoted by µMÅ. Proof. Let Hi, i ¢I, be the family of all subgroups of G containing M. Then their intersection is again a subgroup of G containing M. Hence i¢I Hi is the smallest subgroup of G containing M. Proposition 3.7.3. Internal characterization. Let M be a subset of a group G. Then µMÅis the set of all finite products of elements of M and their inverses: e1 e 2 (*) g =

ε x n,x

x1

2...

x

n

¢M, ε¢{–1,1}, i = 1, . . ., n. i I

Proof. We are going to prove that the elements of form (*) form a subgroup H of G. The neutral element εis the empty product with n = 0 factors. Hence it is an element of H. The product of two elements

of the form (*) is again an element of this form, hence in H. The inverse of g is g–1 = x–εnn . . . x–ε11 having again the form (*). Moreover, M is a subset of H, since every element x of M has the form (*) with just one factor x. Every subgroup U containing M contains all inverses of x ¢M, hence all elements of form (*), i.e., H ¡U. Remark 23 If the group operation is commutative then the elements of the subgroup generated by {a1, . . ., an} are of the form ak11 ak22 . . .aknn, k1, . . ., kn ¢Z Definition 3.7.2. A group is called cyclic if it is generated by one of its elements. Remark 24 Let G be a group and a ¢G an element. The subgroup generated by the element a consists of all powers of a: µaÅ= {ak|k ¢Z}. Definition 3.7.3. Let G be a group and H a subgroup of G. For every g ¢G, the set gH := {gh|h ¢H} is called the left coset of g with respect to H. Proposition 3.7.4. The cosets of H form a partition of G with disjoint sets. The equivalence relation generated by this partition is given by x y :¯ x–1y ¢H ¯ y ¢xH. Proof. Since e ¢H, it follows that ge = g ¢gH for every g ¢G, hence the cosets are covering G. Suppose z ¢xH yH. Then z = xh1 = yh2. It follows, x = yh2h–11 ¢yH and so xH ¡ yH. Dually, y = xh1h–12 ¢xH and so yH ¡xH. We have proved that if xH yH then xH = yH which concludes the proof. Example 3.7.2

(1) In the group (Z, +) consider the subgroup H = nZ. The cosets are exactly the equivalence classes modulo n:

Algebraic Structures 61 =0+ 0 nZ,

n– 1 = 1 + nZ, . . ., 1 = (n – 1) + nZ.

(2) Consider the permutation group of a three element set {1, 2, 3}: Ï 3

S

Ê123ˆ = Ìid = Á

˜ , p1

Ô Ë 1 2 3¯

Ê123ˆ

Ê 1 2 3ˆ





˜

Ë

3 1 2¯

˜ , p2

Ë 2 3 1¯

Ó Ê 1 2 3ˆ 12

t

Ê 1 2 3ˆ

= Á ˜ , t13



Ë213¯

Ë 3 2 1¯

H, t23H = {t23, p2}, t13H = {t13, p1}. Definition 3.7.4. If G is a group, the right cosets are defined dually:

Ê 1 2 3ˆ ¸

˜ , t23 = Á

,

Then H = {id, t12} is a subgroup. The cosets are given by

˜ ˝.

Ë 1 3 2¯ Ô ˛

Hx := {hx|h ¢H}.

Remark 25 If the group G is commutative, then the left and right cosets are the same. In the non-commutative case, this is no longer the case. In the above example for the permutation group S3 and the subgroup H, we have t13H = {t13, p1} {t13, p2} = Ht13. 3.8 Group Morphisms Definition 3.8.1. Let (G1, °1, e1) and (G2, °2, e2) be groups. We call group morphism a monoid morphism f : G1 G2 with f(a–1) = (f(a))–1 for every a ¢G.

The morphism f is an isomorphism, if f is bijective. An isomorphism f : G G is called an automorphism. Example 3.8.1 (1) Let (Z,+, 0) be the additive group of integers and (G,+, 0) an arbitrary group. Let a ¢G be fixed. The map f : Z G defined by f(n) := na is a group morphism. (2) Let (R\{0}, , 1) be the multiplicative group of reals and GL2(R) := {A ¢M2(R)|det A 0} be the multiplicative group of invertible matrices over R. The map f : R\ {0} G defined by Ê x 0ˆ f ( x ) := Á

˜

Ë 62

0

,xŒR



Natural Language Processing: Semantic Aspects

is a group morphism. Proof. Indeed, for every x, y ¢R\ {0}, Ê x 0ˆ f(x).f(y)=Á Ë0

Ê y0ˆ

Ê xy0ˆ

˜.

Á

˜



Ë0



= Á Ë0

˜

= f ( xy).

xy ¯

(3) Let (G, , 1) be a commutative group and f : G G defined by f(x) := x–1. The function f is a group morphism. Proof. For every x, y ¢G, f(xy) = (xy)–1 = y–1x–1 = x1–y–1 = f(x)f(y). (4) The identity function idG : G G is a group isomorphism, hence an automorphism of G. (5) The map f : (R, +, 0) (R+, , 1) defined by f(x) := ex is a group isomorphism. Its inverse is g : (R+, , 1) (R, +, 0), given by g(x) := ln x. (6) Let (G, , 1) be a group and x ¢G. The function ιx: G G, defined by ιx(g) := xgx–1 for every g ¢G is an automorphism of G. Proof. For g, g' ¢G, ιx(gg') = xgg'x–1 = (xgx–1)(xg'x–1) = ιx(g)ιx(g'),

i.e., ιxis a group morphism. We will now prove the bijectivity of ιx. Suppose ιx(g) = ιx(g'). Then xgx–1 = xg'x–1, hence g = g', so ιxis injective. For every h ¢G, take g := x–1hx ¢G. Then ιx(g) = h, so ιxis surjective, hence bijective. The automorphisms ιxare called inner automorphisms of G. If G is abelian, all inner automorphisms are equal to idG, the identity on G. (7) The set of all inner automorphisms of a group G with map composition and the identity is a group. Proposition 3.8.1. If f : G1 G2 and g : G2 G3 are group morphisms, then g f is a group morphism. Proof. For every x, y ¢G, (g f )(xy) = g(f(xy)) = g(f(x)f(y)) = g(f(x))g(f(y)) = (g f )(x)(g f )(y). Proposition 3.8.2. Let f : G H be a group morphism. Then (1) f(e) = e; (2) f(x–1) = f(x)–1 for all x ¢G; (3) If G1 is a subgroup of G, then f(G1) is a subgroup of H; (4) If H1 is a subgroup of H, then f–1(H1) is a subgroup of G. Proof. (1) We have f(e e) = f(e) = f(e)f(e). Multiplying to the left with f(e)–1, we obtain e = f(e). (2) e = x x–1 for every x ¢G. Then e = f(e) = f(x x–1) = f(x)f(x–1) and so f(x–1) = f(x)–1. (3) For every x, y ¢G1, we have x y–1 ¢G1 from the subgroup definition. Let now s, t ¢f(G1). Then there exist x, y ¢G1 with s = f(x), t = f(y). We want to prove that st–1 ¢f(G1). Indeed st–1 = f(x)f(y)–1 = f(x y–1) ¢f(G1). (4) Let x, y ¢f–1(H1). Then there exist s, t ¢H1 such that f(x) = s, f(y) = t. Then x y–1 ¢f–1(s) f–1(t–1) = f–1(st–1) ¢f–1(H1). Hence f–1(H1) is a subgroup.

Definition 3.8.2. Let f : G H be a group morphism. The kernel of f is defined as ker(f) = {x ¢G|f(x) = e}. The image of f is defined as Im(f ) = f(G). Remark 26 ker(f) = f–1(e), hence the kernel of a group morphism is a subgroup of G. The image of f is a subgroup of H. Theorem 3.8.3. A group morphism f : G H is injective if and only if ker(f ) = {e}. Proof. Since f(e) = e, we have e ¢ker(f ). Let x ¢ker(f ). Then f(x) = e = f(e). If f is injective, then x = e, i.e., ker(f ) = {e}. Suppose now ker( f ) = {e} and f(x) = f(y). It follows that e = f(x)–1f(y) = f(x–1y), i.e., x–1y ¢ ker(f ) = {e}. We have x–1y = e, wherefrom follows x = y, hence f is injective. 3.9 Congruence Relations Let G be a set and * be a binary operation on G. Definition 3.9.1. An equivalence relation R on G is called congruence relation if for all a, b, c, d ¢G (C) aRc, bRd ² a * bRc * d. The equivalence classes are called congruence classes. The congruence class of an element a ¢G is denoted by a = {g ¢G|gRa}. The set of all congruence classes is denoted by G/R and is called quotient of G. Remark 27 (1) The notion of a congruence is a natural generalization of congruence modulo n for integers. For every natural n ¢N, the relation n is not only an equivalence relation but also a congruence relation on Z. (2) We can define the following operation on the quotient G/R: a * b := a * b. This operation is well defined, being independent on the representative choice. The quotient map R: G G/R, R(a) := a is a surjective morphism. This can be easily seen from R(a * b) = a * b = a * b = R(a) *R(b).

The quotient set with this induced operation is called the quotient groupoid of G and Ris called the quotient morphism. (3) If G is a semigroup, monoid or group, respectively, then so is the quotient of G modulo the congruence R. Proposition 3.9.1. Let R be a congruence on a group G. Then (1) xRy ¯ xy–1Re; (2) xRy ² x–1Ry–1; (3) e is a subgroup of G; (4) eRx ¯ eRyxy–1. Proof. (1) xRy ² xy–1Ryy–1 ¯ xy–1Re and xy–1Re ² xy–1yRe ¯ xRy. (2) xRy ² xy–1Re ² x–1xy–1Rx–1 ¯ y–1Rx–1 ¯ x–1Ry–1. (3) Because of the reflexivity of R, we have eRe, and so e ¢e. If x, y ¢ e, then eRx and eRy. It follows e = eeRxy, hence xy ¢e. Let now x ¢e. Then eRx and so e = e–1 Rx–1, hence x–1 ¢e. (4) eRx ² yey–1Ryxy–1 ² eRyxy–1 ² y–1eyRy–1yxy–1y ² eRx Definition 3.9.2. Let G be a group, H a subgroup is called normal if for all g ¢G gHg–1 = H. Proposition 3.9.2. Let N be a normal subgroup of a group G. The relation RH ¡G G with xRHy :¯ xy–1 ¢H is a congruence on G. Proof. Let x1, x2, y1, y2 ¢G with x1RHx2 and y1RHy2, i.e., x1x–21 , y1y–21 ¢ H. Then x1y1(x2y2)–1 = x1y2y–21 x–21 = x1y1y–21 x–11 x1x–21 ¢H since x1y1y–12 x–11 ¢H, because y1y–21 ¢H and H is a normal subgroup, and x1x–21 ¢H. It follows x1y1RHx2y2.

The following theorem states that there is a one to one correspondence between congruences on a group G and some subgroups of G. Theorem 3.9.3. Let G be a group. Then (1) eRH = H for all normal subgroups H of G; (2) ReR = R for all congruences R on G. Proof. (1) x ¢eRH ¯ xRHe ¯ xe–1 ¢H ¯ x ¢H. (2) xReRy ¯ xy–1 ¢eR ¯ xy–1Re ¯ xRy. 3.10 Rings and Fields The usual arithmetic operations on number sets are addition and multiplication. The same situation occurs in the case of matrix sets. We investigate now the general situation: Definition 3.10.1. A set R with two operations + and , called addition and multiplication, and a fixed element 0 ¢R is called a ring if (R1) (R, +, 0) is a commutative monoid; (R2) (R, ) is a semigroup; (R3) The following distributivity condition holds true a (b + c) = a b + a c for every a, b, c ¢R, (b + c) a = b a + c a for every a, b, c ¢R. If the ring R has an element 1 ¢R such that (R, , 1) is a monoid, then R is called unit ring. If the multiplication commutes, the ring is called commutative ring. Example 3.10.1 (1) (Z,+, , 0, 1) is a commutative unit ring. (2) (Mn(R),+, , 0n, In) is a non-commutative unit ring for n 2. (3) Z[i] := {z = a + bi ¢C|a, b ¢Z} is a commutative ring. Definition 3.10.2. A commutative ring K with 0 1 and K : = K\ {0} is called field. Example 3.10.2

(1) Q,Rare fields. (2) Quaternions field: Consider the set of matrices over C: ÏÊ z

-wˆ

¸

ŒM2 H := Ì C | z , w ŒC˝. Á ˜ ÔË w z¯ Ô Ó ˛ Remember that for a complex number z = x+iy ¢C, we have denoted x := x–iy ¢Cthe complex conjugate of z. The set H with matrix addition and multiplication form a non-commutative field called the quaternions field. The elements of H Ê 1 0ˆ e := Á Ë01

Êi

0 ˆ Ê 0 -ˆ

˜ , u := Á ˜ ¯Ë0

-i ¯

Ê 0 iˆ

, v := Á ˜ 0, w := Á ˜ Ë

1 1¯ Ë i



satisfy u2 = v2 = w2 = –e

.01 000 101 and uv = –vu = w, vw = –wv = u, uw = –wu = v. (3) We consider the set of quotients modulo 2, Z2 = {0, 1}, and define the following operations: +01

000

110 Z2 with these operations is a commutative field. Remark 28 A field is a set K with two operations +, and two elements 0 1, satisfying the following axioms: (K1) a, b, c ¢K. a + (b + c) = (a + b) + c, a (b c) = (a b) c; (K2) a ¢K . a + 0 = 0 + a = a, a 1 = 1 a = 1; (K3) a ¢K . b ¢K.a + b = b + a = 0; (K4) a ¢K\{0}. b ¢K.a b = b a = 1; (K5) a, b ¢K . a + b = b + a; (K6) a, b; c ¢K. a (b + c) = a b + a c.

Linear Algebra for AI 4.1 Vectors In geometry, a vector is an oriented segment, with a precise direction and length. Real numbers are called scalars. Two oriented segments represent sent the same vector if they have the same length and direction. In other words, no difference is made between parallel vectors. Let u and v be two vectors. The addition of u and v is described by the paralellogram rule. The substraction of two vectors is defined by the triangle rule, adding the opposite vector. The zero vector is denoted by 0, being a vector of length 0 and no direction. Let v be a vector and Ca scalar. The scalar multiplication of v with the scalar αis denoted by αv and is defined as follows: 0 If α> 0 then αv has the same direction with v and its length equals α-times the length of v. 0 If α= 0 then 0v = 0. 0 If α< 0 then αv has the opposite direction as v and its length equals |α|-times the length of v. Remark 29 The previous operations satisfy the following rules: (1) u + (v + w) = (u + v) + w, for every vector u, v,w. (2) v + 0 = 0 + v = v, for every vector v. (3) v + (–v) = –v + v = 0, for every vector v. (4) v + w = w + v, for every vector v, w. (5) λ(μv) = λ(μ)v, for every scalar λ, μand every vector v. (6) 1v = v, for every vector v. (7) λ(v + w) = λv + λw, for every scalar λand every vector v and w. (8) (λ+ μ)v = λv + μv, for every scalar λ, μand every vector v. Remark 30 The following derived rules hold true: (1) u + w = v + w implies u = v.

(2) 0v = 0 and λ0 = 0. (3) λv = 0 implies λ= 0 or v = 0. Proof. (1) Suppose u + w = v + w. Adding –w, we obtain (u + w)+(–w) = (v + w)+(–w). Applying rules (1), (3), and (2), we obtain (u + w)+(–w) = u+(w+(–w)) = u + 0 = u. Similarly, (v + w) + (–w) = v, hence u = v. (2) We have 0v = (1 + (–1))v = v + (–v) = 0 by (8) and (3). We proceed for the second part similarly. (3) Suppose λv = 0. If λ= 0 the proof has been finished. If λ 0, then, by multiplicating equation λv = 0 with λ–1, right hand is still 0, the left hand becomes (λ–1)(λv) = (λ–1λ)v = 1v = v, using (5) and (6). We obtain v = 0. Definition 4.1.1. Two vectors v and w in the plane are called linearly independent if they are not parallel. If they are parallel they are called linearly dependent. Remark 31 The vectors v and w are linearly independent if and only if there is no scalar λ, such that v = λw or w = λv. If there is a scalar λ, such that v = λw or w = λv, the vectors v and w are linearly dependent. Proposition 4.1.1. Let v and w be two vectors in plane. The following conditions are equivalent (1) v and w are linearly independent. (2) If λv + μw = 0, then λ= 0 and μ= 0. (3) If λv + μv = λ'v + μ'w, then λ= λ' and μ= μ'. Proof. (1 ² 2) Suppose λv + μw = 0 and λ 0. Then λv = –μw and, by multiplying with λ–1, one gets v = (–μ/ λ–1)w, a contradiction with the linearly independence of the two vectors. (2 ² 3) Suppose λv + μv = λ'v + μ'w, then (λ– λ')v +(μ–μ–1)w = 0, implying λ= λ' and μ= μ'. 3 ² 1) Suppose v and w are linearly independent. Then, there exists a scalar λ 0 such that v = λw or w = λv. In the first case, v 0 λw leads to 1v + 0w = 0v + λw. From the hypothesis, one gets 1 0 0, contradiction!

In the following, we will describe the conditions that three vectors are not coplanar. Definition 4.1.2. Let v and w be two vectors and λ,μ ¢ Rscalars. A vector λv + μw is called linear combination of v and w. Remark 32 If v and w are fixed and the scalars λand μrun over the set of all reals, the set of all linear combinations of v and w is the set of vectors in the plane defined by the support lines of the two vectors v and w. Definition 4.1.3. Three vectors u, v and w in space are called linearly independent if none of them is a linear combination of the other two. Otherwise, they are called linearly dependent. Remark 33 Three vectors u, v and w are linearly independent if and only if they are not coplanar. Proposition 4.1.2. For every three vectors in space u, v and w, the following are equivalent: (1) u, v and w are linearly independent. (2) λu + μw + νw = 0 implies λ= μ= ν= 0. (3) λu + μv + νw = λ'u + μ'v + ν'w implies λ= λ', μ= μ', ν= ν'. 4.2 The space Rn Let n be a non-zero natural number. The set Rn consists of all n-tuples 0 a1 ö÷ v = çç #2÷÷ , 0a÷ è n øæa where α1, α2 . . . , αn ¢R. The elements of Rn are called vectors and they describe the coordinates of a point in Rn with respect to the origin of the chosen coordinates system

æ0 ç

ö 0

ç 0=

ç

÷ ÷

#

.

÷

ç

÷

è

0

ø

Addition, multiplication and scalar multiplication are defined as follows: a1

b1

a v+w =

a1 + b1

b 2

+ b

a 2

a1

2

a 2

=

+

λa1

, λv = λ

λa 2

2 =

.

..

..

..

..

..

.

.

.

.

.

anbn

an + bn

an

λan

The properties (1)-(8) of addition and multiplication of vectors hold true in Rn. Remark 34 (Special cases) (1) n = 0. Define R0 := {0}. (2) n = 1. Define R1 := R. (3) n = 2. In order to identify the set of all vectors in the plane with R2, a coordonatization is necessary. For this, we choose an origin O and two linearly independent vectors u1, u2 in plane. Any other vector v in the plane can be represented as a linear combination of u1 and u2: v = α1u1 + α2u2. The scalars C1 and C2 are uniquely determined by the linearly independent vectors u1 and u2. Suppose there are other scalars b1 and b2 with v = α1u1 + α2u2 = D1u1 +D2u2, then from the linear independence condition we obtain that C1 = D1, C2 = D2. The system B : (O, u1, u2) is called coordinate system and the scalars C1, C2 are called coordinates of the vector v with respect to the system B. The R2 vector [ v]B := æç C1 ö÷ è C2 ø

consisting of the coordinates of v with respect to the system B is called coordinates vector of v with respect to B. This coordinatization gives a one-to-one correspondence between the set of all vectors in the plane and the elements of R2. Indeed, every vector in the plane has a unique set of coordinates in R2. Conversely, æ a1 ö to every element ç ÷ ¢R2 can be associated a vector v in the plane, è a2 ø the linear combination of the vectors from the coordinates system B with the scalars a1 and a2, namely v = a1u1 + a2u2. Remark 35 The coordinates system B must not be the same as the cartesian one, which consists of two perpendicular axes intersecting in the origin. Any two non parallel vectors can be used as a coordinate system, together with a common point as origin. (4) n = 3. Consider now the set of all vectors in space. A reference point O is chosen along with three noncoplanar vectors u1, u2 and u3. Denote B := (O, u1, u2, u3). This system will allow one to coordinate the space as follows: Every vector v in space is a linear combination of the three vectors in the chosen system: v = a1u1 + a2u2 + a3u3. Similarly as in the planar case, the scalars a1, a2, a3 are uniquely determined by the vector v and the system B. The R3 æ a1 ö [ v]B := çç a2 ÷÷ è a3 ø is called the coordinate vector of v with respect to the system B. With the help of this construction, every vector from the space is associated by a one-to-one correspondence with an element from R3, allowing us to identify the set of all vectors in space with the set R3. 4.3 Vector Spaces Over Arbitrary Fields Let K be a field, called scalar field. Definition 4.3.1. A vector space over K (for short a K-vector space) is defined to be a set V with

0 A fixed element 0 ¢V, 0 A binary operation +: V V V called addition, 0 A binary operation .: K V V called scalar multiplication satisfying (1) u,v,w ¢V. u + (v + w) = (u + v) + w.

(2) v ¢V. v + 0 = 0 + v = v. (3) v ¢V. v + (–1)v = 0. (4) v, w ¢V. v + w = w + v. (5) v ¢V, λ, μ¢ λ-(μv) = (λμ)v. (6) v ¢V. 1v = v. (7) v, w ¢V, λ¢ λ-(v + w) = λv + λw. v, ¢V, λ, μ¢ -λ= μ)v = λv + μv. Remark 36 (V, +, 0) is a commutative group. Remark 37 Let V be a K-vector space. Then (1) u,v, w ¢V. u + w = v + w implies u = v. (2) v ¢V. 0v = 0. (3) λ¢K. λ0 = 0. (4) v ¢V, λ¢K. λv = 0 implies λ= 0 or v = 0. Remark 38 Even if the same symbol is used, it is important to distinguish between the scalar 0 ¢K and the vector 0 ¢V. Example 4.3.1 (1) Rn is an R-vector space with vector addition and scalar multiplication defined above. (2) The set of all oriented segments in the plane is a vector space identified with R2. (3) The set of all oriented segments in space is a vector space identified with R3.

(4) Let n ¢N. We denote by Cn the elements of type æ a1

ö

ça

÷

v:= ç ç

2 ÷

,



ç

÷

è an

ø

where α1, α2, . . . , αn ¢C. Vector addition and scalar multiplication are defined as follows: α1

β1

α

α1 β

2 v+w=

α 2

+

αn

λα1 λα 2

, λv = λ

2 =

.

..

..

..

..

.

.

.

.

βn

αn

λαn

The set Cn enhanced with these operations is a C-vector space. (5) Let m, n ¢N. Consider Mm n (C) the set of all matrices with m rows and n columns over the set of all complex numbers C. This forms a complex vector space together with the operations of matrix addition and complex scalar multiplication. (6) Let X be a set. We denote by F(X, K) := {f|f : X K}. This set is enhanced with a K-vector space structure as follows: Define 0 ¢F(X, K) as the map whose values are always 0 ¢K. For f, g ¢F(X, K) and λ¢K define x ¢X.(f + g)(x) := f(x) + g(x) x ¢X.(λf)(x) := λf(x). 4.4 Linear and Affine Subspaces Let K be a field and V a K-vector space.

Definition 4.4.1. A subset U ¡V is called linear subspace (or vector subspace) if (1) 0 ¢U; (2) u, v ¢U. u + v ¢U; (3) λ¢K, u ¢U. λu ¢U. Remark 39 (1) Conditions 2 and 3 can be synthesized into an equivalent condition λ, μ¢K, u, v ¢U. λu μv ¢U. (2) By induction, if U is a linear subspace of V, then for every u1, u2, . . . , un ¢U and every λ1, λ2, . . . , λn ¢K one has λ1u1 + λ2u2 +. . . + λnun ¢U. Example 4.4.1. (1) Every vector space V has two trivial subspaces {0} and V . All other subspaces of V are called proper subspaces. (2) The linear subspaces of R2 are {0}, lines through 0 and R2. (3) The linear subspaces of R3 are {0}, lines through 0, planes through 0 and R3. (4) Let P(R) be the set of all polynomial mappings p: R Rwith real coefficients and Pn(R) the set of polynomial maps of degree at most n with real coefficients. P(R) and Pn(R) are linear subspaces of the R-vector space F(R, R). Remark 40 (1) Let U be a linear subspace of V . Then U is itself a K- vector space. Proof. By definition, 0 ¢U. The restriction of vector addition and scalar multiplication to U define two similar operations on U. The vector space axioms are obviously fulfilled. (2) If U is a linear subspace of V, we denote this by U wV. Let S(V) be the set of all linear subspaces of V. The relation wdefined on S(V ) by U1 wU2 if and only if U1 is a linear subspace of U2, is an order relation. Proof. If U is a subspace of V, then U is a subspace of itself, hence reflexivity is proven. Let now U1 , U2, U3 be subspaces of V with U1wU2 , U2wU3. Herefrom follows that U1 is a subset of U3. Since vector addition and scalar multiplication are closed in U1 and U3, it follows that U1 is a subspace of U3, hence the relation is transitive. The antisymmetry follows immediately. Proposition 4.4.1. Let U1 and U2 be linear subspaces of V. Then U1 U2 is a linear subspace of V.

Proof. We are checking the subspace axioms: 1. Since U1, U2 wV , we have 0 ¢U1 and 0 ¢U2, so 0 ¢U1 U2. 2. Let u, v ¢U1 U2. Then u, v ¢U1 and u, v ¢U2. It follows u + v ¢U1, u + v ¢U2, hence u + v ¢U1 U2. 3. Let λ¢K and u ¢U1 U2. Then u ¢U1 and u ¢U2. It follows λu ¢U1 and λu ¢U2, hence λu ¢U1 U2. Corollary 4.4.2. Let Ui ¢S(V ), i ¢I. Then ∩i¢I Ui ¢S(V). Remark 41 (1) The set S(V ) is closed under arbitrary intersections.

(2) If U1, U2 wV, then U1 U2= inf(U1, U2) in the ordered set (S(V), w), i.e., the infimum of two subspaces always exists and is given by their intersection. (3) For an arbitrary index set I, if Ui wV, i ¢I, then ∩i¢I Ui = inf(Ui) I in the ordered set (S(V), w), i.e., the infimum of an arbitrary family of subspaces always exists and is given by their intersection. (4) (S(V), w) is a complete lattice, the infimum being described by arbitrary intersections of linear subspaces. (5) In general, the union of two subspaces is no longer a subspace. Consider R2 and the subspaces U1 and U2 given by the vectors v1 := respectively, i.e., the lines through zero and v = æ having 1ö

æ 1ö

ç

÷

è



2 ç ÷ è -1ø

as support v1 and v2, respectively. If U1 U2 is a subspace, then u, v ¢U U . u+v ¢U U . But v +v =

æ 2ö

Ï

1

ç÷

12

2

12

12

UU.

è 0ø (6) The supremum in the complete lattice S(V), w) is not the union of subspaces. Another construction is needed for this. Definition 4.4.2. Let U 1, U2, . . . , Un be linear subspaces of V. The sum of these subspaces is defined as

U1 + U2 + . . . + Un := {u1 + u2 + . . . + un ¾u1 ¢U1, u2 ¢ U2 , . . . , un ¢Un}. For an infinite family of subspaces (Ui)i¢I, their sum is defined by the set of all finite sums of elements from different Ui, ¹Ui := {ui1 + ui2 + . . . + uik ¾uij ¢Uij, ij ¢I, j = 1, . . . k, k ¢N}. i¢I Proposition 4.4.3. The sum of the linear subspaces U1, U2, . . . , Un is a linear subspace and U1 + U2 + . . . + Un = sup(U1, U2, . . . , Un). Proof. One has 0 ¢Ui for every i = 1, . . . , n, hence 0 ¢U1+U2+ . . . + Un.

Let v

, v ¢U 1 2

+ U + . . . + U . Then v

1

2

u ¢U , i = 1, . . . , n and v i

i

. . . n.

+ v

Then v 1

= (u 2

=u+u

=w n +w

1

i

2

, with w

¢U , i = 1,

2 1

2

n

i

+w)+ (u

+w )+

...

1

1

2

2

1

+ (u + w + U

, with n

+ ... + w

+ w ¢U , i = 1, . . . , n, we have v + v ¢U i

1

+ ... + u

i ). Since u n

n

i

2 + ... + U . 21

n

The third condition is verified in a similar way. We have to prove that U1 + .U.2. + . . . + Un = sup(U1, U2, . . . , Un). On the one hand, Ui ¡U1+U2+ +Un since one can choose all uj = 0 in the definition of the sum, except the elements of Ui. Suppose W wV and Ui wW, i = 1, . . . , n, hence all elements of all subspaces Ui are in W, in particular all sums u1 + u2 + . . . +un with u1 ¢U1, u2 ¢

U2, . .. .. ,. un ¢Un are in W. Hence U1 + U2 + . . . + Un wW and so U1 + U2 + + Un = sup(U1, U2, . . . , Un). Remark 42 Similarly, the supremum of an arbitrary family of subspaces (Ui)i¢I is defined by their sum. Remark 43 (S(V ), w) is a complete lattice, the infimum is given by subspace intersection and the supremum by their sum. Example 4.4.2. Consider in R3 the planes U1 := x0y and U2 := x0z. Their intersection is 0x, a line through 0. Their sum is the entire space R3. æ 1ö Consider the vector v := çç 1÷÷ ¢R3. This vector can be described as the è 1ø sum of vectors from U1 and U2 as follows: æ1 æ æ 1ö ö 0ö

æ æ 0ö 1ö

v 0÷ = 1 + = 1 + = ç 1÷ ç ÷ ç 0÷ ç ÷ ç . ç

÷ ç ÷ ç ÷ ç ÷ ç ÷

è

1 1ø è 0ø è ø è 0ø è 1ø

0 æ1 ö The vectors

æ

ö

ç

÷ ¢U

ç 1 ÷ , ç

÷

è 0ø

æ

1 ç

è 0ø

æ1 ö

and ç ÷ 1

÷



0

ç ,

÷ ¢U . Hence, the 0

2

ç

÷

ç

÷

è



è 1ø

decomposition of v as the sum of two vectors from U1 and U2 is not unique. Definition 4.4.3. Let U1, U2, . . . , Un be linear subspaces of V . The vector space V is the direct sum of Ui, i = 1, . . . , Un and we write V = U1 U2

Un,

if every element v ¢V has a unique decomposition as

v = u1 + u2 + . . . + un for every ui ¢Ui, i = 1, . . . , n. Proposition 4.4.4. Let U1 and U2 be linear subspaces of V. Then V = U1 U2 if and only if the following holds true: (1) V = U1 + U2 (2) U1 U2 = {0}. Proof. (²) Suppose V = U1 U2, then V = U1 + U2. Let v ¢U1 U2. Then v=v+0

=0 +v.

N

N

N

ÎU 1

ÎU 2

ÎU1 ÎU2

N

Hence v has two different representations as a sum of vectors from U1 and U2. It follows v = 0 and U1 U2 = {0}. (°) Let us suppose that (1) and (2) are true. Then, for every v ¢V, this vector has a decomposition as v = u1 + u2, u1 ¢U1, u2 ¢U2. For the uniqueness, suppose v = u'1 + u'2, with u'1 ¢U1, u'2 ¢U2. Substracting these equalities, we have 0 = (u1 – u'1) + (u2 – u'2). It follows u1 – u'1 = u'2 – u2 ¢U1 U2. Since U1 U2 = {0}, we have that u1 = u'1 and u2 = u'2, the uniqueness of the decomposition is proven. Definition 4.4.4. Let V be a K-vector space. A subset A ¡V is called affine subspace of V if there is a linear subspace U wV and a vector v ¢V such that A = v + U := {v + u|u ¢U}. Example 4.4.3.

(1) The affine subspaces of R2 are the points, lines and the entire space R2. (2) The affine subspaces of R3 are the points, lines, planes and the entire space R3. (3) Affine subspaces are obtained via translation of a linear subspace with a given vector. 4.5 Linearly Independent Vectors. Generator Systems. Basis Definition 4.5.1. Let V be a K-vector space and v1, v2, . . . , vn ¢V . Every element of the type v = λ1v1 + λ2v2 + . . . + λnvn = ¹λivi, i=1 with λi¢K, i = 1, . . . , n is called linear combination of the vectors v1, v2, . . . , vn. We denote the set of all linear combinations of the vectors v1, v2, . . . , vn with µv1, v2, . . . ,vnÅ. Proposition 4.5.1. Let V be a K-vector space and v1, v2, . . . , vn be vectors from V. The set µv1, v2, . . . , vnÅof all linear combinations of v1, v2, . . . , vn is a linear subspace of V . Proof. Let us verify the linear subspace axioms: (1) 0 = 0v1 + 0v2 + . . . + 0vn ¢µv1, v2, . . . , vnÅ. (2) Let v, w ¢µv1, v2, . . . , vnÅ. Then v and w are linear combinations v = λ1v1 + λ2v2 + . . . + λnvn w = μ1v1+μ2v2 + . . . + μnvn. By summing up, we have v + w = (λ1 + μ1)v1 + (λ2 + μ2) v2 + . . . + (λn + μn)vn, hence v + w ¢µv1, v2, . . . , vnÅ. (3) Let v ¢µv1, v2, . . . , vnÅ,then v = λ1v1 + λ2v2 + . . . + λnvn and let λ¢K be arbitrary chosen. Then λn = λλ1v1 + λλ2v2 + λλnvn wherefrom follows λv ¢µv1, v2, . . . , vnÅ Remark 44 (1) v1 = 1v1 + 0v2 + . . . + 0vn ¢µv1, v2, . . . , vnÅ.Similarly, v2, . . . , vn ¢µv1, v2, . . . , vnÅ. (2) µv1, v2, . . . , vnÅis the smallest linear subspace containing v1, v2, . . . , vn. If U wV is another linear subspace containing v1, v2, . . . , vn, we deduce from the subspace properties that it will also contain all linear combinations of these vectors, hence µv1, v2, . . . , vnÅ¡U. (3) The subspace µv1, v2, . . . , vnÅis called the linear subspace spanned by the vectors v1, v2, . . . , vn. These vectors are called the generator system of the spanned subspace.

(4) If w1, w2, . . . , wk ¢µv1, v2, . . . , vnÅ,then µw1, w2, . . . , wkÅ¡µv1, v2, . . . , vnÅ. Example 4.5.1 (1) Rn is spanned by 1

0

0 .

0

e

= 1

1

,e 0

= 2

..

,...,e 0

=

.

n

0

..

..

.

.

0

0

0

1

2

1

æ



æö (2) R is spanned by e1, e2 and also by v1 =

ç÷



and v

è 1ø

÷. è

-1ø

This proves that a subspace can be spanned by different generator systems. (3) The vector space Pn(R) of polynomial functions of degree at most n is generated by pj(x) := xj for every x ¢R, j = 1, . . . , n. Proposition 4.5.2. Let G ¡ V and v ¢V . We define G' := G {v}. Then µGÅ= µGÅ'if and only if v ¢µGÅ. Proof. (²) µGÅ' is the set of all linear combinations of vectors from G and v. It follows that v is a linear combination of vectors from G, hence v is an element of µGÅ. (°) Suppose v ¢µGÅ.Then v is a linear combination of vectors from G. It follows that G' ¡µGÅhence µGÅ'¡µGÅ.The inverse inclusion holds true, so µGÅ= µGÅ'.

Definition 4.5.2. A vector space V is called finitely generated, if it has a finite generator system. Remark 45 (1) Let V be a K-vector space and v ¢V. Then µvÅ= Kv ={λv ¾λ¢ }.(2) If v1, v2, . . . , vn ¢V, then µv1, v2, . . . , vnÅ= Kv1 +Kv2+ . . . +Kvn. (3) If V is finitely generated and K = R, then it is the sum of the lines Kvi, i=1, . . . ,n, defined by the generator vectors µv1, . . . , vnÅ. (4) µ =Å{0}. Definition 4.5.3. The vectors v1, v2, . . . , vn ¢V are called linearly independent if none of the vectors vj can be written as a linear combination of the other vectors vi, i j. In the other case, these vectors are called linearly dependent. Proposition 4.5.3. The following are equivalent: (1) v1, v2, . . . , vn are linearly independent. (2) λ1v1 + λ2v2 + . . . + λnvn = 0 implies λ1 = λ2 = . . . = λn= 0. (3) λ1v1 + λ2v2 + . . . + λnvn = λ'1v1 + λ'2v2 + . . . + λ'nvn implies λi= λ'i for every i = 1, . . . , n. Proof. (1 ² 2) Suppose there exist i ¢{1, . . . , n} such that λi 0 and λ1v1 + λ2v2 + . . . + λnvn = 0. Then vi = – λ1 λi–1v1– λ2 λi–1v2 – . . . – λi–1λi–1vi–1– λi+1λi–1vi+1 – . . . – λn λi–1vn, i.e., vi is a linear combination of vj, j i. (2 ² 3) If λ1v1 + λ2v2 +. .. .. . + λnvn = λ'1v1 + λ'2v2 + . . . + λ'nvn, then (λ1 – λ'1)v1 + (λ2 – λ'2)v2 + + (λn–λ'n)vn = 0. It follows λi= λ'i, for every i = 1,. . . . , n. (3 ² 1) Suppose v1, v2, . . . , vn are linearly dependent. Then there exist i ¢{1, . . . , n} such that vi is a linear combination of vj, j i. Then vi = λ1v1 + λ2v2 + . . . + λi–1vi–1 + λi+1vi+1 + . . . + λnvn where from follows that λ1v1 + λ2v2 + . . . + λi–1vi–1 – vi+λi+1vi+1+ . . . + λnvn = 0 = 0v1 + . . . + 0vn. From the hypothesis follows λj= 0, j i and –1 = 0, which is a contradiction.

Remark 46 (1) Any permutation of the elements of a linearly independent set is linearly independent. (2) For every v ¢V , the set {v} is linearly independent. (3) The vectors v1, v2 are linearly dependent if and only if v1 = v2 0 or v2 is a scalar multiple of v1. (4) If 0 ¢{v1, v2, . . . , vn}, the set {v1, v2, . . . , vn} is linearly dependent. (5) If v1, v2, . . . , vn are linearly independent, then they are pairwise distinct. Example 4.5.2. (1) Any two non-parallel vector in the plane are linearly independent. The vectors e

= æ 1ö and e

= æ 0ö are linearly independent, as

1

well as v

2 ç ÷

ç

÷

è 0ø

è



= æ 2ö and v 1 ç ÷ è 1ø

=

æ 1ö .

2 ç ÷ è 2ø

(2) Any three vectors in R2 are linearly dependent. (3) Any three vectors in R3 which do not lie in the same plane are æ 1ö linearly independent. In particular, the vectors v =ç 1

1 ç

æ



v =ç

1

3ç è

è ÷

÷

æ1 ö v=ç ,2

÷ 0ø

are linearly independent.

÷ 1 ø

(4) Any collection of 4 vectors in R3 are linearly dependent. (5) The vectors e1, e2, . . . , en are linearly independent in Kn.

÷ , 0

ç

÷

è 1 ø

(6) The polynomial functions p0, p1, . . . , pn are linearly independent in Pn(R). (7) The non zero rows of a triangle shaped matrix are linearly independent. Remark 47 The notion of linear independence can be defined for infinite sets of vectors too. Definition 4.5.4. An arbitrary family of vectors (vi)i¢I in V is linearly independent if none of its vectors can be written as a linear combination of a finite subset of the family vi, i ¢I. Definition 4.5.5. A basis of a vector space V is a family of vectors (vi)i¢I in V satisfying: (1) µ{vi|i ¢I}Å= V , (2) The family (vi)i¢I is linearly independent. Remark 48 The vectors v1, v2, . . . , vn are a basis of V if and only if they are linearly independent and every vector in V is a linear combination of v1, v2, . . . , vn.

Linear Algebra 83 Example 4.5.3 æ 1ö (1) e1 =

ç ÷ and e2

æ 0ö =ç

è 0ø

are a basis of R2, as well as the vectors ÷

è 1ø

æ 1ö v1=

ç

÷ and v2 = æ

è 1ø

ç

1ö . ÷

è -1ø (2) e1, e2, . . . , en are a basis of Kn, called canonical basis. (3) p0, p1, . . . , pn are a basis for the real vector space Pn(R). 4.5.1 Every vector space has a basis

Let V be a K-vector space. If V has a basis B: =(vi)i¢I, then every vector w ¢V has a unique decomposition as a linear combination of the vectors of B. We ask whether every vector space has a basis. The following result uses the choice axiom. Theorem 4.5.4. Let V be a K-vector space and X ¡V . If V = µXÅand the subset X1 ¡X is linearly independent, then there exists a basis B of V such that X1 ¡ B ¡X. Proof. In order to find this basis, we need a maximal generator set whose elements are linearly independent. For this, consider the set C := {X' | X1 ¡X' ¡X, X' is linearly independent}. The set C is not empty, since X1 ¢C. We want to prove that this set has a maximal element. Consider an arbitrary non empty chain L ¡C and denote X0 :=U{X' |X' ¢C'}. We prove that X0 is linearly independent. For this, choose a finite linearly independent subset v1, v2, . . . , vn ¢X0. From the definition of X0 we obtain that for every i ¢{1, . . . , n} there exist subsets X'1, X'2, . . . , X'n in L such that vi ¢X', i = 1, . . . , n. Since C is a chain, it follows the existence of i0 ¢{1, . . . , n} with X'i ¡X'i0, i = 1, . . . , n implying vi ¢X'i0, i = 1, . . . , n. The subset Xi0 ¢C is linearly independent, hence any finite subset is linearly independent too, particularly, {v1, v2, . . . , vn}. The vectors v1, v2, . . . , vn being arbitrarily chosen, we conclude the linearly independence of X0 and X1 ¡X0 ¡X, i.e., X 0 ¢C and X0 is an upper bound for L. From Zorn's lemma follows the existence of a maximal element B in L. Since B ¢C, it follows that B is linearly independent. Now we are going to prove that B is a generator system for V. Suppose V µBÅ.Then X µBÅ.Let v ¢Xn\µBÅ.Then, we can prove that B {v} is linearly independent, being in contradiction with the maximality of B. For this, consider a linear combination λ1v1 + λ2v2 + . . . + λnvn + λv = 0, λi, λ¢K, vi ¢B, i = 1, . . . , n, Then v can be written as a linear combination of vectors v1,v2, . . . ,vn from B v = –λ1λ–1v1 – λ2λ–1v2– . . . – λnλ–1vn hence v ¢µBÅ,which is a contradiction. Corollary 4.5.5.

(1) Every vector space has a basis. (2) Any linearly independent subset of a vector space can be completed to a basis. (3) The basis of a vector space V are exactly the maximal linearly independent subsets of V . (4) From any generator system of a vector space, we can extract a basis of V . (5) The basis of a vector space V are exactly the minimal generator systems of V . (6) If X is a generator system of V and B (1) is a linearly independent subset of V , then B can be completed with vectors from X to a basis B of V . Proposition 4.5.6. Let V be a finite generated vector space and v1, v2, . . . , vm a generator system. Every subset containing m+1 vectors from V is linearly dependent. Proof. We are going to prove this assertion by induction. Consider the assertion P(m): If the vector space V is spanned by m vectors, V = µv1, v2, . . . , vmÅ,then any subset of m+1 vectors {w1,w2, . . . , wm, wm +1} from V is linearly dependent. Verification step for P(1): Let {w1, w2} ¡µ{v1}Å.There are some scalars λ1,λ2 ¢K, such that w1 = λ1v1, w2 = λ2v1. It follows that v1 = λ1–1w1 and w2 = λ2λ1–1w1, hence the vectors w1 and w2 are linearly dependent.

(1) Linear Algebra 85 Induction step P(m – 1) ² P(m) Suppose that some m vectors from a m – 1 vectors generated space are linearly dependent and prove this assertion for m vectors. Let {w1, w2, . . . , wm, wm+1} ¡µ]v1, v2, . . . , vm}Å. Then w1 = λ11v1 + λ12w2 + . . . + λ1mvm w2 = λ21v1 + λ22w2 + . . . + λ2mvm

m+1 = λm+1,1v1 + λm+1,2w2 + . . . + λm+1, mvm

w

We distinguish the following cases Case 1. λ11 = λ12 = . . . = λm+1,m = 0. The vectors w2, . . . , wm+1 ¢µ{v2, . . . , vm}Åand from the induction hypothesis follow their linearly dependence. The vectors {w1, w2, . . . , wm+1} are linearly dependent too. Case 2. At least one of the coefficients λi1 0, i = 1, . . . , m+1. Suppose λ11 0. Then w'2 = (λ22 – λ21 λ11–1 λ12)v2 + . . . + (λ2m – λ21λ11–1 λ1m)vm w'm+1 = (λm+1,2 – λm+1,1λ11–1λ12)v2 + . . . + (λm+1,m – λm+1,1λ11–1λ1m Then w'2, . . . ,w' m+1 ¢µ{v 2, . . . , vm+1}Å.From the induction hypothesis follows the linearly dependence of w'2, . . . ,w'm+1. There exist scalars μ2, . . . ,μm+1 ¢K, not all of them being zero, such that μ2w'2+ . . . μm+1w'm+1 = 0. Since w' = w – λλ–1w , we have i i i1 11 1 μ2(w2– λ21 λ11–1w1)+ . . . μm+1 (wm+1 – λm+1,1 λ11–1w1) = 0 and m+1 åμiλi1λ11–1 w1 + μ2w2 + . . . + μm+1 wm+1 = 0. i=2 Since not all scalars μ2, . . . , μm+1 are zero, the vectors w1, w2, . . . , wm+1 are linearly independent. Corollary 4.5.7. If a vector space has a generator system v1, . . . , vn, any set of linearly independent vectors has at most n elements. Theorem 4.5.8. Let v1, v2, . . . , vm ¢V being linearly independent. If w1, w2, . . . , wr ¢V are chosen such that µ]v1, v2, . . . , vm, w1, w2, . . . , wr}Å then we can find indices 1 wi1 < i2 < . . . < is wr such that v1, v2, . . . , vm, wi1, wi2, . . . , wis

is a basis of V . Proof. Consider the set C := {{v1, v2, . . . , vm, wi1, wi2, . . . , wis}| V = µ]v1, v2, . . . , vm, wi1, wi2, . . . , wis }Å,1 wi1 < i2 < . . . < is wr}. The vectors v1, v2, . . . , vm, w1, w2, . . . , wr are satisfying this condition, hence C . We order the set C by inclusion. Then, there exists in C a minimal generator set. It can be proved that this minimal generator set is a basis of V. It is sufficient to check the linearly independence of the minimal system v1, v2, . . . , vm, wi1, wi2, . . . , wis. Suppose one can find scalars λ1, . . . , λm, μ1, . . . , μs ¢K, not all of them being zero with λ1v1 + . . . + λmvm + μ1wi1 + . . . + μs wis = 0. Then there exist an index j ¢{1, . . . , s} with μj If for all j, μj= 0, then λ1v1+ . . . + λmvm = 0 and from the linearly independence of v1, . . . , vm one obtains λi= , i ¢{1, . . . , m}, which is a contradiction. If μj 0, then wij = λ1μj–1v1 – . . . – λm μj–1vm – μ1μj–1wi1 – . . . – μsμj–1wis is a linear combination of v1, . . . , vm, wi1, w1j–1, w1j+1, . . . , wis. We have obtained a generator system for V = µ]v1, . . . , vm, wi1, w1j–1, w1j+1, . . . , wis}Å with less elements than the minimal generator system we have previously considered. Contradiction! The following results are consequences of this theorem: Theorem 4.5.9. (Basis choice theorem). If the vector space V is generated by the vectors w1, . . . , wr, then one can find indices 1 wi1 < i2 < . . . < in wr such that the vectors wi1 , . . . , win are a basis of V . Proof. Apply the above theorem for m = 0.

Theorem 4.5.10. [Exchange theorem (Steinitz)]. Let V be a vector space, v1, . . . , vm linearly independent vectors from V and w1, w2, . . . , wn a generator system of V . Then m wn and V = µv1, . . . , vm, wm+1,. . . , wnÅ. Proof. We proceed by induction after m. If m = 0, then 0 wn and the conclusion is obviously true. Suppose the assertion is true for m – 1 linearly independent vectors and let us prove it for m vectors. Let v1, v2, . . . , vm ¢V be m linearly independent vectors. Then m – 1 of them will be linearly independent too, hence m – 1 wn and V = µv1, v2, . . . , vm–1, w1, w2,. . . , wnÅ. If m – 1 = n, then V = µv1, . . . , vm–1Åand vm ¢µv1, . . . , vm–1Å, contradicting the linearly independence of the vectors v1, . . . , vm, wherefrom follows that m – 1 < n and m wn. On the other hand, from V = µv1, . . . , vm–1, wm,. . . , wnÅ,follows that vm can be expressed as a linear combination of these generators vm = λ1v1 + . . . + λm–1vm–1 + λmwm + . . . + λnwn. Since the vectors v1, . . . , vm are linearly independent, it follows that not all scalars λm , . . . , λn are zero. Without restricting generality, one can suppose that λm 0. Hence w

–1v – . . . – λ λ–1v

= –λλ

m 1 m 1 m–1 m m–1

. . . – λλ–1w .

+ λ–1v – λ λ–1w



m mm+1 m m+1

nmn

The vector wm can be written as a linear combination of the other vectors, hence V = µv1, . . . , vm, vm+1,. . . , wnÅ Corollary 4.5.11. All base of a finitely generated vector space are finite and have the same cardinality. Theorem 4.5.12. Let V be K-vector space and X an infinite basis. Then, for every generator system Y of V , we have |X| w|Y|.

Proof. Let w ¢Y and Xw the set of those elements from X which occur in the decomposition of w as a linear combination of vectors from X. Then X = wUY Xw. ¢ Obviously, Uw¢Y Xw ¡X. Suppose the inclusion is strict, i.e., there exists v ¢X \ Uw¢Y Xw. Since V is generated by Y, v has a decomposition as v = λ1 w1 + . . . + λnwn, λi¢K, wi ¢Y, i ¢{1, . . . , n}. Every wi is a linear combination of elements of Xwi , i ¢ {1, . . . , n}, hence v is a linear combination of elements of Uw¢Y Xw. Contradiction! The sets Xw are finite for all vectors w ¢Y . The set X being infinite, it follows that Y is infinite too and |X| w¹|Xw| = |Y| w¢Y Corollary 4.5.13. (1) If a vector space V has an infinite basis, this basis is infinite, having the same cardinality as X. (2) If the vector space V has an infinite basis, then V is not finitely generated. Defi nition 4.5.6. Let V be a K -vector space. The dimension of V is the cardinality of the basis of V. We denote the dimension of V by dim V. The space V is called finite dimensional if it has a finite basis. The space V is called infinite dimensional if it has an infinite basis. Example 4.5.4. (1) The dimension of R2 is 2. The dimension of R3 is 3. (2) If K is a field, dim Kn = n, since

Linear Algebra 89 æ



æ 0ö æ

e1 = ç 0÷ e2 = ç 1÷ . . . , en = ç

0ö ÷

0

ç ÷ ç

÷ #

,

#, ç ç ÷

÷ #

ç

÷

ç

÷

ç

÷

ç ÷

ç

÷

è



è 0ø è



is a basis, called canonical basis of Kn. (3) dim Pn(R) = n+1. (4) The vector spaces P(R) and F(X, R) are infinite dimensional. As a direct consequence of these the previous results we get: Remark 49 (1) Let V be an n dimensional vector space. Then any linearly independent system has at most n vectors. (2) Let V be a finitely generated vector space. Then V is finite dimensional and its dimension is smaller than the cardinality of any of its generator systems. (3) If V is a finite dimensional vector space then any minimal generator system is a basis. (4) In a finite dimensional vector space, any maximal linearly independent set of vectors is a basis. (5) dim V = n if and only if there exist in V exactly n linearly independent vectors and any n + 1 vectors from V are linearly dependent. (6) If U wV is a linear subspace of V, then dim U wdim V . If V is finite dimensional, then U = V if and only if dim U = dim V . Proposition 4.5.14. Let V be a finite dimensional vector space and dim V = n. Let v1, . . ., vn ¢V be a family of vectors. The following are equivalent: (1) v1, . . . , vn ¢V are linearly independent. (2) v1, . . . , vn ¢V is a generator system of V. (3) v1, . . . , vn ¢V is a basis of V.

Theorem 4.5.15. (Dimension theorem.) Let U1 and U2 be linear subspaces of a vector space V . Then dim(U1 + U2) + dim(U1 U2) = dimU1 + dimU2. Proof. If dimU1 = yor dimU2 = y, then dim(U1 + U2) = y. Suppose U1 and U2 are finite dimensional and dimU1 = m1, dimU2 = m2. The intersection of U1 and U2 is a subspace and U1 U2 wU1, U1 U2 wU2. We denote dimU1 U2 = m and let v1, . . . , vm be a basis of U1 U2. From the Steinitz theorem, we deduce the existence of the vectors um+1, . . . , um1 and wm+1, . . . , wm2, such that v1, . . . , vm, um+1, . . . , um1 is a basis of U1 and v1, . . . , vm, wm+1, . . . , wm2 is a basis of U2. Then v1, . . . , vm, um+1, . . . , um1, wm+1, . . . , wm2 is a basis of U1 + U2. Hence dim (U1 + U2) = m + (m1 – m) + (m2 – m) 0 m1 + m2 – m 0 dim U1 + dim U2 – dim (U1 U2). 4.5.2 Algorithm for computing the basis of a generated sub-space Let V be a K-vector space and v1, . . . , vm ¢V . Let U denote the subspace generated by these vectors, U = µv1, . . . , v nÅ.We search for a basis for U and an efficient algorithm to perform this task. Proposition 4.5.16. Let v1, . . . , vn and w1, . . . , wk be two families of vectors of a K-vector space V. If w1, . . . , wk ¢µv1, . . . , vnÅ,then µw1, . . . , wkÅ¡µv1, . . . , vnÅ. Corollary 4.5.17. If

and then

w1, . . . , wk ¢µv1, . . . , vnÅ

v1, . . . , vn ¢w1µ,. . . , wkÅ, µv1, . . . , vnÅ= µw1, . . . , wkÅ. Remark 50 If the vectors w1, . . . , wk are linear combinations of v1, . . . , vn and v1, . . . , vn can be written as linear combinations of w1, . . . , wk, the two families are generating the same subspace. We deduce that the following operations on a vector system v1, . . . , vn do not modify the generated subspace: (1) Addition of a scalar multiple of a vector vj to any other vector vk, k j, of this family. µv1, . . . , vk, . . . , vj, . . . , vnÅ= µv1, . . . , vk + Nvj, . . . , vj, . . . , vnÅ. (2) Permuting the order of the vectors vj and vk: µv1, . . . , vk, . . . , vj, . . . , vnÅ= µv1, . . . , vj, . . . , vk, . . . , vnÅ. (3) Multiplying a vector vj with a scalar λ 0: µv1, . . . , vj, . . . , vnÅ v1, . .µ., Nvj, . . . , vnÅ Let now K = Ror K = C, n ¢N, and consider the K-vector space Kn. The following algorithm allows the computation of a basis for a subspace generated by v1, . . . , vm ¢Kn, subspace denoted by U = µv1, . . . , vmÅ.This algorithm is grounded on a successive filtration of redundant elements from the list v1, . . . , vm until we obtain an equivalent generator system, i.e., a system which generates the same subspace, consisting only of linear independent vectors. ALGORITHM: Step 1: Build the matrix A ¢Mm n(K) whose rows consist of the column vectors vi, i ¢{1, . . . , n}.

Step 2: Apply the Gauss-Jordan algorithm to get A in triangle shape by elementary transformations. Denote by B the obtained triangle matrix. Step 3: Ignore all zero rows in B and denote by w1, . . . , w k the non zero rows of B. These vectors are linear independent and generate the same subspace as the initial vectors. Step 4: The basis consists of the vectors w1, . . . , wk. Example 4.5.5. In R4 consider 2

1

1

2

0

1

6

2

− 2

3 , v2 =

1=

v

, v3 =

0

1

, v4 =

, v5 =

.

1

1

2

2

1

1

1

3









We apply the algorithm to compute a basis for the subspace µv1, v2, v3, v4, v5Å. 1

2 0 1

2

3 1 −1

v1

6 2 −1

0

2 2 −3

1

2 0 1

0

−1 1

0

3 1 0

0 0 0 0

0

1 0 0 0

0

0 1 0 0

0

0 0 1 0

5 0

0 0 0 1

1

0 0 0 0

−2

1 0 0 0

1

0 1 0 0

2

v

v3

−1 1 1 −1 2

1

v4 v

v1

−3 −2v1 v1

+v2 +v3

0

2 2 −3 −2v1

+v4

0

2 2 −3

1

2 0 1

0

−1 1

0

0 4 −9 −5v1

+3v2

0

0 4 −9 −6v1

+2v2

0

0 4 −9 −4v1

+2v2

1

2 0 1

0

−1 1

0

0 4 −9 −5v1

0

0 0 0

0

0 0 0

v5 1

v

−3 −2v1

+v2 +v3 +v4 +v5

v1

−3 −2v1

+v2 +3v2

+v3

−v1

−v2

−v3 +v4

v1

−v2

−v3

−2

0 0 1 0

0

0 0 0 1

1

0 0 0 0

−2

1 0 0 0

−5

3 1 0 0

−6

2 0 1 0

−4

2 0 0 1

w1 1

0 0 0 0

w2 −2

1 0 0 0

w3 −5

3 1 0 0

−1 −1 −1 +v5

1

10

−1 −1 0 1

We obtain the basis

1

w

1

0

0

2

1

0

=, w2 =

, w3 =



.

0

1

4

1

3

9





Moreover, one can express these basis vectors as a linear combination of the original generator vectors:

w1 = v1, w2 = –2v1 + v2, w3 = –5v1 + 3v2 + v3. The last two rows indicate the decomposition of the vectors v4 and v5 as linear combinations of the vectors v1, v2, v3: v4 = v1 + v2 + v3, v5 = –v1 + v2 + v3. We have obtained two basis for the subspace spanned by v1, v2, v3, v4, v5, namely B1 : w1, w2, w3 and B2 : v1, v2, v3.

Internet of Things (IOT) Introduction

There is a need for us to come up with mechanisms by which to sense changes in the factors surrounding us and then take an action based on that. With the IoT, this is possible. In this case, we have a network made up of physical devices, sensors, and other devices which help us to accomplish our tasks. When these devices are connected and programmed, then we are capable of taking data from the environment, transmitting it, and then making a decision based on that. This shows how interesting it is for one to learn IoT programming. This book discusses this in detail. Enjoy reading! 1

What is the IOT?

IoT stands for the “Internet of Things” and it is just a network made up of physical devices which have been embedded with software, sensors, and electronics, allowing the devices to exchange data among themselves. With these, it becomes easy for us to integrate computerbased systems with the physical systems of the world.

This technology has been powered by leading technologies such as Big data and Hadoop, and this is expected to be the next greatest thing to impact our lives in a number of ways. Although the IoT is a new technology, it is believed that it will bring a huge change in the history of computing. Sensors built-in to automobiles, implants for monitoring the heart, biochip transponders, and smart thermostat systems are examples of these. It is possible for such devices to be tailor-made so as to meet the needs of the business.

The expectation is that IoT devices will be in a position to communicate with humans just as it happens with real world devices. IoT devices are also expected to have sensors, and these are expected to capture data such as pulse rate, the temperature of the body, and they should further transmit such data. The devices should be capable of making decisions, and exercising control computation. It is believed that the controllers will be used for the purpose of switching the devices. The devices should also have the capability of storing data. 1

IOT Programming Connected Devices

Before beginning to do this, we should first prepare our environment. We want to demonstrate this using Arduino and ArdOS.

Arduino is a hardware platform designed for the purpose of prototyping and hobby projects, but one can still use it for designing more complex hardware.

Begin by downloading the latest version of the Arduino IDE from its official website. For Windows users, you just have to download an installer, which has a FTDI USB driver and the IDE itself. Make sure that you have installed the USB driver which is responsible for enabling communications between the IDE and the Arduino device.

After installation of the software, plug the Arduino’s USB cable into the laptop. You will see a pop-up saying “installing driver.” After this completes, open the Device Manager by clicking on “start->right click on the Computer icon->Properties->Device manager.”

We can then configure the IDE so that we can start to program.

Launch the IDE. From “Tools->Boards,” choose the board which you need to connect to. 1

Once the board has been selected, set the right serial port.

1

You will then have your environment ready for programming. Before moving further, let us explore the basics of an Arduino C program. This takes the following basic structure:

void setup() { // add the setup code here, to be run once:

}

void loop() { // add the main code here, to be run repeatedly: }

The “setup()” function will be run only once, but the “loop()” will be run repeatedly. The main logic has to be implemented inside the loop.

Our First Program in Arduino

Use your Desktop shortcut to open Arduino, or do it from the Program files. You will observe a default sketch in the IDE. You can then write the following program:

void setup() { pinMode(13,OUTPUT); // add the setup code here, to be run once:

}

void loop() {

// add the main code here, to be run repeatedly: digitalWrite(13,HIGH);

delay(1000); digitalWrite(13,LOW); delay(1000); }

We have just implemented a simple blinking program. Click on the upload button so that the sketch can be uploaded to the board.

Getting Input from a Serial Port

First, we should implement a serial communication. This should be done by use of the Serial.begin command. The method “Serial.available()” will return true if the data is being written by our other device in the port. We should begin by checking for our serial data and if it found, we will print it. This is shown below:

void setup() { pinMode(13,OUTPUT);

// add the setup code here, to be run once: Serial.begin(19200);

}

void loop() { if(Serial.available()) {

int n=Serial.read(); Serial.println(n);

}

} 1

After uploading the sketch, the result can be checked by opening the Serial Monitor Window by hitting “Ctrl+Shift+M,” or from the Tools menu option.

Once the window has been opened, you will see an input window at the top. Just type 1 and then hit enter, and you will get a 49. If you type in 0, you will get 48. This is an indication that our Serial monitor is taking in ASCII input from the keyboard. We now need to convert it to a normal number, in which case we will subtract 48 followed by implementation of the on and off logic.

The program should be as shown below: void setup() { pinMode(13,OUTPUT);

// add the setup code here, to be run once: Serial.begin(19200);

Serial.println("Type 1 to Turn LED on and 0 to Turn it OFF");

}

void loop() {

if(Serial.available()) {

int n=Serial.read(); // Serial.println(n);

n=n-48; if(n==0)

{

Serial.println("LED is OFF"); digitalWrite(13,LOW);

}

else if(n==1) {

Serial.println("LED is ON"); digitalWrite(13,HIGH);

} else { Serial.println("Only 0 and 1 is accepted as input"); } } }

Just load the code to the Serial monitor, and you will observe that the LED will be turned on when you type 1 and off when you type 0. 1

IOT Digital Switches

A digital switch logically involves two points A and B which are connected when the switch is closed and disconnected after the switch has been opened.

The following code shows how we can implement a simple digital with Arduino:

void setup() { pinMode(13,OUTPUT);

pinMode(8,INPUT);

// add the setup code here, to be run once: Serial.begin(19200); 1

Serial.println("Type 1 to Turn LED on and 0 to Turn it OFF"); }

void loop() { int n=0;

n=digitalRead(8);

if(n==0) {

digitalWrite(13,LOW); } else if(n==1) {

digitalWrite(13,HIGH); } delay(200);// for avoiding over polling by continuously reading port data }

The Sensors

A sensor is used for converting the physical parameters such as blood pressure, temperature, speed, humidity, etc., into a signal which is measurable electrically.

Several physical activities can be measured by the use of different types of sensors. Consider the following code:

void setup() {

pinMode(13,OUTPUT); pinMode(12,OUTPUT); digitalWrite(12,HIGH); pinMode(8,INPUT);

// add the setup code here, to be run once: Serial.begin(19200);

Serial.println("Type 1 to Turn LED on and 0 to Turn it OFF"); }

void loop() {int n=0;

n=analogRead(5); Serial.println(n); delay(500); }

Arduino has two analog pins which have readily been connected to 10 bit ADC. The body of a human being has a potential difference from the ground, and Arduino has the capability of detecting this. This can be done by touching the analog pin of the Arduino with our body. This voltage usually ranges from low to high. The voltage is usually measured between a particular point and the ground. The voltage of the body acquired by the pins will have the Earth as the ground, and this is not common to the ground of the microcontroller. This means that there will be too much potential difference.

Before the above program, a wire had been added to the analog pin 5. The program has then been used for obtaining the analog voltage at our pin 5.

The output we get at the Serial monitor will be as follows:

We now need to minimize the effect of the variations. This can be done by taking the sensor value as the average of our 10 readings. This is shown in the code given below:

void setup() { pinMode(13,OUTPUT); pinMode(12,OUTPUT); digitalWrite(12,HIGH); pinMode(8,INPUT);

// add the setup code here, to be run once: Serial.begin(19200);

Serial.println("Type 1 to Turn LED on and 0 to Turn it OFF"); }

int s=0; int j=1; int x=0;

void loop() {int n=0;

n=analogRead(5); s=s+n; a=s/j; j++; if(j>10) {

j=1; s=0; } Serial.println(x); delay(500); }

It is good for you to note that our body voltage has to differ or vary, and the same applies to the open port voltage of our devices, but the results you get will be closely related.

We now need to trigger an event at the point when the value of voltage exceeds 600 or in case it falls below 200. In our case, we need to trigger the LED light on after touching the pin, and then switch it off once the switch has been released. The following program code demonstrates how this can be done:

// Touch the Switch Program. Touch the Pin 5 to Switch on the LED, Release to Switch it off void setup()

{

pinMode(13,OUTPUT); pinMode(12,OUTPUT);

digitalWrite(12,HIGH); pinMode(8,INPUT);

// add the setup code here, to be run once: Serial.begin(19200);

Serial.println("Type 1 to Turn LED on and 0 to Turn it OFF"); }

int s=0; int j=1; int x=0; void loop() {int n=0;

n=analogRead(5); s=s+n; a=s/j; j++;

if(j>10) { j=1; s=0; }

Serial.println(x); if(x>650 || x10) {

// Print a value one in each 5 Second Serial.println(x);

j=1; s=0; } // Serial.println(x);

if(x>650 || x10) 1

{

// Print the value only one in each 5 Seconds Serial.println(x);

j=1; s=0; } // Serial.println(x);

if(x>650 || xAdd Library, and then select the package you have downloaded and renamed.

1

We want to write a sample program to demonstrate how ArdOS can be used, and then modify our previous program using it. 1

Begin by importing the library in a blank sketch. The following should be its header files:

#include #include #include #include

The ArdOS programs have to incorporate the programming logic inside the tasks with a signature. This is shown below:

void task(void * p)

From the setup() method, tasks have to be initialized and then put inside a queue. The loop() should have any code as the kernel has to look into it. Tasks should also include infinite loops so that the tasks will be run infinitely.

#include #include #include

#define NUM_TASKS 2 void task1(void *p)

{

char buffer[16]; unsigned char sreg; 1

int n=0; while(1)

{

sprintf(buffer, "Time: %lu ", OSticks()); Serial.println(buffer);

OSSleep(500); } }

void task2(void *p) {

unsigned int pause=(unsigned int) p; char buffer[16];

while(1) {

digitalWrite(13, HIGH);

sprintf(buffer, "==>Time: %lu ", OSticks()); Serial.println(buffer);

Serial.println("LED HIGH");

OSSleep(pause);

sprintf(buffer, "==>Time: %lu ", OSticks()); Serial.println(buffer);

digitalWrite(13, LOW); Serial.println("LED LOW"); OSSleep(pause);

} 1

}

void setup() { OSInit(NUM_TASKS);

Serial.begin(19200); pinMode(13, OUTPUT);

OSCreateTask(0, task1, NULL); OSCreateTask(1, task2, (void *) 250);

OSRun(); }

void loop() { // Empty }

In the above code, we have implemented two tasks, that is, task1 and task2. Task1 is responsible for printing the time in terms of ticks. Task2 will blink the LED at some specific value which is passed as a parameter and then print the On and Off states with the respective time. The tasks have been implemented to run continuously via the while(1) loop structure. Note that in this case, we have used OSSleep() instead of the “delay” function, as the latter will handle the delay more accurately without necessarily blocking the other tasks. 1

You may need to know how critical ArdOS programming is. Just add the following lines to task1, and then execute it. Here are the lines:

n=analogRead(5); Serial.println(n);

The “n” should be declared before the while.

However, we need to increase the baud rate and then implement an analog read in an extra task so as to get perfect results. This is shown in the following code:

#include #include #include

#define NUM_TASKS 4 void task1(void *p)

{

char buffer[16]; unsigned char sreg; int n=0;

while(1) {

sprintf(buffer, "Time: %lu ", OSticks()); Serial.println(buffer); 1

OSSleep(1000); } }

void task2(void *p) {

unsigned int pause=(unsigned int) p; char buffer[16];

while(1) { digitalWrite(13, HIGH); sprintf(buffer, "==>Time: %lu ", OSticks()); o Serial.println(buffer); //Serial.println("LED HIGH");

OSSleep(pause); p sprintf(buffer, "==>Time: %lu ", OSticks()); q Serial.println(buffer); digitalWrite(13, LOW); //Serial.println("LED LOW"); OSSleep(pause);

} } void task3(void * p) {

char buff1[16]; int n1=0; while(1) 1

{ n1=analogRead(5);

n1=map(n1,0,1023,0,255); sprintf(buff1, "APV: %d ", n1); Serial.println(buff1); OSSleep(1000);

} } void setup() { OSInit(NUM_TASKS);

Serial.begin(115200); pinMode(13, OUTPUT);

OSCreateTask(0, task3, NULL); OSCreateTask(1, task1, NULL); OSCreateTask(2, task2, (void *) 1000); OSCreateTask(3, task1, NULL);

OSRun();

}

void loop() { // Empty } 1

The output from the above program will clearly demonstrate that it is possible for many tasks to try to perform writing on a serial port in a parallel manner.

We now need to rewrite our previous code in ArdOS. This will result in a better performance in terms of scheduling, event trigger, and parallelism. Here is the code for this:

#include #include #include #include

void SerialLogic(void *p) {

int n2=0; while(1)

{ if(Serial.available())

{ n2=Serial.read();

Serial.println(n2); n2=n2-48;

if(n2!=10) // Code for Entering { if(n2==0) {

Serial.println("SERIAL OFF"); digitalWrite(13,LOW); 1

} else if(n2==1) {

Serial.println("SERIAL ON"); digitalWrite(13,HIGH);

}

}

} OSSleep(500); } } void SwitchLogic(void *p) {

int n3=0; while(1)

{ n3=digitalRead(8); if(n3==0) {

\endash

digitalWrite(13,LOW);

\endash

Switching off only through the Serial Command

} else if(n3==1) { Serial.println("SWITCH ON"); 1

digitalWrite(13,HIGH); } OSSleep(500); } }

int s=0; int j; int x=0;

void SensorLogic(void *p) {

int n4=0; while(1)

{ n4=analogRead(5);

if(n4>750 || n4 a. The “;” notation means “parametrized by”; we consider x to be the argument of the function, while a and b are parameters that define the function. To ensure that there is no probability mass outside the interval, we say u(x; a, b) = 0 for all x [a, b]. Within [ a, b], u(x; a, b) = b−1a . We can see that this is nonnegative everywhere. Additionally, it integrates to 1. We often denote that x follows the uniform distribution on [a, b] by writing x ∼ U(a, b).

3.4

Marginal Probability

Sometimes we know the probability distribution over a set of variables and we want to know the probability distribution over just a subset of them. The probability distribution over the subset is known as the marginal probability distribution. For example, suppose we have discrete random variables x and y, and we know P (x, y). We can find P (x) with the sum rule: ∀x ∈ x, P (x = x) = P (x = x, y = y).

(3.3)

y

The name “marginal probability” comes from the process of computing marginal probabilities on paper. When the values of P (x, y ) are written in a grid with different values of x in rows and different values of y in columns, it is natural to sum across a row of the grid, then write P(x) in the margin of the paper just to the right of the row. For continuous variables, we need to use integration instead of summation:

p(x) = p(x, y)dy.

(3.4)

3.5

Conditional Probability

In many cases, we are interested in the probability of some event, given that some other event has happened. This is called a conditional probability. We denote the conditional probability that y = y given x = x as P (y = y | x = x). This conditional probability can be computed with the formula

P (y = y | x = x) =

P (y = y, x = x) P (x = x)

. (3.5)

The conditional probability is only defined when P( x = x) > 0. We cannot compute the conditional probability conditioned on an event that never happens. It is important not to confuse conditional probability with computing what would happen if some action were undertaken. The conditional probability that a person is from Germany given that they speak German is quite high, but if a randomly selected person is taught to speak German, their country of origin does not change. Computing the consequences of an action is called making an intervention query. Intervention queries are the domain of causal modeling, which we do not explore in this book.

3.6

The Chain Rule of Conditional Probabilities

Any joint probability distribution over many random variables may be decomposed into conditional distributions over only one variable: P (x(1), . . . , x(n) ) = P (x(1))Πin=2P (x(i) | x(1), . . . , x(i−1)).

(3.6)

This observation is known as the chain rule or product rule of probability. It follows immediately from the definition of conditional probability in Eq. 3.5. For example, applying the definition twice, we get P (a, b, c) P (b, c) P (a, b, c)

= =

P (a | b, c)P (b, c) P (b | c)P (c)

=

P (a | b, c)P (b | c)P (c).

3.7

Independence and Conditional Independence

Two random variables x and y are independent if their probability distribution can be expressed as a product of two factors, one involving only x and one involving only y:

∀x ∈ x, y ∈ y, p(x = x, y = y) = p(x = x)p(y = y).

(3.7)

Two random variables x and y are conditionally independent given a random variable z if the conditional probability distribution over x and y factorizes in this way for every value of z:

∀x ∈ x, y ∈ y, z ∈ z, p(x = x, y = y | z = z) = p(x = x | z = z)p(y = y | z = z). (3.8) We can denote independence and conditional independence with compact notation: x⊥y means that x and y are independent, while x⊥y | z means that x and y are conditionally independent given z.

3.8

Expectation, Variance and Covariance

The expectation or expected value of some function f (x ) with respect to a probability distribution P (x) is the average or mean value that f takes on when x is drawn from P . For discrete variables this can be computed with a summation:

When the identity of the x distributi on is clear while for continuous variables, it is computed with an integral: from the context, Ex∼p [f(x)] = p(x)f(x)dx. (3.10) we may simply write the name of the random variable that the expectation is over, as in E x[f (x)]. If it is clear which random variable the expectation is over, we may omit the subscript entirely, as in E[f (x)]. By default, we can assume that E[·] averages over the values of all the random variables inside the brackets. Likewise, when there is no ambiguity, we may omit the square brackets. Ex∼P [f(x)] =

P (x)f(x),

(3.9)

Expectations are linear, for example, Ex[αf(x) + βg(x)] = αEx[f(x)] + βEx[g(x)],

(3.11)

when α and β are not dependent on x. The variance gives a measure of how much the values of a function of a random variable x vary as we sample different values of x from its probability distribution:

Var(f (x)) = E (f(x) − E[f(x)])2

. (3.12)

When the variance is low, the values of f(x) cluster near their expected value. The square root of the variance is known as the standard deviation. The covariance gives some sense of how much two values are linearly related to each other, as well as the scale of these variables:

Cov(f(x), g(y)) = E [(f(x) − E [f(x)]) (g(y) − E [g(y)])] . (3.13) High absolute values of the covariance mean that the values change very much and are both far from their respective means at the same time. If the sign of the covariance is positive, then both variables tend to take on relatively high values simultaneously. If the sign of the covariance is negative, then one variable tends to take on a relatively high value at the times that the other takes on a relatively low value and vice versa. Other measures such as correlation normalize the contribution of each variable in order to measure only how much the variables are related, rather than also being affected by the scale of the separate variables. The notions of covariance and dependence are related, but are in fact distinct concepts. They are related because two variables that are independent have zero covariance, and two variables that have non-zero covariance are dependent. How-ever, independence is a distinct property from covariance. For two variables to have zero covariance, there must be no linear dependence between them. Independence is a stronger requirement than zero covariance, because independence also excludes nonlinear relationships. It is possible for two variables to be dependent but have zero covariance. For example, suppose we first sample a real number x from a uniform distribution over the interval [−1, 1]. We next sample a random variable s. With probability 12 , we choose the value of s to be 1. Otherwise, we choose the value of s to be −1. We can then generate a random variable y by assigning y = sx. Clearly, x and y are not independent, because x completely determines the magnitude of y. However, Cov(x, y) = 0. The covariance matrix of a random vector x ∈ Rn is an n × n matrix, such that Cov(x)i,j = Cov(xi, xj).

(3.14)

The diagonal elements of the covariance give the variance: Cov(xi, xi) = Var(xi ).

(3.15)

3.9

Common Probability Distributions

Several simple probability distributions are useful in many contexts in machine learning.

3.9.1

Bernoulli Distribution

The Bernoulli distribution is a distribution over a single binary random variable. It is controlled by a single parameter φ ∈ [0, 1], which gives the probability of the random variable being equal to 1. It has the following properties: P (x = 1) = φ

(3.16)

P (x = 0) = 1 − φ

(3.17)

P (x = x) = φx(1 − φ)1−x Ex[x] = φ

(3.18)

Var x(x) = φ(1 − φ)

(3.20)

(3.19)

3.9.2

Multinoulli Distribution

The multinoulli or categorical distribution is a distribution over a single discrete variable with k different states, where k is finite.1 The multinoulli distribution is “Multinoulli” is a term that was recently coined by Gustavo Lacerdo and popularized by Murphy (2012). The multinoulli distribution is a special case of the multinomial distribution. A multinomial distribution is the distribution over vectors in {0, . . . , n}k representing how many times each of the k categories is visited when n samples are drawn from a multinoulli distribution. Many texts use the term “multinomial” to refer to multinoulli distributions without clarifying that they refer only to the n = 1 case. 1

parametrized by a vector p ∈ [0, 1]k−1, where pi gives the probability of the i-th state. The final, k-th state’s probability is given by 1− 1 p. Note that we must constrain 1 p ≤ 1. Multinoulli distributions are often used to refer to distributions over categories of objects, so we do not usually assume that state 1 has numerical value 1, etc. For this reason, we do not usually need to compute the expectation or variance of multinoulli-distributed random variables. The Bernoulli and multinoulli distributions are sufficient to describe any distri-bution over their domain. This is because they model discrete variables for which it is feasible to simply enumerate all of the states. When dealing with continuous variables, there are uncountably many states, so any distribution described by a small number of parameters must impose strict limits on the distribution.

3.9.3

Gaussian Distribution

The most commonly used distribution over real numbers is the normal distribution, also known as the Gaussian distribution:

1 1 N (x; µ, σ2) = 2πσ2 exp − 2σ2 See Fig. 3.1 for a plot of the density function.

(x − µ)2 .

(3.21)

The two parameters µ ∈ R and σ ∈ (0, ∞) control the normal distribution. The parameter µ gives the coordinate of the central peak. This is also the mean of the distribution: E[x] = µ. The standard deviation of the distribution is given by σ, and the variance by σ2. When we evaluate the PDF, we need to square and invert σ. When we need to frequently evaluate the PDF with different parameter values, a more efficient way of parametrizing the distribution is to use a parameter β ∈ (0, ∞) to control the precision or inverse variance of the distribution:

β

1

N (x; µ, β−1) = − 2 β(x − µ)2 . (3.22) 2π exp Normal distributions are a sensible choice for many applications. In the absence of prior knowledge about what form a distribution over the real numbers should take, the normal distribution is a good default choice for two major reasons. First, many distributions we wish to model are truly close to being normal distributions. The central limit theorem shows that the sum of many independent random variables is approximately normally distributed. This means that in Figure 3.1: The normal distribution: The normal distribution N (x; µ, σ2) exhibits a classic “bell curve” shape, with the x coordinate of its central peak given by µ, and the width of its peak controlled by σ. In this example, we depict the standard normal distribution, with µ = 0 and σ = 1.

practice, many complicated systems can be modeled successfully as normally distributed noise, even if the system can be decomposed into parts with more structured behavior.

Second, out of all possible probability distributions with the same variance, the normal distribution encodes the maximum amount of uncertainty over the real numbers. We can thus think of the normal distribution as being the one that inserts the least amount of prior knowledge into a model. Fully developing and justifying this idea requires more mathematical tools, and is postponed to Sec. 19.4.2. The normal distribution generalizes to Rn, in which case it is known as the multivariate normal distribution. It may be parametrized with a positive definite symmetric matrix Σ:

1 N (x; µ, Σ) =

1

(2π)

ndet(Σ) exp

− 2 (x − µ) Σ−1(x − µ) .

(3.23)

The parameter µ still gives the mean of the distribution, though now it is vector-valued. The parameter Σ gives the covariance matrix of the distribution. As in the univariate case, when we wish to evaluate the PDF several times for

many different values of the parameters, the covariance is not a computationally efficient way to parametrize the distribution, since we need to invert Σ to evaluate the PDF. We can instead use a precision matrix β: det(β) −1 ) =

1

(2π) exp − 2 (x − µ) β(x − µ) . (3.24) We often fix the covariance matrix to be a diagonal matrix. An even simpler version is the isotropic Gaussian distribution, whose covariance matrix is a scalar times the identity matrix. N (x; µ, β

n

3.9.4

Exponential and Laplace Distributions

In the context of deep learning, we often want to have a probability distribution with a sharp point at x = 0. To accomplish this, we can use the exponential distribution: p(x; λ) = λ1x≥0 exp (−λx) .

(3.25) The exponential distribution uses the indicator function 1x≥0 to assign probability zero to all negative values of x. A closely related probability distribution that allows us to place a sharp peak of probability mass at an arbitrary point µ is the Laplace distribution 2γ Laplace(x; µ, γ) = 1 exp

| γ x − µ|



. (3.26)

3.9.5

The Dirac Distribution and Empirical Distribution

In some cases, we wish to specify that all of the mass in a probability distribution clusters around a single point. This can be accomplished by defining a PDF using the Dirac delta function, δ(x): p(x) = δ(x − µ).

(3.27)

The Dirac delta function is defined such that it is zero-valued everywhere except 0, yet integrates to 1. The Dirac delta function is not an ordinary function that associates each value x with a real-valued output, instead it is a different kind of mathematical object called a generalized function that is defined in terms of its properties when integrated. We can think of the Dirac delta function as being the limit point of a series of functions that put less and less mass on all points other than µ. By defining p(x) to be δ shifted by −µ we obtain an infinitely narrow and infinitely high peak of probability mass where x = µ. A common use of the Dirac delta distribution is as a component of an empirical distribution, 1 m i

pˆ(x) = m

δ(x − x(i))

(3.28)

=1 which puts probability mass m1 on each of the m points x(1), . . . , x(m) forming a given data set or collection of samples. The Dirac delta distribution is only necessary to define the empirical distribution over continuous variables. For discrete variables, the situation is simpler: an empirical distribution can be conceptualized as a multinoulli distribution, with a probability associated to each possible input value that is simply equal to the empirical frequency of that value in the training set. We can view the empirical distribution formed from a dataset of training examples as specifying the distribution that we sample from when we train a model on this dataset. Another important perspective on the empirical distribution is that it is the probability density that maximizes the likelihood of the training data (see Sec. 5.5).

3.9.6

Mixtures of Distributions

It is also common to define probability distributions by combining other simpler probability distributions. One common way of combining distributions is to construct a mixture distribution. A mixture distribution is made up of several component distributions. On each trial, the choice of which component distribution generates the sample is determined by sampling a component identity from a multinoulli distribution:

P (x) = P (c = i)P (x | c = i) i

(3.29)

where P (c) is the multinoulli distribution over component identities. We have already seen one example of a mixture distribution: the empirical distribution over real-valued variables is a mixture distribution with one Dirac component for each training example. The mixture model is one simple strategy for combining probability distributions to create a richer distribution. In Chapter 16, we explore the art of building complex probability distributions from simple ones in more detail. The mixture model allows us to briefly glimpse a concept that will be of paramount importance later—the latent variable. A latent variable is a random variable that we cannot observe directly. The component identity variable c of the mixture model provides an example. Latent variables may be related to x through the joint distribution, in this case, P (x, c) = P (x | c)P (c). The distribution P (c) over the latent variable and the distribution P (x | c) relating the latent variables to the visible variables determines the shape of the distribution P (x) even though it is possible to describe P (x) without reference to the latent variable. Latent variables are discussed further in Sec. 16.5. A very powerful and common type of mixture model is the Gaussian mixture model, in which the components p(x | c = i) are Gaussians. Each component has a separately parametrized mean µ(i) and covariance Σ(i). Some mixtures can have more constraints. For example, the covariances could be shared across components via the constraint Σ(i) = Σ∀i. As with a single Gaussian distribution, the mixture of Gaussians might constrain the covariance matrix for each component to be diagonal or isotropic. In addition to the means and covariances, the parameters of a Gaussian mixture specify the prior probability αi = P (c = i) given to each component i. The word “prior” indicates that it expresses the model’s beliefs about c before it has observed x. By comparison, P( c | x) is a posterior probability, because it is computed after observation of x. A Gaussian mixture model is a universal approximator of densities, in the sense that any smooth density can be

approximated with any specific, non-zero amount of error by a Gaussian mixture model with enough components. Fig. 3.2 shows samples from a Gaussian mixture model.

3.10

Useful Properties of Common Functions

Certain functions arise often while working with probability distributions, especially the probability distributions used in deep learning models. One of these functions is the logistic sigmoid:

σ(x) =

1 1 + exp(−x).

(3.30)

The logistic sigmoid is commonly used to produce the φ parameter of a Bernoulli distribution because its range is (0,1), which lies within the valid range of values for the φ parameter. See Fig. 3.3 for a graph of the sigmoid function. The sigmoid

x2

x1

Figure 3.2: Samples from a Gaussian mixture model. In this example, there are three components. From left to right, the first component has an isotropic covariance matrix, meaning it has the same amount of variance in each direction. The second has a diagonal covariance matrix, meaning it can control the variance separately along each axis-aligned direction. This example has more variance along the x2 axis than along the x1 axis. The third component has a full-rank covariance matrix, allowing it to control the variance separately along an arbitrary basis of directions.

function saturates when its argument is very positive or very negative, meaning that the function becomes very flat and insensitive to small changes in its input. Another commonly encountered function is the softplus function (Dugas et al., 2001): ζ(x) = log (1 + exp(x)) .

(3.31)

The softplus function can be useful for producing the β or σ parameter of a normal distribution because its range is (0, ∞). It also arises commonly when manipulating expressions involving sigmoids. The name of the softplus function comes from the fact that it is a smoothed or “softened” version of x+ = max(0, x).

(3.32)

See Fig. 3.4 for a graph of the softplus function. The following properties are all useful enough that you may wish to memorize them: exp(x) σ(x) =

exp(x) + exp(0)

(3.33)

d σ(x) = σ(x)(1 − σ(x))

(3.34)

dx

1 − σ(x) = σ(−x) log σ(x) = −ζ(−x) d ζ(x) = σ(x)

dx

∀x ∈ (0, 1), σ−1 (x) = log

1x x −

∀x > 0, ζ−1(x) = log (exp(x) − 1) x

ζ (x) =σ(y)dy −∞ ζ (x) − ζ(−x) = x

(3.35) (3.36)

(3.37)

(3.38) (3.39) (3.40) (3.41) The function σ−1(x) is called the logit in statistics, but this term is more rarely used in machine learning. Eq. 3.41 provides extra justification for the name “softplus.” The softplus function is intended as a smoothed version of the positive part function, x + = max{0, x}. The positive part function is the counterpart of the negative part function, x− = max{0, −x}. To obtain a smooth function that is analogous to the negative part, one can use ζ (−x). Just as x can be recovered from its positive part and negative part via the identity x + − x− = x, it is also possible to recover x using the same relationship between ζ(x) and ζ(−x), as shown in Eq. 3.41.

3.11

Bayes’ Rule

We often find ourselves in a situation where we know P (y | x) and need to know P (x | y). Fortunately, if we also know P (x), we can compute the desired quantity using Bayes’ rule: P (x

y) =

P (x)P (y | x).

(3.42)

| P (y) Note that while P (y) appears in the formula, it is usually feasible to compute P (y) = x P (y | x)P (x), so we do not need to begin with knowledge of P (y). Bayes’ rule is straightforward to derive from the definition of conditional probability, but it is useful to know the name of this formula since many texts refer to it by name. It is named after the Reverend Thomas Bayes, who first discovered a special case of the formula. The general version presented here was independently discovered by Pierre-Simon Laplace.

3.12

Technical Details of Continuous Variables

A proper formal understanding of continuous random variables and probability density functions requires developing probability theory in terms of a branch of mathematics known as measure theory. Measure theory is beyond the scope of this textbook, but we can briefly sketch some of the issues that measure theory is employed to resolve. In Sec. 3.3.2, we saw that the probability of a continuous vector-valued x lying in some set S is given by the integral of p (x) over the set S. Some choices of set S can produce paradoxes. For example, it is possible to construct two sets S1 and S2 such that p (x ∈ S1) + p(x ∈ S2 ) > 1 but S1 ∩ S2 = ∅. These sets are generally constructed making very heavy use of the infinite precision of real numbers, for example by making fractal-shaped sets or sets that are defined by transforming the set of rational numbers.2 One of the key contributions of measure theory is to provide a characterization of the set of sets that we can compute the probability of without encountering paradoxes. In this book, we only integrate over sets with relatively simple descriptions, so this aspect of measure theory never becomes a relevant concern. For our purposes, measure theory is more useful for describing theorems that apply to most points in Rn but do not apply to some corner cases. Measure theory provides a rigorous way of describing that a set of points is negligibly small. Such a set is said to have “measure zero.” We do not formally define this concept in this textbook. However, it is useful to understand the intuition that a set of measure zero occupies no volume in the space we are measuring. For example, within R2, a line has measure zero, while a filled polygon has positive measure. Likewise, an individual point has measure zero. Any union of countably many sets that each have measure zero also has measure zero (so the set of all the rational numbers has measure zero, for instance). Another useful term from measure theory is “almost everywhere.” A property that holds almost everywhere holds throughout all of space except for on a set of measure zero. Because the exceptions occupy a negligible amount of space, they can be safely ignored for many applications. Some important results in probability theory hold for all discrete values but only hold “almost everywhere” for continuous values. Another technical detail of continuous variables relates to handling continuous random variables that are deterministic functions of one another. Suppose we have two random variables, x and y, such that y = g(x), where g is an invertible, conThe Banach-Tarski theorem provides a fun example of such sets. continuous, differentiable transformation. One might expect that py (y) = px (g−1(y )). This is actually not the case. As a simple example, suppose we have scalar random variables x and y. Suppose y = x2 and x ∼ U(0, 1). If we use the rule py(y) = p x(2y) then py will be 0 everywhere except the interval [0, 12 ], and it will be 1 on this interval. This means 2

py (y)dy =

1

, (3.43)

2 which violates the definition of a probability distribution. This common mistake is wrong because it fails to account for the distortion of space introduced by the function g. Recall that the probability of x lying in an infinitesimally small region with volume δx is given by p(x)δx. Since g can expand or contract space, the infinitesimal volume surrounding x in x space may have different volume in y space. To see how to correct the problem, we return to the scalar case. We need to preserve the property |py(g(x))dy| = |px(x)dx|. (3.44) Solving from this, we obtain ∂x py(y) = px (g (y)) −1

∂y

(3.45)

or equivalently ∂g(x) px(x) = py(g(x)) In higher dimensions, the derivative generalizes

∂x

.

to the

(3.46) determinant of the Jacobian

∂xi matrix—the matrix with Ji,j = ∂yj . Thus, for real-valued vectors x and y, ∂g(x) p x(x) = py(g(x)) det

∂x

.

(3.47)

3.13

Information Theory

Information theory is a branch of applied mathematics that revolves around quantifying how much information is present in a signal. It was originally invented to study sending messages from discrete alphabets over a noisy channel, such as communication via radio transmission. In this context, information theory tells how to design optimal codes and calculate the expected length of messages sampled from

specific probability distributions using various encoding schemes. In the context of machine learning, we can also apply information theory to continuous variables where some of these message length interpretations do not apply. This field is fundamental to many areas of electrical engineering and computer science. In this textbook, we mostly use a few key ideas from information theory to characterize probability distributions or quantify similarity between probability distributions. For more detail on information theory, see Cover and Thomas (2006) or MacKay (2003). The basic intuition behind information theory is that learning that an unlikely event has occurred is more informative than learning that a likely event has occurred. A message saying “the sun rose this morning” is so uninformative as to be unnecessary to send, but a message saying “there was a solar eclipse this morning” is very informative. We would like to quantify information in a way that formalizes this intuition. Specifically, • Likely events should have low information content, and in the extreme case, events that are guaranteed to happen should have no information content whatsoever. • Less likely events should have higher information content. • Independent events should have additive information. For example, finding out that a tossed coin has come up as heads twice should convey twice as much information as finding out that a tossed coin has come up as heads once. In order to satisfy all three of these properties, we define the self-information of an event x = x to be I(x) = − log P (x).

(3.48)

In this book, we always use log to mean the natural logarithm, with base e. Our definition of I(x) is therefore written in units of nats. One nat is the amount of information gained by observing an event of probability 1e . Other texts use base-2 logarithms and units called bits or Shannon’s; information measured in bits is just a rescaling of information measured in nats.

When x is continuous, we use the same definition of information by analogy, but some of the properties from the discrete case are lost. For example, an event with unit density still has zero information, despite not being an event that is guaranteed to occur.

Reinforcement Learning

1. Introduction

The idea that we learn by interacting with our environment is probably the first to occur to us when we think about the nature of learning. When an infant plays, waves its arms, or looks about, it has no explicit teacher, but it does have a direct sensorimotor connection to its environment. Exercising this connection produces a wealth of information about cause and effect, about the consequences of actions, and about what to do in order to achieve goals. Throughout our lives, such interactions are undoubtedly a major source of knowledge about our environment and ourselves. Whether we are learning to drive a car or to hold a conversation, we are acutely aware of how our environment responds to what we do, and we seek to influence what happens through our behavior. Learning from interaction is a foundational idea underlying nearly all theories of learning and intelligence. In this book we explore a computational approach to learning from interaction. Rather than directly theorizing about how people or animals learn, we explore idealized learning situations and evaluate the effectiveness of various learning methods. That is, we adopt the perspective of an artificial intelligence researcher or engineer. We explore designs for machines that are effective in solving learning problems of scientific or economic interest, evaluating the designs through mathematical analysis or computational experiments. The approach we explore, called reinforcement learning, is much more focused on goal-directed learning from interaction than are other approaches to machine learning.

1.1 Reinforcement Learning

Reinforcement learning is learning what to do--how to map situations to actions--so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. These two characteristics--trial-and-error search and delayed reward--are the two most important distinguishing features of reinforcement learning. Reinforcement learning is defined not by characterizing learning methods, but by characterizing a learning problem. Any method that is well suited to solving that problem, we consider to be a reinforcement learning method. A full specification of the reinforcement learning problem in terms of optimal control of Markov decision processes must wait until Chapter 3, but the basic idea is simply to capture the most important aspects of the real problem facing a learning agent interacting with its environment to achieve a goal. Clearly, such an agent must be able to sense the state of the environment to some extent and must be able to take actions that affect the state. The agent also must have a goal or goals relating to the state of the environment. The formulation is intended to include just these three aspects-sensation, action, and goal--in their simplest possible forms without trivializing any of them. Reinforcement learning is different from supervised learning, the kind of learning studied in most current research in machine learning, statistical pattern recognition, and artificial neural networks. Supervised learning is learning from examples provided by a knowledgable external supervisor. This is an important kind of learning, but alone it is not adequate for learning from interaction. In interactive problems it is often impractical to obtain examples of desired behavior that are both correct and representative of all the situations in which the agent has to act. In uncharted territory-- where one would expect learning to be most beneficial--an agent must be able to learn from its own experience.

One of the challenges that arise in reinforcement learning and not in other kinds of learning is the trade-off between exploration and exploitation. To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward. But to discover such actions, it has to try actions that it has not selected before. The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future. The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task. The agent must try a variety of actions and progressively favor those that appear to be best. On a stochastic task, each action must be tried many times to gain a reliable estimate its expected reward. The exploration-exploitation dilemma has been intensively studied by mathematicians for many decades (see Chapter 2). For now, we simply note that the entire issue of balancing exploration and exploitation does not even arise in supervised learning as it is usually defined.

Another key feature of reinforcement learning is that it explicitly considers the whole problem of a goal-directed agent interacting with an uncertain environment. This is in contrast with many approaches that consider subproblems without addressing how they might fit into a larger picture. For example, we have mentioned that much of machine learning research is concerned with supervised learning without explicitly specifying how such an ability would finally be useful. Other researchers have developed theories of planning with general goals, but without considering planning's role in real-time decision-making, or the question of where the predictive models necessary for planning would come from. Although these approaches have yielded many useful results, their focus on isolated subproblems is a significant limitation. Reinforcement learning takes the opposite tack, starting with a complete, interactive, goal-seeking agent. All reinforcement learning agents have explicit goals, can sense aspects of their environments, and can choose actions to influence their environments. Moreover, it is usually assumed from the beginning that the agent has to operate despite significant uncertainty about the environment it faces. When reinforcement learning involves planning, it has to address the interplay between planning and real-time action selection, as well as the question of how environmental models are acquired and improved. When reinforcement learning involves supervised learning, it does so for specific reasons that determine which capabilities are critical and which are not. For learning research to make progress,

important subproblems have to be isolated and studied, but they should be subproblems that play clear roles in complete, interactive, goal-seeking agents, even if all the details of the complete agent cannot yet be filled in.

One of the larger trends of which reinforcement learning is a part is that toward greater contact between artificial intelligence and other engineering disciplines. Not all that long ago, artificial intelligence was viewed as almost entirely separate from control theory and statistics. It had to do with logic and symbols, not numbers. Artificial intelligence was large LISP programs, not linear algebra, differential equations, or statistics. Over the last decades this view has gradually eroded. Modern artificial intelligence researchers accept statistical and control algorithms, for example, as relevant competing methods or simply as tools of their trade. The previously ignored areas lying between artificial intelligence and conventional engineering are now among the most active, including new fields such as neural networks, intelligent control, and our topic, reinforcement learning. In reinforcement learning we extend ideas from optimal control theory and stochastic approximation to address the broader and more ambitious goals of artificial intelligence.

1.2 Examples

A good way to understand reinforcement learning is to consider some of the examples and possible applications that have guided its development.

A master chess player makes a move. The choice is informed both by planning-anticipating possible replies and counterreplies--and by immediate, intuitive judgments of the desirability of particular positions and moves. An adaptive controller adjusts parameters of a petroleum refinery's operation in real time. The controller optimizes the yield/cost/quality trade-off on the basis of specified marginal costs without sticking strictly to the set points originally suggested by engineers. A gazelle calf struggles to its feet minutes after being born. Half an hour later it is running at 20 miles per hour. A mobile robot decides whether it should enter a new room in search of more trash to collect or start trying to find its way back to its battery recharging station. It makes its decision based on how quickly and easily it has been able to find the recharger in the past. Phil prepares his breakfast. Closely examined, even this apparently mundane activity reveals a complex web of conditional behavior and interlocking goalsubgoal relationships: walking to the cupboard, opening it, selecting a cereal box, then reaching for, grasping, and retrieving the box. Other complex, tuned, interactive sequences of behavior are required to obtain a bowl, spoon, and milk jug. Each step involves a series of eye movements to obtain information and to guide reaching and locomotion. Rapid judgments are continually made about how to carry the objects or whether it is better to ferry some of them to the dining table before obtaining others. Each step is guided by goals, such as grasping a spoon or getting to the refrigerator, and is in service of other goals, such as having the spoon to eat with once the cereal is prepared and ultimately obtaining nourishment.

These examples share features that are so basic that they are easy to overlook. All involve interaction between an active decision-making agent and its environment, within which the agent seeks to achieve a goal despite uncertainty about its environment. The agent's actions are permitted to affect the future state of the environment (e.g., the next chess position, the level of reservoirs of the refinery, the next location of the robot), thereby affecting the options and opportunities available to the agent at later times. Correct choice requires taking into account indirect, delayed consequences of actions, and thus may require foresight or planning.

At the same time, in all these examples the effects of actions cannot be fully predicted; thus the agent must monitor its environment frequently and react appropriately. For example, Phil must watch the milk he pours into his cereal bowl to keep it from overflowing. All these examples involve goals that are explicit in the sense that the agent can judge progress toward its goal based on what it can sense directly. The chess player knows whether or not he wins, the refinery controller knows how much petroleum is being produced, the mobile robot knows when its batteries run down, and Phil knows whether or not he is enjoying his breakfast. In all of these examples the agent can use its experience to improve its performance over time. The chess player refines the intuition he uses to evaluate positions, thereby improving his play; the gazelle calf improves the efficiency with which it can run; Phil learns to streamline making his breakfast. The knowledge the agent brings to the task at the start--either from previous experience with related tasks or built into it by design or evolution--influences what is useful or easy to learn, but interaction with the environment is essential for adjusting behavior to exploit specific features of the task.

1.3 Elements of Reinforcement Learning

Beyond the agent and the environment, one can identify four main subelements of a reinforcement learning system: a policy, a reward function, a value function, and, optionally, a model of the environment. A policy defines the learning agent's way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states. It corresponds to what in psychology would be called a set of stimulus-response rules or associations. In some cases the policy may be a simple function or lookup table, whereas in others it may involve extensive computation such as a search process. The policy is the core of a reinforcement learning agent in the sense that it alone is sufficient to determine behavior. In general, policies may be stochastic. A reward function defines the goal in a reinforcement learning problem. Roughly speaking, it maps each perceived state (or state-action pair) of the environment to a single number, a reward, indicating the intrinsic desirability of that state. A reinforcement learning agent's sole objective is to maximize the total reward it receives in the long run. The reward function defines what are the good and bad events for the agent. In a biological system, it would not be inappropriate to identify rewards with pleasure and pain. They are the immediate and defining features of the problem faced by the agent. As such, the reward function must necessarily be unalterable by the agent. It may, however, serve as a basis for altering the policy. For example, if an action selected by the policy is followed by low reward, then the policy may be changed to select some other action in that situation in the future. In general, reward functions may be stochastic. Whereas a reward function indicates what is good in an immediate sense, a value function specifies what is good in the long run. Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. Whereas rewards determine the immediate, intrinsic desirability of environmental states, values indicate the long-term desirability of states after taking into account the states that are likely to follow, and the rewards available in those states. For example, a state might always yield a low immediate

reward but still have a high value because it is regularly followed by other states that yield high rewards. Or the reverse could be true. To make a human analogy, rewards are like pleasure (if high) and pain (if low), whereas values correspond to a more refined and farsighted judgment of how pleased or displeased we are that our environment is in a particular state. Expressed this way, we hope it is clear that value functions formalize a basic and familiar idea.

Rewards are in a sense primary, whereas values, as predictions of rewards, are secondary. Without rewards there could be no values, and the only purpose of estimating values is to achieve more reward. Nevertheless, it is values with which we are most concerned when making and evaluating decisions. Action choices are made based on value judgments. We seek actions that bring about states

of highest value, not highest reward, because these actions obtain the greatest amount of reward for us over the long run. In decision-making and planning, the derived quantity called value is the one with which we are most concerned. Unfortunately, it is much harder to determine values than it is to determine rewards. Rewards are basically given directly by the environment, but values must be estimated and reestimated from the sequences of observations an agent makes over its entire lifetime. In fact, the most important component of almost all reinforcement learning algorithms is a method for efficiently estimating values. The central role of value estimation is arguably the most important thing we have learned about reinforcement learning over the last few decades. Although all the reinforcement learning methods we consider in this book are structured around estimating value functions, it is not strictly necessary to do this to solve reinforcement learning problems. For example, search methods such as genetic algorithms, genetic programming, simulated annealing, and other function optimization methods have been used to solve reinforcement learning problems. These methods search directly in the space of policies without ever appealing to value functions. We call these evolutionary methods because their operation is analogous to the way biological evolution produces organisms with skilled behavior even when they do not learn during their individual lifetimes. If the space of policies is sufficiently small, or can be structured so that good policies are common or easy to find, then evolutionary methods can be effective. In addition,

evolutionary methods have advantages on problems in which the learning agent cannot accurately sense the state of its environment.

Nevertheless, what we mean by reinforcement learning involves learning while interacting with the environment, which evolutionary methods do not do. It is our belief that methods able to take advantage of the details of individual behavioral interactions can be much more efficient than evolutionary methods in many cases. Evolutionary methods ignore much of the useful structure of the reinforcement learning problem: they do not use the fact that the policy they are searching for is a function from states to actions; they do not notice which states an individual passes through during its lifetime, or which actions it selects. In some cases this information can be misleading (e.g., when states are misperceived), but more often it should enable more efficient search. Although evolution and learning share many features and can naturally work together, as they do in nature, we do not consider evolutionary methods by themselves to be especially well suited to reinforcement learning problems. For simplicity, in this book when we use the term "reinforcement learning" we do not include evolutionary methods.

The fourth and final element of some reinforcement learning systems is a model of the environment. This is something that mimics the behavior of the environment. For example, given a state and action, the model might predict the resultant next state and next reward. Models are used for planning, by which we mean any way of deciding on a course of action by considering possible future situations before they are actually experienced. The incorporation of models and planning into reinforcement learning systems is a relatively new development. Early reinforcement learning systems were explicitly trial-and-error learners; what they did was viewed as almost the opposite of planning. Nevertheless, it gradually became clear that reinforcement learning methods are closely related to dynamic programming methods, which do use models, and that they in turn are closely related to state-space planning methods. In Chapter 9 we explore reinforcement learning systems that simultaneously learn by trial and error, learn a model of the environment, and use the model for planning. Modern reinforcement learning spans the spectrum from lowlevel, trial-and-error learning

1.4 An Extended Example: Tic-Tac-Toe

To illustrate the general idea of reinforcement learning and contrast it with other approaches, we next consider a single example in more detail.

Consider the familiar child's game of tic-tac-toe. Two players take turns playing on a three-by-three board. One player plays Xs and the other Os until one player wins by placing three marks in a row, horizontally, vertically, or diagonally, as the X player has in this game:

If the board fills up with neither player getting three in a row, the game is a draw. Because a skilled player can play so as never to lose, let us assume that we are playing against an imperfect player, one whose play is sometimes incorrect and allows us to win. For the moment, in fact, let us consider draws and losses to be equally bad for us. How might we construct a player that will find the imperfections in its opponent's play and learn to maximize its chances of winning? Although this is a simple problem, it cannot readily be solved in a satisfactory way through classical techniques. For example, the classical "minimax" solution from

game theory is not correct here because it assumes a particular way of playing by the opponent. For example, a minimax player would never reach a game state from which it could lose, even if in fact it always won from that state because of incorrect play by the opponent. Classical optimization methods for sequential decision problems, such as dynamic programming, can compute an optimal solution for any opponent, but require as input a complete specification of that opponent, including the probabilities with which the opponent makes each move in each board state. Let us assume that this information is not available a priori for this problem, as it is not for the vast majority of problems of practical interest. On the other hand, such information can be estimated from experience, in this case by playing many games against the opponent. About the best one can do on this problem is first to learn a model of the opponent's behavior, up to some level of confidence, and then apply dynamic programming to compute an optimal solution given the approximate opponent model. In the end, this is not that different from some of the reinforcement learning methods we examine later in this book.

An evolutionary approach to this problem would directly search the space of possible policies for one with a high probability of winning against the opponent. Here, a policy is a rule that tells the player what move to make for every state of the game--every possible configuration of X s and Os on the three-by-three board. For each policy considered, an estimate of its winning probability would be obtained by playing some number of games against the opponent. This evaluation would then direct which policy or policies were considered next. A typical evolutionary method would hill-climb in policy space, successively generating and evaluating policies in an attempt to obtain incremental improvements. Or, perhaps, a genetic-style algorithm could be used that would maintain and evaluate a population of policies. Literally hundreds of different optimization methods could be applied. By directly searching the policy space we mean that entire policies are proposed and compared on the basis of scalar evaluations.

Here is how the tic-tac-toe problem would be approached using reinforcement learning and approximate value functions. First we set up a table of numbers, one for each possible state of the game. Each number will be the latest estimate of the probability of our winning from that state. We treat this estimate as the state's value,

and the whole table is the learned value function. State A has higher value than state B, or is considered "better" than state B, if the current estimate of the probability of our winning from A is higher than it is from B. Assuming we always play X s, then for all states with three Xs in a row the probability of winning is 1, because we have already won. Similarly, for all states with three Os in a row, or that are "filled up," the correct probability is 0, as we cannot win from them. We set the initial values of all the other states to 0.5, representing a guess that we have a 50% chance of winning. We play many games against the opponent. To select our moves we examine the states that would result from each of our possible moves (one for each blank space on the board) and look up their current values in the table. Most of the time we move greedily, selecting the move that leads to the state with greatest value, that is, with the highest estimated probability of winning. Occasionally, however, we select randomly from among the other moves instead. These are called exploratory moves because they cause us to experience states that we might otherwise never see. A sequence of moves made and considered during a game can be diagrammed as in Figure 1.1.

Figure 1.1:A sequence of tic-tac-toe moves. The solid lines represent the moves taken during a game; the dashed lines represent moves that we (our reinforcement learning player) considered but did not make. Our second move was an exploratory move, meaning that it was taken even though another sibling move, the one leading to , was ranked higher. Exploratory moves do not result in any learning, but each of our other moves does, causing backupsas suggested by the curved arrows and detailed in the text. While we are playing, we change the values of the states in which we find ourselves during the game. We attempt to make them more accurate estimates of the probabilities of winning. To do this, we "back up" the value of the state after each greedy move to the state before the move, as suggested by the arrows in Figure 1.1. More precisely, the current value of the earlier state is adjusted to be closer to the value of the later state. This can be done by moving the earlier state's value a

fraction of the way toward the value of the later state. If we let before the greedy move, and the

denote the state

state after the move, then the update to the estimated value of , denoted be written as

, can

where is a small positive fraction called the step-size parameter, which influences the rate of learning This update rule is an example of a temporaldifference learning method, so called because

its changes are based on a difference, different times.

, between estimates at two

The method described above performs quite well on this task. For example, if the step-size parameter is reduced properly over time, this method converges, for any fixed opponent, to the true probabilities of winning from each state given optimal play by our player. Furthermore, the moves then taken (except on exploratory moves) are in fact the optimal moves against the opponent. In other words, the method converges to an optimal policy for playing the game. If the step-size parameter is not reduced all the way to zero over time, then this player also plays well against opponents that slowly change their way of playing. This example illustrates the differences between evolutionary methods and methods that learn value functions. To evaluate a policy, an evolutionary method must hold it fixed and play many games against the opponent, or simulate many games using a model of the opponent. The frequency of wins gives an unbiased estimate of the

probability of winning with that policy, and can be used to direct the next policy selection. But each policy change is made only after many games, and only the final outcome of each game is used: what happens during the games is ignored. For example, if the player wins, then all of its behavior in the game is given credit, independently of how specific moves might have been critical to the win. Credit is even given to moves that never occurred! Value function methods, in contrast, allow individual states to be evaluated. In the end, both evolutionary and value function methods search the space of policies, but learning a value function takes advantage of information available during the course of play.

This simple example illustrates some of the key features of reinforcement learning methods. First, there is the emphasis on learning while interacting with an environment, in this case with an opponent player. Second, there is a clear goal, and correct behavior requires planning or foresight that takes into account delayed effects of one's choices. For example, the simple reinforcement learning player would learn to set up multimove traps for a shortsighted opponent. It is a striking feature of the reinforcement learning solution that it can achieve the effects of planning and lookahead without using a model of the opponent and without conducting an explicit search over possible sequences of future states and actions. While this example illustrates some of the key features of reinforcement learning, it is so simple that it might give the impression that reinforcement learning is more limited than it really is. Although tic-tac-toe is a two-person game, reinforcement learning also applies in the case in which there is no external adversary, that is, in the case of a "game against nature." Reinforcement learning also is not restricted to problems in which behavior breaks down into separate episodes, like the separate games 1

of tic-tac-toe, with reward only at the end of each episode. It is just as applicable when behavior continues indefinitely and when rewards of various magnitudes can be received at any time.

Tic-tac-toe has a relatively small, finite state set, whereas reinforcement learning can be used when the state set is very large, or even infinite. For example, Gerry Tesauro (1992, 1995) combined the algorithm described above with an artificial neural network to learn to play backgammon, which has approximately states. With this many states it is impossible ever to experience more than a small fraction of them. Tesauro's program learned to play far better than any previous program, and now plays at the level of the world's best human players (see Chapter 11). The neural network provides the program with the ability to generalize from its experience, so that in new states it selects moves based on information saved from similar states faced in the past, as determined by its network. How well a reinforcement learning system can work in problems with such large state sets is intimately tied to how appropriately it can generalize from past experience. It is in this role that we have the greatest need for supervised learning methods with reinforcement learning. Neural networks are not the only, or necessarily the best, way to do this. In this tic-tac-toe example, learning started with no prior knowledge beyond the rules of the game, but reinforcement learning by no means entails a tabula rasa view of learning and intelligence. On the contrary, prior information can be incorporated into reinforcement learning in a variety of ways that can be critical for efficient learning. We also had access to the true state in the tic-tac-toe example, whereas reinforcement learning can also be applied when part of the state is hidden, or when different states appear to the learner to be the same. That case, however, is substantially more difficult, and we do not cover it significantly in this book. Finally, the tic-tac-toe player was able to look ahead and know the states that would result from each of its possible moves. To do this, it had to have a model of the game that allowed it to "think about" how its environment would change in response to moves that it might never make. Many problems are like this, but in others even a short-term model of the effects of actions is lacking. Reinforcement learning can be

applied in either case. No model is required, but models can easily be used if they are available or can be learned. Exercise 1.1: Self-Play Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself. What do you think would happen in this case? Would it learn a different way of playing? Exercise 1.2: Symmetries Many tic-tac-toe positions appear different but are really the same because of symmetries. How might we amend the reinforcement learning algorithm described above to take advantage of this? In what ways would this improve it? Now think again. Suppose the opponent did not take advantage of symmetries. In that case, should we? Is it true, then, that symmetrically equivalent positions should necessarily have the same value? Exercise 1.3: Greedy Play Suppose the reinforcement learning player was greedy, that is, it always played the move that brought it to the position that it rated the best. Would it learn to play better, or worse, than a nongreedy player? What problems might occur?

Exercise 1.4: Learning from Exploration Suppose learning updates occurred after all moves, including exploratory moves. If the step-size parameter is appropriately reduced over time, then the state values would converge to a set of probabilities. What are the two sets of probabilities computed when we do, and when we do not, learn from exploratory moves? Assuming that we do continue to make exploratory moves, which set of probabilities might be better to learn? Which would result in more wins? Exercise 1.5: Other Improvements Can you think of other ways to improve the reinforcement learning player? Can you think of any better way to solve the tictac-toe problem as posed?

1.5 Summary

Reinforcement learning is a computational approach to understanding and automating goal-directed learning and decision-making. It is distinguished from other computational approaches by its emphasis on learning by the individual from direct interaction with its environment, without relying on exemplary supervision or complete models of the environment. In our opinion, reinforcement learning is the first field to seriously address the computational issues that arise when learning from interaction with an environment in order to achieve long-term goals. Reinforcement learning uses a formal framework defining the interaction between a learning agent and its environment in terms of states, actions, and rewards. This framework is intended to be a simple way of representing essential features of the artificial intelligence problem. These features include a sense of cause and effect, a sense of uncertainty and nondeterminism, and the existence of explicit goals. The concepts of value and value functions are the key features of the reinforcement learning methods that we consider in this book. We take the position that value functions are essential for efficient search in the space of policies. Their use of value functions distinguishes reinforcement learning methods from evolutionary methods that search directly in policy space guided by scalar evaluations of entire policies.

1.6 History of Reinforcement Learning

The history of reinforcement learning has two main threads, both long and rich, that were pursued independently before intertwining in modern reinforcement learning. One thread concerns learning by trial and error and started in the psychology of animal learning. This thread runs through some of the earliest work in artificial intelligence and led to the revival of reinforcement learning in the early 1980s. The other thread concerns the problem of optimal control and its solution using value functions and dynamic programming. For the most part, this thread did not involve learning. Although the two threads have been largely independent, the exceptions revolve around a third, less distinct thread concerning temporal-difference methods such as used in the tic-tac-toe example in this chapter. All three threads came together in the late 1980s to produce the modern field of reinforcement learning as we present it in this book. The thread focusing on trial-and-error learning is the one with which we are most familiar and about which we have the most to say in this brief history. Before doing that, however, we briefly discuss the optimal control thread. The term "optimal control" came into use in the late 1950s to describe the problem of designing a controller to minimize a measure of a dynamical system's behavior over time. One of the approaches to this problem was developed in the mid-1950s by Richard Bellman and others through extending a nineteenth century theory of Hamilton and Jacobi. This approach uses the concepts of a dynamical system's state and of a value function, or "optimal return function," to define a functional equation, now often called the Bellman equation. The class of methods for solving optimal control problems by solving this equation came to be known as dynamic programming (Bellman, 1957a). Bellman (1957b) also introduced the discrete stochastic version of the optimal control problem known as Markovian decision processes (MDPs), and Ron Howard (1960) devised the policy iteration method for MDPs. All of these are essential elements underlying the theory and algorithms of modern reinforcement learning.

Dynamic programming is widely considered the only feasible way of solving general stochastic optimal control problems. It suffers from what Bellman called "the curse of dimensionality," meaning that its computational requirements grow exponentially with the number of state variables, but it is still far more efficient and more widely applicable than any other general method. Dynamic programming has been extensively developed since the late 1950s, including extensions to partially observable MDPs (surveyed by Lovejoy, 1991), many applications (surveyed by White, 1985, 1988, 1993), approximation methods (surveyed by Rust, 1996), and asynchronous methods (Bertsekas, 1982, 1983). Many excellent modern treatments of dynamic programming are available (e.g., Bertsekas, 1995; Puterman, 1994; Ross, 1983; and Whittle, 1982, 1983). Bryson (1996) provides an authoritative history of optimal control. In this book, we consider all of the work in optimal control also to be, in a sense, work in reinforcement learning. We define reinforcement learning as any effective way of solving reinforcement learning problems, and it is now clear that these problems are closely related to optimal control problems, particularly those formulated as MDPs. Accordingly, we must consider the solution methods of optimal control, such as dynamic programming, also to be reinforcement learning methods. Of course, almost all of these methods require complete knowledge of the system to be controlled, and for this reason it feels a little unnatural to say that they are part of reinforcement learning. On the other hand, many dynamic programming methods are incremental and iterative. Like learning methods, they gradually reach the correct answer through successive approximations. As we show in the rest of this book, these similarities are far more than superficial. The theories and solution methods for the cases of complete and incomplete knowledge are so closely related that we feel they must be considered together as part of the same subject matter.

Let us return now to the other major thread leading to the modern field of reinforcement learning, that centered on the idea of trial-and-error learning. This thread began in psychology, where "reinforcement" theories of learning are common. Perhaps the first to succinctly express the essence of trial-and-error learning was Edward Thorndike. We take this essence to be the idea that actions followed by good or bad outcomes have their tendency to be reselected altered accordingly. In Thorndike's words:

Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond. (Thorndike, 1911, p. 244) Thorndike called this the "Law of Effect" because it describes the effect of reinforcing events on the tendency to select actions. Although sometimes controversial (e.g., see Kimble, 1961, 1967; Mazur, 1994), the Law of Effect is widely regarded as an obvious basic principle underlying much behavior (e.g., Hilgard and Bower, 1975; Dennett, 1978; Campbell, 1960; Cziko, 1995). The Law of Effect includes the two most important aspects of what we mean by trial-and-error learning. First, it is selectional, meaning that it involves trying alternatives and selecting among them by comparing their consequences. Second, it is associative, meaning that the alternatives found by selection are associated with particular situations. Natural selection in evolution is a prime example of a selectional process, but it is not associative. Supervised learning is associative, but not selectional. It is the combination of these two that is essential to the Law of Effect and to trial-and-error learning. Another way of saying this is that the Law of Effect is an elementary way of combining search and memory: search in the form of trying and selecting among many actions in each situation, and memory in the form of remembering what actions worked best, associating them with the situations in which they were best. Combining search and memory in this way is essential to reinforcement learning. In early artificial intelligence, before it was distinct from other branches of engineering, several researchers began to explore trial-and-error learning as an engineering principle. The earliest computational investigations of trial-and-error learning were perhaps by Minsky and by Farley and Clark, both in 1954. In his Ph.D. dissertation, Minsky discussed computational models of reinforcement learning and described his construction of an analog machine composed of components he called SNARCs (Stochastic Neural-Analog Reinforcement Calculators). Farley and Clark described another neural-network learning machine designed to learn by trial and error. In the 1960s the terms

"reinforcement" and "reinforcement learning" were used in the engineering literature for the first time (e.g., Waltz and Fu, 1965; Mendel, 1966; Fu, 1970; Mendel and McClaren, 1970). Particularly influential was Minsky's paper "Steps Toward Artificial Intelligence" (Minsky, 1961), which discussed several issues relevant to reinforcement learning, including what he called the credit assignment problem: How do you distribute credit for success among the many decisions that may have been involved in producing it? All of the methods we discuss in this book are, in a sense, directed toward solving this problem.

The interests of Farley and Clark (1954; Clark and Farley, 1955) shifted from trialand-error learning to generalization and pattern recognition, that is, from reinforcement learning to supervised learning. This began a pattern of confusion about the relationship between these types of learning. Many researchers seemed to believe that they were studying reinforcement learning when they were actually studying supervised learning. For example, neural network pioneers such as Rosenblatt (1962) and Widrow and Hoff (1960) were clearly motivated by reinforcement learning--they used the language of rewards and punishments--but the systems they studied were supervised learning systems suitable for pattern recognition and perceptual learning. Even today, researchers and textbooks often minimize or blur the distinction between these types of learning. Some modern neural-network textbooks use the term "trial-and-error" to describe networks that learn from training examples because they use error information to update connection weights. This is an understandable confusion, but it substantially misses the essential selectional character of trial-and-error learning.

Partly as a result of these confusions, research into genuine trial-and-error learning became rare in the the 1960s and 1970s. In the next few paragraphs we discuss some of the exceptions and partial exceptions to this trend. One of these was the work by a New Zealand researcher named John Andreae. Andreae (1963) developed a system called STeLLA that learned by trial and error in interaction with its environment. This system included an internal model of the world and, later, an "internal monologue" to deal with problems of hidden state (Andreae, 1969a). Andreae's later work (1977) placed more emphasis on learning from a teacher, but still included trial and error. Unfortunately, his pioneering

research was not well known, and did not greatly impact subsequent reinforcement learning research. More influential was the work of Donald Michie. In 1961 and 1963 he described a simple trial-and-error learning system for learning how to play tic-tac-toe (or naughts and crosses) called MENACE (for Matchbox Educable Naughts and Crosses Engine). It consisted of a matchbox for each possible game position, each matchbox containing a number of colored beads, a different color for each possible move from that position. By drawing a bead at random from the matchbox corresponding to the current game position, one could determine MENACE's move. When a game was over, beads were added to or removed from the boxes used during play to reinforce or punish MENACE's

decisions. Michie and Chambers (1968) described another tic-tac-toe reinforcement learner called GLEE (Game Learning Expectimaxing Engine) and a reinforcement learning controller called BOXES. They applied BOXES to the task of learning to balance a pole hinged to a movable cart on the basis of a failure signal occurring only when the pole fell or the cart reached the end of a track. This task was adapted from the earlier work of Widrow and Smith (1964), who used supervised learning methods, assuming instruction from a teacher already able to balance the pole. Michie and Chambers's version of pole-balancing is one of the best early examples of a reinforcement learning task under conditions of incomplete knowledge. It influenced much later work in reinforcement learning, beginning with some of our own studies (Barto, Sutton, and Anderson, 1983; Sutton, 1984).

Michie has consistently emphasized the role of trial and error and learning as essential aspects of artificial intelligence (Michie, 1974).

Widrow, Gupta, and Maitra (1973) modified the LMS algorithm of Widrow and Hoff (1960) to produce a reinforcement learning rule that could learn from success and failure signals instead of from training examples. They called this form of learning "selective bootstrap adaptation" and described it as "learning with a critic" instead of "learning with a teacher." They analyzed this rule and

showed how it could learn to play blackjack. This was an isolated foray into reinforcement learning by Widrow, whose contributions to supervised learning were much more influential. Research on learning automata had a more direct influence on the trial-and-error thread leading to modern reinforcement learning research. These are methods for solving a nonassociative, purely selectional learning problem known as the armed bandit by analogy to a slot machine, or "one-armed bandit," except with levers (see Chapter 2). Learning automata are simple, low-memory machines for solving this problem. Learning automata originated in Russia with the work of Tsetlin (1973) and has been extensively developed since then within engineering (see Narendra and Thathachar, 1974, 1989). Barto and Anandan (1985) extended these methods to the associative case. John Holland (1975) outlined a general theory of adaptive systems based on selectional principles. His early work concerned trial and error primarily in its nonassociative form, as in evolutionary methods and the -armed bandit. In 1986 he introduced classifier systems, true reinforcement learning systems including association and value functions. A key component of Holland's classifier systems was always a genetic algorithm, an evolutionary method whose role was to evolve useful representations. Classifier systems have been extensively developed by many researchers to form a major branch of reinforcement learning research (e.g., see Goldberg, 1989; Wilson, 1994), but genetic algorithms--which by themselves are not reinforcement learning systems--have received much more attention. The individual most responsible for reviving the trial-and-error thread to reinforcement learning within artificial intelligence was Harry Klopf (1972, 1975, 1982). Klopf recognized that essential aspects of adaptive behavior were being lost as learning researchers came to focus almost exclusively on supervised learning. What was missing, according to Klopf, were the hedonic aspects of behavior, the drive to achieve some result from the environment, to control the environment toward desired ends and away from undesired ends. This is the essential idea of trialand-error learning. Klopf's ideas were especially influential on the authors because our assessment of them (Barto and Sutton, 1981a) led to our appreciation of the distinction between supervised and reinforcement learning, and to our eventual focus on reinforcement learning. Much of the early work that we and colleagues accomplished was directed toward showing that reinforcement learning

and supervised learning were indeed different (Barto, Sutton, and Brouwer, 1981; Barto and Sutton, 1981b; Barto and Anandan, 1985). Other studies showed how reinforcement learning could address important problems in neural network learning, in particular, how it could produce learning algorithms for multilayer networks (Barto, Anderson, and Sutton, 1982; Barto and Anderson, 1985; Barto and Anandan, 1985; Barto, 1985, 1986; Barto and Jordan, 1987). We turn now to the third thread to the history of reinforcement learning, that concerning temporal-difference learning. Temporal-difference learning methods are distinctive in being driven by the difference between temporally successive estimates of the same quantity--for example, of the probability of winning in the tic-tac-toe example. This thread is smaller and less distinct than the other two, but it has played a particularly important role in the field, in part because temporaldifference methods seem to be new and unique to reinforcement learning. The origins of temporal-difference learning are in part in animal learning psychology, in particular, in the notion of secondary reinforcers. A secondary reinforcer is a stimulus that has been paired with a primary reinforcer such as food or pain and, as a result, has come to take on similar reinforcing properties. Minsky (1954) may have been the first to realize that this psychological principle could be important for artificial learning systems. Arthur Samuel (1959) was the first to propose and implement a learning method that included temporal-difference ideas, as part of his celebrated checkers-playing program. Samuel made no reference to Minsky's work or to possible connections to animal learning. His inspiration apparently came from Claude Shannon's (1950) suggestion that a computer could be programmed to use an evaluation function to play chess, and that it might be able to to improve its play by modifying this function on-line. (It is possible that these ideas of Shannon's also influenced Bellman, but we know of no evidence for this.) Minsky (1961) extensively discussed Samuel's work in his "Steps" paper, suggesting the connection to secondary reinforcement theories, both natural and artificial.

As we have discussed, in the decade following the work of Minsky and Samuel, little computational work was done on trial-and-error learning, and apparently no computational work at all was done on temporal-difference learning. In 1972, Klopf brought trial-and-error learning together with an important component of temporaldifference learning. Klopf was interested in principles that would scale to learning in

large systems, and thus was intrigued by notions of local reinforcement, whereby subcomponents of an overall learning system could reinforce one another. He developed the idea of "generalized reinforcement," whereby every component (nominally, every neuron) views all of its inputs in reinforcement terms: excitatory inputs as rewards and inhibitory inputs as punishments. This is not the same idea as what we now know as temporal-difference learning, and in retrospect it is farther from it than was Samuel's work. On the other hand, Klopf linked the idea with trialand-error learning and related it to the massive empirical database of animal learning psychology.

Sutton (1978a, 1978b, 1978c) developed Klopf's ideas further, particularly the links to animal learning theories, describing learning rules driven by changes in temporally successive predictions. He and Barto refined these ideas and developed a psychological model of classical conditioning based on temporal-difference learning (Sutton and Barto, 1981a; Barto and Sutton, 1982). There followed several other influential psychological models of classical conditioning based on temporaldifference learning (e.g., Klopf, 1988; Moore et al., 1986; Sutton and Barto, 1987, 1990). Some neuroscience models developed at this time are well interpreted in terms of temporal-difference learning (Hawkins and Kandel, 1984; Byrne, Gingrich, and Baxter, 1990; Gelperin, Hopfield, and Tank, 1985; Tesauro, 1986; Friston et al., 1994), although in most cases there was no historical connection. A recent summary of links between temporal-difference learning and neuroscience ideas is provided by Schultz, Dayan, and Montague (1997). Our early work on temporal-difference learning was strongly influenced by animal learning theories and by Klopf's work. Relationships to Minsky's "Steps" paper and to Samuel's checkers players appear to have been recognized only afterward. By 1981, however, we were fully aware of all the prior work mentioned above as part of the temporal-difference and trial-and-error threads. At this time we developed a method for using temporal-difference learning in trial-and-error learning, known as the actor-critic architecture, and applied this method to Michie and Chambers's pole-balancing problem (Barto, Sutton, and Anderson, 1983). This method was extensively studied in Sutton's (1984) Ph.D. dissertation and extended to use backpropagation neural networks in Anderson's (1986) Ph.D. dissertation. Around this time, Holland (1986) incorporated temporal-difference ideas explicitly into his classifier systems. A key step was taken by Sutton in 1988 by separating temporal-

difference learning from control, treating it as a general prediction method. That paper also introduced the TD(

) algorithm and proved some of its convergence properties.

As we were finalizing our work on the actor-critic architecture in 1981, we discovered a paper by Ian Witten (1977) that contains the earliest known publication of a temporal-difference learning rule. He proposed the method that we now call tabular TD(0) for use as part of an adaptive controller for solving MDPs. Witten's work was a descendant of Andreae's early experiments with STeLLA and other trialand-error learning systems. Thus, Witten's 1977 paper spanned both major threads of reinforcement learning research--trial-and-error learning and optimal control--while making a distinct early contribution to temporal-difference learning. Finally, the temporal-difference and optimal control threads were fully brought together in 1989 with Chris Watkins's development of Q-learning. This work extended and integrated prior work in all three threads of reinforcement learning research. Paul Werbos (1987) contributed to this integration by arguing for the convergence of trial-and-error learning and dynamic programming since 1977. By the time of Watkins's work there had been tremendous growth in reinforcement learning research, primarily in the machine learning subfield of artificial intelligence, but also in neural networks and artificial intelligence more broadly. In 1992, the remarkable success of Gerry Tesauro's backgammon playing program, TDGammon, brought additional attention to the field. Other important contributions made in the recent history of reinforcement learning are too numerous to mention in this brief account; we cite these at the end of the individual chapters in which they arise.

2. Evaluative Feedback The most important feature distinguishing reinforcement learning from other types of learning is that it uses training information that evaluates the actions taken rather than instructs by giving correct actions. This is what creates the need for active exploration, for an explicit trial-and-error search for good behavior. Purely evaluative feedback indicates how good the action taken is, but not whether it is the best or the worst action possible. Evaluative feedback is the basis of methods for function optimization, including evolutionary methods. Purely instructive feedback, on the other hand, indicates the correct action to take, independently of the action actually taken. This kind of feedback is the basis of supervised learning, which includes large parts of pattern classification, artificial neural networks, and system identification. In their pure forms, these two kinds of feedback are quite distinct: evaluative feedback depends entirely on the action taken, whereas instructive feedback is independent of the action taken. There are also interesting intermediate cases in which evaluation and instruction blend together.

In this chapter we study the evaluative aspect of reinforcement learning in a simplified setting, one that does not involve learning to act in more than one situation. This nonassociative setting is the one in which most prior work involving evaluative feedback has been done, and it avoids much of the complexity of the full reinforcement learning problem. Studying this case will enable us to see most clearly how evaluative feedback differs from, and yet can be combined with, instructive feedback. The particular nonassociative, evaluative feedback problem that we explore is a simple version of the -armed bandit problem. We use this problem to introduce a number of basic learning methods which we extend in later chapters to apply to the full reinforcement learning problem. At the end of this chapter, we take a step closer to the full reinforcement learning problem by discussing what happens when the bandit problem becomes associative, that is, when actions are taken in more than one situation.

2.1 An

-Armed Bandit Problem

Consider the following learning problem. You are faced repeatedly with a choice among different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections. Each action selection is called a play. This is the original form of the -armed bandit problem, so named by analogy to a slot machine, or "one-armed bandit," except that it has levers instead of one. Each action selection is like a play of one of the slot machine's levers, and the rewards are the payoffs for hitting the jackpot. Through repeated plays you are to maximize your winnings by concentrating your plays on the best levers. Another analogy is that of a doctor choosing between experimental treatments for a series of seriously ill patients. Each play is a treatment selection, and each reward is the survival or well-being of the patient. Today the term " -armed bandit problem" is often used for a generalization of the problem described above, but in this book we use it to refer just to this simple case. In our -armed bandit problem, each action has an expected or mean reward given that that action is selected; let us call this the value of that action. If you knew the value of each action, then it would be trivial to solve the -armed bandit problem: you would always select the action with highest value. We assume that you do not know the action values with certainty, although you may have estimates. If you maintain estimates of the action values, then at any time there is at least one action whose estimated value is greatest. We call this a greedy action. If you select a greedy action, we say that you are exploiting your current knowledge of the values of the actions. If instead you select one of the nongreedy actions, then we say you are exploring because this enables you to improve your estimate of the nongreedy action's value. Exploitation is the right thing to do to maximize the expected reward on the one play, but exploration may produce the greater total reward in the long run. For example, suppose the greedy action's value is known with certainty, while several other actions are estimated to be nearly as good but with substantial

uncertainty. The uncertainty is such that at least one of these other actions probably is actually better than the greedy action, but you don't know which one. If you have many plays yet to make, then it may be better to explore the nongreedy actions and discover which of them are better than the greedy action. Reward is lower in the short run, during exploration, but higher in the long run because after you have discovered the better actions, you can exploit them. Because it is not possible both to explore and to exploit with any single action selection, one often refers to the "conflict" between exploration and exploitation.

In any specific case, whether it is better to explore or exploit depends in a complex way on the precise values of the estimates, uncertainties, and the number of remaining plays. There are many sophisticated methods for balancing exploration and exploitation for particular mathematical formulations of the -armed bandit and related problems. However, most of these methods make strong assumptions about stationarity and prior knowledge that are either violated or impossible to verify in applications and in the full reinforcement learning problem that we consider in subsequent chapters. The guarantees of optimality or bounded loss for these methods are of little comfort when the assumptions of their theory do not apply. In this book we do not worry about balancing exploration and exploitation in a sophisticated way; we worry only about balancing them at all. In this chapter we present several simple balancing methods for the -armed bandit problem and show that they work much better than methods that always exploit. In addition, we point out that supervised learning methods (or rather the methods closest to supervised learning methods when adapted to this problem) perform poorly on this problem because they do not balance exploration and exploitation at all. The need to balance exploration and exploitation is a distinctive challenge that arises in reinforcement learning; the simplicity of the - armed bandit problem enables us to show this in a particularly clear form.

2.2 Action-Value Methods

We begin by looking more closely at some simple methods for estimating the values of actions and for using the estimates to make action selection decisions. In this chapter, we denote the true (actual) value

of action as , and the estimated value at the th play as . Recall that the true value of an action is the mean reward received when that action is selected. One natural way to estimate this is by averaging the rewards actually received when the action was selected. In other words, if at the th play action has been chosen times prior to , yielding rewards value is estimated to be

If

, then we define . As

, then its

instead as some default value, such as , by

the law of large numbers converges to . We call this the sampleaverage method for estimating action values because each estimate is a simple average of the sample of relevant rewards. Of course this is just one way to estimate action values, and not necessarily the best one. Nevertheless, for now let us stay with this simple estimation method and turn to the question of how the estimates might be used to select actions. The simplest action selection rule is to select the action (or one of the actions) with highest estimated action value, that is, to select on play one of the greedy

actions,

, for which

. This method always exploits current knowledge to maximize immediate reward; it spends no time at all sampling apparently inferior actions to see if they might really be better. A simple alternative is to behave greedily most of the time, but every once in a while, say with small probability , instead select an action at random, uniformly, independently of the action-value

estimates. We call methods using this near-greedy action selection rule -greedy methods. An advantage of these methods is that, in the limit as the number of plays increases, every action will be sampled an infinite number of times, guaranteeing that

for all , and thus ensuring that all the

converge to . This of course implies that the probability of selecting the optimal action converges to greater than , that is, to near certainty. These are just asymptotic guarantees, however, and say little about the practical effectiveness of the methods.

To roughly assess the relative effectiveness of the greedy and -greedy methods, we compared them numerically on a suite of test problems. This is a set of 2000 randomly generated -armed bandit tasks with . For each action, , the rewards were selected from a normal (Gaussian) probability

distribution with mean were generated by

and variance . The 2000

-armed bandit tasks

reselecting the 2000 times, each according to a normal distribution with mean and variance . Averaging over tasks, we can plot the performance and

behavior of various methods as they improve with experience over 1000 plays, as in Figure 2.1. We call this suite of test tasks the 10-armed testbed.

Figure 2.1: Average performance of -greedy action-value methods on the 10armed testbed. These data are averages over 2000 tasks. All methods used sample averages as their action-value estimates.

Figure 2.1 compares a greedy method with two -greedy methods ( and ), as described above, on the 10-armed testbed. Both methods formed their action-value estimates using the sample-average technique. The upper graph shows the increase in expected reward with experience. The greedy method improved slightly faster than the other methods at the very beginning, but then leveled off at a lower level. It achieved a reward per step of only about 1, compared with the best possible of about 1.55 on this testbed. The greedy method performs significantly worse in the long run because it often gets stuck performing suboptimal actions. The lower graph shows that the greedy method found the optimal action in only approximately one-third of the tasks. In the other two-thirds, its initial samples

of the optimal action were disappointing, and it never returned to it. The -greedy methods eventually perform better because they continue to explore, and to improve their chances of recognizing the optimal action. The method explores more, and usually finds the optimal action earlier, but never selects it more than 91% of the time. The method improves more slowly, but eventually performs better than the also possible to reduce values.

method on both performance measures. It is over time to try to get the best of both high and low

The advantage of -greedy over greedy methods depends on the task. For example, suppose the reward variance had been larger, say 10 instead of 1. With noisier rewards it takes more exploration to find the

optimal action, and -greedy methods should fare even better relative to the greedy method. On the other hand, if the reward variances were zero, then the greedy method would know the true value of each action after trying it once. In this case the greedy method might actually perform best because it would soon find the optimal action and then never explore. But even in the deterministic case, there is a large advantage to exploring if we weaken some of the other assumptions. For example, suppose the bandit task were nonstationary, that is, that the true values of the actions changed over time. In this case exploration is needed even in the deterministic case to make sure one of the nongreedy actions has not changed to become better than the

greedy one. As we will see in the next few chapters, effective nonstationarity is the case most commonly encountered in reinforcement learning. Even if the underlying task is stationary and deterministic, the learner faces a set of banditlike decision tasks each of which changes over time due to the learning process itself. Reinforcement learning requires a balance between exploration and exploitation. Exercise 2.1 In the comparison shown in Figure 2.1, which method will perform best in the long run in terms of cumulative reward and cumulative probability of selecting the best action? How much better will it be?

2.3 Softmax Action Selection

Although -greedy action selection is an effective and popular means of balancing exploration and exploitation in reinforcement learning, one drawback is that when it explores it chooses equally among all actions. This means that it is as likely to choose the worst-appearing action as it is to choose the next-to-best action. In tasks where the worst actions are very bad, this may be unsatisfactory. The obvious solution is to vary the action probabilities as a graded function of estimated value. The greedy action is still given the highest selection probability, but all the others are ranked and weighted according to their value estimates. These are called softmax action selection rules. The most common softmax method uses a Gibbs, or Boltzmann, distribution. It chooses action on the th play with probability

where is a positive parameter called the temperature. High temperatures cause the actions to be all (nearly) equiprobable. Low temperatures cause a greater difference in selection probability for actions that differ in their value estimates. In the limit as , softmax action selection becomes the same as greedy action selection. Of course, the softmax effect can be produced in a large number of ways other than by a Gibbs distribution. For example, one could simply add a random number

from a long-tailed distribution to each was largest.

and then pick the action whose sum

Whether softmax action selection or -greedy action selection is better is unclear and may depend on the task and on human factors. Both methods have only one

parameter that must be set. Most

people find it easier to set the parameter with confidence; setting requires knowledge of the likely action values and of powers of . We know of no careful comparative studies of these two simple action-selection rules. Exercise 2.2 (programming) How does the softmax action selection method using the Gibbs distribution fare on the 10-armed testbed? Implement the method and run it at several temperatures to produce graphs similar to those in Figure 2.1. To verify your code, first implement the -greedy methods and reproduce some specific aspect of the results in Figure 2.1. Exercise 2.3 Show that in the case of two actions, the softmax operation using the Gibbs distribution becomes the logistic, or sigmoid, function commonly used in artificial neural networks. What effect does the temperature parameter have on the function?

2.4 Evaluation Versus Instruction The -armed bandit problem we considered above is a case in which the feedback is purely evaluative. The reward received after each action gives some information about how good the action was, but it says nothing at all about whether the action was correct or incorrect, that is, whether it was a best action or not. Here, correctness is a relative property of actions that can be determined only by trying them all and comparing their rewards. In this sense the problem is inherently one requiring explicit search among the alternative actions. You have to perform some form of the generate-and-test method whereby you try actions, observe the outcomes, and selectively retain those that are the most effective. This is learning by selection, in contrast to learning by instruction, and all reinforcement learning methods have to use it in one form or another. This contrasts sharply with supervised learning, where the feedback from the environment directly indicates what the correct action should have been. In this case there is no need to search: whatever action you try, you will be told what the right one would have been. There is no need to try a variety of actions; the instructive "feedback" is typically independent of the action selected (so is not really feedback at all). It might still be necessary to search in the parameter space of the supervised learning system (e.g., the weight space of a neural network), but searching in the space of actions is not required. Of course, supervised learning is usually applied to problems that are much more complex in some ways than the -armed bandit. In supervised learning there is not one situation in which action is taken, but a large set of different situations, each of which must be responded to correctly. The main problem facing a supervised learning system is to construct a mapping from situations to actions that mimics the correct actions specified by the environment and that generalizes correctly to new situations. A supervised learning system cannot be said to learn to control its environment because it follows, rather than influences, the instructive information it receives. Instead of trying to make its environment behave in a certain way, it tries to make itself behave as instructed by its environment. Focusing on the special case of a single situation that is encountered repeatedly helps make plain the distinction between evaluation and instruction. Suppose there are 100 possible actions and you select action number 32. Evaluative feedback would give you a score, say 7.2, for that action, whereas instructive training

information would say what other action, say action number 67, would actually have been correct. The latter is clearly much more informative training information. Even if instructional information is noisy, it is still more informative than evaluative feedback. It is always true that a single instruction can be used to advantage to direct changes in the action selection rule, whereas evaluative feedback must be compared with that of other actions before any inferences can be made about action 2.1 selection. The difference between evaluative feedback and instructive information remains significant even if there are only two actions and two possible rewards. For these binary bandit tasks, let us call the two rewards success and failure. If you received success, then you might reasonably infer that whatever action you selected was correct, and if you received failure, then you might infer that whatever action you did not select was correct. You could then keep a tally of how often each action was (inferred to be) correct and select the action that was correct most often. Let us call this the supervised algorithm because it corresponds most closely to what a supervised learning method might do in the case of a single input pattern. If the rewards are deterministic, then the inferences of the supervised algorithm are all correct and it performs excellently. If the rewards are stochastic, then the picture is more complicated. In the stochastic case, a particular binary bandit task is defined by two numbers, the probabilities of success for each possible action. The space of all possible tasks is thus a unit square, as shown in Figure 2.2. The upper-left and lower-right quadrants correspond to relatively easy tasks for which the supervised algorithm would work well. For these, the probability of success for the better action is greater than and the probability of success for the poorer action is less than . For these tasks, the action inferred to be correct (as described above) will actually be the correct action more than half the time.

However, binary bandit tasks in the other two quadrants of Figure 2.2 are more difficult and cannot be solved effectively by the supervised algorithm. For example, consider a task with success probabilities 0.1 and 0.2, corresponding to point A in the lower-left difficult quadrant of Figure 2.2. Because both actions produce failure at least 80% of the time, any method that takes failure as an indication that the other action was correct will oscillate between the two actions, never settling on the better one. Now consider a task with success probabilities 0.8 and 0.9, corresponding to point B in the upper-right difficult quadrant of Figure 2.2. In this case both actions produce success almost all the time. Any method that takes success as an indication of correctness can easily become stuck selecting the wrong action. Figure 2.3 shows the average behavior of the supervised algorithm and several other algorithms on the binary bandit tasks corresponding to points A and B. For comparison, also shown is the behavior of an -greedy action-value method ( ) as described in Section 2.2. In both tasks, the supervised algorithm learned to select the better action only slightly more than half the time.

The graphs in Figure 2.3 also show the average behavior of two other algorithms, known as

and

. These are classical methods from the field of learning automata that follow a logic similar to that of the supervised algorithm. Both methods are stochastic, updating the probabilities of selecting

each action, denoted and . The method infers the correct action just as the supervised algorithm does, and then adjusts its probabilities as follows. If the action inferred to be correct on play

was , then toward 1:

is incremented a fraction,

, of the way from its current value

The probability of the other action is adjusted inversely, so that the two probabilities sum to 1. For the results shown in Figure 2.3, was . The idea of similar to that of the supervised algorithm, only it is stochastic. Rather than committing totally to the action inferred to be best, 2.2 probability.

is

gradually increases its

The name stands for "linear, reward-penalty," meaning that the update (2.3) is linear in the probabilities and that the update is performed on both success (reward) plays and failure (penalty) plays. The name

stands for "linear,

reward-inaction." This algorithm is identical to except that it updates its probabilities only upon success plays; failure plays are ignored entirely. The results in Figure 2.3 show that performs little, if any, better than the supervised algorithm on the binary bandit tasks corresponding to points A and B in Figure 2.2. eventually performs very well on the A task, but not on the B task, and learns slowly in both cases. Binary bandit tasks are an instructive special case blending aspects of supervised and reinforcement learning problems. Because the rewards are binary, it is possible

to infer something about the correct action given just a single reward. In some instances of such problems, these inferences are quite reasonable and lead to effective algorithms. In other instances, however, such inferences are less appropriate and lead to poor behavior. In bandit tasks with nonbinary rewards, such as in the 10-armed testbed, it is not at all clear how the ideas behind these inferences could be applied to produce effective algorithms. All of these are very simple problems, but already we see the need for capabilities beyond those of supervised learning methods. Exercise 2.4 Consider a class of simplified supervised learning tasks in which there is only one situation (input pattern) and two actions. One action, say , is correct and the other, , is incorrect. The instruction signal is noisy: it instructs the wrong action with probability ; that is, with probability it says that is correct. You can think of these tasks as binary bandit tasks if you treat agreeing with the (possibly wrong) instruction signal as success, and disagreeing with it as failure. Discuss the resulting class of binary bandit tasks. Is anything special about these tasks? How does the supervised algorithm perform on these tasks?

2.5 Incremental Implementation The action-value methods we have discussed so far all estimate action values as sample averages of observed rewards. The obvious implementation is to maintain, for each action , a record of all the rewards that have followed the selection of that action. Then, when the estimate of the value of action a is needed at time , it can be computed according to (2.1), which we repeat here:

where are all the rewards received following all selections of action prior to play . A problem with this straightforward implementation is that its memory and computational requirements grow over time without bound. That is, each additional reward following a selection of action requires more memory to store it and results in more computation being required to determine

.

As you might suspect, this is not really necessary. It is easy to devise incremental update formulas for computing averages with small, constant computation required to process each new reward. For some action, let rewards (not to be confused with reward,

, then the average of all

, the average for action

at the

rewards can be computed by

denote the average of its first

th play). Given this average and a

st

which holds even for and

, obtaining

for arbitrary

. This implementation requires memory only for

, and only the small computation (2.4) for each new reward.

The update rule (2.4) is of a form that occurs frequently throughout this book. The general form is

The expression is an error in the estimate. It is reduced by taking a step toward the "Target." The target is presumed to indicate a desirable direction in which to move, though it may be noisy. In the case above, for example, the target is

the

reward.

Note that the step-size parameter ( time step. In processing the

th reward for action

) used in the incremental method described above changes from time step to , that method uses a step-size parameter of

. In this book we denote the

step-size parameter by the symbol of the sample-average method is

or, more generally, by

. For example, the above incremental implementation

described by the equation . Accordingly, we sometimes use the informal shorthand this case, leaving the action dependence implicit.

Exercise 2.5 Give pseudocode for a complete algorithm for the

to refer to

-armed bandit problem. Use greedy action selection and

incremental computation of action values with step-size parameter. Assume a function that takes an action and returns a reward. Use arrays and variables; do not subscript anything by the time index . Indicate how the action values are initialized and updated after each reward.

2.6 Tracking a Nonstationary Problem

The averaging methods discussed so far are appropriate in a stationary environment, but not if the bandit is changing over time. As noted earlier, we often encounter reinforcement learning problems that are effectively nonstationary. In such cases it makes sense to weight recent rewards more heavily than long-past ones. One of the most popular ways of doing this is to use a constant step-size parameter. For example, the incremental update rule (2.4) for updating an average rewards is modified to be

where the step-size parameter,

,

of the

, is constant. This results in

being a weighted average of past rewards and the initial estimate

:

past

We call this a weighted average because the sum of the weights is

, as you can check yourself. Note that the weight,

, given to the reward depends on how many rewards ago, , it was observed. The quantity is less than , and thus the weight given to decreases as the number of intervening rewards increases. In fact, the weight decays exponentially according to the exponent on . Accordingly, this is sometimes called an exponential, recency-weighted average.

Sometimes it is convenient to vary the step-size parameter from step to step. Let denote the step-size parameter used to process the reward received after the th selection of action . As we have

noted, the choice results in the sample-average method, which is guaranteed to converge to the true action values by the law of large numbers. But of course convergence is not guaranteed for

all choices of the sequence . A well-known result in stochastic approximation theory gives us the conditions required to assure convergence with probability 1:

The first condition is required to guarantee that the steps are large enough to eventually overcome any initial conditions or random fluctuations. The second condition guarantees that eventually the steps become small enough to assure convergence.

Note that both convergence conditions are met for the sample-average case, , but not for

the case of constant step-size parameter, . In the latter case, the second condition is not met, indicating that the estimates never completely converge but continue to vary in response to the most recently received rewards. As we mentioned above, this is actually desirable in a nonstationary environment, and problems that are effectively nonstationary are the norm in reinforcement learning. In addition, sequences of step-size parameters that meet the conditions (2.8) often converge very slowly or need considerable tuning in order to obtain a satisfactory

convergence rate. Although sequences of step-size parameters that meet these convergence conditions are often used in theoretical work, they are seldom used in applications and empirical research.

Exercise 2.6 If the step-size parameters,

, are not constant, then the

estimate is a weighted average of previously received rewards with a weighting different from that given by (2.7). What is the weighting on each prior reward for the general case?

Exercise 2.7 (programming) Design and conduct an experiment to demonstrate the difficulties that sample-average methods have for nonstationary problems. Use a modified version of the 10-armed

testbed in which all the start out equal and then take independent random walks. Prepare plots like Figure 2.1 for an action-value method using sample averages, incrementally computed by , and another action-value method using a a constant step-size parameter, . Use and, if necessary, runs longer than 1000 plays.

2.7 Optimistic Initial Values All the methods we have discussed so far are dependent to some extent on the initial action-value

estimates, . In the language of statistics, these methods are biased by their initial estimates. For the sample-average methods, the bias disappears once all actions have been selected at least once, but for methods with constant , the bias is permanent, though decreasing over time as given by (2.7). In practice, this kind of bias is usually not a problem, and can sometimes be very helpful. The downside is that the initial estimates become, in effect, a set of parameters that must be picked by the user, if only to set them all to zero. The upside is that they provide an easy way to supply some prior knowledge about what level of rewards can be expected. Initial action values can also be used as a simple way of encouraging exploration. Suppose that instead of setting the initial action values to zero, as we did in the 10armed testbed, we set them all to +5.

Recall that the in this problem are selected from a normal distribution with mean 0 and variance 1. An initial estimate of +5 is thus wildly optimistic. But this optimism encourages action-value methods to explore. Whichever actions are initially selected, the reward is less than the starting estimates; the learner switches to other actions, being "disappointed" with the rewards it is receiving. The result is that all actions are tried several times before the value estimates converge. The system does a fair amount of exploration even if greedy actions are selected all the time.

Figure 2.4:The effect of optimistic initial action-value estimates on the 10armed testbed. Figure 2.4 shows the performance on the 10-armed bandit testbed of a greedy method using

, for all . For comparison, also shown is an

-greedy method

with . Both methods used a constant step-size parameter, . Initially, the optimistic method performs worse because it explores more, but eventually it performs better because its exploration decreases with time. We call this technique for encouraging exploration optimistic initial values. We regard it as a simple trick that can be quite effective on stationary problems, but it is far from being a generally useful approach to encouraging exploration. For example, it is not well suited to nonstationary problems because its drive for exploration is inherently temporary. If the task changes, creating a renewed need for exploration, this method cannot help. Indeed, any method that focuses on the initial state in any special way is unlikely to help with the general nonstationary case. The beginning of time occurs only once, and thus we should not focus on it too much. This criticism applies as well to the sample-average methods, which also treat the beginning of time as a special event, averaging all subsequent rewards with equal weights. Nevertheless, all of these methods are very simple, and one of them or some simple combination of

them is often adequate in practice. In the rest of this book we make frequent use of several of these simple exploration techniques. Exercise 2.8 The results shown in Figure 2.4 should be quite reliable because they are averages over 2000 individual, randomly chosen 10-armed bandit tasks. Why, then, are there oscillations and spikes in the early part of the curve for the optimistic method? What might make this method perform particularly better or worse, on average, on particular early plays?

2.8 Reinforcement Comparison

A central intuition underlying reinforcement learning is that actions followed by large rewards should be made more likely to recur, whereas actions followed by small rewards should be made less likely to recur. But how is the learner to know what constitutes a large or a small reward? If an action is taken and the environment returns a reward of 5, is that large or small? To make such a judgment one must compare the reward with some standard or reference level, called the reference reward. A natural choice for the reference reward is an average of previously received rewards. In other words, a reward is interpreted as large if it is higher than average, and small if it is lower than average. Learning methods based on this idea are called reinforcement comparison methods. These methods are sometimes more effective than action-value methods. They are also the precursors to actor-critic methods, a class of methods for solving the full reinforcement learning problem that we present later. Reinforcement comparison methods typically do not maintain estimates of action values, but only of an overall reward level. In order to pick among the actions, they maintain a separate measure of their preference for each action. Let us denote the preference for action

on play

by

. The preferences might be used to determine action-selection probabilities according to a softmax relationship, such as

where denotes the probability of selecting action on the th play. The reinforcement comparison idea is used in updating the action preferences. After

each play, the preference for the action selected on that play, , is incremented by the difference between the reward, , and the reference reward, :

where is a positive step-size parameter. This equation implements the idea that high rewards should increase the probability of reselecting the action taken, and low rewards should decrease its probability.

where denotes the probability of selecting action on the th play. The reinforcement comparison idea is used in updating the action preferences. After each play, the preference for the action selected on that play, , is incremented by the difference between the reward, , and the reference reward, :

The reference reward is an incremental average of all recently received rewards, whichever actions were taken. After the update (2.10), the reference reward is updated:

where , , is a step-size parameter as usual. The initial value of the reference reward, , can be set either optimistically, to encourage exploration, or according to prior knowledge. The initial values

of the action preferences can all be set to zero. Constant is a good choice here because the distribution of rewards is changing over time as action selection improves. We see here the first case in which the learning problem is effectively nonstationary even though the underlying problem is stationary.

Figure 2.5:Reinforcement comparison methods versus action-value methods on the 10-armed testbed.

Reinforcement comparison methods can be very effective, sometimes performing even better than action-value methods. Figure 2.5 shows the performance of the above algorithm ( ) on the 10-armed testbed. The performances of and

-greedy (

) action-value methods with

are also shown for comparison.

Exercise 2.9 The softmax action-selection rule given for reinforcement comparison methods (2.9) lacks the temperature parameter, , used in the earlier softmax equation (2.2). Why do you think this was done? Has any important flexibility been lost here by omitting ?

Exercise 2.10 The reinforcement comparison methods described here have two step-size parameters,

and

. Could we, in general, reduce this to one parameter by choosing ? What would be lost by doing this?

Exercise 2.11 (programming) Suppose the initial reference reward, , is far too low. Whatever action is selected first will then probably increase in its probability of selection. Thus it is likely to be selected again, and increased in probability again. In this way an early action that is no better than any other could crowd out all other actions for a long time. To counteract this effect, it is common to add a factor

of to the increment in (2.10). Design and implement an experiment to determine whether or not this really improves the performance of the algorithm.

2.9 Pursuit Methods Another class of effective learning methods for the -armed bandit problem are pursuit methods. Pursuit methods maintain both action-value estimates and action preferences, with the preferences continually "pursuing" the action that is greedy according to the current action-value estimates. In the simplest pursuit

method, the action preferences are the probabilities, , is selected on play .

, with which each action,

After each play, the probabilities are updated so as to make the greedy action more likely to be selected. After the th play, let denote the greedy action (or a random sample from the greedy actions if there are more than one) for the ( )st play. Then the probability of selecting is incremented a fraction,

, of the way toward 1:

while the probabilities of selecting the other actions are decremented toward zero:

The action values, , are updated in one of the ways discussed in the preceding sections, for example, to be sample averages of the observed rewards,

using (2.1).

Figure 2.6:Performance of the pursuit method vis-á-vis action-value and reinforcement comparison methods on the 10-armed testbed. Figure 2.6 shows the performance of the pursuit algorithm described above when the action values are estimated using sample averages (incrementally computed using results, the initial action probabilities were was 0.01. For comparison, we also show the

). In these

, for all , and the parameter

performance of an -greedy method ( ) with action values also estimated using sample averages. The performance of the reinforcement comparison algorithm from the previous section is also shown. Although the pursuit algorithm performs the best of these three on this task at these parameter settings, the ordering could well be different in other cases. All three of these methods appear to have their uses and advantages.

Exercise 2.12 An -greedy method always selects a random action on a fraction of the time steps. How about the pursuit algorithm? Will it eventually select the optimal action with probability approaching 1?

Exercise 2.13 For many of the problems we will encounter later in this book it is not feasible to update action probabilities directly. To use pursuit methods in these cases it is necessary to modify them to use action preferences that are not probabilities but

that determine action probabilities according to a softmax relationship such as the Gibbs distribution (2.9). How can the pursuit algorithm described above be modified to be used in this way? Specify a complete algorithm, including the equations for action values, preferences, and probabilities at each play. Exercise 2.14 (programming) How well does the algorithm you proposed in Exercise 2.13 perform? Design and run an experiment assessing the performance of your method. Discuss the role of parameter settings in your experiment. Exercise 2.15 The pursuit algorithm described above is suited only for stationary environments because the action probabilities converge, albeit slowly, to certainty. How could you combine the pursuit idea with the - greedy idea to obtain a method with performance close to that of the pursuit algorithm, but that always continues to explore to some small degree?

2.10 Associative Search

So far in this chapter we have considered only nonassociative tasks, in which there is no need to associate different actions with different situations. In these tasks the learner either tries to find a single best action when the task is stationary, or tries to track the best action as it changes over time when the task is nonstationary. However, in a general reinforcement learning task there is more than one situation, and the goal is to learn a policy: a mapping from situations to the actions that are best in those situations. To set the stage for the full problem, we briefly discuss the simplest way in which nonassociative tasks extend to the associative setting. As an example, suppose there are several different -armed bandit tasks, and that on each play you confront one of these chosen at random. Thus, the bandit task changes randomly from play to play. This would appear to you as a single, nonstationary armed bandit task whose true action values change randomly from play to play. You could try using one of the methods described in this chapter that can handle nonstationarity, but unless the true action values change slowly, these methods will not work very well. Now suppose, however, that when a bandit task is selected for you, you are given some distinctive clue about its identity (but not its action values). Maybe you are facing an actual slot machine that changes the color of its display as it changes its action values. Now you can learn a policy associating each task, signaled by the color you see, with the best action to take when facing that task--for instance, if red, play arm 1; if green, play arm 2. With the right policy you can usually do much better than you could in the absence of any information distinguishing one bandit task from another. This is an example of an associative search task, so called because it involves both trial-and-error learning in the form of search for the best actions and association of these actions with the situations in which they are best. Associative search tasks are intermediate between the -armed bandit problem and the full reinforcement learning problem. They are like the full reinforcement learning problem in that they involve learning a policy, but like our version of the -armed bandit problem in that each action affects only the immediate reward. If actions are allowed to affect the next situation as well as the reward, then we have the full reinforcement learning problem. We present this problem in the next chapter and consider its ramifications throughout the rest of the book.

Exercise 2.16 Suppose you face a binary bandit task whose true action values change randomly from play to play. Specifically, suppose that for any play the true values of actions and are respectively 0.1 and 0.2 with probability 0.5 (case A), and 0.9 and 0.8 with probability 0.5 (case B). If you are not able to tell which case you face at any play, what is the best expectation of success you can achieve and how should you behave to achieve it? Now suppose that on each play you are told if you are facing case A or case B (although you still don't know the true action values). This is an associative search task. What is the best expectation of success you can achieve in this task, and how should you behave to achieve it? 1

2.11 Conclusions

We have presented in this chapter some simple ways of balancing exploration and exploitation. The

-greedy methods choose randomly a small fraction of the time, the softmax methods grade their action probabilities according to the current action-value estimates, and the pursuit methods keep taking steps toward the current greedy action. Are these simple methods really the best we can do in terms of practically useful algorithms? So far, the answer appears to be "yes." Despite their simplicity, in our opinion the methods presented in this chapter can fairly be considered the state of the art. There are more sophisticated methods, but their complexity and assumptions make them impractical for the full reinforcement learning problem that is our real focus. Starting in Chapter 5 we present learning methods for solving the full reinforcement learning problem that use in part the simple methods explored in this chapter. Although the simple methods explored in this chapter may be the best we can do at present, they are far from a fully satisfactory solution to the problem of balancing exploration and exploitation. We conclude this chapter with a brief look at some of the current ideas that, while not yet practically useful, may point the way toward better solutions. One promising idea is to use estimates of the uncertainty of the action-value estimates to direct and encourage exploration. For example, suppose there are two actions estimated to have values slightly less than that of the greedy action, but that differ greatly in their degree of uncertainty. One estimate is nearly certain; perhaps that action has been tried many times and many rewards have been observed. The uncertainty for this action's estimated value is so low that its true value is very unlikely to be higher than the value of the greedy action. The other action is known less well, and the estimate of its value is very uncertain. The true value of this action could easily be better than that of the greedy action. Obviously, it makes more sense to explore the second action than the first.

This line of thought leads to interval estimation methods. These methods estimate for each action a confidence interval of the action's value. That is, rather than learning that the action's value is approximately 10, they learn that it is between 9 and 11 with, say, 95% confidence. The action selected is then the action whose confidence interval has the highest upper limit. This encourages exploration of actions that are uncertain and have a chance of ultimately being the best action. In some cases one can obtain guarantees that the optimal action has been found with confidence equal to the confidence factor (e.g., the 95%). Unfortunately, interval estimation methods are problematic in practice because of the complexity of the statistical methods used to estimate the confidence intervals. Moreover, the underlying statistical assumptions required by these methods are often not satisfied. Nevertheless, the idea of using confidence intervals, or some other measure of uncertainty, to encourage exploration of particular actions is sound and appealing. There is also a well-known algorithm for computing the Bayes optimal way to balance exploration and exploitation. This method is computationally intractable when done exactly, but there may be efficient ways to approximate it. In this method we assume that we know the distribution of problem instances, that is, the probability of each possible set of true action values. Given any action selection, we can then compute the probability of each possible immediate reward and the resultant posterior probability distribution over action values. This evolving distribution becomes the information state of the problem. Given a horizon, say 1000 plays, one can consider all possible actions, all possible resulting rewards, all possible next actions, all next rewards, and so on for all 1000 plays. Given the assumptions, the rewards and probabilities of each possible chain of events can be determined, and one need only pick the best. But the tree of possibilities grows extremely rapidly; even if there are only two actions and two rewards, the tree will have leaves. This approach effectively turns the bandit problem into an instance of the full reinforcement learning problem. In the end, we may be able to use reinforcement learning methods to approximate this optimal solution. But that is a topic for current research and beyond the scope of this introductory book.

The classical solution to balancing exploration and exploitation in -armed bandit problems is to compute special functions called Gittins indices. These provide an optimal solution to a certain kind of bandit problem more general than that considered here but that assumes the prior distribution of possible problems is known. Unfortunately, neither the theory nor the computational tractability of this

method appear to generalize to the full reinforcement learning problem that we consider in the rest of the book.

3. The Reinforcement Learning Problem

In this chapter we introduce the problem that we try to solve in the rest of the book. For us, this problem defines the field of reinforcement learning: any method that is suited to solving this problem we consider to be a reinforcement learning method. Our objective in this chapter is to describe the reinforcement learning problem in a broad sense. We try to convey the wide range of possible applications that can be framed as reinforcement learning tasks. We also describe mathematically idealized forms of the reinforcement learning problem for which precise theoretical statements can be made. We introduce key elements of the problem's mathematical structure, such as value functions and Bellman equations. As in all of artificial intelligence, there is a tension between breadth of applicability and mathematical tractability. In this chapter we introduce this tension and discuss some of the tradeoffs and challenges that it implies.

3.1 The Agent-Environment Interface The reinforcement learning problem is meant to be a straightforward framing of the problem of learning from interaction to achieve a goal. The learner and decisionmaker is called the agent. The thing it interacts with, comprising everything outside the agent, is called the environment. These interact continually, the agent selecting actions and the environment responding to those actions and presenting 3.1 new situations to the agent. The environment also gives rise to rewards, special numerical values that the agent tries to maximize over time. A complete specification of an environment defines a task, one instance of the reinforcement learning problem.

More specifically, the agent and environment interact at each of a sequence of 3.2 discrete time steps, . At each time step , the agent receives some representation of the environment's state, , where is the set of possible states, and on that basis selects an action, is the set of actions available in state

, where

. One time step later, in part as a

consequence of its action, the agent receives a numerical reward, , and 3.3 finds itself in a new state, . Figure 3.1 diagrams the agent-environment interaction.

Figure 3.1:The agent-environment interaction in reinforcement learning. At each time step, the agent implements a mapping from states to probabilities of selecting each possible action. This mapping is called the agent's policy and is denoted , where is the probability that if . Reinforcement learning methods specify how the agent changes its policy as a result of its experience. The agent's goal, roughly speaking, is to maximize the total amount of reward it receives over the long run.

This framework is abstract and flexible and can be applied to many different problems in many different ways. For example, the time steps need not refer to fixed intervals of real time; they can refer to arbitrary successive stages of decisionmaking and acting. The actions can be low-level controls, such as the voltages applied to the motors of a robot arm, or high-level decisions, such as whether or not to have lunch or to go to graduate school. Similarly, the states can take a wide variety of forms. They can be completely determined by low-level sensations, such as direct sensor readings, or they can be more high-level and abstract, such as symbolic descriptions of objects in a room. Some of what makes up a state could be based on memory of past sensations or even be entirely mental or subjective. For example, an agent could be in "the state" of not being sure where an object is, or of having just been "surprised" in some clearly defined sense. Similarly, some actions might be totally mental or computational. For example, some actions might control what an agent chooses to think about, or where it focuses its attention. In general, actions can be any decisions we want to learn how to make, and the states can be anything we can know that might be useful in making them.

In particular, the boundary between agent and environment is not often the same as the physical boundary of a robot's or animal's body. Usually, the boundary is drawn closer to the agent than that. For example, the motors and mechanical linkages of a robot and its sensing hardware should usually be considered parts of the

environment rather than parts of the agent. Similarly, if we apply the framework to a person or animal, the muscles, skeleton, and sensory organs should be considered part of the environment. Rewards, too, presumably are computed inside the physical bodies of natural and artificial learning systems, but are considered external to the agent. The general rule we follow is that anything that cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment. We do not assume that everything in the environment is unknown to the agent. For example, the agent often knows quite a bit about how its rewards are computed as a function of its actions and the states in which they are taken. But we always consider the reward computation to be external to the agent because it defines the task facing the agent and thus must be beyond its ability to change arbitrarily. In fact, in some cases the agent may know everything about how its environment works and still face a difficult reinforcement learning task, just as we may know exactly how a puzzle like Rubik's cube works, but still be unable to solve it. The agent-environment boundary represents the limit of the agent's absolute control, not of its knowledge. The agent-environment boundary can be located at different places for different purposes. In a complicated robot, many different agents may be operating at once, each with its own boundary. For example, one agent may make high-level decisions which form part of the states faced by a lower-level agent that implements the highlevel decisions. In practice, the agent-environment boundary is determined once one has selected particular states, actions, and rewards, and thus has identified a specific decision-making task of interest. The reinforcement learning framework is a considerable abstraction of the problem of goal-directed learning from interaction. It proposes that whatever the details of the sensory, memory, and control apparatus, and whatever objective one is trying to achieve, any problem of learning goal-directed behavior can be reduced to three signals passing back and forth between an agent and its environment: one signal to represent the choices made by the agent (the actions), one signal to represent the basis on which the choices are made (the states), and one signal to define the agent's goal (the rewards). This framework may not be sufficient to represent all decisionlearning problems usefully, but it has proved to be widely useful and applicable.

Of course, the particular states and actions vary greatly from application to application, and how they are represented can strongly affect performance. In reinforcement learning, as in other kinds of learning, such representational choices are at present more art than science. In this book we offer some advice and examples regarding good ways of representing states and actions, but our primary focus is on general principles for learning how to behave once the representations have been selected. Example 3.1: Bioreactor Suppose reinforcement learning is being applied to determine moment-by-moment temperatures and stirring rates for a bioreactor (a large vat of nutrients and bacteria used to produce useful chemicals). The actions in such an application might be target temperatures and target stirring rates that are passed to lower-level control systems that, in turn, directly activate heating elements and motors to attain the targets. The states are likely to be thermocouple and other sensory readings, perhaps filtered and delayed, plus symbolic inputs representing the ingredients in the vat and the target chemical. The rewards might be moment-bymoment measures of the rate at which the useful chemical is produced by the bioreactor. Notice that here each state is a list, or vector, of sensor readings and symbolic inputs, and each action is a vector consisting of a target temperature and a stirring rate. It is typical of reinforcement learning tasks to have states and actions with such structured representations. Rewards, on the other hand, are always single numbers.

Example 3.2: Pick-and-Place Robot Consider using reinforcement learning to control the motion of a robot arm in a repetitive pick-and-place task. If we want to learn movements that are fast and smooth, the learning agent will have to control the motors directly and have low-latency information about the current positions and velocities of the mechanical linkages. The actions in this case might be the voltages applied to each motor at each joint, and the states might be the latest readings of joint angles and velocities. The reward might be for each object successfully picked up and placed. To encourage smooth movements, on each time step a small, negative reward can be given as a function of the moment-to-moment "jerkiness" of the motion.

Example 3.3: Recycling Robot A mobile robot has the job of collecting empty soda cans in an office environment. It has sensors for detecting cans, and an arm and gripper that can pick them up and place them in an onboard bin; it runs on a rechargeable battery. The robot's control system has components for interpreting sensory information, for navigating, and for controlling the arm and gripper. Highlevel decisions about how to search for cans are made by a reinforcement learning agent based on the current charge level of the battery. This agent has to decide whether the robot should (1) actively search for a can for a certain period of time, (2) remain stationary and wait for someone to bring it a can, or (3) head back to its home base to recharge its battery. This decision has to be made either periodically or whenever certain events occur, such as finding an empty can. The agent therefore has three actions, and its state is determined by the state of the battery. The rewards might be zero most of the time, but then become positive when the robot secures an empty can, or large and negative if the battery runs all the way down. In this example, the reinforcement learning agent is not the entire robot. The states it monitors describe conditions within the robot itself, not conditions of the robot's external environment. The agent's environment therefore includes the rest of the robot, which might contain other complex decision-making systems, as well as the robot's external environment.

Exercise 3.1 Devise three example tasks of your own that fit into the reinforcement learning framework, identifying for each its states, actions, and rewards. Make the three examples as different from each other as possible. The framework is abstract and flexible and can be applied in many different ways. Stretch its limits in some way in at least one of your examples. Exercise 3.2 Is the reinforcement learning framework adequate to usefully represent all goal-directed learning tasks? Can you think of any clear exceptions?

Exercise 3.3 Consider the problem of driving. You could define the actions in terms of the accelerator, steering wheel, and brake, that is, where your body meets the machine. Or you could define them farther out--say, where the rubber meets the road, considering your actions to be tire torques. Or you could define them farther in--say, where your brain meets your body, the actions being muscle twitches to control your limbs. Or you could go to a really high level and say that your actions are your choices of where to drive. What is the right level, the right place to draw the line between agent and environment? On what basis is one location of the line to be preferred over another? Is there any fundamental reason for preferring one location over another, or is it a free choice?

3.2 Goals and Rewards

In reinforcement learning, the purpose or goal of the agent is formalized in terms of a special reward signal passing from the environment to the agent. At each time step, the reward is a simple number, . Informally, the agent's goal is to maximize the total amount of reward it receives. This means maximizing not immediate reward, but cumulative reward in the long run. The use of a reward signal to formalize the idea of a goal is one of the most distinctive features of reinforcement learning. Although this way of formulating goals might at first appear limiting, in practice it has proved to be flexible and widely applicable. The best way to see this is to consider examples of how it has been, or could be, used. For example, to make a robot learn to walk, researchers have provided reward on each time step proportional to the robot's forward motion. In making a robot learn how to escape from a maze, the reward is often zero until it escapes, when it becomes . Another common approach in maze learning is to give a reward of for every time step that passes prior to escape; this encourages the agent to escape as quickly as possible. To make a robot learn to find and collect empty soda cans for recycling, one might give it a reward of zero most of the time, and then a reward of for each can collected (and confirmed as empty). One might also want to give the robot negative rewards when it bumps into things or when somebody yells at it. For an agent to learn to play checkers or chess, the natural rewards are for winning, for losing, and 0 for drawing and for all nonterminal positions.

You can see what is happening in all of these examples. The agent always learns to maximize its reward. If we want it to do something for us, we must provide rewards to it in such a way that in maximizing them the agent will also achieve our goals. It is thus critical that the rewards we set up truly indicate what we want accomplished. In particular, the reward signal is not the place to impart to the agent prior 3.4 knowledge about how to achieve what we want it to do. For example, a chessplaying agent should be rewarded only for actually winning, not for achieving subgoals such taking its opponent's pieces or gaining control of the center of the

board. If achieving these sorts of subgoals were rewarded, then the agent might find a way to achieve them without achieving the real goal. For example, it might find a way to take the opponent's pieces even at the cost of losing the game. The reward signal is your way of communicating to the robot what you want it to achieve, not how you want it achieved.

Newcomers to reinforcement learning are sometimes surprised that the rewards-which define of the goal of learning--are computed in the environment rather than in the agent. Certainly most ultimate goals for animals are recognized by computations occurring inside their bodies, for example, by sensors for recognizing food, hunger, pain, and pleasure. Nevertheless, as we discussed in the previous section, one can redraw the agent-environment interface in such a way that these parts of the body are considered to be outside of the agent (and thus part of the agent's environment). For example, if the goal concerns a robot's internal energy reservoirs, then these are considered to be part of the environment; if the goal concerns the positions of the robot's limbs, then these too are considered to be part of the environment--that is, the agent's boundary is drawn at the interface between the limbs and their control systems. These things are considered internal to the robot but external to the learning agent. For our purposes, it is convenient to place the boundary of the learning agent not at the limit of its physical body, but at the limit of its control. The reason we do this is that the agent's ultimate goal should be something over which it has imperfect control: it should not be able, for example, to simply decree that the reward has been received in the same way that it might arbitrarily change its actions. Therefore, we place the reward source outside of the agent. This does not preclude the agent from defining for itself a kind of internal reward, or a sequence of internal rewards. Indeed, this is exactly what many reinforcement learning methods do.

3.3 Returns

So far we have been imprecise regarding the objective of learning. We have said that the agent's goal is to maximize the reward it receives in the long run. How might this be formally defined? If the sequence of rewards received after time step is denoted , then what precise aspect of this sequence do we wish to maximize? In general, we seek to maximize the expected return, where the return, , is defined as some specific function of the reward sequence. In the simplest case the return is the sum of the rewards:

where is a final time step. This approach makes sense in applications in which there is a natural notion of final time step, that is, when the agent-environment 3.5 interaction breaks naturally into subsequences, which we call episodes, such as plays of a game, trips through a maze, or any sort of repeated interactions. Each episode ends in a special state called the terminal state, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states. Tasks with episodes of this kind are called episodic tasks. In episodic tasks we sometimes need to distinguish the set of all nonterminal states, denoted , from the set of all states plus the terminal state, denoted .

On the other hand, in many cases the agent-environment interaction does not break naturally into identifiable episodes, but goes on continually without limit. For example, this would be the natural way to formulate a continual process-control task, or an application to a robot with a long life span. We call these continuing tasks. The return formulation (3.1) is problematic for continuing tasks because the

final time step would be , and the return, which is what we are trying to maximize, could itself easily be infinite. (For example, suppose the agent receives a reward of at each time step.) Thus, in this book we usually use a definition of return that is slightly more complex conceptually but much simpler mathematically. The additional concept that we need is that of discounting. According to this approach, the agent tries to select actions so that the sum of the discounted rewards it receives over the future is maximized. In particular, it chooses to maximize the expected discounted return:

where

is a parameter,

, called the discount rate.

The discount rate determines the present value of future rewards: a reward received time steps in the future is worth only worth if it were received immediately. If

times what it would be

, the infinite sum has a finite value as long as the reward sequence

is

bounded. If , the agent is "myopic" in being concerned only with maximizing immediate rewards: its objective in this case is to learn how to choose so as to maximize only . If each of the agent's actions happened to influence only the immediate reward, not future rewards as well, then a myopic agent could maximize (3.2) by separately maximizing each immediate reward. But in general, acting to maximize immediate reward can reduce access to future rewards so that the return may actually be reduced. As approaches 1, the objective takes future rewards into account more strongly: the agent becomes more farsighted.

Example 3.4: Pole-Balancing Figure 3.2 shows a task that served as an early illustration of reinforcement learning. The objective here is to apply forces to a cart moving along a track so as to keep a pole hinged to the cart from falling over. A failure is said to occur if the pole falls past a given angle from vertical or if the cart runs off the track. The pole is reset to vertical after each failure. This task could be treated as episodic, where the natural episodes are the repeated attempts to balance the pole. The reward in this case could be for every time step on which failure did not occur, so that the return at each time would be the number of steps until failure. Alternatively, we could treat pole balancing as a continuing task, using discounting. In this case the reward would be on each failure and zero at all other times. The return at each time would then be related to , where is the number of time steps before failure. In either case, the return is maximized by keeping the pole balanced for as long as possible.

Exercise 3.4 Suppose you treated pole-balancing as an episodic task but also used discounting, with all rewards zero except for upon failure. What then would the return be at each time? How does this return differ from that in the discounted, continuing formulation of this task? Exercise 3.5 Imagine that you are designing a robot to run a maze. You decide to give it a reward of for escaping from the maze and a reward of zero at all other times. The task seems to break down naturally into episodes--the successive runs through the maze--so you decide to treat it as an episodic task, where the goal is to maximize expected total reward (3.1). After running the learning agent for a while, you find that it is showing no improvement in escaping from the maze. What is going wrong? Have you effectively communicated to the agent what you want it to achieve?

3.4 Unified Notation for Episodic and Continuing Tasks In the preceding section we described two kinds of reinforcement learning tasks, one in which the agent-environment interaction naturally breaks down into a sequence of separate episodes (episodic tasks), and one in which it does not (continuing tasks). The former case is mathematically easier because each action affects only the finite number of rewards subsequently received during the episode. In this book we consider sometimes one kind of problem and sometimes the other, but often both. It is therefore useful to establish one notation that enables us to talk precisely about both cases simultaneously. To be precise about episodic tasks requires some additional notation. Rather than one long sequence of time steps, we need to consider a series of episodes, each of which consists of a finite sequence of time steps. We number the time steps of each episode starting anew from zero. Therefore, we have to refer not just to , the state representation at time , but to , the state representation at time of episode (and similarly for , , , , etc.). However, it turns out that, when we discuss episodic tasks we will almost never have to distinguish between different episodes. We will almost always be considering a particular single episode, or stating something that is true for all episodes. Accordingly, in practice we will almost always abuse notation slightly by dropping the explicit reference to episode number. That is, we will write to refer to , and so on. We need one other convention to obtain a single notation that covers both episodic and continuing tasks. We have defined the return as a sum over a finite number of terms in one case (3.1) and as a sum over an infinite number of terms in the other (3.2). These can be unified by considering episode termination to be the entering of a special absorbing state that transitions only to itself and that generates only rewards of zero. For example, consider the state transition diagram

Starting from

, we get the reward sequence

. Summing these, we get the same return whether we sum over the first rewards (here ) or over the full infinite sequence. This remains true even if we introduce discounting. Thus, we can define the return, in general, according to (3.2), using the convention of omitting episode numbers when they are not needed, and

including the possibility that if the sum remains defined (e.g., because all episodes terminate). Alternatively, we can also write the return as

Here the solid square represents the special absorbing state corresponding to the end of an episode including the possibility that or 3.6 (but not both ). We use these conventions throughout the rest of the book to simplify notation and to express the close parallels between episodic and continuing tasks.

3.5 The Markov Property In the reinforcement learning framework, the agent makes its decisions as a function of a signal from the environment called the environment's state. In this section we discuss what is required of the state signal, and what kind of information we should and should not expect it to provide. In particular, we formally define a property of environments and their state signals that is of particular interest, called the Markov property. In this book, by "the state" we mean whatever information is available to the agent. We assume that the state is given by some preprocessing system that is nominally part of the environment. We do not address the issues of constructing, changing, or learning the state signal in this book. We take this approach not because we consider state representation to be unimportant, but in order to focus fully on the decisionmaking issues. In other words, our main concern is not with designing the state signal, but with deciding what action to take as a function of whatever state signal is available. Certainly the state signal should include immediate sensations such as sensory measurements, but it can contain much more than that. State representations can be highly processed versions of original sensations, or they can be complex structures built up over time from the sequence of sensations. For example, we can move our eyes over a scene, with only a tiny spot corresponding to the fovea visible in detail at any one time, yet build up a rich and detailed representation of a scene. Or, more obviously, we can look at an object, then look away, and know that it is still there. We can hear the word "yes" and consider ourselves to be in totally different states depending on the question that came before and which is no longer audible. At a more mundane level, a control system can measure position at two different times to produce a state representation including information about velocity. In all of these cases the state is constructed and maintained on the basis of immediate sensations together with the previous state or some other memory of past sensations. In this book, we do not explore how that is done, but certainly it can be and has been done. There is no reason to restrict the state representation to immediate sensations; in typical applications we should expect the state representation to be able to inform the agent of more than that.

On the other hand, the state signal should not be expected to inform the agent of everything about the environment, or even everything that would be useful to it in making decisions. If the agent is playing blackjack, we should not expect it to know what the next card in the deck is. If the agent is answering the phone, we should not expect it to know in advance who the caller is. If the agent is a paramedic called to a road accident, we should not expect it to know immediately the internal injuries of an unconscious victim. In all of these cases there is hidden state information in the environment, and that information would be useful if the agent knew it, but the agent cannot know it because it has never received any relevant sensations. In short, we don't fault an agent for not knowing something that matters, but only for having known something and then forgotten it! What we would like, ideally, is a state signal that summarizes past sensations compactly, yet in such a way that all relevant information is retained. This normally requires more than the immediate sensations, but never more than the complete history of all past sensations. A state signal that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property (we define this formally below). For example, a checkers position--the current configuration of all the pieces on the board--would serve as a Markov state because it summarizes everything important about the complete sequence of positions that led to it. Much of the information about the sequence is lost, but all that really matters for the future of the game is retained. Similarly, the current position and velocity of a cannonball is all that matters for its future flight. It doesn't matter how that position and velocity came about. This is sometimes also referred to as an "independence of path" property because all that matters is in the current state signal; its meaning is independent of the "path," or history, of signals that have led up to it. We now formally define the Markov property for the reinforcement learning problem. To keep the mathematics simple, we assume here that there are a finite number of states and reward values. This enables us to work in terms of sums and probabilities rather than integrals and probability densities, but the argument can easily be extended to include continuous states and rewards. Consider how a general environment might respond at time to the action taken at time . In the most general, causal case this response may depend on everything that has happened earlier. In this case the dynamics can be defined only by specifying the complete probability distribution:

for all

, , and all possible values of the past events: . If the state signal has the Markov property, on the other hand, then the environment's response at depends only on the state and action representations at , in which case the environment's dynamics can be defined by specifying only

for all , , , and . In other words, a state signal has the Markov property, and is a Markov state, if and only if (3.5) is equal to (3.4) for all , , and histories, . In this case, the environment and task as a whole are also said to have the Markov property. If an environment has the Markov property, then its one-step dynamics (3.5) enable us to predict the next state and expected next reward given the current state and action. One can show that, by iterating this equation, one can predict all future states and expected rewards from knowledge only of the current state as well as would be possible given the complete history up to the current time. It also follows that Markov states provide the best possible basis for choosing actions. That is, the best policy for choosing actions as a function of a Markov state is just as good as the best policy for choosing actions as a function of complete histories. Even when the state signal is non-Markov, it is still appropriate to think of the state in reinforcement learning as an approximation to a Markov state. In particular, we always want the state to be a good basis for predicting future rewards and for selecting actions. In cases in which a model of the environment is learned (see Chapter 9), we also want the state to be a good basis for predicting subsequent states. Markov states provide an unsurpassed basis for doing all of these things. To the extent that the state approaches the ability of Markov states in these ways, one

will obtain better performance from reinforcement learning systems. For all of these reasons, it is useful to think of the state at each time step as an approximation to a Markov state, although one should remember that it may not fully satisfy the Markov property. The Markov property is important in reinforcement learning because decisions and values are assumed to be a function only of the current state. In order for these to be effective and informative, the state representation must be informative. All of the theory presented in this book assumes Markov state signals. This means that not all the theory strictly applies to cases in which the Markov property does not strictly apply. However, the theory developed for the Markov case still helps us to understand the behavior of the algorithms, and the algorithms can be successfully applied to many tasks with states that are not strictly Markov. A full understanding of the theory of the Markov case is an essential foundation for extending it to the more complex and realistic non-Markov case. Finally, we note that the assumption of Markov state representations is not unique to reinforcement learning but is also present in most if not all other approaches to artificial intelligence. Example 3.5: Pole-Balancing State In the pole-balancing task introduced earlier, a state signal would be Markov if it specified exactly, or made it possible to reconstruct exactly, the position and velocity of the cart along the track, the angle between the cart and the pole, and the rate at which this angle is changing (the angular velocity). In an idealized cart-pole system, this information would be sufficient to exactly predict the future behavior of the cart and pole, given the actions taken by the controller. In practice, however, it is never possible to know this information exactly because any real sensor would introduce some distortion and delay in its measurements. Furthermore, in any real cart-pole system there are always other effects, such as the bending of the pole, the temperatures of the wheel and pole bearings, and various forms of backlash, that slightly affect the behavior of the system. These factors would cause violations of the Markov property if the state signal were only the positions and velocities of the cart and the pole. However, often the positions and velocities serve quite well as states. Some early studies of learning to solve the pole-balancing task used a coarse state signal that divided cart positions into three regions: right, left, and middle (and similar rough quantizations of the other three intrinsic state variables). This distinctly non-Markov

state was sufficient to allow the task to be solved easily by reinforcement learning methods. In fact, this coarse representation may have facilitated rapid learning by forcing the learning agent to ignore fine distinctions that would not have been useful in solving the task.

Example 3.6: Draw Poker In draw poker, each player is dealt a hand of five cards. There is a round of betting, in which each player exchanges some of his cards for new ones, and then there is a final round of betting. At each round, each player must match or exceed the highest bets of the other players, or else drop out (fold). After the second round of betting, the player with the best hand who has not folded is the winner and collects all the bets. The state signal in draw poker is different for each player. Each player knows the cards in his own hand, but can only guess at those in the other players' hands. A common mistake is to think that a Markov state signal should include the contents of all the players' hands and the cards remaining in the deck. In a fair game, however, we assume that the players are in principle unable to determine these things from their past observations. If a player did know them, then she could predict some future events (such as the cards one could exchange for) better than by remembering all past observations. In addition to knowledge of one's own cards, the state in draw poker should include the bets and the numbers of cards drawn by the other players. For example, if one of the other players drew three new cards, you may suspect he retained a pair and adjust your guess of the strength of his hand accordingly. The players' bets also influence your assessment of their hands. In fact, much of your past history with these particular players is part of the Markov state. Does Ellen like to bluff, or does she play conservatively? Does her face or demeanor provide clues to the strength of her hand? How does Joe's play change when it is late at night, or when he has already won a lot of money?

Although everything ever observed about the other players may have an effect on the probabilities that they are holding various kinds of hands, in practice this is far too much to remember and analyze, and most of it will have no clear effect on one's predictions and decisions. Very good poker players are adept at remembering just the key clues, and at sizing up new players quickly, but no one remembers everything that is relevant. As a result, the state representations people use to make their poker decisions are undoubtedly non-Markov, and the decisions themselves are presumably imperfect. Nevertheless, people still make very good decisions in such tasks. We conclude that the inability to have access to a perfect Markov state representation is probably not a severe problem for a reinforcement learning agent.

Exercise 3.6: Broken Vision System Imagine that you are a vision system. When you are first turned on for the day, an image floods into your camera. You can see lots of things, but not all things. You can't see objects that are occluded, and of course you can't see objects that are behind you. After seeing that first scene, do you have access to the Markov state of the environment? Suppose your camera was broken that day and you received no images at all, all day. Would you have access to the Markov state then?

3.6 Markov Decision Processes

A reinforcement learning task that satisfies the Markov property is called a Markov decision process, or MDP. If the state and action spaces are finite, then it is called a finite Markov decision process (finite MDP). Finite MDPs are particularly important to the theory of reinforcement learning. We treat them extensively throughout this book; they are all you need to understand 90% of modern reinforcement learning. A particular finite MDP is defined by its state and action sets and by the one-step dynamics of the environment. Given any state and action, and , the probability of each possible next state, , is

These quantities are called transition probabilities. Similarly, given any current state and action, and , together with any next state, , the expected value of the next reward is

These quantities, and , completely specify the most important aspects of the dynamics of a finite MDP (only information about the distribution of rewards around the expected value is lost). Most of the theory we present in the rest of this book implicitly assumes the environment is a finite MDP.

Example 3.7: Recycling Robot MDP The recycling robot (Example 3.3) can be turned into a simple example of an MDP by simplifying it and providing some more details. (Our aim is to produce a simple example, not a particularly realistic one.) Recall that the agent makes a decision at times determined by external events (or by other parts of the robot's control system). At each such time the robot decides whether it should (1) actively search for a can, (2) remain stationary and wait for someone to bring it a can, or (3) go back to home base to recharge its battery. Suppose the environment works as follows. The best way to find cans is to actively search for them, but this runs down the robot's battery, whereas waiting does not. Whenever the robot is searching, the possibility exists that its battery will become depleted. In this case the robot must shut down and wait to be rescued (producing a low reward). The agent makes its decisions solely as a function of the energy level of the battery. It can distinguish

two levels, high and low, so that the state set is . Let us call the possible decisions--the agent's actions--wait, search, and recharge. When the energy level is high, recharging would always be foolish, so we do not include it in the action set for this state. The agent's action sets are

If the energy level is high, then a period of active search can always be completed without risk of depleting the battery. A period of searching that begins with a high energy level leaves the energy level high with probability and reduces it to low with probability . On the other hand, a period of searching undertaken when the energy level is low leaves it low with probability

and depletes the battery with

probability . In the latter case, the robot must be rescued, and the battery is then recharged back to high. Each can collected by the robot counts as a unit reward,

whereas a reward of results whenever the robot has to be rescued. Let and , with , respectively denote the expected number of cans the robot will collect (and hence the expected reward) while searching and while waiting. Finally, to keep things simple, suppose that no cans can be collected during a run home for recharging, and that no cans can be collected on a step in which the battery is depleted. This system is then a finite MDP, and we can write down the transition probabilities and the expected rewards, as in Table 3.1. Table 3.1:Transition probabilities and expected rewards for the finite MDP of the recycling robot example. There is a row for each possible combination of current state, , next state, , and action possible in the current state, .

A transition graph is a useful way to summarize the dynamics of a finite MDP. Figure 3.3 shows the transition graph for the recycling robot example. There are two kinds of nodes: state nodes and action nodes. There is a state node for each possible state (a large open circle labeled by the name of the state), and an action node for each state-action pair (a small solid circle labeled by the name of the action and connected by a line to the state node). Starting in state and taking action moves you

along the line from state node to action node . Then the environment responds with a transition to the next state's node via one of the arrows leaving action node . Each arrow corresponds to a triple the next state, and we label the arrow with the transition

, where

is

probability, , and the expected reward for that transition, . Note that the transition probabilities labeling the arrows leaving an action node always sum to 1.

Exercise 3.7 Assuming a finite MDP with a finite number of reward values, write an equation for the transition probabilities and the expected rewards in terms of the joint conditional distribution in (3.5).

3.7 Value Functions Almost all reinforcement learning algorithms are based on estimating value functions--functions of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state). The notion of "how good" here is defined in terms of future rewards that can be expected, or, to be precise, in terms of expected return. Of course the rewards the agent can expect to receive in the future depend on what actions it will take. Accordingly, value functions are defined with respect to particular policies. Recall that a policy, , is a mapping from each state, , and action, , to the probability of taking action when in state . Informally, the value of a state under a policy , denoted , is the expected return when starting in and following thereafter. For MDPs, we can define formally as

where denotes the expected value given that the agent follows policy , and is any time step. Note that the value of the terminal state, if any, is always zero. We call the function the state-value function for policy . Similarly, we define the value of taking action in state under a policy , denoted , as the expected return starting from , taking the action , and thereafter following policy :

We call

the action-value function for policy .

The value functions and can be estimated from experience. For example, if an agent follows policy and maintains an average, for each state encountered, of the actual returns that have followed that state, then the average will converge to the state's value, , as the number of times that state is encountered approaches infinity. If separate averages are kept for each action taken in a state, then these averages will similarly converge to the action values, . We call estimation methods of this kind Monte Carlo methods because they involve averaging over many random samples of actual returns. These kinds of methods are presented in Chapter 5. Of course, if there are very many states, then it may not be practical to keep separate averages for each state individually. Instead, the agent would have to maintain and as parameterized functions and adjust the parameters to better match the observed returns. This can also produce accurate estimates, although much depends on the nature of the parameterized function approximator (Chapter 8). A fundamental property of value functions used throughout reinforcement learning and dynamic programming is that they satisfy particular recursive relationships. For any policy and any state , the following consistency condition holds between the value of and the value of its possible successor states:

where it is implicit that the actions, , are taken from the set , and the next states, , are taken from the set , or from in the case of an episodic problem. Equation (3.10) is the Bellman equation for . It expresses a relationship between the value of a state and the values of its successor states. Think of looking ahead from one state to its possible successor states, as suggested by Figure 3.4a. Each open circle represents a state and each solid circle represents a state-action pair. Starting from state , the root node at the top, the agent could take any of some set of actions--three are shown in Figure 3.4a. From each of these, the environment could respond with one of several next states, , along with a reward, . The Bellman equation (3.10) averages over all the possibilities, weighting each by its probability of occurring. It states that the value of the start state must equal the (discounted) value of the expected next state, plus the reward expected along the way. The value function is the unique solution to its Bellman equation. We show in subsequent chapters how this Bellman equation forms the basis of a number of ways to compute, approximate, and learn . We call diagrams like those shown in Figure 3.4 backup diagrams because they diagram relationships that form the basis of the update or backup operations that are at the heart of reinforcement learning methods. These operations transfer value information back to a state (or a stateaction pair) from its successor states (or state-action pairs). We use backup diagrams throughout the book to provide graphical summaries of the algorithms we discuss. (Note that unlike transition graphs, the state nodes of backup diagrams do not necessarily represent distinct states; for example, a state might be its own successor. We also omit explicit arrowheads because time always flows downward in a backup diagram.)

Example 3.8: Gridworld Figure 3.5a uses a rectangular grid to illustrate value functions for a simple finite MDP. The cells of the grid correspond to the states of the environment. At each cell, four actions are possible: north, south, east, and west, which deterministically cause the agent to move one cell in the respective direction on the grid. Actions that would take the agent off the grid leave its location unchanged, but also result in a reward of . Other actions result in a reward of 0, except those that move the agent out of the special states A and B. From state A, all four actions yield a reward of and take the agent to . From state B, all actions yield a reward of and take the agent to .

Suppose the agent selects all four actions with equal probability in all states. Figure 3.5b shows the value function, , for this policy, for the discounted reward case with . This value function was computed by solving the system of equations (3.10). Notice the negative values near the lower edge; these are the result of the high probability of hitting the edge of the grid there under the random policy. State A is the best state to be in under this policy, but its expected return is less than 10, its immediate reward, because from A the agent is taken to , from which it is likely to run into the edge of the grid. State B, on the other hand, is valued more than 5, its immediate reward, because from B the agent is taken to , which has a positive value. From the expected penalty

(negative reward) for possibly running into an edge is more than compensated for by the expected gain for possibly stumbling onto A or B.

golf example: the state-value function for putting (above) and tion-value function for using the driver (below). Example 3.9: Golf To formulate playing a hole of golf as a reinforcement learning task, we count a penalty (negative reward) of

for each stroke until we hit the ball into the hole. The state is the location of the ball. The value of a state is the negative of the number of strokes to the hole from that location. Our actions are how we aim and swing at the ball, of course, and which club we select. Let us take the former as given and consider just the choice of club, which we assume is either a putter or a driver. The upper part of Figure 3.6 shows a possible state-value function, , for the policy that always uses the putter. The terminal state in-the-hole has a value of . From anywhere on the green we assume we can make a putt; these states have value . Off the green we cannot reach the hole by putting, and the value is greater. If we can reach the green from a state by putting, then that state must have value one less than the green's value, that is, . For simplicity, let us assume we can putt very precisely and deterministically, but with a limited range. This gives us the sharp contour line labeled in the figure; all locations between that line and the green require exactly two strokes to complete the hole. Similarly, any location within putting range of the contour line must have a value of , and so on to get all the contour lines shown in the figure. Putting doesn't get us out of sand traps, so they have a value of . Overall, it takes us six strokes to get from the tee to the hole by putting.

Exercise 3.8 What is the Bellman equation for action values, that is, for give the action value in terms of the

? It must

action values, , of possible successors to the state-action pair . As a hint, the backup diagram corresponding to this equation is given in Figure 3.4b. Show the sequence of equations analogous to (3.10), but for action values. The Bellman equation (3.10) must Exercise hold for each state for the value 3.9 function

example, show numerically that this equation holds for the center state, valued at

shown in Figure 3.5b. As an

, with respect to its four neighboring states,

valued at

,

,

, . (These numbers are accurate only to and one decimal place.)

Exercise 3.10 In the gridworld example, rewards are positive for goals, negative for running into the edge of the world, and zero the rest of the time. Are the signs of these rewards important, or only the intervals between them? Prove, using (3.2), that adding a constant to all the rewards adds a constant, , to the values of all states, and thus does not affect the relative values of any states under any policies. What is in terms of and ? Exercise 3.11 Now consider adding a constant to all the rewards in an episodic task, such as maze running. Would this have any effect, or would it leave the task unchanged as in the continuing task above? Why or why not? Give an example. Exercise 3.12 The value of a state depends on the the values of the actions possible in that state and on how likely each action is to be taken under the current policy. We can think of this in terms of a small backup diagram rooted at the state and considering each possible action:

Give the equation corresponding to this intuition and diagram for the value at the root node, , in terms of the value at the expected leaf node, , given . This expectation depends on the policy, . Then give a second equation in which the expected value is written out explicitly in terms of such that no expected value notation appears in the equation.

Exercise 3.13 The value of an action, , can be divided into two parts, the expected next reward, which does not depend on the policy , and the expected sum of the remaining rewards, which depends on the next state and the policy. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state-action pair) and branching to the possible next states:

Give the equation corresponding to this intuition and diagram for the action value, , in terms of the expected next reward, , and the expected next state value, , given that and . Then give a second equation, writing out the expected value explicitly in terms of and , defined respectively by (3.6) and (3.7), such that no expected value notation appears in the equation.

3.8 Optimal Value Functions Solving a reinforcement learning task means, roughly, finding a policy that achieves a lot of reward over the long run. For finite MDPs, we can precisely define an optimal policy in the following way. Value functions define a partial ordering over policies. A policy is defined to be better than or equal to a policy if its expected return is greater than or equal to that of for all states. In other words,

if and only if

for all . There is always at least one policy that is better than or equal to all other policies. This is an optimal policy. Although there may be more than one, we denote all the optimal policies by . They share the same state-value function, called the optimal state-value function, denoted , and defined as

for all

.

Optimal policies also share the same optimal action-value function, denoted defined as

, and

for all and . For the state-action pair , this function gives the expected return for taking action in state and thereafter following an optimal policy. Thus, we can write

in terms of

as follows:

Example 3.10: Optimal Value Functions for Golf The lower part of Figure 3.6 shows the contours of a possible optimal action-value function . These are the values of each state if we first play a stroke with the driver and afterward select either the driver or the putter, whichever is better. The driver enables us to hit the ball farther, but with less accuracy. We can reach the hole in one shot using the driver only if we don't have to drive all the way to within the small contour, but only to anywhere on the green; from there we can use the putter. The optimal action-value function gives the values after committing to a particular first action, in this case, to the driver, but afterward using whichever actions are best. The contour is still farther out and includes the starting tee. From the tee, the best sequence of actions is two drives and one putt, sinking the ball in three strokes.

Because is the value function for a policy, it must satisfy the self-consistency condition given by the Bellman equation for state values (3.10). Because it is the optimal value function, however, 's consistency condition can be written in a special form without reference to any specific policy. This is the Bellman equation for , or the Bellman optimality equation. Intuitively, the Bellman optimality

equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state

The backup diagrams in Figure 3.7 show graphically the spans of future states and actions considered in the Bellman optimality equations for

and

. These are the same as the backup

diagrams for and except that arcs have been added at the agent's choice points to represent that the maximum over that choice is taken rather than the

expected value given some policy. Figure 3.7a graphically represents the Bellman optimality equation (3.15).

For finite MDPs, the Bellman optimality equation (3.15) has a unique solution independent of the policy. The Bellman optimality equation is actually a system of equations, one for each state, so if there are states, then there are

equations in

unknowns. If the dynamics of the environment are

known ( and ), then in principle one can solve this system of equations for using any one of a variety of methods for solving systems of nonlinear equations. One can solve a related set of equations for

.

Once one has , it is relatively easy to determine an optimal policy. For each state , there will be one or more actions at which the maximum is obtained in the Bellman optimality equation. Any policy that assigns nonzero probability only to these actions is an optimal policy. You can think of this as a one-step search. If you have the optimal value function, , then the actions that appear best after a onestep search will be optimal actions. Another way of saying this is that any policy that is greedy with respect to the optimal evaluation function is an optimal policy. The term greedy is used in computer science to describe any search or decision procedure that selects alternatives based only on local or immediate considerations, without considering the possibility that such a selection may prevent future access to even better

alternatives. Consequently, it describes policies that select actions based only on their short-term consequences. The beauty of is that if one uses it to evaluate the short-term consequences of actions--specifically, the one-step consequences--then a greedy policy is actually optimal in the long-term sense in which we are interested because already takes into account the reward consequences of all possible future behavior. By means of , the optimal expected long-term return is turned into a quantity that is locally and immediately available for each state. Hence, a onestep-ahead search yields the long-term optimal actions. Having makes choosing optimal actions still easier. With even have to do a one-step-

, the agent does not

ahead search: for any state , it can simply find any action that maximizes . The action-value function effectively caches the results of all one-stepahead searches. It provides the optimal expected long-term return as a value that is locally and immediately available for each state-action pair. Hence, at the cost of representing a function of state-action pairs, instead of just of states, the optimal action-value function allows optimal actions to be selected without having to know anything about possible successor states and their values, that is, without having to know anything about the environment's dynamics. Example 3.11: Bellman Optimality Equations for the Recycling Robot Using (3.15), we can explicitly give the the Bellman optimality equation for the recycling robot example. To make things more compact, we abbreviate the states high and low, and the actions search, wait, and recharge respectively by h, l, s, w, and re. Since there are only two states, the Bellman optimality equation consists of two equations. The

equation for

can be written as follows:

Example 3.12: Solving the Gridworld Suppose we solve the Bellman equation for for the simple grid task introduced in Example 3.8 and shown again in Figure 3.8a. Recall that state A is followed by a reward of and transition to state , while state B is followed by a reward of and transition to state . Figure 3.8b shows the optimal value function, and Figure 3.8c shows the corresponding optimal policies. Where there are multiple arrows in a cell, any of the corresponding actions is optimal.

Explicitly solving the Bellman optimality equation provides one route to finding an optimal policy, and thus to solving the reinforcement learning problem. However, this solution is rarely directly useful. It is akin to an exhaustive search, looking ahead at all possibilities, computing their probabilities of occurrence and their desirabilities in terms of expected rewards. This solution relies on at least three assumptions that are rarely true in practice: (1) we accurately know the dynamics of the environment; (2) we have enough computational resources to complete the computation of the solution; and (3) the Markov property. For the kinds of tasks in which we are interested, one is generally not able to implement this solution exactly because various combinations of these assumptions are violated. For example, although the first and third assumptions present no problems for the game of backgammon, the second is a major impediment. Since the game has about states, it would take thousands of years on today's fastest computers to solve the Bellman equation for , and the same is true for finding . In reinforcement learning one typically has to settle for approximate solutions. Many different decision-making methods can be viewed as ways of approximately solving the Bellman optimality equation. For example, heuristic search methods can be viewed as expanding the right-hand side of (3.15) several times, up to some depth, forming a "tree'' of possibilities, and then using a heuristic evaluation function to approximate at the "leaf'' nodes. (Heuristic search methods such as are almost always based on the episodic case.) The methods of dynamic programming can be related even more closely to the Bellman optimality equation. Many reinforcement learning methods can be clearly understood as approximately solving the Bellman optimality equation, using actual experienced transitions in place of knowledge of the expected transitions. We consider a variety of such methods in the following chapters. Exercise 3.14 Draw or describe the optimal state-value function for the golf example. Exercise 3.15 Draw or describe the contours of the optimal action-value function for putting,

, for the golf example.

Exercise 3.16 Give the Bellman equation for

for the recycling robot.

Exercise 3.17 Figure 3.8 gives the optimal value of the best state of the gridworld as 24.4, to one decimal place. Use your knowledge of the optimal policy and (3.2) to express this value symbolically, and then to compute it to three decimal places. 3.9 Optimality and Approximation

We have defined optimal value functions and optimal policies. Clearly, an agent that learns an optimal policy has done very well, but in practice this rarely happens. For the kinds of tasks in which we are interested, optimal policies can be generated only with extreme computational cost. A well-defined notion of optimality organizes the approach to learning we describe in this book and provides a way to understand the theoretical properties of various learning algorithms, but it is an ideal that agents can only approximate to varying degrees. As we discussed above, even if we have a complete and accurate model of the environment's dynamics, it is usually not possible to simply compute an optimal policy by solving the Bellman optimality equation. For example, board games such as chess are a tiny fraction of human experience, yet large, custom-designed computers still cannot compute the optimal moves. A critical aspect of the problem facing the agent is always the computational power available to it, in particular, the amount of computation it can perform in a single time step.

The memory available is also an important constraint. A large amount of memory is often required to build up approximations of value functions, policies, and models. In tasks with small, finite state sets, it is possible to form these approximations using arrays or tables with one entry for each state (or state-action pair). This we call the tabular case, and the corresponding methods we call tabular methods. In many cases of practical interest, however, there are far more states than could possibly be entries in a table. In these cases the functions must be approximated, using some sort of more compact parameterized function representation. Our framing of the reinforcement learning problem forces us to settle for approximations. However, it also presents us with some unique opportunities for

achieving useful approximations. For example, in approximating optimal behavior, there may be many states that the agent faces with such a low probability that selecting suboptimal actions for them has little impact on the amount of reward the agent receives. Tesauro's backgammon player, for example, plays with exceptional skill even though it might make very bad decisions on board configurations that never occur in games against experts. In fact, it is possible that TD-Gammon makes bad decisions for a large fraction of the game's state set. The on-line nature of reinforcement learning makes it possible to approximate optimal policies in ways that put more effort into learning to make good decisions for frequently encountered states, at the expense of less effort for infrequently encountered states. This is one key property that distinguishes reinforcement learning from other approaches to approximately solving MDPs.

3.10 Summary

Let us summarize the elements of the reinforcement learning problem that we have presented in this chapter. Reinforcement learning is about learning from interaction how to behave in order to achieve a goal. The reinforcement learning agent and its environment interact over a sequence of discrete time steps. The specification of their interface defines a particular task: the actions are the choices made by the agent; the states are the basis for making the choices; and the rewards are the basis for evaluating the choices. Everything inside the agent is completely known and controllable by the agent; everything outside is incompletely controllable but may or may not be completely known. A policy is a stochastic rule by which the agent selects actions as a function of states. The agent's objective is to maximize the amount of reward it receives over time. The return is the function of future rewards that the agent seeks to maximize. It has several different definitions depending upon the nature of the task and whether one wishes to discount delayed reward. The undiscounted formulation is appropriate for episodic tasks, in which the agent-environment interaction breaks naturally into episodes; the discounted formulation is appropriate for continuing tasks, in which the interaction does not naturally break into episodes but continues without limit. An environment satisfies the Markov property if its state signal compactly summarizes the past without degrading the ability to predict the future. This is rarely exactly true, but often nearly so; the state signal should be chosen or constructed so that the Markov property holds as nearly as possible. In this book we assume that this has already been done and focus on the decision-making problem: how to decide what to do as a function of whatever state signal is available. If the Markov property does hold, then the environment is called a Markov decision process (MDP). A finite MDP is an MDP with finite state and action sets. Most of the current theory of reinforcement learning is restricted to finite MDPs, but the methods and ideas apply more generally. A policy's value functions assign to each state, or state-action pair, the expected return from that state, or state-action pair, given that the agent uses the policy. The optimal value functions assign to each state, or state-action pair, the largest expected

return achievable by any policy. A policy whose value functions are optimal is an optimal policy. Whereas the optimal value functions for states and state-action pairs are unique for a given MDP, there can be many optimal policies. Any policy that is greedy with respect to the optimal value functions must be an optimal policy. The Bellman optimality equations are special consistency condition that the optimal value functions must satisfy and that can, in principle, be solved for the optimal value functions, from which an optimal policy can be determined with relative ease. A reinforcement learning problem can be posed in a variety of different ways depending on assumptions about the level of knowledge initially available to the agent. In problems of complete knowledge, the agent has a complete and accurate model of the environment's dynamics. If the environment is an MDP, then such a model consists of the one-step transition probabilities and expected rewards for all states and their allowable actions. In problems of incomplete knowledge, a complete and perfect model of the environment is not available. Even if the agent has a complete and accurate environment model, the agent is typically unable to perform enough computation per time step to fully use it. The memory available is also an important constraint. Memory may be required to build up accurate approximations of value functions, policies, and models. In most cases of practical interest there are far more states than could possibly be entries in a table, and approximations must be made. A well-defined notion of optimality organizes the approach to learning we describe in this book and provides a way to understand the theoretical properties of various learning algorithms, but it is an ideal that reinforcement learning agents can only approximate to varying degrees. In reinforcement learning we are very much concerned with cases in which optimal solutions cannot be found but must be approximated in some way.

Reinforcement Learning algorithms — an intuitive overview This article pursues to highlight in a non-exhaustive manner the main type of algorithms used for reinforcement learning (RL). The goal is to provide an overview of existing RL methods on an intuitive level by avoiding any deep dive into the models or the math behind it. When it comes to explaining machine learning to those not concerned in the field, reinforcement learning is probably the easiest sub-field for this challenge. RL it’s like teaching your dog (or cat if you live your life in a challenging way) to do tricks: you provide goodies as a reward if your pet performs the trick you desire, otherwise, you punish him by not treating him, or by providing lemons. Dogs really hate lemons.

This is just for the cover[Source] Beyond controversy, RL is a more complex and challenging method to be realized, but basically, it deals with learning via interaction and feedback, or in other words learning to solve a task by trial and error, or in other-other words acting in an environment and receiving rewards for it. Essentially an agent (or several) is built that can perceive and interpret the environment in which is placed, furthermore, it can take actions and interact with it.

Terminologies For the beginning lets tackle the terminologies used in the field of RL.

Agent-environment interaction [Source] Agent — the learner and the decision maker. Environment — where the agent learns and decides what actions to perform. Action — a set of actions which the agent can perform. State — the state of the agent in the environment. Reward — for each action selected by the agent the environment provides a reward. Usually a scalar value. Policy — the decision-making function (control strategy) of the agent, which represents a mapping from situations to actions. Value function — mapping from states to real numbers, where the value of a state represents the long-term reward achieved starting from that state, and executing a particular policy. Function approximator — refers to the problem of inducing a function from training examples. Standard approximators include decision trees, neural networks, and nearestneighbour methods Markov decision process (MDP) — A probabilistic model of a sequential decision problem, where states can be perceived exactly, and the current state and action selected determine a probability distribution on future states. Essentially, the outcome of applying an action to a state depends only on the current action and state (and not on preceding actions or states). Dynamic programming (DP) — is a class of solution methods for solving sequential decision problems with a compositional cost structure. Richard Bellman was one of the principal founders of this approach. Monte Carlo methods — A class of methods for learning of value functions, which estimates the value of a state by running many trials starting at that state, then averages the total rewards received on those trials. Temporal Difference (TD) algorithms — A class of learning methods, based on the idea of comparing temporally successive predictions. Possibly the single most fundamental idea in all of reinforcement learning.

Model — The agent’s view of the environment, which maps state-action pairs to probability distributions over states. Note that not every reinforcement learning agent uses a model of its environment OpenAI — a non-profit AI research company with the mission to build and share safe Artificial General Intelligence (AGI) — launched a program to “spin up” deep RL. The website provides a comprehensive introduction to main RL algorithms. This blog will mainly follow this overview with additional explanation.

Reinforcement Learning taxonomy as defined by OpenAI [Source]

Model-Free vs Model-Based Reinforcement Learning Model-based RL uses experience to construct an internal model of the transitions and immediate outcomes in the environment. Appropriate actions are then chosen by searching or planning in this world model. … Model-free RL, on the other hand, uses experience to learn directly one or both of two simpler quantities (state/ action values or policies) which can achieve the same optimal behaviour but without estimation or use of a world model. Given a policy, a state has a value, defined in terms of the future utility that is expected to accrue starting from that state. … Model-free methods are statistically less efficient than model-based methods, because information from the environment is combined with previous, and possibly erroneous, estimates or beliefs about state values, rather than being used directly. (Peter Dayana and Yael Niv — Reinforcement learning: The Good, The Bad and The Ugly, 2008) Well, that should’ve explained it. Generally: Model-based learning attempts to model the environment then choose the optimal policy based on it’s learned model; In Model-free learning the agent relies on trial-and-error experience for setting up the optimal policy.

I. Model-free RL Two main approaches to represent agents with model-free reinforcement learning is Policy optimization and Q-learning.

I.1. Policy optimization or policy-iteration methods In policy optimization methods the agent learns directly the policy function that maps state to action. The policy is determined without using a value function. Important to mention that there are two types of policies: deterministic and stochastic. Deterministic policy maps state to action without uncertainty. It happens when you have a deterministic environment like a chess table. Stochastic policy outputs a probability distribution over actions in a given state. This process is called Partially Observable Markov Decision Process (POMDP).

I.1.1. Policy Gradient (PG) In this method, we have the policy π that has a parameter θ. This π outputs a probability distribution of actions.

Probability of taking action a given state’s with parameters theta. [Source] Then we must find the best parameters (θ) to maximize (optimize) a score function J(θ), given the discount factor γ and the reward r.

Policy score function [Source] Main steps: Measure the quality of a policy with the policy score function. Use policy gradient ascent to find the best parameter that improves the policy. A great and detailed explanation with all the math included about policy gradient can be found in Jonathan Hui’s blog or in Thomas Simoni’s introduction blog to PG with examples in TensorFlow.

I.1.2. Asynchronous Advantage Actor-Critic (A3C) These methods were published by Google’s DeepMind group and covers the following key concept embedded in it’s naming: Asynchronous: Several agents are trained in it’s own copy of the environment and the model form these agents are gathered in a master agent. The reason behind this idea, is that the experience of each agent is independent of the experience of the others. In this way the overall experience available for training becomes more diverse. Advantage: Similarly, to PG where the update rule used the discounted returns from a set of experiences in order to tell the agent which actions were “good” or “bad”. Actor-critic: combines the benefits of both approaches from policy-iteration method as PG and value-iteration method as Q-learning (See below). The network will estimate both a value function V(s) (how good a certain state is to be in) and a policy π(s). A simple but throughout explanation with code implemented in TensorFlow can be found in Arthur Juliane blog.

I.1.3. Trust Region Policy Optimization (TRPO) A on-policy algorithm that can be used or environments with either discrete or continuous action spaces. TRPO updates policies by taking the largest step possible to improve performance, while satisfying a special constraint on how close the new and old policies are allowed to be. A comprehensive introduction is provided on TRPO in this and this blog post and a great repo provides TensorFlow and OpenAI Gym based solutions.

I.1.4. Proximal Policy Optimization (PPO) Also, an on-policy algorithm which similarly to TRPO can perform on discrete or continuous action spaces. PPO shares motivation with TRPO in the task of answering the question: how to increase policy improvement without the risk of performance collapse? The idea is that PPO improves the stability of the Actor training by limiting the policy update at each training step. PPO became popular when OpenAI made a breakthrough in Deep RL when they released an algorithm trained to play Dota2 and they won against some of the best players in the world. See description on this page. For deep dive into PPO visit this blog.

I.2. Q-learning or value-iteration methods Q-learning learns the action-value function Q(s, a): how good to take an action at a particular state. Basically, a scalar value is assigned over an action a given the states. The following chart provides a good representation of the algorithm.

Q-learning steps [Source]

I.2.1 Deep Q Neural Network (DQN) DQN is Q-learning with Neural Networks. The motivation behind is simply related to big state space environments where defining a Q-table would be a very complex, challenging and timeconsuming task. Instead of a Q-table Neural Networks approximate Q-values for each action based on the state. For deep dive to DQN visit this course and play Doom meanwhile.

I.2.2 C51 C51 is a feasible algorithm proposed by Bellemare et al. to perform iterative approximation of the value distribution Z using Distributional Bellman equation. The number 51 represents the use of 51 discrete values to parameterize the value distribution Z(s,a). See the original paper here and for a deep dive follow this exploratory tutorial with implementation in Keras.

I.2.3 Distributional Reinforcement Learning with Quantile Regression (QR-DQN) In QR-DQN for each state-action pair instead of estimating a single value a distribution of values in learned. The distribution of the values, rather than just the average, can improve the policy. This means that quantiles are learned which threshold values attached to certain probabilities in the cumulative distribution function. See paper for the method here and an easy implementation using Pytorch here.

I.2.4 Hindsight Experience Replay (HER) In Hindsight Experience Replay method, basically a DQN is supplied with a state and a desired end-state, or in other words goal. It allows to quickly learn when the rewards are sparse. In other words when the rewards are uniform for most of the time, with only a few rare rewardvalues that really stand out. For a better understanding, beside the paper check out this blog post, fr coding this github repository

I.3 Hybrid Simply as it sounds, these methods combine the strengths of Q-learning and policy gradients, thus the policy function that maps state to action and the action-value function that provides a value for each action is learned. Some hybrid model-free algorithms are: Deep Deterministic Policy Gradients (DDPG): paper and code, Soft Actor -Critic (SAC): paper and code. Twin Delayed Deep Deterministic Policy Gradients (TD3) paper and code II. Model-based RL Model-based RL has a strong influence from control theory, and the goal is to plan through an f(s,a) control function to choose the optimal actions. Think of it as the RL field where the laws of physics are provided by the creator. The drawback of model-based methods is that although they have more assumptions and approximations on a given task, but may be limited only to these specific types of tasks. There are two main approaches: learning the model or learn given the model.

II.1. Learn the Model To learn the model a base policy is ran, like a random or any educated policy, while the trajectory is observed. The model is fitted using the sampled data. Below steps describe the procedure:

Supervised learning is used to train a model to minimize the least square error from the sampled data for the control function. Optimal trajectory using the model and a cost function is used in step three. The cost function can measure how far we are from the target location and the amount of effort spent. [source] World models: one of my favourite approaches in which the agent can learn from its own “dreams” due to the Variable Auto-encoders, See paper and code. Imagination-Augmented Agents (I2A): learns to interpret predictions from a learned environment model to construct implicit plans in arbitrary ways, by using the predictions as additional context in deep policy networks. Basically, it’s a hybrid learning method because it combines model-baes and model-free methods. Paper and implementation. Model-Based Priors for Model-Free Reinforcement Learning (MBMF): aims to bridge the gap between model-free and model-based reinforcement learning. See paper and code. Model-Based Value Expansion (MBVE): Authors of the paper state that this method controls for uncertainty in the model by only allowing imagination to fixed depth. By enabling wider use of learned dynamics models within a model-free reinforcement learning algorithm, we improve value estimation, which, in turn, reduces the sample complexity of learning.

Asynchronous Advantage Actor Critic (A3C) algorithm The Asynchronous Advantage Actor Critic (A3C) algorithm is one of the newest algorithms to be developed under the field of Deep Reinforcement Learning Algorithms. This algorithm was developed by Google’s DeepMind which is the Artificial Intelligence division of Google. This algorithm was first mentioned in 2016 in a research paper appropriately named Asynchronous Methods for Deep Learning. Decoding the different parts of the algorithm’s name: Asynchronous: Unlike other popular Deep Reinforcement Learning algorithms like Deep Q-Learning which uses a single agent and a single environment, this algorithm uses multiple agents with each agent having its own network parameters and a copy of the environment. This agent interacts with their respective environments Asynchronously, learning with each interaction. Each agent is controlled by a global network. As each agent gains more knowledge, it contributes to the total knowledge of the global network. The presence of a global network allows each agent to have more diversified training data. This setup mimics the real-life environment in which humans live as each human gains’ knowledge from the experiences of some other human thus allowing the whole “global network” to be better. Actor-Critic: Unlike some simpler techniques which are based on either Value-Iteration methods or Policy-Gradient methods, the A3C algorithm combines the best parts of both the methods ie the algorithm predicts both the value function V(s) as well as the optimal policy function . The learning agent uses the value of the Value function (Critic) to update the optimal policy function (Actor). Note that here the policy function means the probabilistic distribution of the action space. To be exact, the learning agent determines the conditional probability P(a|s ; ) ie the parametrized probability that the agent chooses the action a when in states. Advantage: Typically, in the implementation of Policy Gradient, the value of Discounted Returns ( ) to tell the agent which of its actions were rewarding and which ones were penalized. By using the value of Advantage instead, the agent also learns how much better the rewards were then it’s expectation. This gives a new-found insight to the agent into the environment and thus the learning process is better. The advantage metric is given by the following expression: -

Advantage: A = Q(s, a) – V(s) The following pseudo-code is referred from the research paper linked above. Define global shared parameter vectors Define global shared counter T = 0 Define thread specific parameter vectors Define thread step counter t = 1 while( {

)

and and

while( {

} if( {

is not terminal

Simulate action

according to

Receive reward t++ T++

and next state

is terminal)

R=0 } else { R= } for(i=t-1;i>= {

;i--)

R=

}

}

Where, – Maximum number of iterations – change in global parameter vector – Total Reward – Policy function – Value function – discount factor

)

Advantages: This algorithm is faster and more robust than the standard Reinforcement Learning Algorithms. It performs better than the other Reinforcement learning techniques because of the diversification of knowledge as explained above. It can be used on discrete as well as continuous action spaces.

Role of AI in Autonomous Driving Use of AI in “self-driving” tasks (i.e. described in SAE standard J3016) out of scope in this talk Focus on AI role in Human-Autonomous Vehicle (HAV) interaction: Virtual Assistants and the car as a Social Robot Past, present and future of virtual assistants and AI-based intelligent agents for IoT applications

Virtual Assistants in Desktop Environments Quite popular in the Nineties: animated characters/avatars such as Clippy the Paperclip Supporting user learning, efficiency and productive with a specific software product Two main behaviours Proactive Help User Query (typically based on text input)

Controversies on added value of virtual assistants, designed as “virtual butlers”

Virtual Assistants in Mobile Contexts

ITS Virtual assistants again in fashion thanks to technological advances such as the shift from touch to speech interfaces and AI breakthroughs Intelligent Agents that are part of the Operating System: Apple Siri, MS Cortana, Google Assistant, Amazon Alexa Two main behaviours (as desktop assistants) Proactive Help User Query (text or speech input) Perceived usefulness and social acceptance far from optimal: more about curiosity than utility

Virtual Assistants and the Internet of Things

Virtual assistants designed for mobile contexts and ubiquitous interaction are part of “smart devices” and the Internet of Things (IoT) Increasingly sophisticated: meant to process and connect contextual input with large sets of data in real-time Automotive Sector among the first areas impacted. High expectations and uncertainty Context of Analysis

Virtual Assistants as a type of (Disembodied) Robot Defining a Robot is a Challenging Task Characterise robots by features (Robolaw Project) Use/Task: two large categories – industrial and service robots The environment where the robot operate: physical, virtual Nature: embodied or disembodied (virtual agent, bot) Human-Robot Interaction: interface, communication form Autonomy: tele-operated, semi-autonomous, fully autonomous

Virtual Assistants as Social Robots

Social Robots are a special type of service robots with advanced capabilities in interacting and communicating with humans Two main categories Robot companions (embodied) Intelligent agents, virtual assistants (disembodied)

The Place for AI in Autonomous Driving: Learning from Robotics Supporting or Replacing the Human? (e.g. under which circumstances, how) Scope of action: aiming at “Operational autonomy” (linked to e.g. safety functions only) or full autonomy (i.e. the “sentient” car)? Dealing with sensitive personal data: ethical, legal and security issues.

AI in “safety-related” Autonomous Driving Functionalities is based on Standards

Five levels (SAE Int. Standard) Level 1-2: Assisted Driving Level 3: Semi-Automated (current state of the art) Level 5: fully autonomous (in respect to driving functionalities) Why not adopting the same approach for the general use of AI in autonomous driving (including role of virtual assistants)? The Self-Driving car as a Social Robot:

Various approaches and products Unlike safety-related functionalities, scope of virtual assistants in autonomous driving is completely unregulated and based on proprietary systems, different views and approaches Two broad categories of “social” features/functionalities In-Car Virtual Assistants: integration of existing virtual assistants (e.g. Siri, Cortana, Alexa, Google Assistant) or implementation of ad-hoc ones Communication Devices (e.g. displays) for informing, possibly in languageindependent way, people outside the car (e.g. pedestrians

Communication Displays

Roof-mounted communication display through which the self-driving car communicates its intentions/plans to people outside the car (typically pedestrians) Language-independent: communication via emoticons Supporting safety-related objectives and other purposes, such as “shared mobility” (e.g. also driveless robo-taxis, being developed by Uber)

In-Car Virtual Assistants

Safety-related functionalities Customised infotainment Personal Health and Well-Being Gateway to IoT (e.g. home control)

Gateway to IoT (e.g. home control) Virtual Assistant not only as human-car interface, but as access point to “everything”

Customised Infotainment

“IoT Connected Cars and Virtual Assistants make life simpler” (Vision of Nuance, IT company specialized in voice/natural language processing)

Personal Health and Well-being

Virtual Assistants as “Personal Coach”: recommendations and in-car services supporting health and well-being

Recent Strategic Developments

Highlights on AI and Self-Driving Cars

NVIDIA and the AI Co-Pilot (CES 2017)

Highlights on AI and Self-Driving Cars

Nissan featuring Microsoft Cortana (CES 2017)

Highlights on AI and Self-Driving Cars

Mercedes-Benz Fit and Healthy Research Car (CES 2017)

Highlights on AI and Self-Driving Cars

Honda’s NeuV featuring HANA (CES 2017)

Highlights on AI and Self-Driving Cars

Volkswagen’s Sedric (Geneva Motor Show 2017)

Social Robots, AI and Self-Driving Cars Car manufacturers and ICT companies established strategic partnerships and are investing massively into “connected and autonomous driving”, seeing the car as gateway to the IoT Too early to judge possible impact, but a few considerations No common approach on the role of AI, apart on its use in safety-related functionalities (in line with international standards) Commercial solutions more conservative than what they may appear Solutions optimized for car ownership and “brand loyalty”. Shared mobility and interoperability supported, but unclear how personal data would be handled

Social Robots, AI and Self-Driving Cars Self-Driving Cars turn into Social Robots, thanks to an algorithmic AI “brain”, a big data “memory”, a “voice” (virtual Assistants) and “gestures” (communication displays) Virtual Assistants will “know” the user better than he/she does Role of virtual butler (as “Clippy”) still predominant “Sentient” and self-aware cars still far away Towards Hyper-Connectivity: virtual assistants as the “glue” binding humans with diverse and separate technologies, with a “human touch” ITST 2017,

AI & Industry 4.0

The Way Ahead : Social Robots in Human Societies “It is essential that the big ethical principles which will come to govern robotics develop in perfect harmony with Europe’s humanist values” (2016 Study on European Civil Law Rules in Robotics, commissioned by the European Parliament) “Reflect on what kind of society we want to build and live in. This includes the robots we build and use, and tells us about the model of our society” (“Situating the Human in Social Robots. In Vincent et al. (Eds.). Social Robots from a Human Perspective, Springer 2015”) IYour Role as IEEE Members and ITST Experts Contribute to IEEE mission: “IEEE is the world’s largest professional association dedicated to advancing technological innovation and excellence for the benefit of humanity”.

Major Impact of AI-based Applications on Global Societies Deep concerns raised, among others, by Bill Gates, Elon Musk and Stephen Hawking Adequate Solutions Require: Input from all relevant disciplines (technical and non-technical) Joint development by all relevant actors (role of standardization)

Well-defined legal framework (principles globally agreed)

Some Recent Developments ITU “AI for Good” Global Summit (Geneva, 7-9 June 2017) Technical standardization and Policy Guidance for AI “Partnership on AI” by Google, Facebook, Amazon, IBM and Microsoft: AI solutions for the benefit of people and society Apple and Elon Musk’s OpenAI initiative not taking part IEEE Global Initiative for Ethical Considerations in AI and Autonomous Systems, launched in April 2016 Six Working Groups (P7001, P7002, P7003, P7004, P7005, P7006) WG P7006 on “Personal Data Artificial Intelligence Agent”

ARTIFICIAL INTELLIGENCE AND ROBOTICS Introduction: Basics and Definition of Terms Modern information technologies and the advent of machines powered by artificial intelligence (AI) have already strongly influenced the world of work in the 21st century. Computers, algorithms and software simplify everyday tasks, and it is impossible to imagine how most of our life could be managed without them. However, is it also impossible to imagine how most process steps could be managed without human force? The information economy characterised by exponential growth replaces the mass production industry based on economy of scales. When we transfer the experience of the past to the future, disturbing questions arise: what will the future world of work look like and how long will it take to get there? Will the future world of work be a world where humans spend less time earning their livelihood? Alternatively, are mass unemployment, mass poverty and social distortions also a possible scenario for the new world, a world where robots, intelligent systems and algorithms play an increasingly central role?1 What is the future role of a legal framework that is mainly based on a 20th century industry setting? What is already clear and certain is that new technical developments will have a fundamental impact on the global labour market within the next few years, not just on industrial jobs but on the core of human tasks in the service sector that are considered ‘untouchable’. Economic structures, working relationships, job profiles and well-established working time and remuneration models will undergo major changes. In addition to companies, employees and societies, education systems and legislators are also facing the task of meeting the new challenges resulting from constantly advancing technology. Legislators are already lagging behind and the gap between reality and legal framework is growing. While the digitalisation of the labour market has a widespread impact on intellectual property, information technology, product liability, competition and labour and employment laws, this report is meant to also provide an overview of the fundamental transformation of the labour market, the organisation of work and the specific consequences for employment relationships. Additionally, labour and data privacy protection issues are to be considered. For this purpose, it is first necessary to define a few basic terms.

What is artificial intelligence? The name behind the idea of AI is John McCarthy, who began research on the subject in 1955 and assumed that each aspect of learning and other domains of intelligence can be described so precisely that they can be simulated by a machine.2

See: www.spiegel.de/wirtschaft/soziales/arbeitsmarkt-der-zukunft-die-jobfresser-kommen-a-1105032.html (last accessed on 3 August 2016). See: www.spiegel.de/netzwelt/web/john-mccarthy-der-vater-der-rechner-cloud-ist-tot-a-793795.html (last accessed on 11 February 2016). 1

ARTIFICIAL INTELLIGENCE AND ROBOTICS AND THEIR IMPACT ON THE WORKPLACE Even the terms ‘artificial intelligence’ and ‘intelligent human behaviour’ are not clearly defined, however.

9

Artificial intelligence describes the work processes of machines that would require intelligence if performed by humans. The term ‘artificial intelligence’ thus means ‘investigating intelligent problem-solving behaviour and creating intelligent computer systems’.3 There are two kinds of artificial intelligence: Weak artificial intelligence: The computer is merely an instrument for investigating cognitive processes – the computer simulates intelligence. Strong artificial intelligence: The processes in the computer are intellectual, self-learning processes. Computers can ‘understand’ by means of the right software/programming and are able to optimise their own behaviour on the basis of their former behaviour and their experience.4 This includes automatic networking with other machines, which leads to a dramatic scaling effect.

Economic fields of artificial intelligence In general, the economic use of AI can be separated into five categories:5 Deep learning This is machine learning based on a set of algorithms that attempt to model high-level abstractions in data. Unlike human workers, the machines are connected the whole time. If one machine makes a mistake, all autonomous systems will keep this in mind and will avoid the same mistake the next time. Over the long run, intelligent machines will win against every human expert.

Robotization Since the 19th century, production robots have been replacing employees because of the advancement in technology. They work more precisely than humans and cost less. Creative solutions like 3D printers and the self-learning ability of these production robots will replace human workers. Dematerialisation Thanks to automatic data recording and data processing, traditional ‘back-office’ activities are no longer in demand. Autonomous software will collect necessary information and send it to the employee who needs it.

See: http://wirtschaftslexikon.gabler.de/Archiv/74650/kuenstliche-intelligenz-ki-v12.html (last accessed on 11 February 2016). www2.cs.uni-paderborn.de/cs/ag-klbue/de/courses/ss05/gwbs/ai-intro-ss05-slides.ps.nup.pdf (last accessed on 11 February 2016).

Dettmer, Hesse, Jung, Müller and Schulz, ‘Mensch gegen Maschine’ (3 September 2016) Der Spiegel p 10 ff.

1

Additionally, dematerialisation leads to the phenomenon that traditional physical products are becoming software, for example, CDs or DVDs are being replaced by streaming services. The replacement of traditional event tickets, travel tickets or hard cash will be the next steps, due to the enhanced possibility of contactless payment by smartphone. Gig economy A rise in self-employment is typical for the new generation of employees. The gig economy is usually understood to include chiefly two forms of work: ‘crowdworking’ and ‘work on-demand via apps’ organised networking platforms.6 There are more and more independent contractors for individual tasks that companies advertise on online platforms (eg, ‘Amazon Mechanical Turk’). Traditional employment relationships are becoming less common. Many workers are performing different jobs for different clients. Autonomous driving Vehicles have the power for self-governance using sensors and navigating without human input. Taxi and truck drivers will become obsolete. The same applies to stock managers and postal carriers if the delivery is distributed by delivery drones in the future.

The ‘second machine age’ or the ‘internet of things’ – the fourth industrial revolution AI will lead to a redefinition and a disruption of service models and products. While the technical development leads primarily to an efficiency enhancement in the production sectors, new creative and disruptive service models will revolutionise the service sector. These are adapted with the support of big data analyses at the individual requirements of the client and not at the needs of a company.

INDUSTRY 1.0: INDUSTRIALISATION

Industry 1.0 is known as the beginning of the industrial age, around 1800. For the first time, goods and services were produced by machines. Besides the first railways, coal mining and heavy industry, the steam engine was the essential invention of the first industrial revolution; steam engines replaced many employees, which led to social unrest. At the end of the 18th century, steam engines were introduced for the first time in factories in the UK; they were a great driving force for industrialisation, since they provided energy at any location for any purpose.7 INDUSTRY 2.0: ELECTRIFICATION The second industrial revolution began at the beginning of electrification at the end of the 19th century. The equivalent of the steam engine in the first industrial revolution was the assembly line, which was first used in the automotive industry.

See: www.ilo.org/wcmsp5/groups/public/---ed_protect/---protrav/---travail/documents/publication/wcms_443267. pdf p1 (last accessed on 26 September 2016). See: www.lmis.de/im-wandel-der-zeit-von-industrie-1-0-bis-4-0 (last accessed on 11 February 2016). 1

ARTIFICIAL INTELLIGENCE AND ROBOTICS AND THEIR IMPACT ON THE WORKPLACE

11

It helped accelerate and automate production processes. The term Industry 2.0 is characterised by separate steps being executed by workers specialised in respective areas. Serial production was born. At the same time, automatically manufactured goods were transported to different continents for the first time. This was aided by the beginning of aviation.8 INDUSTRY 3.0: DIGITALISATION The third industrial revolution began in the 1970s and was distinguished by IT and further automation through electronics. When personal computers and the internet took hold in working life, it meant global access to information and automation of working steps. Human labour was replaced by machines in serial production. A process that was intensified in the context of Industry 4.0 was already in the offing at that time.9 INDUSTRY 4.0 The term Industry 4.0 means in essence the technical integration of cyber physical systems (CPS) into production and logistics and the use of the ‘internet of things’ (connection between everyday objects)10 and services in (industrial) processes including the consequences for a new creation of value, business models as well as downstream services and work organisation.11 CPS refers to the network connections between humans, machines, products, objects and ICT (information and communication technology) systems.12 Within the next five years, it is expected that over 50 billion connected machines will exist throughout the world. The introduction of AI in the service sector distinguishes the fourth industrial revolution from the third. Particularly in the field of industrial production, the term ‘automatization’ is characterised essentially by four elements:13 First, production is controlled by machines. Owing to the use of intelligent machines, production processes will be fully automated in the future, and humans will be used as a production factor only in individual cases. The so-called ‘smart factory’, a production facility with few or without humans, is representative of this.

Second, real-time production is a core feature of Industry 4.0. An intelligent machine calculates the optimal utilisation capacity of the production facility.

See: www.lmis.de/im-wandel-der-zeit-von-industrie-1-0-bis-4-0 (last accessed on 11 February 2016). Ibid. Stiemerling, ‘“Künstliche Intelligenz” – Automatisierung geistiger Arbeit, Big Data und das Internet der Dinge’ (2015) Computer und Recht 762 ff.

Forschungsunion and acatech, ‘Deutschlands Zukunft als Produktionsstandort sichern: Umsetzungsempfehlung für das Zukunftsprojekt Industrie 4.0’ (2013) Promotorengruppe Kommunikation der Forschungsunion Wirtschaft – Wissenschaft. ‘Industrie 4.0 und die Folgen für Arbeitsmarkt und Wirtschaft’ (2015) IAB Forschungsbericht 8/2015, Institute for Employment Research, 12. Ibid, 13 f. 1

1

Lead times are short in the production process, and standstills, except those caused by technical defects, can be avoided. Within the value creation chain, the coordination of materials, information and goods is tailored exactly to demand. Stocks are kept to a minimum, but if materials needed for production fall below a certain level, the machine orders more. The same applies to finished products; the machine produces depending on incoming orders and general demand, thus reducing storage costs. The third element is the decentralisation of production. The machine is essentially self-organised. This includes a network of the manufacturing units. In addition to material planning, the handling of orders is also fully automated. The last element is the individualisation of production even down to a batch of one unit. The machine of the future will be able to respond, within certain limits, to individual customer requests. No adjustments to the machines by humans are required. As a result, changeover times are eliminated. The smart factory adds certain components or, in a context of optimum distribution throughout the entire process, adapts individual stages of production to correspond with customer requests. The term Industry 4.0 thus stands for the optimisation of components involved in the production process (machines, operating resources, software, etc) owing to their independent communication with one another via sensors and networks.14 This is supposed to reduce production costs, particularly in the area of staff planning, giving the company a better position in international competition. Well-known examples from the field of robotics and AI are the so-called ‘smart factories’, driverless cars, delivery drones or 3D printers, which, based on an individual template, can produce highly complex things without changes in the production process or human action in any form being necessary. Well-known service models are, for example, networking platforms like Facebook or Amazon Mechanical Turk, the economy-on-demand providers Uber and Airbnb, or sharing services, such as car sharing, Spotify and Netflix. Studies show that merely due to sharing services the turnover of the sector will grow twentyfold within the next ten years. Old industry made progress by using economies of scale in an environment of mass production, but the new information economy lives on networking effects, leading to more monopolies.15

Sandro Panagl, ‘Digitalisierung der Wirtschaft - Bedeutung Chancen und Herausforderungen’ (2015) Austrian Economic Chambers 5. See: www.bloomberg.com/news/videos/2016-05-20/forward-thinking-march-of-the-machines (last accessed on 2 November 2016). 1

ARTIFICIAL INTELLIGENCE AND ROBOTICS AND THEIR IMPACT ON THE WORKPLACE

13

The Impact of New Technology on the Labour Market Both blue-collar and white-collar sectors will be affected. The faster the process of the division of labour and the more single working or process steps can be described in detail, the sooner employees can be replaced by intelligent algorithms. One third of current jobs requiring a bachelor’s degree can be performed by machines or intelligent software in the future. Individual jobs will disappear completely, and new types of jobs will come into being. It must be noted in this regard, however, that no jobs will be lost abruptly. Instead, a gradual transition will take place, which has already commenced and differs from industry to industry and from company to company.16

Advantages of robotics and intelligent algorithms Particularly in the industrial sectors in the Western high-labour cost countries, automation and use of production robots lead to considerable savings with regard to the cost of labour and products. While one production working hour costs the German automotive industry more than €40, the use of a robot costs between €5 and €8 per hour.17 A production robot is thus cheaper than a worker in China is.18 A further aspect is that a robot cannot become ill, have children or go on strike and is not entitled to annual leave.

An autonomous computer system does not depend on external factors meaning that it works reliably and constantly, 24/7, and it can work in danger zones.19 As a rule, its accuracy is greater than that of a human, and it cannot be distracted either by fatigue or by other external circumstances.20 Work can be standardised and synchronised to a greater extent, resulting in an improvement in efficiency and a better control of performance and more transparency in the company.21 In the decision-making process, autonomous systems can be guided by objective standards, so decisions can be made unemotionally, on the basis of facts. Productivity gains have so far always led to an improvement of living circumstances for everybody. The same applies for intelligent algorithms.

Brzeski and Burk, ‘Die Roboter kommen, Folgen der Automatisierung für den deutschen Arbeitsmarkt’ (2015) ING DiBa 1. See: www.bcgperspectives.com/content/articles/lean-manufacturing-innovation-robots-redefine-competitiveness/ (last accessed on 3 August 2016). Krischke and Schmidt, ‘Kollege Roboter’ (2015) 38/2015 Focus Magazin 66. See: www.faz.net/aktuell/wirtschaft/fuehrung-und-digitalisierung-mein-chef-der-roboter-14165244.html (last accessed on 8 April 2016). Haag, ‘Kollaboratives Arbeiten mit Robotern – Visionen und realistische Perspektive’ in Botthof and Hartmann (eds), Zukunft der Arbeit in Industrie 4.0 (2015) 63. Maschke and Werner, ‘Arbeiten 4.0 – Diskurs und Praxis in Betriebsvereinbarungen’ (October 2015) Hans Böckler Stiftung, Report No 14, 9. 1

1

The advantage for employees is that they have to do less manual or hard work; repetitive, monotonous work can be performed by autonomous systems. The same applies for typical back-office activities in the service sector: algorithms will collect data automatically, they will transfer data from purchasers’ to sellers’ systems, and they will find solutions for clients’ problems. Once an interface between the sellers’ and the purchasers’ system has been set up, employees are no longer required to manually enter data into an IT system.22 Employees might have more free time that they can use for creative activities or individual recreational activities. Robots and intelligent machines can have not only supporting, but even life-saving functions. Examples are robots used in medical diagnostics, which have high accuracy, or for the assessment of dangerous objects using remote control and integrated camera systems. These make it possible, for example, to defuse a bomb without a human having to come close to it. The ‘Robo Gas Inspector’,23 an inspection robot equipped with remote gas sensing technology, can inspect technical facilities even in hard-to-reach areas without putting humans at risk, for example, to detect leaks in aboveground and underground gas pipelines.

A global phenomenon While the trends of automation and digitalisation continue to develop in developed countries, the question arises as to whether this is also happening to the same extent in developing countries. According to a 2016 study by the World Economic Forum, technically highly equipped countries such as Switzerland, the Netherlands, Singapore, Qatar or the US are considered to be particularly well prepared for the fourth industrial revolution.24 Since July 2016, the Netherlands is the first country that has a nationwide internet of things, allowing the connection of more intelligent technical devices than the inhabitants of the small country.25 What is relevant for each country in this respect is the degree of its technological development and the technological skills of young people who will shape the future of the labour market. Young people in developing countries are optimistic with regard to their professional future. They have more confidence in their own ability than many young people in developed countries. Many developing countries, however, face the problem that only those employees who have already gained substantial IT knowledge show an interest in and a willingness to improve their technological skills.26 A great advantage in a number of developing countries is that more women are having access to education. In the UAE, for example,

See: www.spiegel.de/karriere/roboter-im-job-werde-ich-bald-wegdigitalisiert-a-1119061.html (last accessed on 2 November 2016). German Federal Ministry for Economic Affairs and Technology, ‘Mensch-Technik-Interaktion’ (2013) 3 Autonomik Bericht 18. www3.weforum.org/docs/Media/GCR15/WEF_GCR2015-2016_NR_DE.pdf (last accessed on 15 February 2016). See: http://newatlas.com/netherlands-nationwide-iot-network/44134/ (last accessed on 28 September 2016). See: http://images.experienceinfosys.com/Web/Infosys/%7B6139fde3-3fa4-42aa-83db-ca38e78b51e6%7D_ InfosysAmplifying-Human-Potential.pdf (last accessed on 18 February 2016).

1

ARTIFICIAL INTELLIGENCE AND ROBOTICS AND THEIR IMPACT ON THE WORKPLACE

15

most of the university graduates are female. Particularly in economic systems that were originally dominated by men, the opening up of labour markets was a great opportunity for highly qualified female professionals. Women are more likely to have better developed ‘soft skills’ which makes them an important talent pool – especially in developing countries.27 Low-labour-cost countries, such as China, India and Bangladesh, are still benefiting from their surplus of low-skilled workers, while Western companies are still outsourcing their production to these countries. If, however, these companies decide to produce in their countries of origin in the future, using production robots and only a few workers, the surplus of low-skilled workers might turn into a curse for these developing countries.28 A good example of this problem is the clothing industry, in which clothing is still often produced by hand in low-labour-cost countries such as Bangladesh or Thailand, although the work could easily be done by machines because much of it is routine. The question is how to integrate the great number of unskilled production workers into a structurally difficult labour market that depends on foreign investment. Another problem for developing countries such as India, Thailand or China is the lack of social security systems. Possible mass unemployment could lead to human catastrophes and a wave of migration.29 Accordingly, the same rule applies to developing countries as to developed countries: jobs with low or medium qualification requirements will be eliminated in the end.30 The only difference is that in developing countries there will be more routine jobs with lower or medium qualification requirements. About 47 per cent of total US employment is at risk, whereas 70 per cent of total employment in Thailand or India is at risk.31 In many sectors, the implementation of (partly) autonomous systems requires too much of an investment at present, compared to the existing labour costs.32 In addition, companies operating in developing countries have to promote their appropriate systems in order to improve their productivity and attractiveness vis-à-vis their competitors and remain competitive in the long run. At the same time, (production) robots are becoming less expensive year by year. Replacing human manual labour with robots makes economic sense in low-labour-cost countries when the cost of human labour becomes 15 per cent higher than the cost of robotic labour.33 This will happen in countries such as Mexico by 2025, according to a study by the Boston Consulting Group. Chinese companies are already starting to build factories where robots will replace 90 per cent of human workers.34

International Organization of Employers, ‘Brief on Understanding the Future of Work’ (6 July 2016) 18. UBS, ‘Extreme automation and connectivity: The global, regional, and investment implications of the Fourth Industrial Revolution’ (January 2016) 24 ff. ‘Automat trifft Armut’ (15 July 2016) 135 Handelsblatt News am Abend 6. ‘Jeder zehnte Arbeitsplatz durch Roboter gefährdet’ (19 May 2016) 115 Frankfurter Allgemeine Zeitung 20. See n29 above. See: www.bcgperspectives.com/content/articles/lean-manufacturing-innovation-robots-redefine-competitiveness (last accessed on 3 August 2016). Ibid (last accessed on 3 August 2016).

See: www.spiegel.de/wirtschaft/soziales/arbeitsmarkt-der-zukunft-die-jobfresser-kommen-a-1105032.html (last accessed on 3 August 2016). 1

1

It must therefore be assumed that in most developing countries, markets for autonomous IT systems will be opened up with a delay of a few years. The driving force will most likely be international companies, which will integrate their common systems in all production facilities around the world. In future, companies will locate where they can most easily find suitable highly qualified employees for monitoring and generating AI. If developing countries thus can provide qualified staff in the technological sector, it can be assumed that developing countries will also be able to profit from technological change.35

Potential losers of the fourth industrial revolution For a long time, the BRIC countries (Brazil, Russia, India and China) were considered the beacon of hope for the global economy. Owing to an increased mining of raw materials and the outsourcing of numerous Western branches of industry to low-labour-cost countries, investors expect long-term yields. However, demand for raw materials is currently very low, so Brazil and Russia are becoming less attractive. With the technical development of production robots, many companies producing in low-labour-cost countries will relocate their production sector to the countries where they originally came from.36 The developing countries in Central and South America will also not profit from the trend of the fourth industrial revolution. It is to be feared that these countries – like the North African countries and Indonesia – are not equipped to face automation and digitalisation due to the lack of education of much of the population, lack of investment in a (digital) infrastructure and lack of legal framework.37

Further complicating the matter is the rising birth rate in the North African and Arabic countries, which will lead to high rates of youth unemployment. For every older employee in Uganda, Mali or Nigeria, seven younger employees will enter the badly structured national labour market.38 In these countries, only 40 per cent of the younger generation is in employment, and most of these jobs are low-paid jobs without social security in the third sector.39 It does not come as a surprise that many youths – especially those who are better educated – would like to leave their countries to migrate to Western developed countries. Legal frameworks, less corruption, more social security and a better infrastructure would be necessary to avoid the

See: www.alumniportal-deutschland.org/nachhaltigkeit/wirtschaft/artikel/wachstumsmotor-digitalisierung-industrie-40-ikt.html (last accessed on 17 February 2016).

See: www.bcgperspectives.com/content/articles/lean-manufacturing-innovation-robots-redefine-competitiveness (last accessed on 3 August 2016). UBS, ‘Extreme automation and connectivity: The global, regional, and investment implications of the Fourth Industrial Revolution’ (January 2016) 24 ff. ‘Die große Migrationswelle kommt noch’ (8 August 2016) 183 Frankfurter Allgemeine Zeitung 18. Ibid. 1

ARTIFICIAL INTELLIGENCE AND ROBOTICS AND THEIR IMPACT ON THE WORKPLACE

17

younger generation’s migration wave. Additionally, better access to higher education and training opportunities – particularly for women – would be necessary to promote the competitiveness of these countries.40

Potential winners of the fourth industrial revolution The winners of the digital revolution are, on the other hand, likely to be the highly developed Asian countries with good education systems, such as Singapore, Hong Kong, Taiwan and South Korea.41 These countries – together with the Scandinavian countries – have been undertaking research and working to find digital solutions for complex issues for a long time. The digital interconnection of people in these countries is also very far advanced. The share of the population at risk of unemployment is about six per cent in these countries.42 Finally, Western developed countries will profit from the relocation of the companies’ production sectors when robotic production becomes cheaper than human production in low-labour-cost countries. This will create new jobs in these countries and destroy many routine jobs in the low-labour-cost countries.

Another positive trend can be seen for India and China, which are both considered very suitable candidates for participation in the digital revolution due to most of the population having a good command of English and IT skills. IT knowledge is taught in schools as a key qualification. It is, therefore, not surprising that Indian and Chinese professionals have more extensive computer knowledge than their French or English colleagues do.43 Not only are salaries and wages lower in India, but also the number of better-qualified professionals is why, according to Forrester Research, 25,000 IT jobs are likely to be outsourced to India from the UK alone.44 Like China, India is in the process of developing from simply being a low-labour-cost country into being a Western-orientated society whose population works mainly in the tertiary sector. As the most populated countries in the world, these two countries have a high level of consumer demand. Moreover, because of their rapidly growing cities, these developing countries need highly

developed solutions in terms of logistics and environmental technologies, like the smart city, in order to increase the quality of life for city residents over the long term.

See: www3.weforum.org/docs/Media/GCR15/WEF_GCR2015-2016_NR_DE.pdf (last accessed on 15 February 2016). See: www.sueddeutsche.de/wirtschaft/schwellenlaender-ticks-sind-die-neuen-brics-1.2844010 (last accessed on 15 February 2016). ‘Jeder zehnte Arbeitsplatz durch Roboter gefährdet’ (19 May 2016) 115 Frankfurter Allgemeine Zeitung 20. See: www.experienceinfosys.com/humanpotential-infographic (last accessed on 18 February 2016). See: www.zukunftsinstitut.de/artikel/die-neuerfindung-der-arbeitswelt/ (last accessed on 15 February 2016). 1

1

The digital world market leaders are based in Silicon Valley, California. In 2015, the top ten Silicon Valley start-ups created an annual turnover of approximately US$600bn with information and communication services.45 Additionally, the eight leading digital platforms – Alphabet, Amazon, Facebook, etc – due to their exponential growth show a significantly higher capital market value than the leading industrial companies (eg, General Electric, Siemens or Honeywell).46 The rise of AI in the service sector, especially the gig-economy, can be illustrated by the example of Uber, which saw an increase in its market value from zero to US$40bn in only six years.47 Even though more than 80 per cent of the robots sold each year are deployed in Japan, South Korea, the US and Germany48 and enhance productivity in the production sector, the new business models in the service sector are the digital future. With economic growth in this sector, the US will be particularly resistant to future economic crises. It is therefore not surprising that innovative countries like Switzerland, Germany, the US or Japan are rated best in the Global Competitiveness Index by the World Economic Forum.49 In summary, it can be said that the increase of automation and digitalisation is a global concern that, due to the lack of financial possibilities in many developing countries, will initially be strongly focused on Western developed countries and Southeast Asia. These countries will be considered the winners of Industry 4.0 because of their technological head start and their creative service models.

Necessary skillset for employees Owing to the great number of emerging multidisciplinary support alternatives due to AI and machines, the requirements for future employees will change. There will be hardly any need for employees who do simple and/or repetitive work. Already today, the number of factory workers is constantly decreasing, and humans are ever more becoming the control mechanism of the machine. The automotive industry, where many production steps are already fully automated, is the pioneer in this respect.

The lower the demand for workers, the higher will be the companies’ demand for highly qualified employees. According to common belief, better education helps.50

See: http://bundestag.at/2015/wp-content/uploads/2015/09/kollmann-digital-leadership.pdf (last accessed on 2 November 2016) See: www.rolandberger.com/publications/publication_pdf/roland_berger_ief_plattformstudie_en_final.pdf (last accessed on 2 November 2016). See n45 above.

See: www.bcgperspectives.com/content/articles/lean-manufacturing-innovation-robots-redefine-competitiveness (last accessed on 3 August 2016). See: http://reports.weforum.org/global-competitiveness-report-2015-2016/competitiveness-rankings (last accessed on 15 February 2016). ‘Automatisierung und Arbeitslosigkeit – Bürojobs sind stärker als andere bedroht’ (15 March 2015) Süddeutsche. de Digital (last accessed on 29 December 2015). 1

ARTIFICIAL INTELLIGENCE AND ROBOTICS AND THEIR IMPACT ON THE WORKPLACE

19

Better education helps, however, only in certain circumstances. The additional qualification of an individual employee must be connected to the work in question. Additional qualifications as an accountant will be of little benefit for the individual employee, because – over time – there is a 98 per cent probability that the work of an accountant can be done by intelligent software.51 Creative people who are talented in mathematics and sciences are best qualified for the new labour market. Although not every future employee will be required to be an IT programmer, should have a fundamental grasp of analytical and technical matters. Employees should be able to form a unit with supporting machines and algorithms and to navigate the internet comfortably and move safely in social networks. To do this, it is necessary to know how the basic structures work. The employee should also, however, be able to examine machines and software critically. There is an increasing demand for employees who can work in strategic and complex areas as well. It is not necessary only to oversee machines, but also to coordinate them. The interfaces between humans and machines and the overlaps in the area of responsibility among the more flexible humans must also be coordinated. There is thus likewise an increasing demand for future executive staff with social and interdisciplinary competence.52 Employees must be able not only to communicate with other people, but also, if necessary, to lead them effectively and coordinate with them. In addition, creativity and flexibility are becoming increasingly important. In the future, critical and problemorientated thinking will be expected of employees as the most important requirement.53 This requires sound judgment. The expectations with respect to availability will be higher for future employees. Flexible working hours and standby duties will be the rule and no longer an exception in the labour market. Employees will be required to focus not only on one main practice area, but also to take on several multifaceted, sometimes highly complex tasks as necessary, and also to perform as part of a team. Employees are increasingly expected to have non-formal qualifications. These include, for example, the ability to act independently, to build networks, to organise themselves and their teams with a focus on targets, and to think abstractly. Special knowledge or a flair for high-quality craftsmanship will become less important, since this work is likely to be done by intelligent software or a machine.54 Mere knowledge workers will no longer be required; the focus will rather be on how to find creative solutions to problems.55 Deals will still be made between people in the future, even if the facts may be gathered beforehand by software.56

Krischke and Schmidt, ‘Kollege Roboter’ (12 September 2015) 38/2015 Focus Magazin 66. Bochum, ‘Gewerkschaftliche Positionen in Bezug auf “Industrie 4.0”’ in Botthof/Hartmann (eds), Zukunft der Arbeit in Industrie 4.0, 36. See: http://reports.weforum.org/future-of-jobs-2016/shearable-infographics (last accessed on 11 February 2016). See n12 above, 14. See: www.zukunftsinstitut.de/artikel/die-neuerfindung-der-arbeitswelt (last accessed on 15 February 2016).

Anderson, The Future of Work? The Robot Takeover is Already Here (2015) 27. 1

1

One of the most important requirements, however, will be creativity. As one can see from the examples of Tesla, Uber or Airbnb, innovations are created not only by established market participants, but also by visionary start-ups making a name for themselves with disruptive ideas.

IV.

Necessary investments

Many investments will be necessary for companies to be able to ride the industrial wave 4.0. This applies not only to the IT sector, but equally to the development and procurement of new technical assistive machines. In addition, a multitude of (mostly external) service providers will be necessary to assist in the reorganisations. Moreover, governments must very quickly make provisions for a broad coverage of broadband internet in several countries.57 In their investments, companies will focus more and more on sensor technology and IT services of any type in the years to come. In addition to newer electrical equipment of any type, these so-called equipment investments also include new production machines and their repair, installation and maintenance.58 In the area of processing and extractive industries, these investments are of vital importance because in the long run, costs for material and personnel can be reduced only with the aid of these investments. Without this cost reduction, these companies will no longer be able to compete. Apart from this, building investments are vital. In addition to the classic extension and conversion of a company’s own production facility and workplace, this primarily concerns fast internet across the board, without which efficient communication is not possible either among humans or between human and machine. In the course of digitalisation, companies will change their focus and invest more in other areas. Seventy-one per cent of the CEOs of the worlds’ biggest companies are sure that the next three years will be more important for the strategic orientation of their companies than the last 50 years.59 Therefore, investments in technical devices and the focused use of AI are necessary in all branches.

Connection between different and independent computer systems, creation of intelligent communication channels Many companies already use intelligent systems. Industry 4.0 will add still more systems, and it often turns out to be difficult in practice to connect these to the already established systems.60 Normally, the systems do not stem from the same developer and they usually cover different

See n10 above. See n54 above, 28. ‘Bangen von der digitalen Zukunft’ (26 June 2016) No 121 Handelsblatt News am Abend 3.

See: www.mckinsey.com/industries/high-tech/our-insights/digital-america-a-tale-of-the-haves-and-have-mores (last accessed on 1 April 2016). 1

ARTIFICIAL INTELLIGENCE AND ROBOTICS AND THEIR IMPACT ON THE WORKPLACE

21

ranges of tasks. In order to warrant an optimal operating procedure, the systems must, however, synchronise with each other and with their user. It is thus necessary to integrate the (partially) autonomous systems into the previous work organisation, which is a huge challenge for IT experts. Only if the machines are optimally synchronised with each other and with the human being operating them can an optimal added value chain be created (so-called ‘augmented intelligence’).61

REQUIREMENTS CONCERNING ARTIFICIAL INTELLIGENCE High standards are set for the automatic systems and their certification. First, the system must be able to learn independently, that is, to optimise its own skills.62 This happens not only by the human programming individual production steps or demonstrating them to the system, but also by the IT system gathering experience during its work and independently implementing suggestions for improvement or even learning how to improve. This requires, in turn, that the programmer of the autonomous system understands both the employee’s physical properties and the cognitive process in the context of the relevant tasks and accordingly makes use of this when programming the system.63 The core element of artificial intelligence and a functioning production IT system is thus an interactive, lifelong process of learning from the human partner and responding to human needs.64 Moreover, the robot must be able to draw up highly complex plans as needed by the customer and to produce them autonomously.

It is vital that the IT system comes with comprehensive ‘collective’ intelligence and communicates with other devices and the human being. A production robot, in particular, is supposed to be designed in such a way that it has nearly human capabilities, for instance, fine motor skills, perception, adaptability and cognition. In order to achieve its full functionality, however, it must be programmed dynamically and rigidly.65 The operating human must thus be able to adapt the system’s functions to their individual needs if the system does not recognise them itself.

Barth, ‘Digitale Konkurrenz’ (April 2016) 19 JUVE Rechtsmarkt 24.

See n10 above. See n23 above, 12. See: www.spiegel.de/wirtschaft/soziales/arbeitsmarkt-der-zukunft-die-jobfresser-kommen-a-1105032.html (last accessed on 3 August 2016). Ibid. 1

1

AT LEAST: SMART FACTORY The target in this regard is the so-called ‘smart factory’. A smart factory is characterised by the intelligent machine taking an active part in the production process. In this context, the machines exchange information and control themselves in real time, which causes the production to run fully automatically. The machine takes over the digital receipt of the incoming order, the – if necessary, individual – planning of the product, the request for required materials, the production as such, the handling of the order and even the shipment of the product. The human has only a supervisory function.66 Most companies are still a long way from reaching this target, but there are many attempts in individual production areas to work towards achieving a smart factory situation. It must also be noted that the created interfaces open another gateway to the outside.67 The manufacturers of the autonomous systems must, for example, protect their own know-how against potential hacker attacks, the customer itself and competitors with whose systems a connection is made under certain circumstances. It is therefore recommended that contractual precautions for the (restricted) use of data also be made in addition to the technical precautions.

Preparation of future workers by equipping them with the required skills Many employees and trade unions are hostile towards intelligent IT systems, although AI is a phenomenon without which certain industries and services would be unthinkable. Many people, for instance, have got used to small robotic vacuum cleaners. In principle, there is no structural difference between this household aid and intelligent production system. Moreover, only 11 per cent of US employees assume that they will lose their jobs because of intelligent IT systems or production robots.68 The biggest fear is of a plant closure as a consequence of mismanagement.

The reservations of the (representatives of the) employees are primarily associated with the fear of massive job cutbacks. The machine costs money only once and pays for itself, whereas labour costs are a major, recurring expenditure for a company. The machine or the algorithm carries out its work with a precision and reliability that a human cannot achieve. Humans can thus be considered inferior to machines in a competitive situation. The situation is aggravated by science fiction blockbusters and single industrial accidents with robots that cast a poor light on

See n52 above, 35. See: http://presse.cms-hs.net/pdf/1602/20160208MarktUndMittelstand_M2M_Interview.pdf (last accessed on 9 February 2016). See: www.pewinternet.org/2016/03/10/public-predictions-for-the-future-of-workforce-automation 5 (last accessed on 30 March 2016). 1

ARTIFICIAL INTELLIGENCE AND ROBOTICS AND THEIR IMPACT ON THE WORKPLACE

23

the robot systems. It is the responsibility of governments and companies, however, to create general acceptance, and this will be possible after a certain time period; for example, 25 per cent of people can presently imagine being cared for by a robot when they are old.69 Employers must proceed sensitively and gradually when introducing new systems. They should establish clear rules for handling the machines and specify relevant hierarchies, for example, that the machine has only an assistive and not a replacing function, and the power to make decisions still lies with the human being as before and not vice-versa. Employees should be involved in the development and the process of change at an early stage in order to grow accustomed to the new technology themselves.70

VI.

Adaptation of the education system is necessary

In order to be able to meet the above-mentioned standards set for Industry 4.0, future employees must learn new key qualifications, but the educational system must also be adapted to these new framework conditions. There was agreement at the World Economic Forum 2016, for instance, that both schools and universities ‘should not teach the world as it was, but as it will be’.71 New qualification strategies for individual countries are thus needed. They must encourage students’ interest in subjects such as mathematics, information technology, science and technology when they are still in school, and teachers with digital competence must teach students how to think critically when using new media and help them to achieve a fundamental grasp of new digital and information devices.72

Furthermore, increased use should be made of the design thinking method in order to encourage creative minds already at schools and universities. This method designates an integrated degree programme during which creative work at a company is accompanied by degree courses.73 Adaptability is one of the major challenge’s humans’ face, yet at the same time it can be a major strong point. The next generation of employees must learn to adapt quickly to the technical, social and digital change, because it is to be expected that even a ‘fifth industrial revolution’ will not be long in coming. Lifelong learning is the buzzword that applies not only to fully automated robots, but also to human beings! If an employee’s field of work is automated, the employee must be able to reposition or to distance himself or herself from the machine by individual skills.74

Albers, Breuer, Fleschner and Gottschling, ‘Mein Freund, der Roboter’ (2015) 41/2015 Focus 78 ff. IG Metall Robotik-Fachtagung, ‘Die neuen Roboter kommen – der Mensch bleibt’ (2015).

See: www.faz.net/aktuell/wirtschaft/weltwirtschaftsforum/weltwirtschaftsforum-in-davos-das-ist-die-groessteherausforderung-der-digitalisierung-14031777.html (last accessed on 25 January 2016). Hadeler and Dross, ‘ME Gesamtmetall EU-Informationen’ (13 November 2015) 32/2015 RS 2. See n21 above, 18. See: www.computerwoche.de/a/die-elf-wichtigsten-soft-skills,1902818 (last accessed on 22 February 2016). 1

1

Besides tried and tested school subjects and degree courses, more new degree courses and occupations requiring vocational training based on imparting extensive skills in IT, communication and sciences must be created. This includes data processing occupations, in particular. Although previous degree courses such as classic information technology or business information technology include numerous elements of significant importance for Industry 4.0, they deal too superficially with some aspects owing to their great variety, whereas other aspects are superfluous for the intended work. For example, ‘industrial cognitive science’ and ‘automation bionics’ are suggested as innovative degree courses that deal mainly with researching and optimising the interaction between robot systems and employees.75 In addition to the area of robotics, extended degree courses in the area of big data will be necessary. Employers’ demand for data artists and data scientists or big data developers is rapidly increasing. The main subjects for the professional field of data science include researching data of all types and their structures. Uniform education in this area is, however, still not available.76 Governments are responsible not only for making education possible, but also for focusing young people’s interests on technical and IT jobs at an early age. This will increase the number of graduates in the long run.77

Ultimately, neither the ‘tried and tested’ nor the ‘new’ degree courses may focus solely on imparting specific technical knowledge. The employees of the future must, for instance, be given an understanding of the possibilities of technical aids. This applies, however, not only to theoretical background, but also to practical applications and thus handling the technical aids. US investors do not expect the new generation of employees to be technical geniuses, but employees should always be willing to learn new skills.78 A lifelong learning progress characterises the new labour market, which is changing rapidly because of technical development. The challenge for schools and universities is to teach the employees ‘soft skills’ that are becoming more important than ever, such as the ability to work in a team and to accept criticism, assertiveness, reliability, social and communicative skills and good time management. Learning ‘soft skills’ will prepare employees optimally for the future labour market: ‘To Switch the Skills, Switch the Schools.’79 See n52 above, 40 f.

See: www.pwc.de/de/prozessoptimierung/assets/pwc-big-data-bedeutung-nutzen-mehrwert.pdf, 27f (last accessed on 31 March 2016). See n14 above, 19. See: www.rolandberger.com/publications/publication_pdf/roland_berger_amcham_business_barometer_2.pdf, 8 (last accessed on 22 September 2016). Brynjolfsson and McAfee, The Second Machine Age (2014) Chapter 12: ‘Learning to Race with Machines: Recommendations for Individuals’. 1

ARTIFICIAL INTELLIGENCE AND ROBOTICS AND THEIR IMPACT ON THE WORKPLACE VII.

25

New job structures

About 47 per cent of total US employment is at risk, read the catch line in the report by Frey/Osborne in 2013.80 Consistent with this is that according to a survey by Pew Research Center, 65 percent of US citizens expect that within 50 years a robot or an intelligent algorithm will be doing their work.81 Experts hold vastly different opinions with regard to the dramatic impact of the changes in the job structures. Others claim that, thanks to digitalisation and automatization, many employees whose jobs are at high risk will not be replaced completely, even if the technical advances would allow a replacement.82 Not every specific occupation will be replaced by the work of machines in general, but it is certain that some individual occupational activities will be performed by machines. For example, the risk of being replaced by a robot is 87 per cent for a barkeeper.83 Already today, it is technically feasible that a robotic machine could mix drinks, send the clients’ orders directly to the kitchen, receive complaints and accept the clients’ money. Nevertheless, the atmosphere in the bar or in the restaurants will no longer be the same. Because of the lack of acceptance by potential clients and the high acquisition costs, it is definite that 87 per cent of all barkeepers will not lose their jobs in the next few years. Small and medium-sized companies, in particular, are likely to shy away from technical devices because of the high acquisition costs and the lack of highly qualified specialists who can handle the new systems.84 In view of the occupational work structure and the legal, technical, ethnical and social barriers, only nine to 12 per cen