Brain and Nature-Inspired Learning, Computation and Recognition 0128197951, 9780128197950

Brain and Nature-Inspired Learning, Computation and Recognition presents a systematic analysis of neural networks, natur

827 134 29MB

English Pages 788 [763] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Brain and Nature-Inspired Learning, Computation and Recognition
 0128197951, 9780128197950

Table of contents :
Cover
Brain and Nature-Inspired Learning, Computation and Recognition
Copyright
1. Introduction
1.1 A brief introduction to the neural network
1.1.1 The development of neural networks
1.1.2 Neuron and feedforward neural network
1.1.3 Backpropagation algorithm
1.1.4 The learning paradigm of neural networks
1.2 Natural inspired computation
1.2.1 Fundamentals of nature-inspired computation
1.2.2 Evolutionary algorithm
1.2.3 Artificial immune system (AIS)
1.2.4 Other methods
1.3 Machine learning
1.3.1 Development of machine learning
1.3.2 Dimensionality reduction
1.3.3 Sparseness and low-rank
1.3.4 Semisupervised learning
1.4 Compressive sensing learning
1.4.1 The development of compressive sensing
1.4.2 Sparse representation
1.4.3 Compressive observation
1.4.4 Sparse reconstruction
1.5 Applications
1.5.1 Community detection
1.5.2 Capacitated arc routing optimization
1.5.3 Synthetic aperture radar image processing
1.5.4 Hyperspectral image processing
References
2. The models and structure of neural networks
2.1 Ridgelet neural network
2.2 Contourlet neural network
2.2.1 Nonsubsampled contourlet transforms
2.2.2 Deep contourlet neural network
2.3 Convolutional neural network
2.3.1 Convolution
2.3.2 Pooling
2.3.3 Activation function
2.3.4 Batch normalization
2.3.5 LeNet5
2.4 Recurrent artificial neural network
2.5 Generative adversarial nets
2.5.1 Biological description—human behavior
2.5.2 Data augmentation
2.5.3 Model description
2.6 Autoencoder
2.6.1 Layer-wise pretraining
2.6.2 Autoencoder network
2.7 Restricted Boltzmann machine and deep belief network
Further reading
3. Theoretical basis of natural computation
3.1 Evolutionary algorithms
3.1.1 Pattern theorem
3.1.2 Implicit parallelism
3.1.3 Building block assumption
3.2 Artificial immune system
3.2.1 Markov chain-based convergence analysis
3.2.2 Nonlinear dynamic model
3.3 Multiobjective optimization
3.3.1 Introduction
3.3.2 Mathematical concepts
3.3.3 Multiobjective optimization algorithms
3.3.3.1 The first generation of evolutionary multiobjective optimization algorithms
3.3.3.1.1 MOGA
3.3.3.1.2 NSGA
3.3.3.1.3 NPGA
3.3.3.2 The second generation of evolutionary multiobjective optimization algorithms
3.3.3.2.1 SPEA and SPEA2
3.3.3.2.2 PAES, PESA, and PESA-II
3.3.3.2.3 NSGA-II
References
4. Theoretical basis of machine learning
4.1 Dimensionality reduction
4.1.1 Subspace segmentation
4.1.2 Nonlinear dimensionality reduction
4.2 Sparseness and low rank
4.2.1 Sparse representation
4.2.2 Matrix recovery and completion
4.3 Semisupervised learning and kernel learning
4.3.1 Semisupervised learning
4.3.2 Nonparametric kernel learning
References
5. Theoretical basis of compressive sensing
5.1 Sparse representation
5.1.1 Stationary dictionary
5.1.2 Learning dictionary
5.2 Compressed observation
5.3 Sparse reconstruction
5.3.1 Relaxation methods
5.3.2 Greedy methods
5.3.3 Natural computation methods
5.3.4 Other methods
References
6. Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
6.1 Introduction
6.1.1 The introduction of MOEA on constrained multiobjective optimization problems
6.1.2 An introduction to MOEA on clustering learning and classification learning
6.1.3 The introduction of MOEA on sparse spectral clustering
6.2 Modified function and feasible-guiding strategy-based constrained MOPs
6.2.1 Problem description
6.2.2 Modified objective function
6.2.3 The feasible-guiding strategy
6.2.4 Procedure for the proposed algorithm
6.3 Learning simultaneous adaptive clustering and classification learning via MOEA
6.3.1 Objective functions of MOASCC
6.3.2 The framework of MOASCC
6.3.3 Computational complexity
6.4 A sparse spectral clustering framework via MOEA
6.4.1 Mathematical description of SRMOSC
6.4.2 Extension on semisupervised clustering
6.4.3 Initialization
6.4.4 Crossover
6.4.5 Mutation
6.4.6 Laplacian matrix construction
6.4.7 Final solution selection phase
6.4.8 Complexity analysis
6.5 Experiments
6.5.1 The experiments of MOEA on constrained multiobjective optimization problems
6.5.1.1 Experimental setup
6.5.1.2 Performance metrics
6.5.1.2.1 IGD
6.5.1.2.2 Minimal spacing
6.5.1.2.3 Coverage of two sets (ς)
6.5.1.3 Comparison experiment results
6.5.2 The experiments of MOEA on clustering learning and classification learning
6.5.2.1 Experiment setup
6.5.2.2 Experiment on a synthetic datasets
6.5.2.3 Experiment on real-life datasets
6.5.3 The experiments of MOEA on sparse spectral clustering
6.5.3.1 Detailed analysis of SRMOSC
6.5.3.2 Experimental comparison between SRMOSC and other algorithms
6.6 Summary
References
7. MOEA-based community detection
7.1 Introduction
7.2 Multiobjective community detection based on affinity propagation
7.2.1 Background to APMOEA
7.2.1.1 Affinity propagation method
7.2.1.2 Multiobjective optimization
7.2.2 Objective functions
7.2.3 The selection method for nondominated solutions
7.2.4 Preliminary partition by the AP method
7.2.5 Further search using multiobjective evolutionary algorithm
7.2.5.1 Representation and initialization
7.2.5.2 Genetic operators
7.2.6 Elitist strategy of the external archive
7.3 Multiobjective community detection based on similarity matrix
7.3.1 Background of GMOEA-net
7.3.1.1 Structural balance theory
7.3.1.2 Tchebycheff approach
7.3.2 Objective functions
7.3.3 The construction of similarity matrix and k-nodes update policy
7.3.3.1 The function of node similarity
7.3.3.2 The k-nodes update policy
7.3.4 Evolutionary operators
7.3.4.1 The cross-merging operator based on local node sets
7.3.4.2 The mutation operator based on similarity matrix
7.3.5 The whole framework of GMOEA-net
7.4 Experiments
7.4.1 Evaluation index
7.4.2 Networks for simulation
7.4.2.1 Computer-generated networks
7.4.2.2 Real-world networks
7.4.3 Comparison algorithms and parameter settings
7.4.3.1 Comparison algorithms
7.4.3.2 Parameter settings
7.4.4 Experiments on computer-generated networks
7.4.4.1 Experiments on APMOEA
7.4.4.2 Experiments on GMOEA-net
7.4.5 Experiments on real-world networks
7.5 Summary
References
8. Evolutionary computation-based multiobjective capacitated arc routing optimizations
8.1 Introduction
8.2 Multipopulation cooperative coevolutionary algorithm
8.2.1 Related works
8.2.1.1 The model of MO-CARP
8.2.1.2 The description of direction vector
8.2.2 Initial population and subpopulations partition
8.2.3 The fitness evaluation in each subpopulation
8.2.4 The elitism archiving mechanism
8.2.4.1 The external elitism archive
8.2.4.2 The internal elitism archive
8.2.5 The cooperative coevolutionary process
8.2.5.1 Construct evolutionary pool for each subregion
8.2.5.2 Crossover
8.2.5.3 Local search
8.2.5.4 The selection of offspring solutions and diversity preservation mechanism
8.2.6 The processing flow of MPCCA
8.3 Immune clonal algorithm via directed evolution
8.3.1 Antibody initialization
8.3.2 Immune clonal operation
8.3.3 Immune gene operations
8.3.3.1 The decomposition operation of the population
8.3.3.2 Gene recombination operator
8.3.3.3 Gene mutation operator
8.3.3.4 Directed comparison operator
8.3.3.5 Clonal selection operator
8.3.4 The processing flow of DE-ICA
8.4 Improved memetic algorithm via route distance grouping
8.4.1 Solutions for the timely replacement of IRDG-MAENS
8.4.2 Determine the regions which individuals belong to
8.4.3 The processing flow of IRDG-MAENS
8.5 Experiments
8.5.1 Test problems and experimental setup
8.5.1.1 MPCCA
8.5.1.2 DE-ICA
8.5.1.3 IRDG-MAENS
8.5.2 The performance metrics
8.5.2.1 The distance to the reference set (ID)
8.5.2.2 Purity
8.5.2.3 Hypervolume (HV)
8.5.3 Wilcoxon signed rank test
8.5.4 Comparison of the evaluation metrics
8.5.4.1 MPCCA
8.5.4.2 DE-ICA
8.5.4.3 IRDG-MAENS
8.5.5 Comparison of nondominant solutions
8.5.5.1 MPCCA
8.5.5.2 DE-ICA
8.5.5.3 IRDG-IDMAENS
8.6 Summary
References
9. Multiobjective optimization algorithm-based image segmentation
9.1 Introduction
9.2 Multiobjective evolutionary fuzzy clustering with MOEA/D
9.2.1 Fuzzy-C means clustering algorithms with local information
9.2.2 Framework of MOEFC
9.2.3 Opposition-based learning operator
9.2.4 Mixed population initialization
9.2.5 The time complexity analysis
9.3 Multiobjective immune algorithm for SAR image segmentation
9.3.1 Definitions of AIS-based, multiobjective optimization
9.3.2 The stage of features extraction and preprocessing
9.3.2.1 Watershed raw segmentation
9.3.2.2 Feature extraction using Gabor filters and GLCP
9.3.3 The immune multiobjective framework for SAR imagery segmentation
9.4 Experiments
9.4.1 The MOEFC experiments
9.4.1.1 Experimental setting of MOEFC
9.4.1.2 Segmentation results on synthetic images
9.4.1.3 Segmentation results on natural images
9.4.1.4 Segmentation results on medical images
9.4.1.5 Segmentation results on SAR images
9.4.2 The IMIS experiments
9.4.2.1 IMIS experimental settings
9.4.2.2 Analysis of experimental results
9.5 Summary
References
10. Graph-regularized feature selection based on spectral learning and subspace learning
10.1 Nonnegative spectral learning and subspace learning-based graph-regularized feature selection
10.1.1 Dual-graph nonnegative spectral learning
10.1.2 Dual-graph sparse regression
10.1.3 Feature selection
10.1.4 Optimization
10.1.5 Local structure preserving
10.1.6 Update rules for SGFS
10.2 Experiments of spectral learning and subspace learning methods for feature selection
10.2.1 Experiments and analysis of NSSRD
10.2.1.1 Experimental settings
10.2.1.2 Simple illustrative example problem
10.2.1.3 Evaluating the effectiveness of NSSRD
10.2.1.4 Clustering results and analysis
10.2.2 Experiments and analysis of SGFS
10.2.2.1 Experimental setting
10.2.2.2 Convergence test
10.2.2.3 AT&T face dataset example
10.2.2.4 Experimental results and analysis
10.2.2.5 Robustness test of algorithms
10.2.2.6 Parameter sensitivity analysis
References
11. Semisupervised learning based on nuclear norm regularization
11.1 Framework of semisupervised learning (SSL) with nuclear norm regularization
11.1.1 A general framework
11.1.2 Nuclear norm regularized model
11.1.3 Modified fixed point algorithm
11.1.4 Implementation
11.1.5 Label propagation
11.1.6 Valid kernel
11.2 Experiments and analysis
11.2.1 Compared algorithms and parameter settings
11.2.2 Synthetic data
11.2.3 Real-world data sets
11.2.4 Transduction classification results
References
12. Fast clustering methods based on learning spectral embedding
12.1 Learning spectral embedding for semisupervised clustering
12.1.1 Graph construction and spectral embedding
12.1.1.1 Symmetry-favored graph
12.1.1.2 Spectral embedding of graph Laplacian
12.1.2 Problem formulation
12.1.2.1 The unit hypersphere
12.1.2.2 Squared loss model
12.1.2.3 Hinge loss model
12.1.2.4 Clustering
12.1.3 Algorithm
12.1.4 Experiments
12.1.4.1 Parameter selection
12.1.4.2 Vector-based clustering
12.1.4.3 Graph-based clustering
12.2 Fast semisupervised clustering with enhanced spectral embedding
12.2.1 Problem formulation
12.2.1.1 Objective function
12.2.1.2 Solving the objective function
12.2.1.3 Clustering
12.2.2 Algorithm
12.2.2.1 Experimental results
12.2.2.2 Parameter selection
12.2.2.3 Toy examples
12.2.2.4 Vector-based clustering
12.2.2.5 Graph-based clustering
References
Chapter 13 - Fast clustering methods based on affinity propagation and density weighting
13.1 The framework of fast clustering methods based on affinity propagation and density weighting
13.1.1 Related works
13.1.1.1 AP clustering
13.1.1.2 Spectral clustering
13.1.1.3 Nyström method
13.1.1.4 Local length and global distance
13.1.2 Fast AP algorithm
13.1.2.1 Coarsening phase
13.1.2.1.1 Fast sampling algorithm
13.1.2.1.2 Determine the number of representative exemplars
13.1.2.2 Exemplar-clustering phase
13.1.2.3 Refinement phase
13.1.3 Fast two-stage spectral clustering framework
13.1.3.1 Fast two-stage AP algorithm
13.1.3.2 Determine the number of representative exemplars
13.1.3.3 Sampling phase
13.1.3.4 Fast-weighted approximation spectral clustering phase
13.1.3.5 Robustness
13.1.3.6 Fast nearest-neighbors research
13.2 Experiments and analysis
13.2.1 Experiments on the method based on affinity propagation
13.2.1.1 Synthetic data sets
13.2.1.2 Compared algorithms and parameter settings
13.2.1.3 Vector-based clustering
13.2.1.4 Evaluation metrics
13.2.1.5 Experimental results
13.2.1.6 Graph-based clustering
13.2.2 Experiments on the method based on density-weighting
13.2.2.1 Intertwined spirals data set
13.2.2.2 Real-world data sets
13.2.2.3 Compared algorithms
13.2.2.4 Algorithm performances
13.2.2.5 Spectral embedding
References
14. SAR image processing based on similarity measures and discriminant feature learning
14.1 SAR image retrieval based on similarity measures
14.1.1 Semantic classification and region-based similarity measures
14.1.1.1 Semisupervised learning
14.1.1.2 Classification recovery scheme
14.1.1.3 Improved integrated region matching measure
14.1.1.3.1 Self-adapting k-means segmentation
14.1.1.3.2 Region-based IRM distance computation
14.1.1.3.3 Improved IRM scheme
14.1.1.3.4 Edge regions calculation
14.1.1.3.5 IIRM computation
14.1.1.4 Methodology summary
14.1.1.4.1 Off-line process
14.1.1.4.2 On-line process
14.1.1.5 Experiment
14.1.1.5.1 Performance of improved integrated region matching (IIRM) measure
14.1.1.5.2 Query example (proposed method, IRM, one of the latest retrieval methods)
14.1.1.5.3 Land cover statistical analysis
14.1.2 Fusion similarity-based reranking for SAR image retrieval
14.1.2.1 Fusion similarity-based reranking
14.1.2.1.1 Preprocessing
14.1.2.1.2 Reranking
14.1.2.1.2.1 Modal-image matrix construction and fusion similarity calculation
14.1.2.1.2.2 Reranking function and solution
14.1.2.2 Experiments and discussion
14.1.2.2.1 Experiment settings
14.1.2.2.2 Numerical assessment
14.1.2.2.2.1 Based on different retrieval methods
14.1.2.2.2.2 Compared with different reranking algorithms
14.1.2.3 Influence of different parameters
14.1.2.4 Reranking efficiency
14.1.2.5 Reranking examples
14.1.3 SAR image content retrieval based on fuzzy similarity and relevance feedback
14.1.3.1 Region-based fuzzy matching
14.1.3.1.1 Introduction to the improved integrated region matching algorithm
14.1.3.1.2 RFM measure
14.1.3.1.2.1 Superpixel-based segmentation for brightness-texture regions
14.1.3.1.2.2 Multiscale edge detector-based segmentation for edge regions
14.1.3.1.2.3 Fuzzy region representation
14.1.3.1.2.4 RFM similarity calculation
14.1.3.1.2.5 RFM summarization and computational complexity
14.1.3.2 Multiple relevance feedback (MRF)
14.1.3.3 Experiments and discussion
14.1.3.3.1 Setting parameters
14.1.3.3.2 Evaluation criteria
14.1.3.3.3 Retrieval examples
14.1.3.4 Numerical evaluation
14.1.3.4.1 Performance of the RFM
14.1.3.4.2 Performance of the proposed retrieval method
14.1.3.4.3 Importance of the multiple RF schemes' integration
14.1.3.4.4 Significance of the RFM Gaussian kernel
14.1.3.4.5 Influences of different parameters
14.2 SAR image change detection based on spatial coding and similarity
14.2.1 Saliency-guided change detection for SAR imagery using a semisupervised Laplacian SVM
14.2.1.1 Learning a pseudotraining set via saliency detection
14.2.1.2 Obtaining change result via Laplacian support vector machine
14.2.1.3 Experimental results
14.2.1.3.1 Description of data sets
14.2.1.3.2 Quantitative analysis
14.2.1.3.3 Parameter selection
14.2.1.3.4 Experiment results and analysis on three data sets
14.2.2 SAR images change detection based on spatial coding and nonlocal similarity pooling
14.2.2.1 Producing the difference image
14.2.2.2 Learning dictionary via affinity propagation
14.2.2.3 Creating feature vectors via sparse coding and nonlocal similarity pooling
14.2.2.3.1 Obtaining a change map by k-means clustering
14.2.2.4 Experimental results
14.2.2.4.1 Quantitative analysis
14.2.2.4.2 Parameter selection
14.2.2.4.3 Experiment results and analysis of the first three data sets
14.2.2.4.4 Experiment results and analysis on the last two image pairs
14.2.2.4.5 Results and analysis on simulated images
14.2.2.4.6 Experiment for sparse representation
References
15. Hyperspectral image processing based on sparse learning and sparse graph
15.1 Hyperspectral image denoising based on hierarchical sparse learning
15.1.1 Spatial-spectral data extraction
15.1.2 Hierarchical sparse learning for denoising each band-subset
15.1.3 Experimental results and discussion
15.1.3.1 Experiment on simulated data
15.1.3.2 Experiment on real data
15.1.3.2.1 Denoising for urban data
15.1.3.2.2 Experimental results on Indian Pines data
15.2 Hyperspectral image restoration based on hierarchical sparse Bayesian learning
15.2.1 Beta process
15.2.1.1 Full hierarchical sparse Bayesian model
15.2.2 Experimental results
15.2.2.1 Denoising
15.2.2.2 Predicting the missing data
15.2.2.3 Discussion
15.3 Hyperspectral image dimensionality reduction using a sparse graph
15.3.1 Sparse representation
15.3.2 Sparse graph-based dimensionality reduction
15.3.3 Sparse graph learning
15.3.4 Spatial-spectral clustering
15.3.5 Experimental results
15.3.5.1 Introduction of hyperspectral datasets
15.3.5.2 Classification results
15.3.5.3 Influence of spatial-spectral clustering
15.3.5.4 Convergence analysis
References
16. Nonconvex compressed sensing framework based on block strategy and overcomplete dictionary
16.1 Introduction
16.2 The block compressed sensing framework based on the overcomplete dictionary
16.2.1 Block compressed sensing
16.2.2 Overcomplete dictionary
16.2.3 Structured compressed sensing model
16.3 Image sparse representation based on the ridgelet overcomplete dictionary
16.4 Structured reconstruction model
16.4.1 Structural sparse prior based on image self-similarity
16.4.2 Reconstruction model based on an estimation of the direction structure of image blocks
16.5 Nonconvex reconstruction strategy
References
17. Sparse representation combined with fuzzy C-means (FCM) in compressed sensing
17.1 Basic introduction to fuzzy C-means (FCM) and sparse representation (SR)
17.2 Two versions combining FCM with SR
17.2.1 FDCM_SSR
17.2.2 SL_FCM
17.3 Experimental results
17.3.1 FDCM_SSR
17.3.1.1 UCI data set
17.3.1.2 Artificial images
17.3.1.3 Natural images
17.4 SAR images
17.4.1 SL_FCM
17.4.1.1 Artificial and natural images
17.4.1.2 Synthetic aperture radar images
References
18. Compressed sensing by collaborative reconstruction
18.1 Introduction
18.2 Methods
18.2.1 Block CS of images
18.2.2 Collaborative reconstruction method based on an overcomplete dictionary
18.2.3 Geometric structure-guided collaborative reconstruction method
18.3 Experiment
18.3.1 Collaborative reconstruction method based on an overcomplete dictionary
18.3.2 Geometric structure-guided collaborative reconstruction method
References
19. Hyperspectral image classification based on spectral information divergence and sparse representation
19.1 The research status and challenges of hyperspectral image classification
19.1.1 The research status of hyperspectral image classification
19.1.2 The challenges of hyperspectral image classification
19.2 Motivation
19.3 Spectral information divergence (SID)
19.4 Sparse representation classification method based on SID
19.5 Joint sparse representation classification method based on SID
19.6 Experimental results and analysis
19.6.1 Comparison of the measurements
19.6.2 Comparison of the performance of sparse representation classification methods
19.6.3 Analysis of parameters
19.6.4 The proof of convergence
References
20. Neural network-based synthetic aperture radar image processing
20.1 Discriminant deep belief network for SAR image classification
20.1.1 Weak classifiers training
20.1.2 Discriminative projection
20.1.3 High-level discriminative feature learning
20.1.4 Experiment and result
20.2 Convolutional-wavelet neural network for SAR image segmentation
20.2.1 Overall framework
20.2.2 Experiment and result
20.3 Deep neural network for SAR image registration
20.3.1 Train deep neural network
20.3.2 Predicting the matching label and eliminate the wrong matching points
20.3.3 Experiment and result
References
21. Neural networks-based polarimetric SAR image classification
21.1 PolSAR decomposition
21.2 Autoencoder for PolSAR image classification
21.2.1 Data processing and feature learning
21.2.2 Experiment and result
21.3 DBN for PolSAR image classification
21.3.1 DBN structure and feature learning
21.3.2 Experiment and result
21.4 Wishart deep stacking networks for PolSAR image classification
21.4.1 Wishart distance and network structure
21.4.2 Experiment and results
References
22. Deep neural network models for hyperspectral images
22.1 Deep fully convolutional network
22.1.1 Fully convolutional networks
22.1.2 Deep multiscale spatial distribution prediction via FCN-8s
22.1.3 Spatial-spectral feature fusion and classification for HSI
22.1.4 Experiment and results
22.2 Recursive autoencoders
22.2.1 Unsupervised RAE
22.2.2 Experiments and results
22.3 Superpixel-based multiple local CNN
22.3.1 Multiple local regions joint representation CNN model
22.3.2 Experiments and results
References
Index
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
Back Cover

Citation preview

Brain and Nature-Inspired Learning, Computation and Recognition Licheng Jiao Xidian University, Xi’an, China

Ronghua Shang Xidian University, Xi’an, China

Fang Liu Xidian University, Xi’an, China

Weitong Zhang Xidian University, Xi’an, China

Elsevier Radarweg 29, PO Box 211, 1000 AE Amsterdam, Netherlands The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States Copyright © 2020 Tsinghua University Press. Published by Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-12-819795-0 For information on all Elsevier publications visit our website at https://www.elsevier.com/books-and-journals

Publisher: Matthew Deans Acquisition Editor: Glyn Jones Editorial Project Manager: Naomi Robertson Production Project Manager: Sruthi Satheesh Cover Designer: Greg Harris Typeset by TNQ Technologies

CHAPTER 1

Introduction Chapter Outline 1.1 A brief introduction to the neural network 1.1.1 1.1.2 1.1.3 1.1.4

1.2 Natural inspired computation 1.2.1 1.2.2 1.2.3 1.2.4

learning 20 20 22

1.5.1 1.5.2 1.5.3 1.5.4

18

24

The development of compressive sensing Sparse representation 25 Compressive observation 26 Sparse reconstruction 26

1.5 Applications

12

18

Development of machine Dimensionality reduction Sparseness and low-rank Semisupervised learning

1.4 Compressive sensing learning 1.4.1 1.4.2 1.4.3 1.4.4

12

Fundamentals of nature-inspired computation Evolutionary algorithm 12 Artificial immune system (AIS) 15 Other methods 16

1.3 Machine learning 1.3.1 1.3.2 1.3.3 1.3.4

1

The development of neural networks 2 Neuron and feedforward neural network 3 Backpropagation algorithm 9 The learning paradigm of neural networks 11

24

27

Community detection 27 Capacitated arc routing optimization 29 Synthetic aperture radar image processing 32 Hyperspectral image processing 36

References

39

1.1 A brief introduction to the neural network Over the years, scientists have been exploring the secrets of the human brain from various perspectives, such as medicine, biology, physiology, philosophy, computer science, cognition, and organization synergetics, hoping to make artificial neurons that simulate the human brain. In the process of research, in recent years, a new multidisciplinary

Brain and Nature-Inspired Learning, Computation and Recognition. https://doi.org/10.1016/B978-0-12-819795-0.00001-3 Copyright © 2020 Tsinghua University Press. Published by Elsevier Inc. All rights reserved.

1

2 Chapter 1 cross-technology field has been formed, called “artificial neural network” The research into neural networks involves a wide range of disciplines, which combine, infiltrate, and promote each other. Artificial neural network (ANN) is an adaptive nonlinear dynamic system composed of a large number of simple basic elementsdneurons. The structure and function of each neuron are relatively simple, but the system behavior produced by a large number of neuron combinations is very complex. The basic structure of an artificial neural network mimics the human brain, and reflects some basic characteristics of human brain function. It can adapt itself to the environment, summarize rules, and complete some operations, recognition, or process control. Artificial neural networks have the characteristics of parallel processing, which can greatly improve work speed.

1.1.1 The development of neural networks The development of artificial neural networks has gone through three climaxes: control theory from the 1940 to 1960s [1-3], connectionism from the 1980s to the mid-1990s [4, 5], and deep learning since 2006 [6, 7]. In 1943, Warren McCulloch and Walter Pitts based on a mathematical algorithm threshold logic algorithm created a neural network model [8]. This linear model identifies two different types of inputs by testing whether the response output is positive or negative. The study of neural networks is divided into the study of biological processes in the brain and the study of artificial intelligence (artificial neural networks). In 1949, Hebb published Organization of Behavior, and put forward the famous “Hebb theory” [2] Hebb theory mainly argues that when the axons of neuron A are close to neuron B and neuron A participates in the repeated and sustained excitement of neuron B, both the neurons or one of them will change the process of growth or metabolism, which can enhance the effectiveness of neuron A stimulating neuron B [9]. Hebb theory was confirmed by Nobel Prize winner Kendall and his animal experiments in 2000 [10]. The later unsupervised machine learning algorithms are the variants of Hebb theory more or less. In 1958, Frank Rosenblatt simulated a neural network model called the “perceptron” which was invented on an IBM-704 computer [11]. This model can perform some simple visual processing tasks. Rosenblatt believed that the perceptron would eventually be able to learn, make decisions, and translate languages. In 1959, another two American engineers, Widrow and Hoff [12], put forward the adaptive linear element (Adaline). This was a change from the perceptron and one of the progenitor models of machine learning. The main difference between it and perceptron was that the Adaline neuron has a linear activation function, which allows the output to be of any value. In

Introduction 3 1969, Marvin Minsky and Seymour Papert found two major defects in the neural network: first, the basic perceptron could not handle XOR problems [13]. Second, the computing power of the computer was not sufficient to deal with the large neural network. The study of neural networks was stagnant. In 1974, Paul Werbos proposed that the multilayer perceptron be trained by a “back propagation algorithm” to overcome the defects that resulted in the single-layer perceptron being unable to deal with an XOR problem [14]. However, because neural network research was at a low level at that time, this method did not attract much attention. The neural network idea began to revived in the 1980s. In 1982, Hopfield [15] proposed a novel neural network called the “Hopfield network.” The Hopfield neural network was a kind of recurrent neural network, which combines a storage system and a binary system. It introduced the concept of energy function for the first time so that the equilibrium state of the neural network had a clear criterion method. But due to the limitations of computing, for the rest of the 20th century, the popularity of support vector machines and other simpler algorithms, such as linear classifiers, gradually exceeded the neural network. In 1998, LeCun proposed convolutional neural networks, called LeNet-5, which were updated by back propagation and this method achieved a good result in a handwritten digits database [16]. In the early 21st century, the computing power of the computer was greatly improved with the help of GPU and distributed computing. The neural network has since gained great attention and development. In 2006, Geoffrey Hinton [17] effectively trained a deep belief network with greedy layer-wise pretraining. This technique was then extended to many different neural networks by researchers, greatly improving the generalization effect of the model on the test set. In 2012, Hinton’s group won Image Net 2012 [18]. Their image classification accuracy rate was far more than the second place. The deep neural network algorithm has a great advantage over the traditional algorithm in some areas. In 2016, Alpha Go [19], an artificial intelligent software which was developed by Google Deep Mind, beat the human top professional chess player. Its principle was to use a Monte Carlo tree search method combined with two different deep neural networks. The emergence of Alpha Go once again pushed the development of neural networks to the peak.

1.1.2 Neuron and feedforward neural network A neuron is a biological model based on the nerve cells of a biological nervous system. In the study of biological nervous systems, the biological mechanism of the neuron can be

4 Chapter 1 represented by mathematics and a computational model based on the neuron obtained. The neurons contain three parts: cell body, dendrites, and axons. The cell body is complexly formed by many molecules. It is the energy supply area of neuronal activity, where metabolic activities such as metabolism are carried out. Dendrites are the entry to receive information from other neurons. Axons are the outlets for stimulating neurons to transmit information. The synapse is the structure that enables communication between one neuron and another and transmits information between them. Neural networks are described on the basis of the mathematical model of neurons. The model is represented by network topology, node characteristics, and learning rules. The main advantages of neural networks are as follows: (1) (2) (3) (4)

Parallel distribution processing High robustness and fault tolerance Distributed storage and learning ability The ability to fully approach the complex nonlinear relationship.

According to the characteristics of the neurons and the biological function, it is known that the neuron is a processing unit of information with multiple inputs and a single output. The processing of information is nonlinear, and we abstract it into a simple mathematical model, as shown in Fig. 1.1. The specific mathematical formulas are as follows: 8 m X

1 > > > sigmoidðxÞ ¼ > > 1 þ ex > > > > < ex  ex tanhðxÞ ¼ x e þ ex > > > > > ReLUðxÞ ¼ maxð0; xÞ > > > > > : softplusðxÞ ¼ logð1 þ ex Þ

(1.2)

Neuroscientists have found that neurons have the characteristics of unilateral inhibition, wide excited boundary, sparse activation, and so on. Compared with other activation functions, the rectified linear unit function (ReLU) has biological interpretability. In addition, the derivative of the softplus function is the logistics function. This is the smooth form of a rectified linear unit. Although it also has unilateral inhibition and wide excitability boundary characteristics, it has no characteristics of sparse activation. Based on mathematical neuron model, neural networks can be divided into forward networks (directed acyclic) and feedback networks (undirected complete graph, also called cyclic network) according to the topology of the network connection. For the feedback network, the stability of the network model is closely related to the associative memory. The Hopfield network and the Boltzmann machine are of this type. The forward network can be realized by the multiple compound of simple nonlinear function. The network structure is very simple. The following is an introduction to the forward neural network. Its network structure is shown in Fig. 1.2. The corresponding mathematical formula is (1.3).

Hidden Input Output

Figure 1.2 Feedforward neural network with a single hidden layer.

6 Chapter 1 ! 8 m X ð1Þ > ð1Þ ð1Þ ð1Þ > xi ,wi þ b h ¼4 > > > < i¼1 0 1 > n X > > ð1Þ ð2Þ ð2Þ > > hj ,wj þ bð2Þ A : y¼4 @

(1.3)

j¼1

where the input x˛ℝm, the hidden layer h˛ℝn, and the output y˛ℝK. w(1)˛ℝmn and b(1)˛ℝn are the weight connection matrix and bias from the input layer to the hidden layer, respectively. w(2)˛ℝnK and b(2)˛ℝK are the weight connection matrix and bias from the hidden layer to the output layer. 4(1) and 4(2)are the activation function. In practical applications, the training data set is assumed to be 8n oN ðnÞ ðnÞ > ; y x > < n¼1 (1.4) ðnÞ m x ˛ ℝ > > : ðnÞ y ˛ ℝK The model between the input and output is Formula (1.5). 0 1 ! n m X X ð1Þ ð2Þ 4ð1Þ xi , wi þ bð1Þ wj þ bð2Þ A y ¼ Tðx; qÞ ¼ 4ð2Þ @ j¼1

(1.5)

i¼1

The parameter q¼(w(1),b(1);w(2),b(2)) is further optimized for the target (the loss term and the regular term composition). min LðqÞ ¼ q

N  2    2 X 1 X   ðnÞ  ðlÞ 2 y  T xðnÞ ; q  þ l w  N n¼1 F F l¼1

The gradient descent method is used to solve the parameter q. 8 k k1 > < q ¼ q a,Vqjq¼qk1   > Vqj k1 ¼ vLðqÞ : q¼q vq q¼qk1

(1.6)

(1.7)

With the increase in the iteration number k, the parameters will converge (indirectly through the target function L(qk) to visualize the observation). limqk ¼ q k/N

(1.8)

The reason for convergence is that the above objective function is convex. To optimize the objective function, it can be directly solved by the closed form solution. However, when the amount of data is large, storage and reading will be very time-consuming. Therefore, it

Introduction 7 is usually solved by random gradient descent (with batch processing). For the neural networks with determined topology structure, Hornik et al. [20e22] proved that if the output layer adopts linear activation function and the hidden layer adopts sigmoid function, the single hidden layer neural network can approximate any rational function with any accuracy. When the number of layers of the network is more than one, it is called a multi hidden layer feedforward neural network, or a deep feedforward neural network. Its structure is shown in Fig. 1.3. The topology of the deep feedforward neural network is the multi hidden layer, the full connection, and the directed acyclic. Using the following notation, the model between the input and output of the network is given. The input layer is x˛ℝm, the output layer is y˛ℝs, and the output of the hidden layer is written as ! 8 nl1 X ðl1Þ ðlÞ > ðlÞ ðlÞ ðlÞ > hi wi þ b h ¼4 > > > < i¼1 (1.9) l ¼ 1; 2; .L > > > > hð0Þ ¼ x > : hðLÞ ¼ y Removing the input layer h(0)and the output layer h(L), the number of hidden layers is L1, and the corresponding hyperparameters (the number of layers, the number of hidden units, the activation function) are represented as: 8 > L þ 1/the number of layers > > < ½n0 ; n1 ; n2 ; .; nL1 ; nL /dimensions of each layer (1.10) i h > > ð1Þ ð2Þ ðL1Þ ðLÞ > ;4 /activation function : 4 ; 4 ; .; 4 where n0 ¼ m and nL ¼ s. The parameters to be learned are represented as Inputs

Hidden Layers Outputs

Figure 1.3 Feedforward neural network with multi hidden layers.

8 Chapter 1 8 q ¼ ðq1 ; q2 ; .; qL Þ > <   ql ¼ wðlÞ ˛ ℝnl1 nl ; bðlÞ ˛ ℝnl > : l ¼ 1; 2; .; L

(1.11)

The relationship between input and output is represented as ! nL   X ðL1Þ ðLÞ hiL ,wiL þ bðLÞ /writteng 4ðLÞ hðL1Þ ; qL y ¼ hðLÞ ¼ 4ðLÞ iL ¼1

¼4

ðLÞ

nL X

4

iL ¼1

ðL1Þ

!

nL1 X iL1 ¼1

/written 4

ðLÞ

ðL2Þ ðL1Þ hiL1 ,wiL1



4

ðL1Þ



ðL2Þ

h

þb

ðL1Þ



; qL1 ; qL

! ðLÞ wiL

þb

ðLÞ



(1.12)

¼/

    ¼ 4ðLÞ 4ðL1Þ /4ð1Þ ðx; q1 Þ/; qL1 ; qL /written f ðx; qÞ In practical applications, the training data set is assumed to be 8n oN ðnÞ ðnÞ > x ; y > < n¼1 xðnÞ ˛ ℝm > > : ðnÞ y ˛ ℝs

(1.13)

The optimized objective function (the loss term and the regular term) is as follows: min JðqÞ ¼ LðqÞ þ lRðqÞ q

(1.14)

where b y n ¼ f ðxn ; qÞ and 8 2    > > y n ¼ yn  b y n F l yn ; b > > > > > > N  > X  > < LðqÞ ¼ 1 l yn ; b yn N n¼1 > > > > L L   > X X >  ðLÞ 2 > 2 > RðqÞ ¼ ¼ kq k w  > l F > : F l¼1 l¼1

(1.15)

There are many forms of loss function l(,): energy function, cross entropy loss, and the regularization term R(,) includes Frobenius norm (preventing overfitting), and sparse regularization (simulating biological response characteristics).

Introduction 9

1.1.3 Backpropagation algorithm In order to optimize the objective function Formula (1.14), first, we must determine the convexity and nonconvexity of the function (Fig. 1.4). If the feasible region is the convex set, the convex function defined on the convex set is convex optimization. And the obtained solution does not depend on the selection of initial value and is the global optimal solution. Usually, the optimization objective function of a deep feedforward neural network is nonconvex, therefore the solution of parameters depends on the setting of the initial parameters (there are many saddle points and local extreme points in the feasible region). If the setup is reasonable, you can avoid falling into the local optimal. In order to illustrate the backpropagation algorithm (based on the gradient descent method), the following method is described to update the parameters. 8 ðkÞ ðk1Þ > a,Vqjq¼qðkÞ

: Vqj ðkÞ ¼ vLðqÞ þ l vRðqÞ q¼q vq vq where a is the learning rate, and the specific parameters on each layer are updated as 8 ðkÞ ðk1Þ > > q ¼ ql a,Vql jq ¼qðk1Þ > l < l l   (1.17)  vLðqÞ vRðqÞ > > Vq j þ l ðk1Þ ¼ > l ql ¼ql : vql ql ¼qðk1Þ vql ql ¼qðk1Þ l

l

Backward Inputs

Hidden Layers Outputs

Forward

Figure 1.4 An illustration of the backpropagation algorithm.

10 Chapter 1 ðkÞ

where ql is the value to be updated for the lth layer in the kth iteration. And the error propagation term is introduced for the solution of the gradient descent. According to the chain rule, it is expanded to: vLðqÞ vhðlÞ vhðlþ1Þ vhðLÞ vLðqÞ ¼ $ . $ vql vql vhðlÞ vhðL1Þ vhðLÞ

(1.18)

The error propagation term is written as: dðlÞ ¼

vLðqÞ vhðlÞ

(1.19)

With the further use of ql ¼ (w(l),b(l)), the corresponding derivatives of parameters for the hidden layer output are represented as:   8  ðlÞ l1 T ðlÞ ðlÞ >  ’ ðlÞ v4 $w þ b h > vh > l1 > ¼ ¼ h $ 4ðlÞ > < ðlÞ ðlÞ vw vw (1.20)    > ðlÞ l1 T ðlÞ ðlÞ >   ðlÞ > v4 $w þ b h ’ > vh > ðlÞ : ¼ ¼ 1$ 4 vbðlÞ vbðlÞ where “$” is the Hadamard product. Formula (1.17) is the derivatives of the parameters on the loss term, and the derivatives of the regular term are: ! L vRðqÞ v X vkql k2F ¼ (1.21) kql k2F ¼ vql vql l¼1 vql Usually, the constraints in the regular term are only for the weight matrix, and the bias is not regular, so there are:   8  ðlÞ 2 > v  w > vRðqÞ > F > ¼ ¼ 2wðlÞ > < ðlÞ ðlÞ vw vw (1.22) 2  >   > ðlÞ > vw  > > : vRðqÞ ¼ F ¼0 ðlÞ ðlÞ vb vb The process of optimizing the parameter ql of the lth hidden layer is mainly determined by the gradient (first derivative) of loss item L(q) and regular item R(q) to the parameter ql. The error propagating is realized by introducing the error propagation term [Formula (1.19)]. Training of the feedforward neural network is divided into two steps. The first is to calculate the output value of each layer in the forward propagation process according to the current parameter value. The second is to backpropagate the error item of each layer according to the difference between the actual output and the expected output.

Introduction 11 The partial derivatives of each layer’s output are combined to update the parameters. Repeat the two steps until the network converges. When the network’s layer is deep, the gradient error of parameters on each layer will gradually decrease from the output to the input. (When it is closer to the output, the decline is greater. When it is closer to the input, the decline is smaller and may be zero.). This makes the whole network difficult to obtain better parameters by training. This phenomenon rejects global minima and saddle points of the feasible region and makes object function tend to fall into the local optimum. This is the vanishing gradient problem.

1.1.4 The learning paradigm of neural networks The basic neural network still uses the paradigm of machine learning, that is, data, model, optimization, and solving four parts. Machine learning emphasizes learning data features based on the prior (including extracting and screening feature to get the discriminable feature) and the classifier design. But model expression ability is limited by the characteristics of learning. The advantages are that it can quickly optimize the objective function by using a convex optimization algorithm or software. Its core is the pursuit of speed and precision. Compared with machine learning, a deep neural network reduces the dependence on priori data. The representation ability of the model is increasingly deep and essential with the deepening of layers. (1) In the training stage, the labeled data are scarce and there are more parameters of the model to be trained. This will lead to insufficient training or overfitting. (2) The optimization objective is a nonconvex optimization problem. It depends on the selection of the initial value. Choosing the proper initial value can avoid prematurely falling into the local optimum and the obtained solution is close to the optimal solution. If the selection is not good, the network is prone to underfitting. (3) When the backpropagation algorithm is used, the phenomenon of a vanishing gradient problem can easily occur, which leads to inadequate training of the network model. The difference in data is crucial to the deep neural network. For classification tasks, stronger aggregation represents that the data belonging to the same class have greater similarity. The common features are the main part, and the individual characteristics are supplemented. The large sparsity between classes indicates that there is greater difference between classes. That is, personalization is the main feature, and the common features are supplemented. Using a deep neural network for feature learning, the multilevel combination of hierarchical parameters will give the weight parameter a discriminable characteristic. It emphasizes commonality in the class and pays attention to individuality among the classes. The most satisfying model under the combination of parameters also

12 Chapter 1 indirectly indicates that the two factors mentioned above are contradictory and unified. In essence, a deep neural network represents data in a hierarchical method. An advanced representation is based on low-level representation. A complex problem is divided into a series of nested and simple representation learning problems. For example, the first hidden layer identifies the edge from some pixels and their adjacent pixels’ value in the image. The second hidden layers integrate the edges to identify the outlines and corners. The third hidden layers extract specific outlines and corners as abstract high-level semantic features. Finally, a linear classifier is used to identify the target in the image.

1.2 Natural inspired computation 1.2.1 Fundamentals of nature-inspired computation Bio-intelligence is a very important source of theoretical inspiration in artificial intelligence research. From the perspective of information processing, the organism is an excellent information processor, and its ability to solve problems through its own evolution is also dwarfing the current best computer. In recent years, artificial intelligence researchers have become accustomed to referring to the intelligent algorithms developed by inspiration from natural phenomena as nature-inspired computation (NIC). Based on the functions, characteristics, and mechanisms of organisms in nature, it studies the abundant processing mechanisms contained in it, constructs corresponding computational models, and designs corresponding algorithms and applies them to various fields. Natural computing is not only a new hotspot in artificial intelligence research, but also a new way of thinking for the development of artificial intelligence, and a new result of the transformation of methodology. The research results include artificial neural networks, evolutionary algorithms, artificial immune systems, fuzzy logic, quantum computing, and complex adaptive systems, etc. Natural computing can solve many complex problems which are difficult to solve by traditional computing methods. It has a good application prospect in the fields of solving large-scale complex optimization problems, intelligent control, and computer network security. This section focuses on evolutionary algorithms and artificial immune systems.

1.2.2 Evolutionary algorithm Evolutionary computation is a kind of adaptive artificial intelligence technique that simulates the process and mechanism of the biological evolution to solve problems. The core idea comes from a basic understanding that the process of evolution from simple to complex and low level to high level is a natural, parallel, and robust optimization process. The goal of this process is to achieve the purpose of optimization through the adaptability of the environment, the “survival of the fittest” and genetic variation of the biological population.

Introduction 13 Evolutionary algorithm (EA) is a kind of random search technology based on the above ideas. They simulate the learning process of a group of individuals, each of which represents a point in a given problem search space. The evolutionary algorithm starts from the selected initial solution and gradually improves the current solution through an iterative evolutionary process until the best solution or a satisfactory solution position is found. In the course of evolution, the algorithm uses a method similar to natural selection and sexual reproduction in a set of solutions to generate the next-generation solutions with better performance indicators on the basis of the inherited superior genes. The general steps for solving an optimization problem using an evolutionary algorithm are: (1) Give a set of initial solutions randomly; (2) Evaluate the performance of the current set of solutions; (3) If the current solution satisfies the requirements or the evolution process reaches a certain algebra, the calculation will be terminated; (4) According to the evaluation result of (2), select a certain number of solutions from the current solutions as the objects of genetic operations; (5) Perform genetic operations on the selected solutions, such as crossover, mutation, etc., to get a new set of solutions. Then go to (2). The commonly used search methods fall into three categories: enumeration, analytics, and randomization. Enumeration refers to enumerating all feasible solutions within a set of feasible solutions in order to find the optimal solution. For a continuous function, it needs to be discretized. However, many practical problems correspond to a large search space, so the solution to this method is very inefficient. The analytical method mainly uses the properties of the objective function in the solution process, such as the first derivative, the second derivative, and so on. This method can be divided into two kinds of methods: direct and indirect. The direct method determines the next search direction based on the gradient of the objective function, so it is difficult to find the global optimal solution, while the indirect method derives a set of equations from the necessary conditions of extreme values, and then solves the system of equations. However, the derived equations are generally nonlinear and their solution is very difficult. The random method introduces random changes to the search direction during the search process, making the algorithm jump out of the local extreme point with a greater probability during the search process. Randomization can be further divided into blind randomization and guided randomization. The former randomly selects different points in the feasible solution space for detection, the latter changes the current search direction with a certain probability, and searches in other directions. EAs belong to a random search method, which adopt a random processing method in the initial solution generation and the genetic operations such as selection, crossover, and

14 Chapter 1 variation. Compared with the traditional search algorithms, they have the following differences: (1) EAs do not act directly on the solution space, but use some kind of encoding representation of the solution. (2) EAs start from a group of multiple points rather than one point, which is one of the main reasons why they can find the global optimal solution with a large probability. (3) EAs only use the adaptive information of the solution (i.e., the value of the objective function) and weigh between increasing revenue and reducing overhead, while traditional search algorithms typically use derivatives. (4) EAs use stochastic transition rules rather than deterministic transition rules. In addition, the main features of EAs compared with the traditional algorithm are reflected in the following two aspects. Intelligence: The intelligence of EAs includes self-organization, self-adaptation, and selflearning. When using EAs to solve the problem, the algorithm will use the information obtained in the evolution process to self-organize the search after the coding scheme, fitness function, and genetic operator are determined. This intelligent feature of EAs also gives them the ability to automatically discover the characteristics and laws of the environment based on changes in the environment. Essential parallelism: The essential parallelism of EAs is manifested in two aspects. The first is that EA is inherently parallel, that is, EA itself is well-suited for massive parallelism; the second is the inherent parallelism of EA. EA uses the population method for searching, so it can search for multiple areas within the solution space and exchange information with each other. The currently studied EAs are mainly divided into four types [1e11]: genetic algorithms (GAs), evolutionary programming (EP), evolution strategy (ES), and genetic programming (GP). The first three algorithms were developed independently of each other, and the last is a branch developed on the basis of the genetic algorithm. Although these branches have some subtle differences in the implementation of the algorithm, they have a common feature, that is, they all rely on the ideas and principles of biological evolution to solve practical problems. Evolutionary computation is the product of multidisciplinary integration and infiltration. It has developed into a comprehensive technology of self-organizing and self-adaption, which has been widely used in computer science, engineering technology, management science, and social science. At present, the research into evolutionary computation mainly focuses on basic theory, function optimization, combinatorial optimization, classification system, parallel evolutionary algorithm, image processing, evolutionary neural network, and artificial life.

Introduction 15

1.2.3 Artificial immune system (AIS) The artificial immune system (AIS) inspired by immunology is an adaptive system to solve complex problems by simulating immune functions, principles, and models [12]. As early as the mid-1980s, Farmer et al. [13] took the lead in providing a dynamic model of the immune system based on the immune network theory and discussed the relationship between the immune system and artificial intelligence methods which took up research on artificial immune systems. However, the research findings after this are rare. Until December 1996, on an international symposium that was held in Japan based on the immune system, the concept of “artificial immune system” was firstly proposed. Subsequently, the relevant research on the artificial immune system began rapidly and the related papers and research results increased year by year. In 1997 and 1998, IEEE Systems, Man and Cybernetics International Conference organized a related topic discussion and established the “Artificial Immune System Memory Application Branch.” Subsequently, the topic of artificial immune system also successively opened up on some famous international conferences in the field of artificial intelligence, such as the International Joint Conference on Artificial Intelligence (IJCAI), International Joint Conference on Neural Networks (IJCNN), IEEE Congress on Evolutionary Computation (CEC), Genetic and Evolutionary Computation Conference (GECCO), etc. Since 2002, six consecutive international conferences on artificial immune systems have been held in the United Kingdom, Italy, Canada, and Brazil. After more than a decade of development, the research into artificial immune system algorithms has focused on the negative selection algorithm [14], clonal selection algorithm [15], and immune network algorithm [16] and the research results mainly relate to anomaly detection, computer security, data mining, and optimization, etc. The organism is a complex large system whose information-processing function is completed by three subsystems with different time and spatial dimensions, including the brain nervous system, the immune system, and the endocrine system. The immune system, consisting of immune-functioning organs, tissues, cells, immune effector molecules, and related genes, is a necessary defense mechanism for organisms, especially vertebrates, and can protect antibodies against the invasion of pathogens, harmful foreign bodies, cancer cells, and pathogenic factors [13]. The immune function mainly includes immune defense, immune stability, and immune surveillance. From the perspective of engineering applications and information processing, biological immune systems provide many information-processing mechanisms for artificial intelligence. It is the full recognition of the rich information-processing mechanism in the biological immune system that enabled Farmer et al. to take the lead in giving a dynamic model of the immune system based on the immune network theory, discussing the relationship between the immune system and other artificial intelligence methods, which began the research into artificial immune system [13].

16 Chapter 1 The artificial immune system is a kind of intelligent method that imitates the natural immune system. It realizes a learning technology inspired by the biological immune system and the natural defense mechanism of external substances and provides the essence of noise tolerance, non-teacher learning, self-organization, and memory. Combined with some of the advantages of classifiers, neural networks, and machine inference, the artificial immune system has the potential to provide novel solutions to problems. Its research results involve many fields such as control, mathematical processing, optimization learning, and fault diagnosis, etc. It has become another research hotspot of artificial intelligence following neural networks, fuzzy logic, and evolutionary computation. Although the artificial immune system has been gradually emphasized by researchers, compared with the artificial neural networks that have been used in more mature methods and models, whether it is the understanding of immune mechanisms, the construction of immune algorithms, or the application of engineering, corresponding research on the artificial immune system is at a relatively low level. The research into the artificial immune system mainly focuses on three aspects, namely research into the artificial immune system model, research into the artificial immune system algorithm, and application of the artificial immune system. This book focuses on the research and applications of immune optimization algorithms. Looking at the research results of the artificial immune system, the immune calculation for the purpose of solving optimization problems has attracted the attention of many researchers. Representative research results include the clonal selection algorithm proposed by de Castro et al. [15], the B-cell algorithm proposed by Timmis et al. [16], the immune network algorithm proposed by de Castro et al. [17], the vaccine-based immune algorithm [18] proposed by Jiao et al., and the immune optimization algorithm (opt-IA) [19] proposed by Cutello et al., and a series of advanced clonal selection algorithms, etc. Many scholars have generated great interest in these studies and proposed a series of improved algorithms in succession; furthermore, they have conducted extensive research on the application of these algorithms.

1.2.4 Other methods In addition, research into NIC also includes quantum computation (QC) and the complex adaptive system (CAS), etc. The study of quantum computing began in 1982. Quantum computing was first seen as a physical process by Richard Feynman, the Nobel Prize winner in physics and has now become one of the foremost disciplines closely followed by countries around the world today. The parallelism, exponential storage capacity, and exponential acceleration features of quantum computing demonstrate its powerful computational capabilities [20,21]. In 1994, Peter Shor proposed a quantum algorithm for decomposing large prime factors

Introduction 17 which only takes a few minutes to complete the RSA-129 problem (a public key cryptosystem) that requires 1600 classic computers to complete in 250 days. RSA is a public key system known to be the safest and cannot be deciphered by classical computers, but it can be easily deciphered by a quantum computer [22]; in 1996, Grover proposed a quantum search algorithm that can replace approximately 3.5*1016 steps of a classical computer with only 200 million steps for deciphering the widely used 56-bit data encoding standard DES (a type used to protect pffiffiffiffi interbank and other financial transactions) to prove that quantum computers are O N faster than classical computers in exhaustive search problems [23]. At present, quantum computing has been successfully applied in the fields of secure communication, password systems, and database searches, etc. The United States developed a prototype of a quantum computer computing as early as 1999. Computational experts predict that this century will see the emergence and application of a quantum computer which is 1000 times faster than electronic technology at the solution of puzzles in the research into quantum computers. Quantum algorithms are related to classical algorithms, whose most essential features are the use of the superposition and coherence of quantum states, as well as the entanglement between quantum bits. It is the product of quantum mechanics in the field of algorithms and has quantum parallelism which is the most essential difference compared with other classical algorithms [24, 25]. In the probabilistic algorithm, the system of the state probability vector is no longer in a fixed state, but is a probability corresponding to each possible state. If you know the initial state probability vector and the state transition matrix, you should be able to get the probability vector at any time by multiplying the state probability vector and the state transition matrix [26]. The quantum algorithm is similar to this, except that the probability amplitudes of the quantum states need to be considered because they are normalized, caused by the fact that the probability amplitude pffiffiffiffi is N larger than the classical probability. The state transition matrix is changed by WalsheHadamard, the rotation phase operation, etc. [27]. The complex adaptive system found by Professor Holland, who is researching a complex system named the Complex Adaptive System (CAS) at the Santa Fe Institute (SFI), consists of networks of parallel, interacting agents [28, 29]. Such systems include the human brain, immune system, ecosystems, cells, ant colonies, political parties, and organizations in human society, etc. The basic idea of a complex adaptive system is that individuals (elements) are called agents in the system [30] and have their own purpose and initiative and are active and adaptive. Agents can “learn” and “accumulate experience” in the ongoing interaction with the environment and other agents so that they can change their structure and behavior based on learned “experiences.” It is this initiative and the role among agents, the environment, and other agents, that means they constantly change themselves, and the environment becomes the basic driving force for system development and evolution. The evolution of the entire system, including the emergence and

18 Chapter 1 differentiation of new levels, the emergence of diversity, and new aggregated larger agents derive from this foundation. The basic idea of complex adaptive system theory is that the complexity of the complex adaptive system originates from the adaptability of the agent. This book focuses on evolutionary computation and artificial immune systems in NIC research and focuses on the theoretical basis and specific application areas in the following chapters.

1.3 Machine learning 1.3.1 Development of machine learning Langley defined machine learning as a science of the artificial. It improves their performance with experience [1]. Alpaydin said that machine learning is a method which optimizes the performance criterion of computer programming using past experience [2]. The problems of traditional machine learning mainly include the following four aspects [3]: (1) understanding and simulating the human learning process; (2) research on the natural language interface between the computer system and human user; (3) the ability of reasoning with incomplete information, that is, the automatic planning problem; and (4) constructing a procedure for discovering new things. The initial rise of machine learning can be traced back to the study of artificial neural networks. In 1943, Warren McCulloch and Walter Pitts put forward the hierarchical structure model of the neural network [4], which was established as the computational model theory of the neural network. Thus laid the foundation for the development of machine learning. In 1950, “the father of artificial intelligence” Turing put forward the famous “Turing test” [5]. Artificial intelligence has become an important research topic in the field of computer science. Frank Rosenblatt put forward the concept of Perceptron [6] in 1957. He defined the mathematical model of a self-organizing and self-learning neural network with an algorithm for the first time. This machine learning algorithm became the pioneer of the neural network model. A. M. Samuel of IBM Corporation in the USA designed a checkers program with learning ability in 1959. The program once defeated an 8-year unbeaten champion in the United States, showing the ability of machine learning. In 1962, Hubel and Wiesel found that the unique neural network structure in the cerebral cortex of cats can effectively reduce the complexity of learning. Thus, the famous HubeleWiesel biological vision model was proposed [7]. The neural network model proposed later was inspired by this. In 1969, Marvin Minsky and Seymour Papert published the book Perceptron [8], which had a profound influence on the research of machine learning. The XOR problem in the book put the research of perceptron into a

Introduction 19 dead end, and the artificial intelligence research based on the neural network entered a low tide in the following 10 years. However, the basic idea of machine learning has had a farreaching influence to this day. In 1980, the first International Symposium on Machine Learning was held at Carnegie Mellon University in the United States, marking the worldwide rise of machine learning research. In 1986, Machine Learning was founded, indicating that machine learning had attracted attention gradually and then its development began to accelerate around the world. In 1986, Rumelhart Hinton and Williams jointly published the famous backpropagation algorithm (BP) in Nature [9]. The application of the BP algorithm in a shallow feedforward neural network model was described for the first time. The algorithm obviously reduced the computational complexity of optimization problems. A hidden layer was added to solve the problem of XOR Gate, which cannot be solved by perceptron. The BP algorithm has become the most basic algorithm of the neural network. From then on, research into and applications of neural network began to recover. Hopfield published a paper on the neural network model in 1987 [10], which constructed energy function and introduced this concept into the Hopfield network. At the same time, through the understanding of the nature of the dynamic system, the optimization solution of the Hopfield network was realized, which promoted further research and development of the neural network. Professor Yann LeCun from American Bell Laboratory proposed the most popular convolution neural network (CNN) computing model in 1989 [11]. An efficient training method based on the BP algorithm was derived and successfully applied to English handwriting recognition. CNN was the first artificial neural network which to be successfully trained. It was also one of the most successful and widely used models for later deep learning. In 2006, Geoffrey Hinton and Ruslan-Salakhutdinov proposed a deep learning model [12]. They pointed out that artificial neural networks with multiple hidden layers have good feature learning ability. Through initializing layer by layer, the difficulty of training can be overcome. Thus overall optimization of the network can be realized. This model opened a new era for deep neural network machine learning. Under the support of cloud computing, big data, and the development of computer hardware technology, deep learning has made impressive progress in many areas. It launched a number of successful commercial applications, such as Google translator, phonic tool “Siri” of Apple, personal voice assistant “Cortana” of Microsoft, face sweeping technique “Smile to Pay” of Ant financial services group, and especially the miracle of Google Alpha Go winning the manemachine conflict.

20 Chapter 1

1.3.2 Dimensionality reduction The given data X˛ℝmn consist of n data vectors xi of m dimensions, and the intrinsic dimension of the data set is de (de ¼ m in general). The basic idea of dimensionality reduction is to map the high-dimensional data set to a low-dimensional space by linear or nonlinear transformation, in order to obtain the representation Y˛ℝnd of d dimension data (d  de in general). At the same time, as far as possible the original high-dimensional data information is maintained. The dimensionality reduction technology encompasses not only the classical principal component analysis (PCA) [13] and linear discriminant analysis (LDA) [14] but also other methods, such as random projection [15] in compressive sensing, and image sampling strategy. Dimensionality reduction is usually divided into feature extraction (e.g., PCA and LDA) and feature selection (e.g., image sampling). Dimensionality reduction can avoid the curse of dimensionality to a great extent. The task of learning (such as classification or clustering) is more stable and efficient, and performance of promotion is better. In fact, for tens of thousands or even higher dimensional data, how to obtain the effective representation of data by dimensionality reduction technology has become more and more important, and more challenging. There are two basic characteristics that need to be satisfied [16]. The dimension of data will be reduced to a certain extent, then effectively identifying the important components of data, internal structure, and hidden variables. In addition, the data will be reduced to two- or three-dimensional to achieve visualization. People can accurately perceive and discover internal structures and laws hidden in the data.

1.3.3 Sparseness and low-rank Sparse representation is a new theoretical framework proposed by Donoho et al. [17]. It was first used to recover the high-dimension original signal x˛ℝml(m [ d) from the low-dimensional observation signal y˛ℝdl. The optimization problem is as follows: minkxk0 ; s:t:Ax ¼ y

(1.23)

where the k,k0 represents l0 norm, that is, the number of nonzero elements in vectors. A˛ℝdm is the observation matrix. The framework has been widely used in signal- and image-processing fields, such as image denoising, recovery, and super-resolution etc. The theory also tells us that when the signal can be sparse represented or compressible, then the signal can be accurately reconstructed through the minimal sampling or observation.

Introduction 21 That is to say, a lot of real signals have much redundancy. Similar arguments include Ockham’s razor and the principle of minimal description length. Sparse representation has become a research hotspot in the fields of signal processing, machine learning, pattern recognition, and computer vision in recent years. In fact, the concept of sparse representation was published in Nature in 1996 [18], in which sparse regularization was introduced into the least squares problem. The directional image blocks are calculated, which can explain the working principle of the primary visual cortex. In the same year, the famous Lasso algorithm [19] was also proposed to solve the least squares problem with sparse constraints. In recent years, the low-rank matrix reconstruction derived from compression-sensing technology has become one of the hottest research directions in the fields of machine learning, computer vision, signal processing, optimization, and so on. It achieved successful application in image and video processing, computer vision, text analysis, multitask learning, recommendation systems, etc. [20]. The sparsity of the matrix is mainly manifested in two aspects. First, the sparsity of the matrix elements, that is there are less nonzero elements in the matrix (l0 norm of the matrix). Second, the sparsity of the singular value of the matrix (in the case of symmetric matrix, the eigenvalue). That is, there are less nonzero elements in the singular value of the matrix. The sparsity of the singular value of the matrix should be considered firstly. The matrix which is waited for to be restored or filled is usually assumed to be low rank. Results are from some linear operations of a matrix. The matrix can be accurately reconstructed by the following optimization problems: min rankðXÞ; s:t:AðXÞ ¼ b

(1.24)

where rank(,) is the rank function of the matrix, Að ,Þ is a linear operator. A low-rank matrix concrete filling problem may be stated in the following form: min rankðXÞ; X

s:t:PU ðXÞ ¼ PU ðZÞ

(1.25)

where U is the set of known element subscripts. PU(Z) is defined in the following form: Zij; if ði; jÞ ˛ U PU ðZij Þ ¼ (1.26) 0; else When consider the sparsity of the matrix element and matrix singular value at the same time, there are three types of problem models that have been very popular in recent years: robust component analysis (RPCA), sparse and low-rank matrix decomposition, and

22 Chapter 1 low-rank representation (LRR). The robust principal component analysis model can be described by the following optimization problems: min rankðZÞ þ lkEkl ; Z:E s:t:X ¼ Z þ E

(1.27)

where l > 0 is a regular parameter, k,kl is a specific regularization strategy, such as the Frobenius norm (k,kF) used to model Gaussian noise [21, 22], the l0 norm dealing with a small amount of large-amplitude noise [23], and the l2,0 norm that can be used to deal with noise or singularity effectively [24, 25]. However, the model mentioned above implicitly assumes that the potential structure of the observed data is a single low-rank linear subspace [26, 27]. Many actual data are distributed in the unions of many linear subspaces. It is unknown whether any data points belong to a subspace. An extended model of low rank and sparse matrix factorization has been proposed. This is called the low-rank representation model (LRR). That is, subspace segmentation is combined with noise recognition and used to deal with multiple subspace problems in a framework. The low-rank representation model has the following forms: min rankðZÞ þ lkEkl ; Z;E

s:t:; X ¼ DZ þ E

(1.28)

where Z˛ℝmn is the lowest rank representation of given data X, D˛ℝmm is a dictionary in a linear spanned data space, and m is the number of atoms or bases in the dictionary. Essentially, data with sparse or low-rank structure can be reconstructed or robustly restored by a small number of samples. The sparse and low-rank assumptions are also applicable to the distribution of high-dimensional data.

1.3.4 Semisupervised learning Traditional machine learning methods are divided into two main categories: supervised learning and unsupervised learning. The former assumes that there are some data inputs and their corresponding outputs, the purpose of which is to learn a mapping function which can predict the output of new data samples. The typical problems are classification and regression. However, the unsupervised learning hypothesis has only some data inputs without any guidance on supervised information. Its purpose is to find some properties hidden in the data. Typical problems include clustering, probability density estimation, and data dimension reduction. Sometimes label data are not enough to be used in the training of supervised learning. The use of unsupervised learning will waste the information contained in the tag data. Semisupervised learning (SSL) [28] was put forward to solve this problem. It can use the information in small amounts of tagged data and large

Introduction 23 amounts of untagged data to achieve a better learning effect. The study in Ref. [29] shows that semisupervised learning also fits well with human learning. SSL is also called learning from label and label-free data. It is a hotspot in the fields of machine learning, data mining, and computer vision. Traditional supervised learning only uses label data for training. However, it is usually difficult to obtain large numbers of label data, which are costly and require a certain amount of manpower and material. It also requires experienced experts to mark the data. Although active learning can effectively reduce the cost of marking data, it cannot use the information of untagged data in the same way as traditional supervised learning. However, with the development of data acquisition technology and computer hardware technology, it is very easy to collect a large number of untagged samples. SSL can use a small amount of tagged data and a large amount of untagged data to learn at the same time. Taking semisupervised classification as an example, using untagged samples and tagged samples together can build better classifiers with better performance. In addition, it is easier to obtain auxiliary information such as pairwise constraint than tagging data. Pairwise constraints indicate that the corresponding target sample is of the same class or heterogeneity, commonly known as Must-link (ML) or Cannot-link (CL) [30e32]. A similar approach to semisupervised learning is transductive learning, which assumes that untagged samples are test data. In other words, SSL is an open system in which any unknown sample can be predicted. Transductive learning is a closed system in which the test data that need to be predicted are already known at the time of learning [33]. Currently, SSL is based on two basic assumptions: cluster assumption and manifold assumption. The content of the cluster assumption is the samples in the same cluster which have a good chance of having the same label. Therefore, the decision boundary should be distributed sparsely through the data as far as possible, to avoid dividing the data points in the same dense cluster into two sides of the decision boundary. It can be expressed as low-density separation: the decision dividing line should be in the lowdensity distribution area. Typical methods are mainly transductive learning SVMs (TSVMs) [34, 35] and the convex relaxation algorithm [36, 37]. The content of manifold assumptions is that all data are located on or near a potential low-dimensional submanifold in high-dimensional space. Different from the clustering hypothesis focusing on global characteristics, the manifold assumption mainly considers the local characteristics of the model. There are many kinds of SSL methods depicting the inherent geometric distribution structure of the data by using the Laplace graph. The typical method is Gaussian random fields [38], local and global consistency [39], and manifold regularization [40, 41]. Recently, Li et al. [42] have applied the pairwise constraint assumption and the clustering assumption to the classification problem.

24 Chapter 1

1.4 Compressive sensing learning 1.4.1 The development of compressive sensing Compressed sensing (CS) [1, 2] is a new framework about signal acquisition and sensors, and the development of its theory and technology has a profound impact on the field of signal acquisition, analysis technology, processing methods, etc. Compressed sensing is a new sampling theory, it is sampling randomly under the frequency far less than the Nyquist sampling frequency to obtain the partial information of the signal, and then restoring the whole signal through the nonlinear reconstruction algorithm. Compressed sensing is a new idea about signal acquisition, representation, and processing, and it not only makes people re-examine the existing signal processing technology, but also brings a wealth of new ideas about signal acquisition and processing, which greatly promotes the combination of mathematics theory and engineering applications [3], and it will play an important role in the processing of large and complex data. From the point of view of signal acquisition, compressed sensing provides a scheme that samples under the sampling rate much lower than the Nyquist frequency, which can greatly reduce the costs of signal acquisition, transmission, and storage. Compression sensing makes it possible to acquire the signal and its information while it is carried out in the case of poor data acquisition capacity or limited data, which greatly extends the scope of the object that human can detect, perceive and study in the natural environment. For example, a single-pixel camera [4], Xampling sampling system [5], ultra low sampling rate ultrawide band(UWB) signal detection [6], and other new imaging and sampling devices and systems are under development. Under the condition of available data acquisition, we can acquire and transmit more and more complete information about natural signals and scenes faster, thus greatly promoting the development of related technologies and applications. For example, compressed sensing technology increases magnetic resonance imaging speed to seven times the original speed [7], which has greatly promoted the development of medical imaging technology. In the field of remote sensing, compressed sensing can improve the imaging resolution of synthetic aperture radar signals under the existing imaging conditions, reducing the cost and enhancing the efficiency of imaging. From the point of view of signal analysis, compressed sensing is closely related to the sparsity and low-dimensional structures of signals. Therefore, the development of compressed sensing promotes the development of the field of signal analysis, which can provide more effective expression and description for extensive and complex data types, including high-dimensional and massive data, the complex relationship among data, etc. At present, research into the analysis and representation of the signal has changed from the spectrum analysis method based on orthogonal basis and frame transform [8] to the

Introduction 25 sparse representation analysis based on overcomplete dictionary and redundant dictionary learning [9, 10]. Compared to the former, the latter can obtain more sparse, flexible, and adaptive signal representation. On the basis of sparse representation, many new signalprocessing methods are also being developed, for example, data separation based on sparse representation, face recognition, anomaly detection, image fusion and image restoration, and image super-resolution applications, and so on [11]. In addition, other lowdimensional structures related to sparsity, such as low rank and manifold, are gaining attention [12]. From the point of view of signal processing, the research into compressed sensing reconstruction is addressing the problem and model of signal reconstruction from the linear compression observation of signals based on their sparsity and sparse representation. Its theory and methods provide a new solution to many signal-processing applications outside compressed sensing, and open up new prospects for many signal application researches, for example, application of the image inverse problem in compressed sensing framework, such as deblurring and super-resolution, remote sensing image fusion based on compressed sensing, UWB signal application, medical image processing, and remote sensing image processing, and so on. Compressed sensing began to gain attention from the research of Cande`s, Roberg, Tao, and Donoho et al. in 2014. The classical compressive sensing theory put forward by them has pointed out that the signals, which have sparsity or can be represented sparsely, can be accurately restored from small scale and nonadaptive compressed observations. The compressed sensing framework mainly includes three parts: sparse representation, compressed observation, and reconstruction model and method. Among them, the sparsity of signal and sparse representation are the basic requirement for compressed sensing. The theory of compressed observation studies shows how to use less nonadaptive observations to contain information which is sufficient to reconstruct the signal; the reconstruction model and reconstruction method are the core contents of compressed sensing, which study the methods of restoration and reconstruction of the signal from the compressed observation.

1.4.2 Sparse representation Sparsity and sparse representation are preconditions and prerequisites for compressed sensing. In the theory of compressed sensing, the information contained in a signal with sparsity can be measured by the sparsity of the signal. Therefore, in compressed sensing applications, sparsity is closely related to the signal sampling rate and recoverability, which are different from the data sampling rate and signal bandwidth related to the Nyquist frequency in traditional sampling methods. In the traditional sampling methods, the higher the maximum frequency of the signal, the higher the required uniform sampling

26 Chapter 1 frequency. But in compressed sensing, the more sparse the signal is, the less the compressed observation is needed to reconstruct the signal accurately. Therefore, in the compression sensing application of the actual signal, it is necessary to find or obtain the sparse representation of the signal.

1.4.3 Compressive observation The research content of the compression observation theory and technology is how to use as few nonadaptive observations as possible to contain sufficient signal information for reconstruction. In the traditional signal sample mode based on the Nyquist sample theorem, signals are first sampled at high speed to obtain a lot of samples; second, all the collected samples are compressed by signal encoding, which discards a large number of samples and then transfer the signals and executes the subsequent processing [13]. But in compressed sensing, sampling and compression are performed simultaneously, and they usually obtain the signal samples based on low-rate nonadaptive linear projection, namely, the inner product operation between signal and observation. Therefore, compared with the traditional signal sampling, compression sensing greatly reduces the cost of sampling and transmission in the sampling phase. But in the phase of signal reconstruction, the traditional methods only need simple decoding and interpolation operation to restore the signal steadily. In compression sensing, the reconstruction process needs to rely on the design of the reconstruction algorithm and uses the complex numerical method to complete it. In other words, compressed sensing reduces the cost of signal acquisition and transmission compared with the traditional method, but increases the computational complexity required for signal recovery.

1.4.4 Sparse reconstruction Signal reconstruction is the core content of compressed sensing, the aim is to get the method and technology of the original signal reconstruction estimation from the signal’s compression observation. Different from mainly using the linear interpolation through the Sinc to recovery signal in the traditional sampling methods, the reconstruction of compressed sensing usually requires complex computation to solve highly nonlinear reconstruction optimization. Nowadays, compressed sensing is developing from theoretical research to practical application of the signal, which is aimed at establishing the framework of compressed sensing and processing methods for actual signals and applications, so as to realize the reconstruction and processing of more extensive and complex signals. In this process, the prior signal and the application environment are the key factors. How to excavate and effectively express the prior knowledge, establishing the solution model combined with multiple prior information, and the design of efficient reconstruction methods are the core

Introduction 27 contents of the research. In the framework of the new generation of structured compressed sensing [14], introducing the prior information and structural information into three basic aspects of compression sensing is put forward, namely sparse representation, compression observation, and reconstruction model. Specific measures include establishing the structured redundant dictionary to obtain a structured sparse representation of the signal; establishing an observation method adaptive to the signal’s structure to obtain all the information with less observation, especially establishing a practical hardware sampling system to process the analog signal; mining the structural characteristics of the signal, and establishing a structured sparse reconstruction model. In the review by Liu Fang and Wu Jiao et al. [15], the main ideas of structured compressed sensing were summarized as follows: based on the structured dictionary and sparse representation, using a structured observation method which matches the structure and information of signal, to achieve the reconstruction of a signal based on the structure prior information. Today, the establishment of an application-oriented structural reconfiguration model remains a hot topic. In addition, the design of the sparse recovery model and the establishment of the corresponding solution are inseparable. The effect and performance of signal recovery depend not only on the applicability of the established recovery model to practical application problems, but also on the solution performance of the established algorithm for solving the model.

1.5 Applications 1.5.1 Community detection In the real world, many systems can be represented by complex networks, such as collaborative networks, the World Wide Web, biological networks, communication networks, transportation networks, social networks, and so on. It is characterized by being in the same community. The connections between inner nodes are relatively dense, and the connections between nodes in different communities are sparse. This statement is rather ambiguous and there is generally no strict definition. The most formal definition [1] is to P consider the degree ki ¼ iAij, where A represents the adjacency matrix of the network G. If there is an edge between node i and node j, the corresponding position (i, j) should be 1, or it should be 0. Assign the subgraph S3G, and the degree of node i in S can be P represented as ki ðSÞ ¼ kiin ðSÞ þ kiout ðSÞ, where kiin ¼ i ˛ S Aij represents the connecting P number between node i and the other nodes in S, and kiout ðSÞ ¼ i;S Aij represents the connecting number between node i and other nodes that do not belong to S. Then kiin ðSÞ > kiout ðSÞ; ci ˛ S means that the subgraph S is a strong community, and otherwise kiin ðSÞ < kiout ðSÞ; ci ˛ S means that S is a weak community. Therefore, the nodes of a strong community have more connections in the community than outside it, and conversely the nodes of a weak community have more connections outside the community. Fig. 1.5 shows

28 Chapter 1

(A)

(B)

Figure 1.5 Example of typical complex networks. (A) American college football network; (B) Books about US politics network.

two typical complex networks that are usually used in research. The American college football network [2] consists of 115 nodes and 616 edges. Each node represents an American college football team, and each edge represents a match between the two football teams being connected. The network is divided into 12 categories. The books about US politics network represents 105 American political books on sale at Amazon. com. This network associates the books bought by the same buyers. It was divided into three categories by Newman [3]. In a real network, nodes belonging to the same community may have similar properties or have similar mechanisms. For example, communities in the social networks represent real social groups with similar backgrounds or interests, communities in the citation networks represent related papers with the same theme, and in electronic circuit networks or biochemical networks, communities may be some kind of functional units. Discovering the network community structure can help to understand and develop these networks more effectively and classify the nodes in them. And new phenomena and knowledge can be discovered during the process, which help us to understand the relationship between network structure and function more profoundly. Complex network community detection is an important method of portraying and studying the structure and behavior of complex systems. In recent years, it has quickly become a research hotspot in the scientific community and has been used to solve problems in information communication, network search, signal transmission, infectious disease control, and the occurrence and development of incidents that are of great significance in sociology.

Introduction 29 Community detection was first applied in computer science and originated from graph segmentation problems. In 1976, the division of roles in the social network of White et al. and the VLSI method designed by Fjallstrom in 1998 were all aimed at solving community detection problems. In 2002, Girvan and Newman published an article entitled “Community structure in social and biology networks,” which opened up an upsurge of research into complex networks’ community detection [4]. The first class of community detection algorithms is graph partitioning, mainly including the KernighaneLin algorithm and the spectral bisection algorithm. The second kind, also known as hierarchical clustering in sociology, is projected to reveal such topological structures in complex networks, which can be divided into condensation methods and splitting methods. A common practice is to abstract this kind of issue as the optimization problem. Through the establishment of an appropriate model, we are capable of getting the optimal solution or some near-optimal solution by optimizing the model. For instance, a primary merit of multiobjective optimization algorithms over single-objective optimization algorithms is that multiobjective optimization algorithms can produce a series of optimal solutions after one operation, which are called Pareto optimal solutions. Normally, the set of optimal solutions covers the optimal solution generated by single-objective optimization algorithms. To determine a final desired solution, an appropriate solution should be selected from the Pareto optimal solutions according to the preference of decision-makers. In most cases, for community detection, one purpose of multiobjective optimization algorithms is to mitigate the resolution limit that commonly occurs in single-objective optimization algorithms.

1.5.2 Capacitated arc routing optimization Arc routing problems (ARPs) are one of the classical constrained multiobjective optimization problems [5]. In many disciplines such as electronic information, artificial intelligence, biomedicine, etc., we encounter a large number of constrained multiobjective optimization problems, and better handling of such issues has great significance for the field of electronic information. The problem of ARP can be summed up as follows. Given an undirected and connected graph, including a series of task edges and a special vertex called a depot, several vehicles with the same capacity start from the depot to service those task edges and then come back to the same depot. Solve the vehicle total consumption minimum driving route [6]. The origin of the ARP problem is the problem of the Seven-hole Bridge in Guinness [7]. There are two islands across the river in Guinness Fortress and seven small bridges are there allowing the people to walk around crossing the river. One day there is someone wondering whether one can start from one of the small islands and walk through the seven bridges and each bridge only once, eventually returned

30 Chapter 1 to the island. In 1736 Euler reached the conclusion that there was no such route according to the one-stroke problem in graph theory. Based on the research of this problem, Golden added a more realistic condition-capacity constraint in 1981. From then on, the typical model of ARP problemdarc path optimization with capacity constraintsdhas attracted the attention of researchers [8], which we call the capacitated arc routing problem (CARP), that is, using all the task edges of a given graph in a vehicle service with capacity constraints. Arc routing problems with capacity constraints are more practical and widely used, such as snow removal in winter, garbage collection, and urban sprinklers. A simple scheme for CARP is illustrated in Fig. 1.6, where the bold lines represent the task edge, and the dotted lines indicate the nontask edge. CARP has a wide range of applications, such as winter gritting, waste collection, and snow removal, and it has been deeply researched for several decades. The solutions to the basic capacitated arc routing problem proposed by scholars mainly include heuristic algorithms and metaheuristic algorithms. The heuristic algorithms can efficiently converge to the local optimum and obtain a relatively good solution, which is very suitable for solving smallscale CARP problems. The heuristics include Path-Scanning [9], Augment-Merge [10], and Ulusoy’s tour splitting technique [11]. However, with the trend for big data, heuristic algorithms can only converge to the local optimum of the problem but fail to reach the lower bound of the test case. In order to overcome this weakness, scholars proposed a series of metaheuristics at a higher level in recent years. The metaheuristics include the tabu search algorithms [12], the tabu scatter search algorithm [13], the variable neighborhood search algorithm [14], the guided local search algorithm [15], memetic algorithms (MA) [16], and the global repair operator [17]. These metaheuristic algorithms have shown their advantages and achieved good results in solving capacitated arc routing problems.

7(15) route 2

depot

route 1

6(14)

1(9)

route 2

route 1 2(10)

8(16) 5(13) route 3

3(11) route 3 4(12)

Figure 1.6 A simple scheme for CARP and its coding.

Introduction 31 In practical applications, there are some situations in which the relevant departments not only want to minimize the total cost of consumption, but also take other factors into account. For example, in the case of garbage collection in Troyes, France, mentioned in Ref. [18], the garbage cleaning vehicle starts from the same garage and runs a route to clean up the garbage. The relevant departments not only require that the total consumption of the garbage is minimal, but also hope that the entire garbage is cleaned up as soon as possible, so that they easily assign other tasks to the staff. Based on this consideration, Lacomme et al. established a corresponding problem model in 2006. We call this the multiobjective capacitated arc routing problem (MO-CARP). In this optimized model, optimizing the loop total consumption is a significant problem. At the same time, it also aims at optimizing the maximum loop total consumption generated by total vehicles. However, these two optimization problems are conflicting and cannot achieve an optimal solution simultaneously, so that we usually find a solution that achieves a good balance between the two goals. In order to solve this kind of problem, multiobjective CARP is confounded by the need for solving a multiobjective optimization problem and a combinational optimization problem simultaneously, which makes it extremely challenging. In 1989, Moscato proposed the Memetic Algorithm (MA) [19], which is a combination of the global search based on populations and the local heuristic search based on the individual. For an excellent review of work in the field of “adaptive MAs”, see Reference [20]. MA also has a wide range of applications in solving NP-hard combinatorial problems. Tang et al. [21] proposed an MA with extended neighborhood search (MAENS) which is superior to a number of other state-of-the-art algorithms. MAENS employs a novel local search operator that is capable of large step sizes and thus has the potential to search the solution space more efficiently [22]. However, this algorithm is only intended for solving single-objective CARP. To overcome this shortcoming, D-MAENS, based on problem decomposition, was presented by Mei et al. to solve multiobjective CARP [23]. MAENS is incorporated into D-MAENS and its framework is similar to that of MOEA/D. Combining fast nondominated sorting and crowding distance method is adopted in D-MAENS [24]. The performance of D-MAENS is evidently better than LMOGA, however there remained room for improvement with respect to the offspring update and allocation mechanisms. Thus an Improved Decomposition Based Memetic Algorithm (IDMAENS) [25] was presented to further improve D-MAENS. An elitist strategy is adopted in IDMAENS, which means that IDMAENS can retain an optimal solution of each decomposed problem according to the direction vector of each subproblem while seeking solutions, and the old solution will be replaced by it at once. When solving each subproblem, the optimal solution of one

32 Chapter 1 subproblem can provide favorable information to adjacent subproblems via a neighborhood sharing approach so as to accelerate the convergence.

1.5.3 Synthetic aperture radar image processing Synthetic aperture radar (SAR) is a breakthrough achievement in the field of remote sensing. It changes the basic functions of radar and makes radar an important approach for obtaining information. The SAR systems have a very strong penetration capability and can provide surface information with high resolution in harsh environments. SAR systems can obtain two-dimensional and high-resolution images of the ground surface target in real time. It has a variety of polarization modes and is variable from the perspective of viewing angles. The gray value of the SAR image represents the intensity of the target’s scattering function, and the weaker the intensity, the lower the grayscale value. This feature can help us to obtain the information we need by processing the SAR image. In recent years, the rapid advancement of remote sensing technology has greatly improved the resolution of SAR images. Compared with the resolution of several tens of meters in the past, the spatial resolution of high-resolution SAR images can now reach 10 m or less. Internationally advanced SAR systems can provide image data with a spatial resolution of less than 1 m. With a significant increase in resolution, high-resolution SAR images can exhibit more specific and detailed texture features while containing more abundant image information for later processing, which brings convenience to the interpretation of SAR images. However, with the increase in spatial resolution, high-resolution SAR image processing also brings new challenges for researchers. Due to its fixed transceiver combination of the traditional SAR, that is, a single polarization and a single channel, so the ground surface information obtained by it is correspondingly limited. Radar is an active sensing device, which allows people to adjust the launching and receiving polarization method of the antenna emitter to obtain the optimal polarization information and then improve the radar detection performance. At the same time, the new technology for measuring the target polarization scattering characteristics through a radar device also provides a powerful approach for analyzing the characteristics of different types of ground objects. Compared with the previous SAR, the polSAR is a more complex radar system, which has multiple channels and multiple parameters. And the corresponding polarization scattering matrix that contains various amplitudes and phase information is obtained by measuring the microwaves scattered from the ground target. That led to a considerable increase in its ability to observe objectives and promoted the rapid development of corresponding fields. In recent years, many countries have attached great importance to the theoretical research and technical application of polSAR systems, and

Introduction 33 have successfully launched many satellites with an all-polarity Earth observation function. The polSAR mainly has two methods: airborne and spaceborne. Some developed countries have established representative airborne SAR systems to better observe the Earth. Because of the obvious advantages of SAR systems, such as all weather, all day, and wide coverage, it is widely used in military reconnaissance, environmental monitoring, urban changes, crop growth, and other fields. Data are the carrier of useful and redundant information. To acquire and mine useful information in SAR image data more comprehensively, the SAR image processing technology should also keep up with the pace of SAR systems development. SAR image retrieval: A large number of SAR images are produced every day by Earth Observation (EO) satellites. It is hard work to rapidly and accurately find the useful information from those SAR images manually. In order to handle this problem, a popular image-processing technology has been introduceddcontent-based image retrieval (CBIR). CBIR is a comprehensive technique, ranging from similarity metric learning to automatic annotation [26]. As an application of CBIR, remote sensing (RS) image retrieval (RSIR) is becoming increasingly mature. Many RSIR methods have been proposed over the last couples of years. An interactive learning and probabilistic retrieval method [27] was presented for RS image archives, in which the cover type was defined by the users and a Bayesian network was adopted to link users’ interests and image content. Shyu et al. [28] proposed an RSIR system named geospatial information retrieval and indexing system (GeoIRIS). In addition to the normal image content retrieval, some complicated tasks could also be accomplished, such as multiobject relationship analysis. An SAR image content retrieval method was introduced in Reference [29], where the speckle robust similarity distance was used to measure the similarities between SAR images. Jiao et al. [30] presented a general-purpose SAR image retrieval method based on a semisupervised learning and region-based similarity measure. A fast RSIR method [31] was proposed with the help of a hashing technique, aiming at searching the RS images in large archives. With the development of the RSIR methods, researchers have increasingly discovered that some post-processing algorithms can be added after the existing retrieval methods to enhance their performance. The most popular is the image reranking technique. Roughly speaking, the existing image reranking methods can be divided into two groups [32]: (1) example-dependent and (2) example-independent. In the first group, the reranking problem is usually regarded as the binary classification. The users select the positive and negative samples to train a machine learning method for reranking. This kind of reranking method is actually relevance feedback (RF) or pseudo-relevance feedback (PRF), which is popular

34 Chapter 1 in the RS community. Different from the first group, the image reranking methods within the second group rerank the images by discovering their relationship. Only top-ranked images of the initial retrieval results are considered, and it is not necessary to select examples during the reranking process. SAR image change detection: Change detection is an important technique that identifies differences in target status or analyzes information of a unified geographical location obtained at different times to identify changes in the surface. After years of development, change detection technology has made gratifying achievements both in theory and in technology. Its detection methods have become increasingly mature and many new technologies have been introduced. Change detection technology has been widely used in land-use and land-cover (LULC), video surveillance, medical diagnosis, disaster relief, agricultural inspection, urban planning, and so on. However, due to the interference of scattering echoes, speckle noise is inevitably present in the SAR image, which makes it difficult to detect changes. For unsupervised SAR change detection, there are generally three basic steps: speckle reduction, difference images generation, and classification. In the past few decades, there have been many good achievements in these areas. In the research of SAR image despeckling, many classic despeckling methods, such as the widely used Lee filter, Kuan filter, and gamma MAP filter have been proposed. Then, inspired by nonlocal ideas, many nonlocal redundancybased image denoising methods have been proposed, such as PPB filters, SAR-BNL filters, SAR-BM3D filters for single-polarized SAR images, and statistical distance nonlocal mean filters, and so on. Some work has studied the generation algorithm of difference images, among which disparity graph generation algorithms including ratio and logarithmic ratio operators are widely used. There are also some methods for constructing difference images based on the statistical distribution of multitemporal SAR images. It is worth pointing out that false alarms and missed alarms are two opposing concepts in the change detection field. Therefore, this constant false alarm algorithm based on statistical hypothesis testing can meet different actual needs by adjusting false alarm probability. This quality is very important for practical decision-makers. For example, in the early stages of a sudden disaster, the limited rescue force must be allocated to areas with a very low probability of false alarms to avoid wasting these rescue forces. Meanwhile in the case of adequate rescue, it is also necessary to ensure a very low probability of missed alarms to ensure that all areas affected by the disaster can receive timely assistance. Therefore, selection of the threshold needs to balance false alarms and missed alarms. In the constant false alarm algorithm, the threshold is a function of the false alarm probability, and the threshold adjustment is achieved by changing the false alarm probability. However, the existing change detection algorithms still have some drawbacks.

Introduction 35 First, existing algorithms do not consider residual noise differences between homogeneous regions and textured regions. Second, because the statistical distribution after despeckling is difficult to estimate, there are few depopulation steps in the existing constant false alarm algorithms, resulting in these algorithms being very ineffective for SAR images under strong noise interference. SAR image segmentation: With the continuous innovation and development of synthetic aperture radar systems, SAR image segmentation, as an important step in SAR image processing, has drawn more and more attention from many researchers. The pixel-level SAR image segmentation task considered in this chapter uses gray-level intensity information to divide the image into nonoverlapping but connected homogeneous regions, where the pixels in the same region share the same class label. A detailed and accurate segmentation result is of great significance for further understanding and processing the SAR images. So far, many methods have been used to solve this problem, such as level set, thresholding, the Markov random field (MRF), the support vector machine (SVM), clustering, and so on. However these traditional approaches have several disadvantages. It is arduous to choose the appropriate parameter for the threshold segmentation method. The MRF and SVM methods require some labeled samples that are unavailable for the SAR image in general condition. Wang et al. [33] presented higher order neighborhood-based triplet Markov fields (HN-TMF) to segment SAR image, which used fuzzy c-means (FCM) to initialize the label of per pixel. However the segmentation result is sensitive to the initial label. Additionally, a kernel FCM algorithm with pixel intensity and location information (ILKFCM) [34] is proposed for SAR image segmentation. Nevertheless, they are the clustering method based on the pixel, which has a high time complexity for largescale SAR images. With the development of SAR technology, large data (such as SAR images) can be easily obtained. Recently sparse representation (SR) has attracted growing interest, and is an effective technique to analyze the sparsity of large data. Compared with traditional methods, SR has several advantages. It can efficiently reveal the global similarity of data, reduce noise, and preserve detail to avoid the negative influence of a filter method which is commonly used in many existing methods. According to the research of SRC and SSC methods, the magnitude of the sparse representation coefficients can reveal which atoms in the dictionary and the testing sample belong to the same class or subspace. The framework is used in a new algorithm called multitask low-rank affinity pursuit (MLAP) [35] to segment a single natural image. Moreover, the SR-based spectral graph segmentation method is effective and nonsensitive to the parameter, and requires no training phase. The success of SR and spectral graph methods in image classification motivates us to apply them to SAR image segmentation.

36 Chapter 1

1.5.4 Hyperspectral image processing In 1983, the NASA Jet Propulsion Laboratory (JPL) developed the first hyperspectral image of the world’s first aerosol imaging spectrometer (AIS-1), from which the first hyperspectral image was born and was applied to the fields of mineral exploration, vegetation monitoring, chemical analysis, and so on. This marks the first generation of available imaging spectrometers, also opening a new era of hyperspectral remote sensing technology. Different types of materials are made from different substances, so that they exhibit different spectral characteristics. Hyperspectral images contain a wealth of spectral information and they are more capable of classifying different features than ordinary optical images. From an application point of view, hyperspectral remote sensing has broad application prospects in the fields of geological exploration, agricultural development, marine and atmospheric monitoring, space exploration, and military activities. A hyperspectral remote sensing image has many advantages, such as broad spectral coverage, high spectral resolution, and high signal-to-noise ratio. Therefore the development of hyperspectral remote sensing data acquisition technology provides a reliable premise guarantee for applications in various fields. However, relative to the rapid development of remote sensing data acquisition technology, remote sensing information analysis, processing, and recognition capabilities showed obvious deficiencies and lags. Existing processing technology can still not really realize the value of remote sensing information and meet people’s needs. Therefore, it is an important and difficult point in the field of contemporary remote sensing technology to propose remote sensing image analysis models and methods, so that we can more comprehensively mine the information in remote sensing data and improve the accuracy of remote sensing image analysis and recognition. For different applications, corresponding processing technologies are needed to be improved. Hyperspectral image denoising: HSI can be considered as a set of grayscale images, its entries are the spectral responses. Hyperspectral images are contaminated by annoying noises with different statistical properties because of the imaging system, photon effects, limited light, and calibration error. The existing noises greatly reduce the effectiveness of the HSI and make subsequent processing more difficult, such as agriculture assessment, target detection, ground-cover classification, and mineral exploitation. Therefore, it is an essential research topic in hyperspectral image analysis to reduce noise. Over the past two decades, various approaches have been proposed for the noise reduction of HSI. Traditional two-dimensional denoising algorithms have been widely used to remove the noise for HSI, such as total variation, nonlocal means, wavelet transform, and K-singular value decomposition (KSVD). These methods destroy the

Introduction 37 latent high-dimensional structure of the HSI, which leads to a great loss of spectral correlations. Lam et al. [36] indicated that spectral domain statistics can improve the quality of the restored HSI. In this regard, some techniques take HSI as a 3D data cube, which can fully exploit the correlations in the spectral domain. Notably, the efficacy and popularity for image recovery have been proven. This can be attributed to the fact that the noises are uniformly spread in many domains and valid data in corrupted images are intrinsically sparse [37]. The 3D sparse coding is exploited to denoise HSI [38], which could explore the spectral information and obtain competitive results by extracting the different patches. In recent years, deep learning has been widely used in HSI analysis. A deep dictionary learning method has been developed to effectively suppress the noise in HSI [39], which consists of the hierarchical dictionary, feature denoising, and fineturning. Hyperspectral image dimensionality reduction: With the evolution of optical sensing technology, more elaborate spectral data can be obtained in high spectral resolution. Besides abundant information, high computational cost, and Hughes phenomenon also rise with the increasing spectral resolution. To deal with the above problems, dimensionality reduction becomes a common preprocessing procedure for HSI. DR methods of hyperspectral imagery can be divided into two categories, which are band selection and feature extraction. For band selection, a few important bands are selected according to some criteria to replace the original bands, where the physical significance of bands can be preserved. Meanwhile, for feature extraction, new features are learned from original bands, which lose the physical significance of bands but contain more discriminant information beneficial for classification. Band selection is one kind of dimensionality reduction method for HSI, which selects few important bands instead of all bands, to reduce the dimension of HSI. Different from feature extraction, the physical significance of original bands can be preserved during dimensionality reduction. Ranking and clustering are two basic ideas used in band selection. Ranking-based band selection ranks all bands according to their importance scores and selects few bands with higher scores. Many measurements such as entropy, mutual information, variance, etc. have been used to define the importance score of bands. However, ranking-based band selection ignores the redundancy of those selected bands. Clustering-based band selection divides all bands into some clusters and selects the representative band of each cluster to form the band subset, so the redundancy of selected bands is minimized, but their importance is not considered. To reduce the limitation of these two kinds of methods, some methods merging ranking and clustering together are proposed.

38 Chapter 1 Considering the utilization of label information, feature extraction methods can be divided into unsupervised, supervised, and semisupervised methods. Supervised and semisupervised methods both need labeled samples during the learning procedure, which have to involve some artificial work to label pixels. Furthermore, because they do not need any label information, unsupervised approaches have attracted great attention in recent years. Hyperspectral image classification: For each image pixel, the dedicated spectral information provides the ability to exactly identify and distinguish different materials of interest. Therefore, classification for HSI has been broadly applied in various fields, such as geological science, precision agriculture, environmental monitoring, and military target detection. To deal with HSI classification, various pixel-based methods have been developed in the last decades, including maximum-likelihood, independent component analysis, support vector machine (SVM), artificial neural networks, extreme learning machine (ELM), sparse representation classifier (SRC), collaborative representation classifier (CRC), and multinomial logistic regression. The core idea of these approaches is that different materials have specified spectral responses at certain spectral bands and the pixels are independent from each other in HSI. Among these approaches, ELM has shown popularity and effectiveness in supervised HSI classification by training single-hidden layer feedforward neural networks (SLFNs). Compared with more traditional computational intelligence techniques, including SVM, ELM has proved to be an alternative in terms of generalization performance, learning speed, and computational scalability. Based on l-minimization, SRC can be performed by solving a convex optimization problem, and the associated sparse coefficients can be solved by greedy algorithms like orthogonal matching pursuit (OMP). In Reference [40], collaborative representation classifier (CRC) is proposed for classification, which points out that the promotion of classification performance is attributed to the “collaborative” nature of training data rather than the sparse restraint. Nevertheless, most existing SRbased and CRC-based methods take the minimal reconstruction residues as the classification criterion, which are calculated by the recovered coefficients. However, the pixels in the same class of HSI deviate from each other in the spectral domain, which significantly weakens the discriminative ability of the pixel-based approaches and usually makes the classification maps appear noisy. Meanwhile, the scarcity of class labels greatly limits the performance of supervised classifiers, which rely on the number and selection of training samples. In this context, spectralespatial classification has attracted great attention, which has enhanced the feature space for improving classification performances by exploiting both the spectral signatures and the spatial information such as textual features and neighbor similarity.

Introduction 39

References A brief introduction to the neural network [1] Lee TS, Mumford D, Romero R, et al. The role of the primary visual cortex in higher level vision. Vision Research 1998;38(15e16):2429. [2] Hebb DO. The organization of behavior. Journal of Applied Behavior Analysis 1949;25(3):575e7. [3] Eccles JC, Fatt P, Koketsu K. Cholinergic and inhibitory synapses in a pathway from motor-axon collaterals to motoneurones. Journal of Physiology 1954;126(3):524. [4] Stoerig P. Blindsight, conscious vision, and the role of primary visual cortex. Progress in Brain Research 2006;155:217e34. [5] Malik J, Perona P. Preattentive texture discrimination with early vision mechanisms. Journal of the Optical Society of America A Optics and Image Science 1990;7(5):923e32. [6] Kru¨ger N, Janssen P, Kalkan S, et al. Deep hierarchies in the primate visual cortex: what can we learn for computer vision? IEEE Transactions on Pattern Analysis and Machine Intelligence 2013;35(8):1847e71. [7] Fukushima K. Artificial vision by multi-layered neural networks: neocognitron and its advances. Neural Networks the Official Journal of the International Neural Network Society 2013;37(1):103. [8] McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics 1943;5(4):115e33. [9] Hebb DO. The organization of behavior: a neuropsychological theory[M]. Science Editions; 1962. [10] Kendall RA, Apra` E, Bernholdt DE, et al. High performance computational chemistry: An overview of NWChem a distributed parallel application[J]. Computer Physics Communications 2000;128(1-2):260e83. [11] Rosenblatt F. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review 1958;65(6):386e408. [12] Widrow B, Hoff ME. Adaptive switching circuits. Stanford Univ Ca Stanford Electronics Labs; 1960. [13] Minsky M, Papert S. Perceptron: an introduction to computational geometry. The MIT Press, Cambridge, expanded edition, 1969, 19(88): 2. [14] Werbos P. Beyond Regression:" New Tools for Prediction and Analysis in the Behavioral Sciences. Ph. D. dissertation, Harvard University; 1974. [15] Hopfield JJ. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences 1982;79(8):2554e8. [16] LeCun Y. LeNet-5, convolutional neural networks[J] 20;2015. p. 5. http://yann.lecun.com/exdb/lenet. [17] Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science 2006;313(5786):504e7. [18] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 2012:1097e105. [19] Chen JX. The evolution of computing: AlphaGo. Computing in Science & Engineering 2016;18(4):4. [20] Hornik K, Stinchcombe M, White H. Multilayer feedforward networks are universal approximators. Neural Networks 1989;2(5):359e66. [21] Hornik K. Approximation capabilities of multilayer feedforward networks. Neural Networks 1991;4(2):251e7. [22] Hornik K, Stinchcombe M, White H. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Networks 1990;3(5):551e60.

Natural inspired computation [1] Holland JH. Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT Press; 1992. [2] Bagley JD. The behavior of adaptive systems which employ genetic and correlation algorithms. 1968.

40 Chapter 1 [3] Booker LB, Goldberg DE, Holland JH. Classifier systems and genetic algorithms. Artificial Intelligence 1989;40(1e3):235e82. [4] De Jong KA. Analysis of the behavior of a class of genetic adaptive systems. 1975. p. 76e9381. [5] Goldberg DE. Genetic algorithms in search, optimization, and machine learning. Reading: AddisonWesley; 1989. [6] Davis L. Handbook of genetic algorithms. 1991. [7] Michelewicz Z. Genetic algorithmþ data structure¼ evolutionary programs, vol. 1. New York: SpringerVerlag; 1996. p. 996. [8] Fogel LJ, Owens AJ, Walsh MJ. Artificial intelligence through simulated evolution. 1966. [9] Fogel DB. Evolutionary computation: toward a new philosophy of machine intelligence. John Wiley & Sons; 2006. [10] Rechenberg I. Cybernetic solution path of an experimental problem. 1965. [11] Koza JR. Genetic programming: on the programming of computers by means of natural selection. MIT Press; 1992. [12] De Castro LN, Timmis J. Artificial immune systems: a new computational intelligence approach. Springer Science & Business Media; 2002. [13] Farmer JD, Packard NH, Perelson AS. The immune system, adaptation, and machine learning. Physica D: Nonlinear Phenomena 1986;22(1e3):187e204. [14] Forrest S, Perelson AS, Allen L, et al. Self-nonself discrimination in a computer[C]. In: Research in security and privacy, 1994. Proceedings., 1994 IEEE computer society symposium on. IEEE; 1994. p. 202e12. [15] De Castro LN, Von Zuben FJ. Learning and optimization using the clonal selection principle. IEEE Transactions on Evolutionary Computation 2002;6(3):239e51. [16] Kelsey J, Timmis J. Immune inspired somatic contiguous hypermutation for function optimisation[C]// Genetic and Evolutionary ComputationdGECCO 2003. Springer Berlin/Heidelberg; 2003. p. 202. [17] De Castro LN, Von Zuben FJ. aiNet: an artificial immune network for data analysis. Data mining: A Heuristic Approach 2001;1:231e59. [18] Jiao LC, Wang L. A novel genetic algorithm based on immunity. IEEE Transactions on Systems, Man, and Cybernetics e Part A: Systems and Humans 2000;30(5):552e61. [19] Cutello V, Nicosia G, Pavone M. Exploring the capability of immune algorithms: a characterization of hypermutation operators. In: ICARIS, vol. 4; 2004. p. 263e76. [20] Barenco A, Deutsch D, Ekert A, et al. Conditional quantum dynamics and logic gates. Physical Review Letters 1995;74(20):4083. [21] Deutsch D, Jozsa R. Rapid solution of problems by quantum computation. Proceedings of the Royal Society of London. Series A 1992;439(1907):553e8. [22] Simon DR. On the power of quantum computation. SIAM Journal on Computing 1997;26(5):1474e83. [23] Abrams DS, Lloyd S. Nonlinear quantum mechanics implies polynomial-time solution for NP-complete and # P problems. Physical Review Letters 1998;81(18):3992. [24] Pittenger AO. An introduction to quantum computing algorithms. Springer Science & Business Media; 2012. [25] Buhrman H, Cleve R, Wigderson A. Quantum vs. classical communication and computation. In: Proceedings of the thirtieth annual ACM symposium on Theory of computing. ACM; 1998. p. 63e8. [26] Nielsen MA, Chuang IL. Quantum computation and quantum information. Higher Education Press; 2003. [27] Bennett CH, Bernstein E, Brassard G, et al. Strengths and weaknesses of quantum computing. SIAM Journal on Computing 1997;26(5):1510e23. [28] Holland JH. Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT Press; 1992.

Introduction 41 [29] Cowan GA, Pines D, Meltzer D. Complexity: metaphors, models, and reality. Addison-Wesley; 1994. [30] Minsky M. Society of Mind: a response to four reviews. Artificial Intelligence 1991;48(3):371e96.

Machine learning [1] Langley P. Elements of machine learning. Morgan Kaufmann; 1996. [2] Alpaydin E. Introduction to machine learning (adaptive computation and machine learning series). The MIT Press Cambridge; 2004. [3] Simon HA. Why should machines learn?. Machine learning Springer Berlin Heidelberg; 1983. p. 25e37. [4] McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics 1943;5(4):115e33. [5] Turing AM. Computing machinery and intelligence. Mind 1950;59(236):433e60. [6] Rosenblatt F. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review 1958;65(6):386. [7] Hubel DH, Wiesel TN. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology 1962;160(1):106e54. [8] Minsky M, Papert S. Perceptrons. 1969. [9] Rumelhart DE, Hinton GE, Williams RJ. Learning internal representations by error propagation. California Univ San Diego La Jolla Inst for Cognitive Science; 1985. [10] Hopfield JJ. Neural networks and physical systems with emergent collective computational abilities. In: Spin glass theory and beyond: an introduction to the replica method and its applications; 1987. p. 411e5. [11] LeCun Y, Boser B, Denker JS, et al. Backpropagation applied to handwritten zip code recognition. Neural Computation 1989;1(4):541e51. [12] Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science 2006;313(5786):504e7. [13] Jolliffe I. Principal component analysis. John Wiley & Sons, Ltd; 2002. [14] Belhumeur PN, Hespanha JP, Kriegman DJ. Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence 1997;19(7):711e20. [15] Bingham E, Mannila H. Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM; 2001. p. 245e50. [16] Wang YX, Zhang YJ. Nonnegative matrix factorization: a comprehensive review. IEEE Transactions on Knowledge and Data Engineering 2013;25(6):1336e53. [17] Candes EJ, Romberg JK, Tao T. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics 2006;59(8):1207e23. [18] Olshausen BA. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 1996;381(6583):607e9. [19] Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological) 1996:267e88. [20] Shang FH, Jiao LC, Wang F. Graph dual regularization non-negative matrix factorization for coclustering. Pattern Recognition 2012;45(6):2237e50. [21] Favaro P, Vidal R, Ravichandran A. A closed form solution to robust subspace estimation and clustering. In: Computer vision and pattern recognition (CVPR), 2011 IEEE conference on. IEEE; 2011. p. 1801e7. [22] Cande`s EJ, Recht B. Exact matrix completion via convex optimization. Foundations of Computational Mathematics 2009;9(6):717e72. [23] Cande`s EJ, Li XD, Ma Y, et al. Robust principal component analysis? Journal of the ACM 2011;58(3):11. [24] Liu GC, Lin ZC, Yan SC, et al. Robust recovery of subspace structures by low-rank representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 2013;35(1):171e84.

42 Chapter 1 [25] Liu GC, Lin ZC, Yu Y. Robust subspace segmentation by low-rank representation. In: Proceedings of the 27th International conference on machine learning (ICML-10); 2010. p. 663e70. [26] Vidal R. Subspace clustering. IEEE Signal Processing Magazine 2011;28(2):52e68. [27] Soltanolkotabi M, Candes EJ. A geometric analysis of subspace clustering with outliers. Annals of Statistics 2012:2195e238. [28] Shang F, Jiao LC, Liu Y, et al. Semi-supervised learning with nuclear norm regularization. Pattern Recognition 2013;46(8):2323e36. [29] Zhu XJ, Rogers T, Qian RC, et al. Humans perform semi-supervised classification too. In: AAAI conference on artificial intelligence, July 22e26, 2007, Vancouver, British Columbia, Canada; 2007. p. 864e9. [30] Wagstaff K, Cardie C. Clustering with instance-level constraints. In: Proceedings of the seventeenth international conference on machine learning; 2010. p. 1103e10. [31] Wagstaff K, Cardie C, Rogers S, et al. Constrained k-means clustering with background knowledge. In: ICML, vol. 1; 2001. p. 577e84. [32] Klein D, Kamvar SD, Manning CD. From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proceedings of the nineteenth international conference on machine learning. Morgan Kaufmann Publishers Inc.; 2002. p. 307e14. [33] Zhou ZH, Li M. Semi-supervised learning by disagreement. Knowledge and Information Systems 2010;24(3):415e39. [34] Joachims T. Transductive inference for text classification using support vector machines. In: ICML, vol. 99; 1999. p. 200e9. [35] Chapelle O, Zien A. Semi-supervised classification by low density separation. In: AISTATS; 2005. p. 57e64. [36] Xu LL, Schuurmans D. Unsupervised and semi-supervised multi-class support vector machines. In: The twentieth national conference on artificial intelligence and the seventeenth innovative applications of artificial intelligence conference, July 9e13, 2005, Pittsburgh, Pennsylvania, USA; 2005. p. 904e10. [37] Xu ZL, Jin R, Zhu JK, et al. Efficient convex relaxation for transductive support vector machine. In: Advances in neural information processing systems; 2008. p. 1641e8. [38] Zhu XJ, Ghahramani Z, Lafferty J. Semi-supervised learning using Gaussian fields and harmonic functions. In: ICML. vol. 3; 2003. p. 912e9. [39] Zhou DY, Bousquet O, Lal TN, et al. Learning with local and global consistency. Advances in Neural Information Processing Systems 2004;16(4):321e8. [40] Belkin M, Niyogi P, Sindhwani V. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research 2006;7(Nov):2399e434. [41] Melacci S, Belkin M. Laplacian support vector machines trained in the primal. Journal of Machine Learning Research 2011;12(Mar):1149e84. [42] Li ZG, Liu JZ, Tang XO. Pairwise constraint propagation by semidefinite programming for semisupervised classification. In: Proceedings of the 25th international conference on machine learning. ACM; 2008. p. 576e83.

Compressive sensing learning [1] Cande`s E, Terence T. Decoding by linear programming. IEEE Transactions on Information Theory 2005;51(12):4203e15. [2] Donoho D. Compressed sensing. IEEE Transactions on Information Theory 2006;52(4):1289e306. [3] Jiao LC, Yang SY, Liu F, et al. Review and Prospect of compressed perception. Acta Electronica Sinica 2011;39(7):1651e62. [4] Duarte M, Davenport M, Takhar D, et al. Single-pixel imaging via compressive sampling. IEEE Signal Processing Magazine 2008;25(2):83e91.

Introduction 43 [5] Mishali M, Eldar YC, Elron A. Xampling: signal acquisition and processing in union of subspaces. IEEE Transactions on Signal Processing 2011;59(10):4719e34. [6] Shi GM, Lin J, Chen XY, et al. UWB echo signal detection with ultra-low rate sampling based on compressed sensing. IEEE Transactions on Circuits and Systems II 2008;55(4):379e83. [7] Vasanawala S, Alley M, Barth R, et al. Faster pediatric MRI via compressed sensing. In: Proc. annual meeting soc. pediatric radiology (SPR), Carlsbad, CA.; 2009. [8] Oppenheim AV, Willsky AS, Hamid S. Signals and systems. 2nd ed. New Jersey: Prentice Hall; 1996. [9] Mallat SG, Zhang ZF. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing 1993;41(12):3397e415. [10] Aharon M, Elad M, Bruckstein A. K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing 2006;54(11):4311e22. [11] Elad M, Aharon M. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing 2006;15(12):3736e45. [12] Davenport MA, Hegde C, Duarte MF, et al. Joint manifolds for data fusion. IEEE Transactions on Image Processing 2010;19(10):2580e94. [13] Mallat BS. A wavelet tour of signal processing. 3rd ed. The Sparse Way[C]//Academic Press; 2010. [14] Duarte MF, Eldar YC. Structured compressed sensing: from theory to applications. IEEE Transactions on Signal Processing 2011;59(9):4053e85. [15] Liu F, Wu J, Yang SY, et al. Progress in Structured Compression Perception. Acta Automatica Sinica 2013;39(12):1980e95.

Applications [1] Radicchi F, Castellano C, Cecconi F, et al. Defining and identifying communities in networks. Proceedings of the National Academy of Sciences of the United States of America 2004;101(9):2658e63. [2] Shang R, Bai J, Jiao L, et al. Community detection based on modularity and an improved genetic algorithm. Physica A: Statistical Mechanics and Its Applications 2013;392(5):1215e31. [3] Newman MEJ. Modularity and community structure in networks. Proceedings of the National Academy of Sciences of the United States of America 2006;103:8577e82. [4] Girvan M, Newman MEJ. Community structure in social and biological networks. Proceedings of the National Academy of Sciences 2002;99(12):7812e26. [5] Shang R, Wang Y, Wang J, et al. A multi-population cooperative coevolutionary algorithm for multiobjective capacitated arc routing problem. Information Sciences 2014;277:609e42. [6] Dror M. Arc routing: theory, solutions and applications. Boston: Kluwer Academic Publishers; 2000. [7] Euler L. Solutio problematis and geometrain situs pertinentis. Commentarii Academic Scintarum Petropolitanae 1736;8(53):128e40. [8] Golden BL, Wong RT. Capacitated arc routing problems. Networks 1981;11(3):305e16. [9] Golden BL, DeArmon JS, Baker EK. Computational experiments with algorithms for a class of routing Problems. Computers and Operations Research 1983;10(1):47e59. [10] Eglese RW. Routing winter gritting vehicles. Discrete Applied Mathematics 1994;48(3):231e44. [11] Ulusoy G. The fleet size and mix problem for capacitated arc routing. European Journal of Operational Research 1985;22(3):329e37. [12] Hertz, Laporte G, Mittaz M. A tabu search heuristic for the capacitated arc routing Problem. Operations Research 2000;48(1):129e35. [13] Greistorfer P. A tabu scatter search meta-heuristic for the arc routing problem. Computers and Industrial Engineering 2003;44(2):249e66. [14] Hertz, Mittaz M. A variable neighborhood descent algorithm for the undirected capacitated arc routing problem. Transportation Science 2001;35(4):425e34.

44 Chapter 1 [15] Beullens P, Muyldermans L, Cattrysse D, Van Oudheusden D. A guided local search heuristic for the capacitated arc routing problem. European Journal of Operational Research 2003;147(3):629e43. [16] Lacomme P, Prins C, Ramdane-Cherif W. Competitive memetic algorithms for arc Routing Problems. Annals of Operations Research 2004;131(1):159e85. [17] Mei Y, Tang K, Yao X. A global repair operator for capacitated arc routing problem. IEEE Transactions on Systems, Man, and Cybernetics Part B 2009;39(3):723e34. [18] Lacomme P, Prins C, Ramdane-Cherif W. Competitive memetic algorithms for arc routing problems. Annals of Operations Research 2004;131(1):159e85. [19] Moscato P. On evolution, search, optimization, genetic algorithms and martial arts: towards memetic algorithms. Caltech Concurrent Computation Program, Publication Rep 1989;790. [20] Ong Y, Lim M, Zhu N, Wong K. Classification of adaptive memetic algorithms: a comparative study. IEEE Transactions on Systems, Man, Cybernetics, Part B Feb. 2006;36(1):141e52. [21] Tang K, Mei Y, Yao X. Memetic algorithm with extended neighborhood search for capacitated arc routing problems. IEEE Transactions on Evolutionary Computation October 2009;13(5):1151e66. [22] Mei Y, Tang K, Yao X. A global repair operator for capacitated arc routing problem. IEEE Transactions on Systems, Man, Cybernetics, Part B 2009;39(14):723e34. [23] Mei Y, Tang K, Yao X. decomposition-based memetic algorithm for multi-objective capacitated arc routing problems. IEEE Transactions on Evolutionary Computation 2011;15(2):151e65. [24] Deb K, Agrawal S, Pratap A, Meyarivan T. A fast and elitist multi-objective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation 2002;6(2):182e97. [25] Shang RH, Wang J, Jiao LC, Wang YY. An improved decomposition-based memetic algorithm for multiobjective capacitated arc routing problem. Applied Soft Computing 2014;19:343e61. [26] Datta R, Joshi D, Li J, Wang JZ. Image retrieval: ideas, influences, and trends of the new age. ACM Computing Surveys 2008;40(2):5. [27] Schro¨der M, Rehrauer H, Seidel K, Datcu M. “Interactive learning and probabilistic retrieval in remote sensing image archives,”. IEEE Transactions on Geoscience and Remote Sensing 2000;38(5):2288e98. [28] Shyu C-R, Klaric M, Scott GJ, Barb AS, Davis CH, Palaniappan K. Geoiris: geospatial information retrieval and indexing system-content mining, semantics modeling, and complex queries. IEEE Transactions on Geoscience and Remote Sensing 2007;45(4):839e52. [29] Espinoza-Molina D, Chadalawada J, Datcu M. SAR image content retrieval by speckle robust compression based methods. In: EUSAR 2014; 10th European conference on synthetic aperture radar; proceedings of. VDE; 2014. p. 1e4. [30] Jiao L, Tang X, Hou B, Wang S. SAR images retrieval based on semantic classification and region-based similarity measure for earth observation. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 2015;8(8):3876e91. [31] Demir, Bruzzone L. Hashing-based scalable remote sensing image search and retrieval in large archives. IEEE Transactions on Geoscience and Remote Sensing 2016;54(2):892e904. [32] Wang M, Li H, Tao D, Lu K, Wu X. Multimodal graph-based reranking for web image search. IEEE Transactions on Image Processing 2012;21(11):4649e61. [33] Wang F, Wu Y, Zhang Q, Zhao W, Li M, Liao G. Unsupervised SAR image segmentation using higher order neighborhood-based triplet Markov fields model. IEEE Trans Geoscience Remote Sensing 2014;52(8):5193e205. [34] Xiang D, Tang T, Hu C, Yu L, Su Y. A kernel clustering algorithm with fuzzy factor: application to SAR image segmentation. IEEE Geoscience and Remote Sensing Letters 2014;11(7):1290e4. [35] Cheng B, Liu G, Wang J, Huang Z, Yan S. Multi-task low-rank affinity pursuit for image segmentation. In: Proc. int. conf. comput. vis.; 2011. p. 2439e46. [36] Lam, A.; Sato, I.; Sato, Y. Denoising hyperspectral images using spectral domain statistics. In Proceedings of the IEEE international conference on pattern recognition (ICPR’12), Tsukuba, Japan, 11e15 november 2012; pp. 477e480.

Introduction 45 [37] Mairal J, Bach F, Ponce J. Sparse modeling for image and vision processing. Foundations and Trends in Computer Graphics and Vision 2014;8:85e283. [38] Wu D.; Zhang Y.; Chen Y. 3D sparse coding based denoising of hyperspectral images. In: Proceedings of the 2015 IEEE international geoscience and remote sensing symposium, Milan, Italy, 26e31 July 2015. p. 3115e3118. [39] Huo L, Feng X, Huo C, Pan C. Learning deep dictionary for hyperspectral image denoising. IEICE e Transactions on Information and Systems 2015;7:1401e4. [40] Zhang L, Yang M, Feng X. Sparse representation or collaborative representation: which helps face recognition?. In: Proc. IEEE int. Conf. Comput. Vis.; 2011. p. 471e8.

CHAPTER 2

The models and structure of neural networks Chapter Outline 2.1 Ridgelet neural network 47 2.2 Contourlet neural network 50 2.2.1 Nonsubsampled contourlet transforms 50 2.2.2 Deep contourlet neural network 51

2.3 Convolutional neural network 2.3.1 2.3.2 2.3.3 2.3.4 2.3.5

53

Convolution 53 Pooling 55 Activation function 55 Batch normalization 57 LeNet5 58

2.4 Recurrent artificial neural network 2.5 Generative adversarial nets 64

61

2.5.1 Biological descriptiondhuman behavior 2.5.2 Data augmentation 65 2.5.3 Model description 65

64

2.6 Autoencoder 68 2.6.1 Layer-wise pretraining 68 2.6.2 Autoencoder network 69

2.7 Restricted Boltzmann machine and deep belief network Further reading 77

73

2.1 Ridgelet neural network In the ridgelet neural network, there are some differences with traditional neural networks based on the first-generation wavelet (e.g., Meyer wavelet, Morlet wavelet, Haar wavelet, Gaussian wavelet, Daubechies wavelet series and so on). The differences are mainly concentrated in two aspects. The first is that the (third-generation) wavelet is almost no longer used as an activation function to adjust the distortion or response characteristics of the linear output of the hidden layer. It is more used in the initialization of weight matrix, the design of multiscale and multipath networks, avoiding falling into the local optimum,

Brain and Nature-Inspired Learning, Computation and Recognition. https://doi.org/10.1016/B978-0-12-819795-0.00002-5 Copyright © 2020 Tsinghua University Press. Published by Elsevier Inc. All rights reserved.

47

48 Chapter 2 and obtaining the characteristics to enhance network representation ability under the same level and different resolution (or angle). The second is that the third-generation wavelet is more likely to obtain a multiband and multiangle topology description of the input scene than the first-generation wavelet. We give the definition of the ridgelet neural network as: If j(x) ˛ L2(ℝ) meets the condition (2.1): 2 ! Z b jðuÞ du < þN Cj ¼ jujd ℝ

(2.1)

where b jðuÞ ¼

Z ℝ

jðxÞejux dðxÞ

(2.2)

We call j(x) an admissible activation function. The ridgelet function is generated by this function, 8   > > < jða;u;bÞ ðxÞ ¼ p1ffiffiffi j u,x  b a a (2.3) > > : a ˛ ℝþ ; b ˛ ℝ; u ˛ Sd1 ; kuk ¼ 1 where Sd1 is the unit ball. (a, b, and u) represent the scale factor, the translation factor, and the direction factor, respectively. Usually the activation function in neural networks (such as sigmoid function) is replaced by using the ridgelet function to build the ridgelet neural network, and its structure can be seen in Fig. 2.1. (1) Data q ¼ ½u; b; a; u; b (2) Model 8        > u$x  b uð1;:Þ$x  bð1Þ uðs;:Þ$x  bðsÞ T > > > ˛ ℝs ¼ j ; /; j j > > a að1Þ aðsÞ > <   u$x  b > h ¼ sðu$x  bÞ ¼ j > > a > > > > : y ¼ u,h þ b ˛ ℝm

(2.4)

(2.5)

The models and structure of neural networks 49

Figure 2.1 The structure of a ridgelet neural network.

The parameters to be optimized in the model are as follows: q ¼ ½u; b; a; u; b

(2.6)

(3) Objective function min JðqÞ ¼ q

T  2 l 1 X g  ðtÞ  y  yðtÞ  þ ,kuk2F þ kuk2F b 2T t¼1 2 2 2

(2.7)

(4) Solving According to the gradient descent method, the corresponding parameter updating formula can be obtained. 8 ðkþ1Þ ¼ uðkÞ  h,Vuju¼uðkÞ u > > > ðkþ1Þ > > > b ¼ bðkÞ  h,Vbjb¼bðkÞ > < (2.8) aðkþ1Þ ¼ aðkÞ  h,Vaja¼aðkÞ > > > > > uðkþ1Þ ¼ uðkÞ  h,Vuju¼uðkÞ > > : ðkþ1Þ b ¼ bðkÞ  h,Vbjb¼bðkÞ h is the learning rate, and the gradient descent is:  8  T    T X >  ðtÞ ðtÞ ðtÞ > > b ¼  y þ g,u ˛ ℝms Vuj y , h  ðkÞ > u¼u <  t¼1 u¼uðkÞ  T   X >  > > b ˛ ℝm y ðtÞ  yðtÞ  > : Vbjb¼bðkÞ ¼  t¼1

b¼bðkÞ

(2.9)

50 Chapter 2 To sum up, the deep ridgelet neural network can be formed by the method of layerwise stacking, and self-encoding. The advantages of the model include: first, it is a semi-supervised learning mode (unsupervised layer-wise parameter initialization and supervised fine tuning of the whole network); second is the integration of ridgelet characteristics, which bring a flexible structure, fast parallel processing speed, and strong fault tolerance and robustness.

2.2 Contourlet neural network Since the contourlet has been put forward, we have explored the characteristics of data after a contourlet transform. It has been found that this transformation can bring about many properties such as sparsity, multiscale characteristics, multidirectionality, localization, low redundancy, translation invariance, easy implementation, and efficient computation. The combination of a contourlet and a deep neural network can give full play to the advantages of both. The core of the contourlet is to obtain directional information based on the multiscale. The filter corresponding to the transform is divided into two parts: Laplacian pyramid decomposition and directional filter bank. The Laplacian pyramid filter mainly completes the separation of the singular points. The directional filter bank mainly completes the collection of singular points. It will collect singular points into a basis function based on the same direction criteria. The common contourlet transforms include nonsubsampled contourlet transforms, all-phase contourlet transforms, wavelet-based contourlet transforms, anti-aliasing contourlet transforms, and complex contourlet transforms.

2.2.1 Nonsubsampled contourlet transforms For input images, we use the Laplacian pyramid decomposition. The decomposition stage mainly consists of two filters. One is the low-pass filters (which mainly obtain the lowfrequency components in the input), and the other is the high-pass filter (which mainly obtain the high-frequency components). The directional filter bank is further used to filter the high-frequency components, in which the direction is manually set (assume eight directions). Then the transformation coefficient after the first-order decomposition is highfrequency components in eight directions and a low-frequency component. If the secondorder decomposition is carried out, only the low-frequency component is required to be operated, and the low-frequency component is considered as the input of the second-order decomposition. Because it is the nonsubsampled operation, the size of all the transformation coefficients is consistent with the input. Similar to wavelet decomposition, the low-frequency components are only decomposed by the Laplacian pyramid, and the high-frequency components are filtered in different directions. The difference between the contourlet transform and the nonsubsampled contourlet transform is mainly whether there

The models and structure of neural networks 51 is a subsampling operation or not. If we want to perform the subsampling operation, we will operate with the decomposition of the hierarchy at low-frequency components. Whether the Laplacian pyramid classification or the directional filter is used, the structure of the filter is independent of the input signal. That is, all the filters are determined beforehand. The first-order decomposition of a nonsubsampled contourlet transform is divided into two stages, one is Laplacian pyramid decomposition: ( ð1Þ ðDÞ xH ¼ x  PFH ˛ ℝnm (2.10) ð1Þ ðDÞ xL ¼ x  PFH ˛ ℝnm ðDÞ

where x ˛ℝnm is the input signal and PFH is the high-pass filter in the decomposition ð1Þ ðDÞ stage and xH is the high-frequency component after the first-order decomposition. PFL ð1Þ is the low-pass filter, and xL is the low-frequency component after the first-order ð0Þ decomposition. And xL ¼ x. The other is the directional filter bank: 8 ð1Þ ð1Þ > xH;1 ¼ xH  DF1 ˛ Rnm > > > > < ð1Þ ð1Þ xH;2 ¼ xH  DF2 ˛ Rnm (2.11) > > « > > > : ð1Þ ð1Þ xH;K ¼ xH  DFK ˛ Rnm where DFk(k ¼ 1.2 /, K) is a directional filter bank, and usually K is an exponential level of 2. The transformation coefficient will become Eq. (2.12) after the first-order ð0Þ nonsubsampled contour decomposition of the input xL . n

o ð1Þ K ð1Þ ð1Þ ; xL X ¼ xH;k (2.12) k¼1

ð0Þ xL

ð1Þ

is decomposed in two stages, one-stage decomposition of xL can achieve If the input the same effect. The number of directional filters for high-frequency components at each level may be different.

2.2.2 Deep contourlet neural network If the input is processed by nonsubsampled contourlet transform, the transform coefficient of the contourlet is obtained. For example, in Formula (2.12), there are Kþ1 feature maps. Each feature map has the same size with the input. In essence, this step can also be regarded as the convolutional operation in a deep convolutional neural network. The feature map is obtained after nonsubsampled contourlet transform, and it is followed by the pooling, the nonlinear operation, and batch normalization. The first operation flow (i.e.,

52 Chapter 2 convolution, pooling, nonlinear operation, and batch normalization) in convolutional neural networks has no parameters to be optimized. The output of the first operation flow (which can be regard as the input features) is put into the subsequent “convolutional network” to realize the optimization of model parameters. This operation method is very reasonable. Because the first convolution flow has inputs such as edge, texture, and angle, it can be replaced by the primary visual cortex transform (such as Gabor transform, contourlet transform, etc.). This replacement can reduce the number of parameters to be trained in the convolutional neural network and increase the number of training samples indirectly to improve the generalization performance of the network. The mathematical description of the above process is represented as follows. It is assumed that the framework of the deep convolution neural network is: 8 ð1Þ > q1 Þ > > X ¼ CPRNðx;   > > > ð2Þ ð1Þ > X ¼ CPRN X ; q > 2 > > > > > « > >   > > > X ðLÞ ¼ CPRN X ðL1Þ ; q > L <   (2.13) ð1Þ ðLÞ > ¼ FC X ; q F > Lþ1 > > > > > « > >   > > > > F ðTÞ ¼ FC F ðT1Þ ; qLþT > > >   > > > : y ¼ s F ðTÞ ; u CPRN represents convolution, pooling, nonlinear activation, and normalization, respectively; it is a combination (the convolution flow). Convolution operation is the core, and the other three operations can be selectively added. In addition, qi(i ¼ 1.2//, L þ T) is the superparameter and parameter on the corresponding level or convolution flow module. FC is a fully connected layer (T is the number of this layer). Finally, after the high-level abstract feature F(T ) is obtained, a classifier or regressor is designed to implement the output. The convolution operation in the first convolution flow in the equation is replaced by the nonlinear contourlet transform, that is: X ð1Þ ¼ CPRNðx; q1 Þ/PRN NSCTðx; sÞ; e q1 (2.14) ð1Þ NSCT(x,s) is a feature map after s decomposition of the input x, and e q is the parameter corresponding to the pooling, nonlinear operation, and batch normalization. After the X (1) is obtained, the remaining operation of Formula (2.13) is consistent with the previous network model.

The models and structure of neural networks 53

2.3 Convolutional neural network The convolutional neural network (CNN) is an efficient recognition method which has been developed in recent years and has attracted a great deal of attention. In the 1960s, Hubel and Wiesel studied local sensitivity and the direction of the neurons in the cerebral cortex of the cat and found a unique network structure that can effectively reduce the complexity of the feedback neural network. They then put forward the convolutional neural network. Generally speaking, the basic structure of the CNN consists of two parts: the feature extraction part and the feature mapping layer. The input of each neuron is connected to the local receptive domain in the previous layer to extract the local feature. Then the position relationship with other features can be determined. Multiple feature maps make up each computing layer. Each feature map is a flat plane and all neuron weights are equal. The sigmoid function is used as the activation function. In addition, the number of free parameters of the network is reduced because of the weights shared by neurons on the mapping surface. Each convolutional layer in the network is closely followed by a computing layer for the local average and the second extraction of the feature. This unique feature extraction structure reduces the feature resolution. The feature detection layer of the CNN learns by training data. When using the CNN, the feature extraction is avoided and the process of learning from the training data is implicit. Furthermore, the weights of the neurons on the same feature mapping surface are the same. Therefore, the network can be learned in parallel, which is also a big advantage for a network connected by neurons. The CNN has unique advantages in speech recognition and image processing with its special structure of which local weights are shared. Its layout is closer to the actual biological neural network. Weight sharing reduces the complexity of the network. In particular, an image with a multidimensional input vector can be directly input into the network, which avoids the complex operation in the process of feature extraction and data reconstruction. The basic operation of the CNN, includes convolution (for dimension expansion), nonlinear operation (sparsity, saturation, and unilateral inhibition), pooling (the polymerization of space or feature), and batch normalization (in order to accelerate the convergence speed of the training process and avoid falling into the local optimum).

2.3.1 Convolution Using the convolutional kernel to process the input pictures, we can learn more robust features. In mathematics, convolution is an important linear operation. Three different kinds of convolution operation are commonly used in digital signal processing including full

54 Chapter 2 convolution, same convolution, and valid convolution. The following assumption is that the input signal is a one-dimensional signal, x˛ℝn, and the filter is one dimension, w˛ℝm. (a) Full convolution 8 nþm1 > < y ¼ convðx; w; ‘full’Þ ¼ ðyð1Þ; /; yðtÞ; /; yðn þ m  1ÞÞ ˛ ℝ m X > yðtÞ ¼ xðt  i þ 1Þ,wðiÞ :

(2.15)

i¼1

where t ¼ 1.2/,n þ m1. (b) Same convolution y ¼ convðx; w; ‘same’Þ ¼ centerðconvðx; w; ‘full’Þ; nÞ ˛ ℝn

(2.16)

The returned result is the central part of the full convolution which has the same size as the input signal x˛ℝn. (c) Valid convolution 8 nmþ1 > < y ¼ convðx; w; ‘valid’Þ ¼ ðyð1Þ; /; yðtÞ; /yðn  m þ 1ÞÞ ˛ ℝ m X (2.17) > yðtÞ ¼ xðt þ i  1Þ,wðiÞ : i¼1

where t ¼ 1.2/,nmþ1 and n > m. If there is no special statement, valid convolution is used in the convolution operations. The above convolution formula can be easily extended from a one-dimensional to a twodimensional scene. There are two commonly used parameters in convolutiondstride and zero paddingdwhere stride refers to the number of intermediate steps from the current location to the next location. Zero padding is the number of rings that add 0 values to the peripheral of the original data. Usually, in the calculation process, if the input signal is x˛ℝnm, the size of the convolution kernel (filter) is w˛ℝsk, and the output signal size obtained by valid convolution with stride and zero padding is: 8 y ¼ x  w ˛ ℝuv > > >

 > > > < u ¼ n  s þ 2,Zero Padding þ 1 Stride (2.18) > >

 > > m  k þ 2,Zero Padding > > þ1 :v ¼ Stride Convolution operation can reduce unnecessary weight connections and introduce sparsity and local connections into the network (Fig. 2.2). The weight-sharing strategy greatly

The models and structure of neural networks 55

Figure 2.2 The type of connection.

reduces the amount of paraments and improves the amount of parameters data relatively. Thus, the phenomenon of overfitting can be avoided. Due to the invariance of the convolution operation, topological correspondence and robustness can be learned by a generated feature.

2.3.2 Pooling Pooling is a subsampling operation, that is, in a small area, a specific value is used as an output value. In essence, the pooling operation performs the aggregation of space or feature types to reduce the spatial dimension. The main purpose is to reduce calculations and describe the invariance of translation. The input dimension of the next layer is reduced (effective reduction of the parameters in the next layer). It also can effectively control the risk of overfitting. There are many forms of pooling operation, such as max pooling, average pooling, and norm pooling. The most commonly used pooling method is max pooling (a nonlinear subsampling method). Spatial pyramid pooling is a multiscale pooling method, which can obtain multiscale information of input (the feature map after convolution). In addition, spatial pyramid pooling can transform a convolution feature with arbitrary size into a fixed dimension. This means that the convolutional neural network cannot only process an image with arbitrary size but also can avoid the loss of information caused by cropping and warping operations. Spatial pyramid pooling is also used in the final convolution flow to prevent the loss of information caused by previous stretching or vectorization.

2.3.3 Activation function The activation function is a nonlinear operation that improves the capability of characterization by bending or distorting. The compound of hierarchy nonlinear mapping enhances the whole network’s nonlinear representation ability. If there is no nonlinear operation in the network, the hierarchical

56 Chapter 2

Figure 2.3 Activation function.

combination approximates the model in a linear method and the capacity is limited in high layer semantic characteristics representation or data mining. The most frequently used activation functions include: rectified linear unit (accelerated convergence), soft plus function (smooth approximation of ReLU), and sigmoid function (including logisticssigmoid and tanh-sigmoid function). Here we illustrate the biological and neurologic characteristics of these activation functions as shown in Fig. 2.3. From a mathematical point of view, the gain of nonlinear sigmoid signal is large on the center and signal gain is small on both sides of the center. This has a good effect in the feature mapping of the signal. However, from the point of view of biological neural science, the central area likes the excitatory state of neurons and both sides of the area like the inhibitory state of neurons. Thus, in neural network learning, the central area will focus on the important features and unimportant features will be pushed to both sides of the area. With the development of biological neuroscience, in 2001, neuroscientists Dayan and Abbott simulated a more precise activation model of brain neurons from a biological perspective (Fig. 2.4). Unlike the sigmoid system, there are three main changes in the activation function of biological brain neurons. The first is unilateral inhibition, the second is a relatively wide excitability boundary, and the third is the sparse activation. In the same year, Charles Duags et al. used the soft plus function in the positive number regressive prediction, and the derivative of the soft plus function was logistic-sigmoid. The soft plus function and rectified linear unit in the machine learning field are similar to the brain neuron frequency activation function in the neural science field (Fig. 2.5).

The models and structure of neural networks 57

Figure 2.4 Biological brain neuron activation model.

Figure 2.5 Soft plus function and rectified linear unit function.

2.3.4 Batch normalization Batch normalization is an optimization operation to reduce instability in the training process. The batch normalization operation is designed to avoid the trend of information decaying when it passes layer by layer with the deepening of the hierarchy. The input of a large data range is likely to play a big role in the pattern classification and the input of a small data range is likely to play a small role. In a word, the range of data being too large or too small may lead to slow convergence of the deep neural network and a long training time.

58 Chapter 2 The commonly used normalization operations are: L 2 norm normalization, sigmoid normalization, etc. Convolutional neural networks usually use all kinds of normalization layers, especially prior to 2015. However, in recent years, studies have shown that the benefits of this layer appear to be minimal for the final results.

2.3.5 LeNet5 LeNet5 is a very successful deep convolutional neural network model, which is mainly used for handwritten numeral recognition, and is applied to identify the number in checks in the bank system. (1) Data Training data and testing data are 60,000 images belonging to 10 classes. The training data set and the test data set are represented as: 8 < xTR ; yTR NTR n n n¼1 (2.19) : xTE ; yTE NTE n

n

n¼1

The TR represents the training data set, and the TE represents the test data set. The input is xn ˛ ℝ3232 , and the output is yn ˛ ½0; 1; 2; //; 9. (2) Model  X ¼ 4ðx; W; bÞ; (2.20) Y ¼ softmaxðX; qÞ The relationship between input and output is shown in Formula (2.21):  X ¼ 4ðx; W; bÞ; Y ¼ softmaxðX; qÞ

(2.21)

X is an abstract feature or hierarchical representation of the input signal x, and the parameters are divided into convolutional kernel and bias. (   W ¼ W1 ˛ ℝ6@155 ; W2 ˛ ℝ16@655 ; W3 ˛ ℝ120@1655 (2.22)   b ¼ b1 ˛ ℝ6 ; b2 ˛ ℝ16 ; b3 ˛ ℝ120 The letter c represents the connection, and a blank represents there being no connection. For example, the first feature map of the third hidden layer is related to the first, second, and third feature maps of the second hidden layer (Fig. 2.6). 0 1   X h 31 ¼ 42 @ C1;j , W2j;1  h2j þ b21 A ˛ ℝ1010 (2.23) j ˛ I1

The models and structure of neural networks 59

Figure 2.6 The connection between the second hidden layer and third hidden layer.

where h 31 is the first feature map in the third hidden layer. 4(t) ¼ t. b21 is bias and C1,j is the connection indicator set. The value of C1,j is 1 when connected, otherwise it is 0. I1 ¼ [1,2,3] is the relational indicator set. If there is no connection table, the default value is full connection. W2j;1 ˛ ℝ55 is the filter between the first feature map in the third hidden layer and the jth feature map in the second hidden layer. For the design process of the classifier, the parameters are: 8 T > eX ,qk > > ˛ℝ > < YðkÞ ¼ Pðy ¼ kjX; qk Þ ¼ P 9 X T ,qs e (2.24) > s¼0 > > > : q ¼ ½q0 ; q1 ; //; q9  where k ¼ 0, 1, 2 //, 9 and the final output is: y ¼ arg maxfYðkÞg k

(2.25)

(3) Objective function In the training data set, cross-entropy is used to construct the objective function. NTR X 9 TR 1 X min JðW; b; qÞ ¼  d yTR n ¼ k ,log Yn ðkÞ þ l1 RðWÞ þ l2 RðqÞ NTR n¼1 k¼0 fW;b;qg

The last two items are regularization items. 8 TR Yn ðkÞ ¼ softmax XnTR ; q ¼ softmax 4 xTR > n ; W; b ; q > > > > 3  X >  > < RðWÞ ¼ Wl 2 l¼1 > > > 9 > X > > > kqk k2F : RðqÞ ¼ k¼0

F

(2.26)

(2.27)

60 Chapter 2

(4) Optimal solution We use the gradient descent method to achieve parameter learning in the optimization objective function. Deepening of the network leads to a nonconvex optimization problem. We need to give better initial values of parameters before solving the optimal solution. It is different to the backpropagation algorithm used in the deep feedforward neural network. When doing backpropagation, the error formula from the pooling layer to the convolutional layer and the error from the convolutional layer to the pooling layer should be considered. The optimization formula for updating parameters is as follows: 8 > vJðW; b; qÞ > > jW¼Wðk1Þ WðkÞ ¼ Wðk1Þ  a, > > > vW > > < vJðW; b; qÞ (2.28) bðkÞ ¼ bðk1Þ  a, jb¼bðk1Þ > vb > > > > > vJðW; b; qÞ > > jq¼qðk1Þ : qðkÞ ¼ qðk1Þ  a, vq The method of solving the partial derivative of the hidden layer in the objective function is: 9   = h1l ¼ 41 W1l  x þ b1l ˛ ℝ2828 >   (2.29) /l ¼ 1; 2; /; 6; 41 ðtÞ ¼ t 2 1 2 1414 > ; h l ¼ Maxpooling h l ; r ˛ ℝ 0 1 9 > X > = h 3s ¼ 42 @ W2l;s  h 2l þ b2s A ˛ ℝ1010 > (2.30) /s ¼ 1; 2/; 16; 42 ðtÞ ¼ t l ˛ Is >   > > ; h 4s ¼ Maxpooling h3s ; r 4 ˛ ℝ55 0 1 9 > X > > W3s;q  h 4s þ b3q A ˛ ℝ11 > h 5q ¼ 43 @ > > = s ˛ Iq /q ¼ 1; 2/; 120; 43 ðtÞ ¼ maxð0; tÞ (2.31) 5 5 > 5 5 1201 > h ¼ h 1 ; h 2 ; /; h 120 ˛ ℝ > > > > ; 6 5 101 h ¼ softmax h ; q ˛ ℝ The error propagation gradient for the fifth hidden layer of the objective function is: dðrÞ ¼

vJðW; b; qÞ vhr

(2.32)

The models and structure of neural networks 61 where r ¼ 1, 2, 3, 4, 5, and the corresponding parameter updating formula is as follows: 8 vJðW; b; qÞ vh2,s1 vJðW; b; qÞ vh2,s1 ð2,s1Þ > > > ¼ ¼ ,d < vWs vWs vWs vh2,s1 (2.33) > > vJðW; b; qÞ vh2,s1 vJðW; b; qÞ vh2,s1 ð2,s1Þ > : ¼ ¼ ,d vbs vbs vbs vh2,s1 where s ¼ 1, 2, 3.

2.4 Recurrent artificial neural network This section introduces the recurrent neural network, which is different from previous deep neural network models. By introducing a directional cycle, it can better characterize the overall logical characteristics of high-dimensional information. In the feedforward neural network, the network topology is a directed and acyclic structure. The connections only exist between layers. The nodes cannot connect the nodes belonging to the same layer but can connect the nodes belonging to adjacent layers. When doing feedforward propagation, nodes of the high layer do not transmit information to low-level nodes. As everyone knows, the brain contains trillions of neurons, and these neurons are connected by thousands of trillions of synapses, and uncovering these connections is a seemingly impossible task. However, in 2015, researchers from the Baylor College of Medicine successfully completed the task and the results were published in Science. These researchers successfully analyzed the connection of neurons in the cerebral cortex of mice and found that the basic line of the local cortical loop can be captured by a series of interconnection rules. These rules are in a continuous cycle in the cerebral cortex. This provides some thinking for understanding the loop connection of the local cerebral cortex and can further help understanding of the working principles of the brain. By using a neuron with self-feedback, a recurrent neural network can handle any length sequence. Compared to the traditional deep feedforward neural network, it is more consistent with the connection mode of the biological neuron. The recurrent neural network has been widely used in natural language processing and other fields, and has achieved many excellent results. We give a simple description of the mathematical model of a recurrent neural network. The recurrent neural network is similar to a dynamic system (a system in which the state of the system changes with time).

62 Chapter 2 (1) Data fxt ˛ ℝn ; yt ˛ ℝm gTt¼1

(2.34)

where xt represents the input of the time t. The length of the time sequence is T, and the output yt is related to the input of the time before t (include t).

(2) Model

fx1 ; x2 ; /; xt g/yt

(2.35)

8 > < st ¼ sðU,xt þ W,st1 þ bÞ ot ¼ V,st þ c ˛ ℝm > : yt ¼ softmaxðot Þ ˛ ℝm

(2.36)

The soft max here is not a classifier, but an activation function, that is to say, an mdimension space vector is compressed into another m-dimension real vector, where each element in the vector is located between (0, 1). 8 T 1  > > softmaxðot Þ ¼ , eðot ð1ÞÞ ; / ; eðot ðmÞÞ > > < Z (2.37) m X > ðot ðjÞÞ > Z ¼ e > > : j¼1

Z is a normalization factor. The parameters to be optimized in Formula (2.36) include the weight connection U,W,V and the offset b,c. s(,) is the activation function on the hidden layer. (3) Objective function Based on Eqs. (2.35) and (2.36), the loss function is constructed by using the negative logarithmic likelihood (cross-entropy), and the objective function is obtained. min JðqÞ ¼ q

T X

loss b y t ; yt

t¼1

0 2

31 T m X X @ 4 ¼ yt ðjÞ,log ybt ðjÞ þ ð1  yt ðjÞÞ,log 1  ybt ðjÞ 5A t¼1

(2.38)

j¼1

where yt( j) is the jth elements of yt , and q is: q ¼ ½U; V; W; b; c

(2.39)

The models and structure of neural networks 63

(4) Solving Because the recurrent neural network corresponds to supervision information yt in each time t(t ¼ 1. 2, /, T ), the corresponding loss is represented as: (2.40) Jt ðqÞ ¼ loss b y t ; yt The solution of the optimization objective function (Formula 2.38) can be realized by the backpropagation trough time algorithm, and its core is the solution to the following five partial derivatives:

vJðqÞ vJðqÞ vJðqÞ vJðqÞ vJðqÞ ; ; ; ; (2.41) vV vc vW vU vb The solution to the first two partial derivatives is based on the solution to the error propagation term as follows: dot ¼

vJt ðqÞ vot

(2.42)

dot is the partial derivative of the objective function to the output ot in the moment t. The following three partial derivatives are solved according to the error propagation term: dst ¼

vJt ðqÞ vst

(2.43)

Partial derivative of objective function to W: T X t T X t vJðqÞ X vsk vst vJt ðqÞ X vsk vst ¼ ¼ , , , ,dst vW vs vW vs vW vsk t k t¼1 k¼1 t¼1 k¼1

(2.44)

The time t and output st of the hidden layer are related to the previous output sk ðk ¼ 1; 2; /; tÞ. According to the chain rule, we obtain the following formula: t t Y Y vsj vst ¼ ¼ WT , diagðs0 ðsj1 ÞÞ vsk j¼kþ1 vsj1 j¼kþ1

(2.45)

And s(,) is the derivative of the activation function, and the function diag(,) is a vector-extension matrix, that is, the matrix is diagonal with the vector. In practice, the activation function s(,) of the hidden layer often uses the Tanh(,) function. In the later period of training, the gradient of Formulas (2.43) becomes relatively small. The continuous multiplication of the gradient makes Formula (2.44) smaller. The phenomenon of a vanishing gradient easily occurs. If s(,) takes the

64 Chapter 2 Sigmoid(,) function, the same situation will occur. The techniques used to avoid a vanishing gradient include parameter initialization and using the ReLU function as an activation function. A recurrent neural network can theoretically establish a dependency between long time intervals, because of the existence of the vanishing gradient problem and the exploding gradient problem (gradient tends to be infinity after continuous multiplication). In practice, we can only learn the dependence of a short period, which is called a longterm dependence problem.

2.5 Generative adversarial nets In June 2014, Ian Goodfellow and other scholars proposed the generative adversarial nets, which is a generative model. It aims at learning the corresponding probability distribution from training samples and obtain more “generated” samples based on the probability distribution function to achieve data expansion. It includes two subnetwork models. One is the generative model, which makes the generated pseudo images the same as the natural image. The other is the discriminative model, which makes the correct predictions between the generated pseudo image and the natural image. The method of training the whole network is to make the two networks compete with each other. The generative model can describe the natural probability distribution of samples and generate new data that are similar to the real data by learning the essential characteristics of the natural data. In order to further understand and analyze the network, we mainly explain the motivation of the model and the mathematical description of the model. There is no relation between generative adversarial nets and adversarial samples. Adversarial samples refer to the statistical processing of the data, such as adding random noise and so on to make the model predict new data well. Ian Goodfellow explains the reason very well: the new data will do the forward calculation (multiplying with the weight matrix and adding the bias), and the noise value of the new data will affect the final output.

2.5.1 Biological descriptiondhuman behavior The inspiration behind the generative adversarial nets mainly comes from the zero-sum game of two people in game theory. In strict competition, one’s profit must mean a loss to the other. The sum of the gains and losses on both sides of the game is always 0, and there is no cooperation between the two sides. An example of a noncooperative and pure competitive game is two people playing table tennis. One winning means the other person losing. The abstract problem is: finding a theoretical balance point while the participants

The models and structure of neural networks 65 set (two sides), strategy set (table tennis skill level), and profit set (winning or losing) are known. This is the most reasonable and optimal strategy for both sides. Von Neumann has mathematically proved that the zero-sum game of two people can find a minimax equilibrium by using a linear operation. The famous minimax theorem is to hope for the best and plan for the worst.

2.5.2 Data augmentation As we all know, the main driving force of deep learning is the amount of available data (input and output). And the more data available, the better the generalization ability (test performance) obtained. However, in practical applications, the labeled data are minimal and expensive. Statistical methods can be used to augment data, for example, clipping, slipping, rotating, multiresolution nonsubsampling, adding random noise with different distributions, and so on. The obtained samples can be seen as adversarial samples. The multiresolution, rotational invariance and robustness of these samples are beneficial to their integration into the model, but the predictive ability of the model is also limited by this method of augmentation. Generative adversarial nets can also augment data using an unsupervised learning method. When the generation model and discriminant model alternately do optimal learning and ultimately achieve zero-sum state (Nash equilibrium), the performance of the attention generation model (the topology of generated data) depends on the amount of training (natural) data. If the parameter of the generated model is much smaller than the amount of training data, it can effectively internalize the distribution characteristics of the data, so that the generated data are close to these natural data. In addition, the design of the two models in the network and the alternate optimization algorithm are very important. The main purpose of generative adversarial nets is to optimize the generative model. The role of discriminative models is to adjust the generative models, so that the generated data will be closer to the natural data and divergence caused by repeated training can be prevented.

2.5.3 Model description The following is a physical explanation of the generative adversarial nets. The random x ˛ ℝn . Because the noise is z˛ℝm. Natural data are x˛ℝn and the generated data are e 2 discriminant model is the binary classifier, y˛[0.1] . Next, the following four aspects are described in detail. (1) Data n  oT xðtÞ ; zðtÞ ; yðtÞ

t¼1

(2.46)

66 Chapter 2 For the tth data pair xðtÞ ; zðtÞ , the corresponding output yðtÞ is [1.0]. This means that the probability that the natural data will be predicted true is 1, and the probability that the generated data will be predicted true is 0. Or yðtÞ is [0,1], which means that the probability that the natural data will be predicted false is 0, and the probability that the generated data will be predicted false is 1. (2) Model 8 > > x8¼ g z; qG ˛ ℝn > G: e > > ( F > > F > > > > > Feature Learning: X ¼ D x; q < > > < e ¼ DF e x; qF X (2.47) ! > D: > C > > > P LðxÞ ¼ realjX; q > > > > ˛ ℝ2 > Classifier Design: y ¼ > > > e qC : P Lðe x Þ ¼ realj X; > : where G represents the generator, namely the generative model. The parameter to be optimized is qG and the nonlinear function g(,) needs to be further quantified. D represents the discriminator, namely the discriminative model. The discriminator is divided into two stages, one is feature learning and the parameters to be optimized are qF. The other is the design of the classifier, and the parameters to be optimized are qC. The processes DF(,) and P(,) also need to be quantified and mapped. In this process, the amount of data needs to be far greater than the parameter of the generative model (T [ Num(qG)). The network can be guaranteed to get a zero-sum game solution. L(x) is the authenticity of x. (3) Objective function The part of the discriminative model in Formula (2.47) can be written as:     DðxÞ DðxÞ y¼ ¼ ˛ ℝ2 (2.48) Dðe xÞ DðGðzÞÞ D(x)˛[0,1] is the probability of estimating x as a true sample. When the discriminator is fixed, the loss function of the generative model is: n hX 8 D X D io < min  e þ log D x; q log 1  D x; q exwPðexÞ xwPðxÞ D :

q D

q ¼ qF ; qC

(2.49)

Here xwP(x) is the sampled data under the natural data distribution P(x) (datasets in Formula (2.46)). And e xwPðe xÞ is the sample which obeys the generation distribution Pðe xÞ (datasets in Formula (2.46)). log(D(x)) in Formula (2.49) means that the greater the probability that the natural data are judged to be true, the better the model is. The best state is that it is equal to 0, that is, D(x) ¼ 1. Meanwhilelogð1 Dðe xÞÞmeans that

The models and structure of neural networks 67 the greater the probability that e x is judged to be false, the better the model is. The concept of entropy is obtained by summing the uncertainty of all samples. In short, the design of the discriminative model requires that the probability of judging the natural data to be true is high, and the probability that the generated data are judged to be false is high. The requirement for the generation model is that when the discriminant model is fixed, the distribution of the generated data remains consistent with natural data as far as possible. It is equal to maximizing the following objective functions under the condition that Pðe xÞ is consistent with P(x) as much as possible. X X max exwPðexÞ logðDðe xÞÞ ¼ xÞÞ (2.50) exwPðxÞ logðDðe qG because x ¼ G(z):

X G xÞ; PðxÞÞ max log D G z; q /dðPðe zwPðzÞ G q

(2.51)

Under the condition of zwP(z), the greater the sum log(D(G(z))) is, the smaller the gap between the generated data and the natural data is. ðDðGðzÞÞ w Pðe xÞÞ/dðGðzÞ; xÞ/ðDðGðzÞÞ w PðxÞÞ

(2.52)

The most ideal state is: for all z, if log(D(G(z))) ¼ 0, then D (G) ¼ 1. The generated data are identified as natural data at this time and D(G(z)) obeys the distribution probability of natural data Pðe xÞ. Finally, the two distributions dðPðe xÞ; PðxÞÞ are as close as possible. According to the data in Formula (2.46) and the loss function in Formula (2.49), the objective function based on the discriminative model is obtained. 8 39 2 T      X > > > > > d yðtÞ ð1Þ ¼ real ,log D xðtÞ þ > > 7> 6 < 7= D 16 t¼1 7 6 (2.53) min J q ¼  6 T      7> X > T4 qD > 5> ðtÞ ðtÞ > > > > d y ð2Þ ¼ fake log 1  D e x ; : t¼1

Usually due to the natural data and generated data corresponding to the true and false label, the formula contains:  8  < d yðtÞ ð1Þ ¼ real ¼ 1   (2.54) : d yðtÞ ð2Þ ¼ fake ¼ 1 where d(,) is the Dirichlet function.

68 Chapter 2 On the basis of Formula (2.53), integrating the requirements of the generative model, the final optimization objective function is obtained. 8 2 T 39  X   > > ðtÞ D > > > > þ log D x ; q > > 6 7 7= D G < 1 6 t¼1 6 7 (2.55) min min J q ; q ¼  6 T      7 > T4 X qD qG > > 5> > > ðtÞ G D > > log 1  D G z ; q ; q ; : t¼1 X X max zwPðzÞ log D G z; qG %min zwPðzÞ log 1  D G z; qG (2.56) qG

qG

(4) Solving The gradient descent method is used to alternately optimize the parameters (qG,qD).

2.6 Autoencoder With the deepening of the neural network layer, the objective function becomes a nonconvex optimization problem. The performance of a deep neural network depends on the selection of the initial value, and a better selection can avoid falling into the local optimum. If the selection is not good, the network model is prone to underfitting (poor network performance leading by a slow decline in the training error). Hinton et al. propose to use the unsupervised pretraining method to optimize the initial value of network weights and then use some labeled data to fine tune the entire network. Brain science research also has found that the human brain has a deep structure and the cognitive process of the outside world is gradually abstracted. The information process procedure of the human visual system is hierarchical and a series of biological functional areas take effect. In each of these areas, there is an input representation and a signal flow from one signal to another. According to the direction of the signal flow, the hierarchical representation of the different functional areas is constantly abstracted. At the same time, the corresponding features between layers are constantly strengthened. We introduce the methods of Hinton et al. in the following and give a description of an autoencoder network.

2.6.1 Layer-wise pretraining The layer-wise pretraining strategy carries out split-level learning between the parameters of the hierarchical neural network. It views the adjacent layers as a shallow neural network to make full use of the learning advantage of the shallow neural network. It can greatly save the computation time and resource and improve the generalization performance of the network model. Generally, the parameter initialization method based on the layer-wise

The models and structure of neural networks 69

Figure 2.7 Layer-wise pretraining strategy.

training strategy includes the following three forms: the first is the analytical form (such as independent component analysis, principal component analysis, etc.); the second is a synthetic form (such as sparse encoding, sparse representation, convolutional sparse coding); and the third is the analytical synthetic form (such as various autoencoder networks based on three-layer feedforward neural networks, restricted Boltzmann machines, and Boltzmann machines, etc.). The methods of initialization of the parameters of the deep neural network are not limited to the above range, but also the nonlearning method. Gabor transform, wavelet transform, and multiscale geometric analysis are sets that can build a filter group. We randomly select several filters from the set of filter groups, and assign them to the weight matrix between layers, or do some semirandom parameter assignments under a certain distribution. Fig. 2.7 is a deep neural network training model based on a layer-wise learning strategy.

2.6.2 Autoencoder network The autoencoder network maintains the consistency between input and output (measured by information loss), and realizes the parameter learning and feature extraction of a hidden layer under an unsupervised mode. Its core is the unsupervised learning, and the method to realize this is the shallow neural network (convex optimization theory). It is depicted by the feature dimension of the hidden layer (the ascending dimension corresponding to sparsity, the descending dimension corresponding to compression). For deep neural networks, the ultimate goal is to learn the parameters. In this section, we detail the implementation and understanding of an autoencoder network based on the three-layer feedforward neural network (including the input layer, the hidden layer, and the output layer). First, the network structure is illustrated in Fig. 2.8. No matter whether the feature dimension of the hidden layer is

70 Chapter 2

Figure 2.8 Autoencoder networks based on three-layer feedforward neural networks.

increased or decreased, the network always has a forward propagation and topological acyclic structure. (1) Data n oN xðnÞ ˛ ℝu

n¼1

The input data are the desired output. (2) Model  X ¼ sa ðWa ,x þ ba Þ ˛ ℝv b x ¼ ss ðWs ,X þ bs Þ ˛ ℝu

(2.57)

(2.58)

The parameters of the analysis (coding) period are ðWa ˛ ℝvu ; ba ˛ ℝv Þ, and the activation (nonlinear) function is sa(,). The letter “a” here is the initial letter of “analysis.” The parameters of the synthesis (decoding) period are ðWs ˛ ℝuv ; bs ˛ ℝu Þ, and the activation function is ss(,). The letter “s” here is the initial letter of “synthesis.” The output b x is the estimate of the input x. The relation between the feature dimension of the hidden layer and the input layer is: u < v ðincrease dimensionÞ u > v ðdecrease dimensionÞ

(2.59)

u ¼ v ðthe same dimensionÞ (3) Objective function According to different loss criteria (such as energy, entropy, etc.), different optimization objective functions can be constructed, and the objective function based on energy loss is built below. min JðqÞ ¼ q

N  2 1 X  ðnÞ  x  xðnÞ  þ l,RðqÞ b 2 N n¼1

(2.60)

The models and structure of neural networks 71 The output b x ðnÞ in the loss item is the prediction of the input xðnÞ , and its expected output is xðnÞ . Parameters and regular terms are defined as: ( q ¼ ½Wa ; ba ; Ws ; bs  (2.61) RðqÞ ¼ kWa k2F þ kWs k2F The hyperparameter of activation functions sa(,) and ss(,), and the number of hidden layer nodes (feature dimensions) have been given. (4) Solving Usually, the objective function (2.60) is a convex optimization problem, and the optimization iteration algorithm based on the method of random gradient descent can be used to solve the problem. (a) Synthesis period The partial derivative of the target function to the parameter is:  T 8 ðnÞ ðnÞ >   b v x  x X > vJðqÞ 2 vRðqÞ > > b ¼ þ 2l x ðnÞ  xðnÞ , > > < vWs N n vWs vWs (2.62)  T > > ðnÞ ðnÞ >   v b x x > > vJðqÞ 2 X ðnÞ > : b ¼ x  xðnÞ , vbs N n vbs The partial derivative of the error term of each sample (the difference between the predicted output and the expected output) can be obtained by the following formula:  T  8  ðnÞ ðnÞ T ðnÞ >   b b v x  x v x > > > ¼ ¼ s0s 1diag XðnÞ ˛ ℝ1v > < vWs vWs (2.63)     T > > ðnÞ ðnÞ T ðnÞ > b b v x  x v x > > : ¼ ¼ s0s 11v ˛ ℝ11 vbs vbs where 1 is a dot product operation of a vector. diag(,) is a function that extends the vector into a diagonal square, and the element on the diagonal is the element of the vector and the nondiagonal elements are 0. 1v is a v-dimensional column vector of which elements are l. s0s is the derivative of activation function in the synthesis period.

72 Chapter 2 (b) Analysis period The partial derivatives of the objective function to the parameters are:  ðnÞ 8 ðnÞ  T b v x  x > vJðqÞ 2 X vRðqÞ > ðnÞ ðnÞ > b ¼ , x  x þ 2l > > < vWa N n vWa vWa  ðnÞ > >  > v b x  xðnÞ  X > vJðqÞ 2 > : ¼ , b x ðnÞ  xðnÞ vba N n vba

(2.64)

In order to facilitate analysis, the error propagation term (the derivative of the error term of each sample to the output of the hidden layer) is written as: ! ðnÞ v b x  xðnÞ ˛ ℝvu dX ðnÞ ¼ (2.65) vX ðnÞ Further, according to the chain rule:  8 ðnÞ ðnÞ b v x  x > T vX ðnÞ > > ¼ d , ¼ dX ðnÞ , s0a 1diagðxn Þ ˛ ℝv1 > ðnÞ X < vWa vWa  > > > v b x ðnÞ  xðnÞ T > vX ðnÞ : ¼ dX ðnÞ , ¼ dX ðnÞ , s0a 11u ˛ ℝvu vba vba

(2.66)

where 1 and diag(,) are consistent with the synthesis period. And 1u is a vdimensional column vector of which elements are l. s0a is the derivative of the activation function in the analysis period. The partial derivative of the regular term to the parameter can be obtained as: 8 vRðqÞ > > ¼ 2Ws > < vWs (2.67) > vRðqÞ > > : ¼ 2Wa vWa Based on the above analysis, the optimal updating parameter formula can be obtained.

The models and structure of neural networks 73 8  vJðqÞ > ðkþ1Þ ðkÞ > > ¼ Wa  a, > Wa > vWa Wa ¼WðkÞ > > a > >  > > > vJðqÞ > > baðkþ1Þ ¼ bðkÞ  a, > a > < vba ba ¼bðkÞ a  > vJðqÞ > > > Wsðkþ1Þ ¼ WðkÞ  a, > s > vWs Ws ¼WðkÞ > > s > >  > >  > vJðqÞ > ðkþ1Þ > ¼ bðkÞ > s  a, : bs ðkÞ vbs 

(2.68)

bs ¼bs

where a is the learning rate. The core of the autoencoder network is to establish reasonable loss items based on effective criteria, expecting that the input and encoding features (i.e., the output of the hidden layer) have good topology correspondence. Further, its encoding feature can be used as a new input. And again, the corresponding coding features can be obtained in the same way (the setting of hyperparameters may be different to before). Repeat this operation several times, then the deep stack neural network is formed. The encoding features here can be considered as a reasonable description of the input. With the increase in the layers, the coding features are more abstract and have global overall characteristics.

2.7 Restricted Boltzmann machine and deep belief network In essence, restricted Boltzmann machines and Boltzmann machines also belong to the category of autoencoder networks. The Boltzmann machine is a feedback neural network composed of the full connection of random neurons, which is symmetrically connected and without self-feedback. It contains two layersda visible layer and a hidden layerdnodes in the same layer are connected and nodes in different layers are also connected. While the restricted Boltzmann machine has no connection to its same layers, it can connect with adjacent layers. Boltzmann machines have a strong unsupervised learning ability to learn the complex rules in the data but the cost is that the training (learning) time is too long. In addition, it is difficult to accurately calculate the distribution of the Boltzmann machine and further obtain random samples of the distribution. Therefore, the restricted Boltzmann machine is introduced. The main characteristic of the restricted Boltzmann machine is that the activation conditions of the hidden layer units are independent when the state of the visible layer (input data) is given. In turn, the activation conditions of the visible layer units are also independent when the state of the hidden layer is given. In this way, although the distribution of the restricted Boltzmann machine is still unable to be effectively calculated, a

74 Chapter 2

Figure 2.9 Boltzmann machine and restricted Boltzmann machine.

random sample of the distribution can be obtained by Gibbs sampling. As long as the number of the hidden layer units is sufficient, the restricted Boltzmann machine can fit any discrete distribution. In order to effectively optimize the objective function, in 2002, Hinton et al. proposed a fast-learning algorithmdthe contrast dispersion algorithm. In application, the model has been successfully applied to solve different machine learning problems, such as classification, regression, dimension reduction, high-dimensional time series modeling, feature extraction, and so on. The core of the deep-stack neural network is pretraining and fine-tuning, which is also the mainstream of semisupervised learning at present. Whether it is an autoencoder network based on energy (constructing loss function) or a restricted Boltzmann machine and Boltzmann machine based on information loss (constructing loss function), the improvement and application of this kind of network model are very extensive. The following is mainly based on the restricted Boltzmann machine to detail the topology and training techniques of the deep belief network. First, the overall structure of the network is a deep feedforward neural network, while the parameter initialization of the network is obtained by using the restricted Boltzmann machine (Fig. 2.9). The multiplicative bias and the weight connection matrix in the restricted Boltzmann machine are directly assigned to the weight matrix and bias of the corresponding layers of the network. (1) Data The data are consistent with the data requirements in the autoencoder network (based on the three-layer feedforward neural network), and the data are divided into labeled data and unlabeled data. 8 o n > > > xðnÞ ; yðnÞ N /Training Data < n¼1 (2.69) n oT > ðtÞ > /Test Data x > : e t¼1 The number of data sets is T þ N.

The models and structure of neural networks 75 (2) Model The network structure of the model is a deep feedforward neural network, and its module design is divided into two stages. One is a feature learning period, and the other is a classifier design period. In the feature learning period,  X l ¼ sl ðWl ,X l1 þ bl Þ ˛ ℝnl (2.70) X0 ¼ x where l ¼ 1, //, L. Parameter initialization of the self-encoding network in each layer is ( X l ¼ sal Wal ,X l1 þ bal (2.71) b l1 ¼ ss Ws ,Xl þ bs X l l l where Wl ¼ Wal,bl ¼ bal , and sl ¼ sal . “a” represents the analysis stage and “s” represents the synthesis stage. In the classifier design period, if the classification number is K, then 8 ðXL ,qk Þ > > > yðkÞ ¼ e > > K < P eðXL ,qj Þ (2.72) > j¼1 > > > > : y ¼ ½yð1Þ; /; yðKÞT qk(k ¼ 1. 2/, K ) is the parameter to be learned. The only difference with the deep stack network which was formed by the autoencoder network is the parameter initialization in the feature learning period. 8 1 > ðbT ,hþvT ,W,hÞ > ,e hwPðhjv; W; a; bÞ ¼ > < PðvÞ,Z (2.73) > 1 > ðaT ,vþvT ,W,hÞ > ,e v wPðvjh; W; a; bÞ ¼ :b PðhÞ,Z where v is the visible layer, that is, the data after the input normalization. h is a hidden layer, and W is the weight connection matrix between the visible layer v and the hidden layer. WT is transposed in the weight connection matrix of the hidden layer to the visible layer b v. The symbol a is the multiplicative bias on the visible layer, and the symbol b is the multiplicative bias on the hidden layer. In addition, h obeys the corresponding distribution and is obtained by Gibbs sampling. But in practical

76 Chapter 2 applications, the specific way to obtain this is different. If the parameter (W, a, b) is known, based on the input v, the calculating formula of the hidden layer h is: ( PðhðiÞ ¼ 1jvÞ ¼ s vT ,W:;i þ bi (2.74) PðhðiÞ ¼ 0jvÞ ¼ 1  s vT ,W:;i þ bi s(,) is the sigmoid activation function, and h(i) is the output value of the ith node in the hidden layer h. W :;i is the ith column of the weight matrix W and bi is the ith component of the multiplicative bias b of the hidden layer. The formula for estimating the visible layer according to the hidden layer h is as follows: ( Pðvð jÞ ¼ 1jhÞ ¼ s W j;: ,h þ aj (2.75) Pðvð jÞ ¼ 0jhÞ ¼ 1  s W j;: ,h þ aj (3) Objective function The optimization of the objective function is divided into two stages. One is the initialization of the parameter and the other is the fine-tuning. The objective function for the initialization of hierarchical parameters is: min JðqÞ ¼  q

T X

T   X  ðtÞ  X log P bv ðtÞ ¼  log P b v ;h

t¼1

t¼1

(2.76)

h

The parameter initialization between the input and the first hidden layer is explained as follows and the visible layer is obtained from the normalization of the data. n oT n oT normalization e xðtÞ (2.77) ! vðtÞ t¼1

t¼1

The parameters ðW ; a ; b Þ are obtained using Formula (2.76) and it is given to the weight matrix and (additive) bias between the first layers in the deep feedforward neural network.  W1 ¼ W (2.78) b1 ¼ b Next, the parameter initialization between the first hidden layer and the second hidden layer is explained as follows. First, the output of the first hidden layer in the feedforward neural network is obtained using the following formula: X 1 ¼ s1 ðW1 , x þ b1 Þ s1(,) is the activation function on the first hidden layer. And then the data are normalized.

(2.79)

The models and structure of neural networks 77 n

o ðtÞ T

X1

t¼1

n oT normalization ðtÞ ! v1

t¼1

(2.80)

Similarly, using Formula (2.76) to get the parameters (Notice that the input changes at this time, and the size of the corresponding parameters also change.) and the obtained parameters are assigned to ðW2 ; b2 Þ. According to this method, each parameter of the feature learning stage can be trained, and the initialization process of the parameters is completed. (4) Solving The method of the fine-tuning operation is consistent with the traditional deep feedforward neural network which uses a backpropagation algorithm to fine tune the entire network. The objective function (Formula (2.76) in the parameter initialization stage can be solved by the contrast divergence algorithm. At present, the deep belief network has been successfully applied to the problem of pattern classification, but there is still a lot of work to be done and problems to be solved. For example, the following three problems need to be solved urgently: One is theory. It includes two aspects. The first is mathematical physics, the second is calculation. The deep model has a better ability to represent the nonlinear function than the shallow model. What needs to be pointed out is that the represent ability is not equivalent to learn ability. For some functions, the deep network can be expressed by very few parameters but it takes a lot of training samples and computing resources to learn a good enough model. The other is the modeling problem. Since the deep belief network has been proposed, many similar structures have been produced, such as an autoencoder network to replace the limited Boltzmann machine. We should find new hierarchical models (such as sparse-hierarchical MAX, S-HMAX) to make it not only have the strong expression ability of the traditional deep models, but also to make it easier to do theoretical analysis. The last is the parallel optimization problem, the backpropagation algorithm in deep belief network and the minibatch stochastic gradient optimization algorithm are difficult for parallel training in multicomputers. It is usually accelerated with GPU. However, GPU in a single machine is not suitable for large data recognition or similar tasks. In the future, the cloud computing platform and the parallel acceleration based on FPGA are the key issues for the deep belief network in large-scale data recognition.

Further reading [1] Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. In: International conference on neural information processing systems. Curran Associates Inc.; 2012. p. 1097e105. [2] Bouvrie J. Notes on convolutional neural networks. Neural Nets; 2006.

78 Chapter 2 [3] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 2014. [4] Lecun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE 1998;86(11):2278e324. [5] Simard PY, Steinkraus D, Platt JC. Best practices for convolutional neural networks applied to visual document analysis. In: International conference on document analysis and recognition. IEEE Computer Society; 2003. p. 958. [6] Lawrence S, Giles CL, Tsoi AC, et al. Face recognition: a convolutional neural-network approach. IEEE Transactions on Neural Networks 1997;8(1):98e113. [7] Maggiori E, Tarabalka Y, Charpiat G, et al. Convolutional neural networks for large-scale remote sensing image classification. 2016. [8] Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions. 2015. p. 1e9. [9] Chen LC, Papandreou G, Kokkinos I, et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs. Computer Science 2014;(4):357e61. [10] Kim Y. Convolutional neural networks for sentence classification. Eprint Arxiv; 2014. [11] Sainath TN, Kingsbury B, Saon G, et al. Deep convolutional neural networks for large-scale speech tasks. Neural Networks the Official Journal of the International Neural Network Society 2015;64:39e48. [12] Zeiler MD, Fergus R. Visualizing and understanding convolutional networks. European Conference on Computer Vision 2013;8689:818e33. [13] Socher R, Lin CY, Ng AY, et al. Parsing natural scenes and natural language with recursive neural networks. In: International conference on machine learning, ICML 2011, Bellevue, Washington, USA, June 28eJuly. DBLP; 2011. p. 129e36. [14] Socher R, Perelygin A, Wu JY, et al. Recursive deep models for semantic compositionality over a sentiment treebank. 2013. [15] Hannun A, Case C, Casper J, et al. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 2014. [16] Socher R, Lin CY, Ng AY, et al. Parsing natural scenes and Natural Language with recursive neural networks. In: International conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28eJuly. DBLP; 2011. p. 129e36. [17] Mikolov T, Karafia´t M, Burget L, et al. Recurrent neural network based language model. In: INTERSPEECH 2010, conference of the international speech communication association, Makuhari, Chiba, Japan, September. DBLP; 2010. p. 1045e8. [18] Sutskever I, Martens J, Hinton GE. Generating text with recurrent neural networks. In: International conference on machine learning, ICML 2011, Bellevue, Washington, USA, June 28eJuly. DBLP; 2011. p. 1017e24. [19] Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural networks[J]. Advances in Neural Information Processing Systems 2014;4:3104e12. [20] Auli M, Galley M, Quirk C, et al. Joint language and translation modeling with recurrent neural networks. American Journal of Psychoanalysis 2013;74(2):212e3. [21] Graves A, Jaitly N. Towards end-to-end speech recognition with recurrent neural networks. International Conference on Machine Learning 2014:1764e72. [22] Liu S, Yang N, Li M, et al. A recursive recurrent neural network for statistical machine translation. In: Meeting of the association for computational linguistics; 2014. p. 1491e500. [23] Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science 2006;313(5786):504. [24] Bengio Y, Lamblin P, Popovici D, et al. Greedy layer-wise training of deep networks. In: International conference on neural information processing systems. MIT Press; 2006. p. 153e60. [25] Vincent P, Larochelle H, Lajoie I, et al. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research 2010;11(12):3371e408.

The models and structure of neural networks 79 [26] Kan M, Shan S, Chang H, et al. Stacked progressive auto-encoders (SPAE) for face recognition across poses. In: Computer vision and pattern recognition. IEEE; 2014. p. 1883e90. [27] Hinton GE, Osindero S, Teh YW. A fast learning algorithm for deep belief nets. Neural Computation 2006;18(7):1527. [28] Hinton GE. Training products of experts by minimizing contrastive divergence. Neural Computation 2002;14(8):1771e800. [29] Tieleman T. Training restricted Boltzmann machines using approximations to the likelihood gradient. In: International conference. DBLP; 2008. p. 1064e71. [30] Lee H, Ekanadham C, Ng AY. Sparse deep belief net model for visual area V2. Advances in Neural Information Processing Systems 2007;20:873e80. [31] Lee H, Grosse R, Ranganath R, et al. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: International conference on machine learning, ICML 2009, Montreal, Quebec, Canada, June. DBLP; 2009. p. 609e16. [32] Ranzato M, Boureau YL, Lecun Y. Sparse feature learning for deep belief networks. Advances in Neural Information Processing Systems 2007:1185e92. [33] Roux NL, Bengio Y. Representational power of restricted Boltzmann machines and deep belief networks. Neural Computation 2008;20(6):1631e49. [34] Mirza M, Osindero S. Conditional generative adversarial nets. Computer Science 2014:2672e80. [35] Goodfellow IJ, Pougetabadie J, Mirza M, et al. Generative adversarial networks. Advances in Neural Information Processing Systems 2014;3:2672e80. [36] Gauthier J. Conditional generative adversarial nets for convolutional face generation. In: Class project for stanford CS231N: convolutional neural networks for visual recognition, Winter semester, 2014; 2014. [37] Radford A, Metz L, Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 2015. [38] Denton EL, Chintala S, Fergus R. Deep generative image models using a laplacian pyramid of adversarial networks. Advances in neural information processing systems 2015:1486e94. [39] Liu MY, Tuzel O. Coupled generative adversarial networks. 2016. [40] Chen M, Denoyer L. Multi-view generative adversarial networks. 2016. [41] Arjovsky M, Bottou L. Towards principled methods for training generative adversarial networks. 2017.

CHAPTER 3

Theoretical basis of natural computation Chapter Outline 3.1 Evolutionary algorithms

81

3.1.1 Pattern theorem 81 3.1.2 Implicit parallelism 82 3.1.3 Building block assumption

3.2 Artificial immune system

83

83

3.2.1 Markov chain-based convergence analysis 83 3.2.2 Nonlinear dynamic model 85

3.3 Multiobjective optimization

86

3.3.1 Introduction 86 3.3.2 Mathematical concepts 88 3.3.3 Multiobjective optimization algorithms 89 3.3.3.1 The first generation of evolutionary multiobjective optimization algorithms 90 3.3.3.2 The second generation of evolutionary multiobjective optimization algorithms 91

References

94

3.1 Evolutionary algorithms Since simulating biological evolution enables success in solving problems, people have tried to study a theory to explain the effect of the evolution process. Holland [1] and Goldberg [2] have established the pattern theorem, implicit parallelism theorem, and building block hypothesis to explain the role of the evolution algorithm. The genetic algorithm (GA) is used as an example to illustrate the above theorems and assumptions as follows.

3.1.1 Pattern theorem Definition 3.1 Let the individual in the genetic algorithm be p ˛{0,1}l and the set S ¼ {0, 1, *}l. We set cs ˛ S as a schemata, where “*” is the asterisk wildcard. Definition 3.2 If each bit of an individual p matches the schemata s, p is said to be a representation of s. Definition 3.3 The order of a schemata s is the number of “0”s and “1”, denoted as O(s).

Brain and Nature-Inspired Learning, Computation and Recognition. https://doi.org/10.1016/B978-0-12-819795-0.00003-7 Copyright © 2020 Tsinghua University Press. Published by Elsevier Inc. All rights reserved.

81

82 Chapter 3 Definition 3.4 The length of a pattern s is the distance between the first determined position and the last determined position in the pattern, denoted as d(s). Assume given the time step t, a particular schemata s contains m representative strings in population P(t), denoted as m ¼ m(s, t). After the operations of crossover, mutation, and selection, the expected value of the number of representative strings of the schemata s in population P(t þ 1) is defined as:   f ðsÞ dðsÞ E½mðs; t þ 1Þzmðs; tÞ (3.1) 1  pc  OðsÞpm f l1 where f ðsÞ denotes the mean of the fitness of all representative strings of s at t times, which named the fitness of s. f is the average of the fitness value of all individuals in population P(t); pc is the crossover probability and pm is probability of mutation. From formula (3.1) we can draw the following theorem: Theorem 3.1 (Schemata theorem) If the schemata has short length, larger fitness than the average of population, and low-order mode, it will be sampled at an exponential growth rate during the iteration of GA. The schemata theorem says that GA assigns the search times for the schemata according to the fitness value, length, and order of the schemata. If the schemata has short length, large fitness, and low-order mode, the search times will rise at an exponential rate. If the schemata has small fitness, long length, and high-order mode, the search times will decline at an exponential rate.

3.1.2 Implicit parallelism If the length is l and the size is n, there are 2lw(n  2l) different schemata. Therefore, the number of processed schemata during iteration is much greater than the number of individuals. Professor Holland referred to this feature of genetic algorithms as implicit parallelism. And Professor Goldberg proved the following theorem: Theorem 3.2 (Implicit parallelism theorem) Suppose ε ˛ (0,1) is a very small number, ls < Q(l1)εSþ1. The population size is N ¼ 2ls =2 . Then the number of schemata in which the “survival rate” is larger than (1ε) is about O(N3). The symbol “QS” indicates rounding. The implicit parallelism theorem reflects that the search efficiency of the genetic algorithm for space is very high. Once the population is processed, it deals with O(N3) schemata parallel. At the same time, the implicit parallelism theorem reflects that GAs can keep spatial information strongly. The information on O(N3) schemata is kept in each population.

Theoretical basis of natural computation 83

3.1.3 Building block assumption Definition 3.5 A mode with a high fitness value, short length, and low order is called a building block. Just like building blocks, these “good” models are put together and combined with each other under genetic operations to produce strings with higher fitness values, so as to find better feasible solutions. This is what the building block hypothesis reveals. Hypothesis 3.1 (Building block hypothesis) Higher-than-average fitness, short length, and low-order modes (blocks) are combined with each other under the action of genetic operators to generate higher-than-average fitness, long length, and high-order modes. It can ultimately generate a global optimal solution. The schemata theorem guarantees exponential growth of the better model samples, which satisfies the necessary conditions for finding the optimal solution that GA has the possibility of finding the global optimal solution. The building block hypothesis indicates that GA has the ability to find the global optimal solution. That is to say, under the action of genetic operators, the building block hypothesis can generate higher than average fitness value, long length, and high-order schemata, and ultimately generate a global optimal solution.

3.2 Artificial immune system Theoretical analysis has always been a difficult point in the research of bionic intelligent computations such as evolutionary algorithms and immune algorithms. The current work on theoretical analysis of the artificial immune system is very limited. As far as we know, there are few formal proofs of artificial immune system algorithms. Jiao et al. have provided proof of the convergence of the clonal selection algorithm based on the Markov chain. Based on the Markov chain, Coello Coello et al. gave proof of the convergence of the multiobjective clonal selection algorithm [3]. Stepney et al. proved the convergence of the B-cell algorithm by Markov chain theory [4]. Most of these works give axiomatic proof to prove the convergence of the algorithm. In addition, using an analysis model based on the nonlinear dynamics is also feasible.

3.2.1 Markov chain-based convergence analysis Global convergence is one of the most important features of the optimization problem. At present, the performance of many immune optimization algorithms is verified through empirical analysis by testing various optimization problems. In fact, most immune

84 Chapter 3 optimization algorithms do not consider the interaction of different antibodies between populations. Given the population at t time, after affinity maturation and selection operation, the state of the population is still a random variable at tþ1 time. Therefore, the change of the population with time t can be described by the Markov chain [5]. For the typical clonal selection algorithm, the convergence analysis based on the Markov chain is as described below. Definition 3.6 The global optimal solution set for the problem is defined as follows. U b fx ˛ U: f ðx Þ ¼ minðf ðxÞjx ˛ UÞg

(3.2)

As for the antibody population M, w(M) b jMXU*j denote the number of the optimal solutions in M. Definition 3.7 If for any initial state M0 meet the following condition: lim PfwðMðtÞÞ  1jMð0Þ ¼ M0 g ¼ 1

t/N

(3.3)

The algorithm will converge to the global solution with probability 1. Theorem 3.3 The clonal selection algorithm converges to the global optimal solution with probability 1. Prove: Set P0(t) ¼ P{w(M(t)) ¼ 0}, then the Bayes conditional probability formula is as follows. P0 ðt þ 1Þ ¼ PfwðMðt þ 1ÞÞ ¼ 0g ¼ PfwðMðt þ 1ÞÞ ¼ 0jwðMðtÞÞ 6¼ 0g  PfwðMðtÞÞ 6¼ 0g

(3.4)

þPfwðMðt þ 1ÞÞ ¼ 0jwðMðtÞÞ ¼ 0g  PfwðMðtÞÞ ¼ 0g Since the optimal solution of the memory population will not deteriorate, we have P {w(M(tþ1)) ¼ 0jw(M(t))s0} ¼ 0, and then, P0 ðt þ 1Þ ¼ PfwðMðt þ 1ÞÞ ¼ 0jwðMðtÞÞ ¼ 0g  P0 ðtÞ

(3.5)

From the property of affinity maturation operation, we know: PfwðMðt þ 1ÞÞ > 0jwðMðtÞÞ ¼ 0gmin > 0

(3.6)

Denote: z ¼ min P{w(M(t þ 1)) ¼ 1jw(M(t)) ¼ 0}min, t ¼ 0,1,2,/, PfwðMðt þ 1ÞÞ ¼ 1jwðMðtÞÞ ¼ 0gmin  z > 0

(3.7)

Theoretical basis of natural computation 85 So, PfwðMðt þ 1ÞÞ ¼ 0jwðMðtÞÞ ¼ 0g ¼ 1  PfwðMðt þ 1ÞÞ 6¼ 0jwðMðtÞÞ ¼ 0g ¼ 1  PfwðMðt þ 1ÞÞ  1jwðMðtÞÞ ¼ 0g  1  PfwðMðt þ 1ÞÞ ¼ 1jwðMðtÞÞ ¼ 0g 1z = (3.12) s:t: gi ðxÞ  0 ði ¼ 1; 2; /; qÞ > ; ð j ¼ 1; 2; /; pÞ hj ðxÞ ¼ 0 where x ¼ (x1, x2, ., xn) ˛ X3Rn is an n-dimensional decision vector, and X is an n-dimensional decision space. In addition, y ¼ (y1, y2, ., ym) ˛ Y3Rm is an m-dimensional object vector, and Y is the m-dimensional object space. The objective function F(x) defines m mapping functions from decision space to object space. gi(x)  0(i ¼ 1, 2, /, q) is the q inequality constraints, and hj(x) ¼ 0( j ¼ 1, 2, /, p) is the p equality constraints. On this basis, the following important definitions are given. Definition 3.8 Feasible solution For an x ˛ X, if x satisfies the constraints of (3e12), i.e., gi(x)  0(i ¼ 1, 2, /, q) and hj(x) ¼ 0( j ¼ 1, 2, /, p), x is called a feasible solution. Definition 3.9 Feasible solution set The set of all feasible solutions in X is called the feasible solution set, denoted as Xf, and Xf4X. Definition 3.10 Pareto-dominance Based on formula (3.12), given xA,xB ˛ Xf, xA dominates (called Pareto dominance) another solution xB if it satisfies the following two conditions: ci ¼ 1; 2; /; m denoted as xA _ xB.

fi ðxAÞ  fi ðxB Þ o dj ¼ 1; 2; /; m

fj ðxA Þ < fj ðxB Þ

(3.13)

Theoretical basis of natural computation 89 Definition 3.11 Pareto-optimal A solution x* ˛ Xf is called the Pareto-optimal (nondominated) solution if :dx ˛ X f : x_x

(3.14)

Definition 3.12 Pareto-optimal set The set of all the Pareto-optimal solutions is called the Pareto-optimal set. P bfx j:dx ˛ X f : x _ x g

(3.15)

Definition 3.13 Pareto-optimal front The surface which is generated by combining all the objective function vectors corresponding to Pareto optimal solutions in set P* is called the Pareto-optimal set (PF*):    PF  b Fðx Þ ¼ ð f1 ðx Þ; f2 ðx Þ; /; fm ðx ÞÞT x ˛ P (3.16)

3.3.3 Multiobjective optimization algorithms In 1967, Rosenberg [31] advised to use the evolution-based method to deal with multiobjective optimization problems, but he did not realize this idea at the time. In 1975, John Holland [32] proposed the genetic algorithm (GA). After 10 years, Schaffer [11] proposed the vector evaluation genetic algorithm, which for the first time realized the combination of a genetic algorithm and multiobjective optimization problems. In 1989, Goldberg in his book Genetic algorithm for search, optimization, and machine learning [33] proposed a new idea of combining the Pareto theory in economics with evolutionary algorithms to solve multiobjective optimization problems, which has important guiding significance for the research of multiobjective optimization problems. Subsequently, the evolutionary multiobjective optimization algorithm has attracted the attention of many scholars, and a large number of research results have emerged. According to our statistics, IEEE Transactions on Evolutionary Computation, an authoritative journal for evolutionary computing, was published from the beginning of 1997 to the end of 2007. The two most cited articles by SCI are all about the research results of EMO, NSGA-II [8], and SPEA [9]. As we can see, in evolutionary computation, evolutionary multiobjective optimization is a very hot research topic. Next, we discuss some of the major algorithms in the evolutionary multiobjective optimization domain according to the classification of Coello Coello [34e37].

90 Chapter 3 3.3.3.1 The first generation of evolutionary multiobjective optimization algorithms The first-generation evolutionary multiobjective optimization algorithm was inspired by Goldberg’s suggestion [33]. In 1989, he suggested to use nondominated sorting and niche to solve multiobjective optimization problems. The process of nondominated sorting is to assign a rank of 1 to the nondominated individuals in the current population and remove them from the competition; then select nondominated individuals from the current population and assign them a level of 2; the process continues until all individuals in the population are assigned to an order. Niche technology is used to maintain population diversity and prevent Prematurity. Although Goldberg did not implement his ideas in evolutionary multiobjective optimization, he was instructive to later scholars. Subsequently, some scholars proposed MOGA [12], NSGA [13], and NPGA [14] based on his idea. 3.3.3.1.1 MOGA

Fonseca and Fleming proposed MOGA in 1993. This method ranks each individual by rank. The rank of all nondominated individuals is defined as 1, and the rank of other individuals is defined as the number of individuals that dominate it plus one. Individuals with the same rank are selected using a fitness sharing mechanism. The fitness allocation method is performed as follows. First, the populations are sorted according to rank, and then all individuals are assigned fitness by using the method of linear or nonlinear interpolation proposed by Goldberg, and individuals that have the same rank have the same fitness. Through the fitness sharing mechanism, random selection is adopted. MOGA relies too much on the selection of shared functions and may generate great selection pressure, leading to premature convergence. 3.3.3.1.2 NSGA

The NSGA is also based on Goldberg’s idea of nondominated sorting. Nondominated solutions are first identified and assigned a large virtual fitness value. In order to maintain the diversity of the population, these nondominated solutions are shared using their virtual fitness values. Then these nondominated individuals are not considered for the time being. The second batch of nondominated individuals is determined from the remaining populations, and then they are assigned a virtual fitness value that is smaller than the minimum fitness value that previously nondominated individuals shared. Then these nondominated individuals are also not considered for the time being, and the third batch of nondominated individuals is determined from the remaining populations. The process continues until the entire population is divided into several levels. NSGA uses proportional selection to duplicate a new generation. The computational complexity of NSGA is O(mN3), where m is the object number and N is the population size, so its computational complexity is high, and the shared parameters need to be determined in advance.

Theoretical basis of natural computation 91 3.3.3.1.3 NPGA

NPGA designed a tournament selection mechanism based on the Pareto dominance. The specific ideas are as follows: two individuals are randomly selected from the evolutionary population, and then a comparison set is randomly selected from the evolutionary population. If only one of the individuals is not dominated by the comparison set, the individual will be selected to enter the next generation; when they are all nondominated or all dominated by the comparison set, niche is used to achieve sharing to select one of them for the next generation. The algorithm selects an individual who has large shared fitness values to enter the next generation. The selection and adjustment of the niche radius in this algorithm are difficult, and the size of a suitable comparison set must be chosen. The main features of the first-generation evolutionary multiobjective optimization algorithms are nondominated sorting and selecting and shared functions that maintain the diversity of population. During the development of first-generation evolutionary multiobjective optimization, some issues that urgently needed to be resolved also emerged. First, other methods that can replace the niche to maintain population diversity must be found. Fitness sharing is proposed by Goldberg and Richardson [38] for multimodal function optimization, which usually requires prior knowledge of the number of finite peaks and the assumption that the niche of solution space is uniformly distributed. For the multiobjective optimization problems, it is also necessary to determine the prior information of the sharing radius, and the computational complexity is the square of the population size. 3.3.3.2 The second generation of evolutionary multiobjective optimization algorithms From the end of the 20th century, the research trends in the field of evolutionary multiobjective optimization have undergone tremendous changes. In 1999, Zitzler et al. proposed SPEA [9], which makes elitism preservation popular in the field of evolutionary multiobjective optimization. The birth of the second-generation evolutionary multiobjective optimization algorithm was marked by the introduction of the elitism preservation strategy. In the field of evolutionary multiobjective optimization, elitism preservation strategy refers to the use of an external population (as opposed to an evolved population) to retain nondominated individuals. Subsequently, some classic evolutionary multiobjective optimization algorithms were proposed. Most of these adopted the elitism preservation strategy. In 2000, Knowles and Corne proposed PAES [16], and later proposed PESA [17] and PESA-II [18], an improved version of PAES. In 2001, Zitzler et al. proposed SPEAII [15], an improved version of SPEA. Deb et al. also improved the NSGA and proposed NSGA-II [8]. Erickson et al. proposed an improved algorithm NPGA2 [19] of NPGA. Coello Coello has been dedicated to the study of evolutionary multiobjective optimization. In 2001, he proposed Micro-GA [20] and also established a

92 Chapter 3 web-based database on EMO (http://lania.mx/wccoello/EMOO/), which collected most of the research results in the EMO field. Below, we discuss some classic second-generation evolutionary multiobjective optimization algorithms. 3.3.3.2.1 SPEA and SPEA2

SPEA is an algorithm put forward by Zitzler and Thiele in 1999. In this algorithm, the individual’s fitness is also called the Pareto strength. The fitness of nondominated individuals is defined as the proportion of the total number of individuals that are dominated by the nondominated individuals in the population. The fitness of other individuals is defined as the total number of individuals who dominate it plus 1, and individuals with low fitness correspond to higher selection probabilities. In addition to the evolutionary population, there is also an external population that holds the current nondominated individuals. When the number of individuals in the external population exceeds the agreed-upon value, clustering techniques are used to eliminate individuals. Tournament selection was used to select individuals from the evolutionary population and external population. Then the selected individuals enter the mating pool for crossover and mutation operations. The computational complexity of this algorithm is the cube of the population size. SPEA2 is an improved version of SPEA proposed by Zitzler and Thiele in 2001. They improved in three aspects: fitness allocation strategy, individual distribution assessment, and nondominated solutions update. The fitness function in SPEA2 is: F(i) ¼ R(i) þ D(i), where R(i) takes into account the individual dominance information of individual i in the external population and evolutionary population. D(i) is a measure of the degree of congestion determined by the distance of an individual i to its k-th neighbor. When constructing a new population, the first requirement is the environment selection and then the mating selection. In environment selection, individuals with fitness less than 1 are first selected to enter the external population. When the number of individuals is smaller than the size of the external population, individuals with lower fitness in the evolutionary population are selected; when the number of individuals is greater than the size of the external population, environmental selection is used for deletion. In mating selection, a tournament is used to select individuals to enter the mating pool. Based on neighbor rules, SPEA2 introduces environment selection, which simplifies the cluster-based update method of the external population in SPEA. Although its computational complexity is still a cube of the population size, the homogeneity of the solution distribution based on the neighbor selection rule environment outperforms many other algorithms. 3.3.3.2.2 PAES, PESA, and PESA-II

PAES uses a (1 þ 1) evolutionary strategy to mutate the current solution, then evaluate the mutated individual, compare its dominance relationship with the premutation individual,

Theoretical basis of natural computation 93 and use the elite retention strategy to retain the good one. The classic point is the introduction of a spatial hyperspace mechanism to maintain the diversity  of the population. Each individual is assigned to a lattice. The time complexity is O N  N , where N is the size of the evolutionary population and N is the size of the external population. The spatial superlattice strategy introduced by this algorithm is adopted by many evolutionary multiobjective algorithms. Later, Corne et al. proposed PESA based on this idea. PESA has an internal population and an external population. During evolution, the nondominated individuals in internal populations are incorporated into external populations. When the external population accepts a new individual, it also needs to eliminate one. The specific method is finding the individual with the largest crowding coefficient in the outer population and deleting it, if there are multiple individuals with the same congested coefficient at the same time, it randomly deletes one. The crowding coefficient refers to the number of individuals gathered in the superlattice. Corne et al. made a further improvement to SPEA in 2001, called PESA-II. They proposed a concept based on regional selection. Compared with PESA based on individual selection, PESA-II replaces individual selection with lattice selection, which improves the efficiency of the algorithm to a degree. 3.3.3.2.3 NSGA-II

NSGA-II is an improved version of NSGA by Deb et al. from 2002. It is one of the best evolutionary multiobjective optimization algorithms so far. Compared to NSGA, NSGA-II has the following advantages. 1. A new fast-ranked nondominated solution selection. The computational complexity reduces from O(mN3) to O(mN2). m represents the number of objective functions and N represents the number of individuals in the population. 2. In order to calibrate the fitness values of different elements in the same level after fast nondominated sorting, making the individual in the current Pareto frontier extend to the entire Pareto frontier and spread as evenly as possible at the same time, the author proposes the concept of crowding distance. The crowding distance comparison operator is used instead of the fitness sharing method in NSGA. The time complexity of the congestion distance is O(m(2N)log(2N)). 3. Introduce the elitism preservation mechanism. Individuals selected to participate in breeding will compete with their offspring to generate the next generation of the population. This will help maintain good individuals and improve the overall evolution of the population. NSGAII, SPEA2, and PESA-II are the main classical algorithms for second-generation evolutionary multiobjective optimization. During this period, many other evolutionary algorithms have been proposed to solve the multiobjective optimization problems, such as the multiobjective messy genetic algorithm (MOMGA) [39] proposed by Veldhuizen et al.

94 Chapter 3 Coello Coello et al. proposed Micro-GA [20]. The algorithms in this period are characterized by elitism preservation strategy, and most of the algorithms no longer use the niche technology to maintain the diversity of the population. Some better strategies have been proposed, such as clustering-based methods, methods based on crowding distance, methods based on a spatial supergrid, etc.

References [1] Burnet SFM. The clonal selection theory of acquired immunity. Glonal Selection Theory of Acquired Immunity 1959;241(4). [2] Kelsey J, Timmis J. Immune inspired somatic contiguous hypermutation for function optimisation. In: Genetic and Evolutionary Computation Conference. Berlin, Heidelberg: Springer; 2003. p. 207e18. [3] Villalobos-Arias M, Coello Coello AC, Herna´ndez-Lerma O. Convergence analysis of a multiobjective artificial immune system algorithm. In: International Conference on Artificial Immune Systems. Berlin, Heidelberg: Springer; 2004. p. 226e35. [4] Stepney S, Smith RE, Timmis J, et al. Conceptual frameworks for artificial immune systems. International Journal of Ubiquitous Computing 2005;1(3):315e38. [5] Cox DR. The theory of stochastic processes. Routledge; 2017. [6] Perelson AS, Weisbuch G. Immunology for physicists. Reviews of Modern Physics 1997;69(4):1219. [7] Nowak MA, May RM. Virus dynamics. 2000. [8] Deb K, Pratap A, Agarwal S, et al. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation 2002;6(2):182e97. [9] Zitzler E, Thiele L. Multiobjective evolutionary algorithms: a comparative case study and the strength Pareto approach. IEEE Transactions on Evolutionary Computation 1999;3(4):257e71. [10] Deb K. Multiobjective optimization using evolutionary algorithms. John Wiley & Sons; 2001. [11] Scaffer JD. Multiobjective optimization with vector evaluated genetic algorithms. In: Proc. of the 1st International Conference on Genetic Algorithms; 1985. p. 93e100. [12] Fonseca CM, Fleming PJ. Genetic algorithms for multiobjective optimization: formulation discussion and generalization. Icga 1993;93(July):416e23. [13] Srinivas N, Deb K. Multiobjective optimization using nondominated sorting in genetic algorithms. Evolutionary Computation 1994;2(3):221e48. [14] Horn J, Nafpliotis N, Goldberg D E. A niched Pareto genetic algorithm for multiobjective optimization// Evolutionary computation, 1994. IEEE World Congress on Computational Intelligence., Proceedings of the First IEEE Conference. IEEE, 1994: 82-87. [15] Zitzler E, Laumanns M, Thiele L. SPEA2: Improving the strength Pareto evolutionary algorithm. TIKreport. 2001. p. 103. [16] Knowles JD, Corne DW. Approximating the nondominated front using the Pareto archived evolution strategy. Evolutionary Computation 2000;8(2):149e72. [17] Corne DW, Knowles JD, Oates MJ. The Pareto envelope-based selection algorithm for multiobjective optimization. In: International Conference on Parallel Problem Solving from Nature. Berlin, Heidelberg: Springer; 2000. p. 839e48. [18] Corne DW, Jerram NR, Knowles JD, et al. PESA-II: region-based selection in evolutionary multiobjective optimization. In: Proceedings of the 3rd Annual Conference on Genetic and Evolutionary Computation. Morgan Kaufmann Publishers Inc.; 2001. p. 283e90. [19] Erickson M, Mayer A, Horn J. The niched Pareto genetic algorithm 2 applied to the design of groundwater remediation systems. In: International Conference on Evolutionary Multi-Criterion Optimization. Berlin, Heidelberg: Springer; 2001. p. 681e95.

Theoretical basis of natural computation 95 [20] Coello Coello A, Pulido GT. Multiobjective optimization using a micro-genetic algorithm. In: Proceedings of the 3rd Annual Conference on Genetic and Evolutionary Computation. Morgan Kaufmann Publishers Inc.; 2001. p. 274e82. [21] Laumanns M, Thiele L, Deb K, et al. Combining convergence and diversity in evolutionary multiobjective optimization. Evolutionary Computation 2002;10(3):263e82. [22] Brockhoff D, Zitzler E. Are all objectives necessary? On dimensionality reduction in evolutionary multiobjective optimization. In: Parallel problem solving from nature-PPSN IX. Berlin, Heidelberg: Springer; 2006. p. 533e42. [23] Herna´ndez-Dı´az AG, Santana-Quintero LV, Coello Coello CA, et al. Pareto-adaptive ε-dominance. Evolutionary Computation 2007;15(4):493e517. [24] Deb K, Saxena DK. On finding pareto-optimal solutions through dimensionality reduction for certain large-dimensional multiobjective optimization problems. 2005. p. 2005011. Kangal report. [25] Saxena DK, Deb K. Non-linear dimensionality reduction procedures for certain large-dimensional multiobjective optimization problems: employing correntropy and a novel maximum variance unfolding. In: International Conference on Evolutionary Multi-Criterion Optimization. Berlin, Heidelberg: Springer; 2007. p. 772e87. [26] Coello Coello AC, Pulido GT, Lechuga MS. Handling multiple objectives with particle swarm optimization. IEEE Transactions on Evolutionary Computation 2004;8(3):256e79. [27] Gong M, Jiao L, Du H, et al. Multiobjective immune algorithm with nondominated neighbor-based selection. Evolutionary Computation 2008;16(2):225e55. [28] Zhou A, Zhang Q, Jin Y, et al. Global multiobjective optimization via estimation of distribution algorithm with biased initialization and crossover. In: Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation. ACM; 2007. p. 617e23. [29] Zhang QF, Zhou AM, Jin Y. RM-MEDA: a regularity model-based multiobjective estimation of distribution algorithm. IEEE Transactions on Evolutionary Computation 2008;12(1):41e63. [30] Coello Coello AC, Lamont GB, Van Veldhuizen DA. Evolutionary algorithms for solving multiobjective problems. New York: Springer; 2007. [31] Rosenberg RS. Simulation of genetic populations with biochemical properties: I. The model. Mathematical Biosciences 1970;7:223e57. [32] Holland JH. Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT press; 1992. [33] Goldberg DE, Richardson J. Genetic algorithm for search optimization and machine learning. In: Proceedings of the Second International Conference on Genetic Algorithms; 1987. p. 41e9. [34] Coello Coello A C. An updated survey of evolutionary multiobjective optimization techniques: state of the art and future trends//Evolutionary computation, 1999. CEC 99. Proceedings of the 1999 Congress. IEEE, 1999, vol. 1: 3-13. [35] Coello Coello AC. Evolutionary multiobjective optimization: current and future challenges. In: Advances in Soft Computing. London: Springer; 2003. p. 243e56. [36] Coello Coello AC. Recent trends in evolutionary multiobjective optimization. In: Evolutionary Multiobjective Optimization. London: Springer; 2005. p. 7e32. [37] Coello Coello AC. Evolutionary multiobjective optimization: a historical view of the field. IEEE Computational Intelligence Magazine 2006;1(1):28e36. [38] Goldberg DE, Richardson J. Genetic algorithms with sharing for multimodal function optimization. In: Genetic algorithms and their applications: Proceedings of the Second International Conference on Genetic algorithms. Hillsdale, NJ: Lawrence Erlbaum; 1987. p. 41e9. [39] Van Veldhuizen DA, Lamont GB. Multiobjective optimization with messy genetic algorithms. In: Proceedings of the 2000 ACM Symposium on Applied Computing, Vol. 1. ACM; 2000. p. 470e6.

CHAPTER 4

Theoretical basis of machine learning Chapter Outline 4.1 Dimensionality reduction

97

4.1.1 Subspace segmentation 97 4.1.2 Nonlinear dimensionality reduction

4.2 Sparseness and low rank

98

100

4.2.1 Sparse representation 100 4.2.2 Matrix recovery and completion

100

4.3 Semisupervised learning and kernel learning 102 4.3.1 Semisupervised learning 102 4.3.2 Nonparametric kernel learning

References

103

104

4.1 Dimensionality reduction 4.1.1 Subspace segmentation Non-negative matrix factorization (NMF) is a matrix decomposition method that combines non-negative constraints. It is a typical linear subspace dimensionality reduction method. Because this decomposition method is consistent with the real physical properties of the data and can be interpreted strongly and conformed to the laws of people’s cognition of the objective world [1,2], it has attracted increasing attention. The idea of NMF dates back to the concept of positive matrix factorization, which was proposed by Paatero and Tapper in 1994 [3]. However, because of its high complexity, this method has not attracted much attention. Lee and Seung formally proposed the basic conceptual framework of the NMF algorithm in 1999. They described and defined the objective function of the algorithm in theory and give a simple and practical non-negative alternating least square method. The method was applied to face recognition and text feature extraction. Since then, research on NMF and its applications has been gradually developed. The NMF method has been widely used in text data clustering, image data representation, face recognition, blind source signal separation, gene expression analysis, and spectral data analysis. Detailed theoretical analysis and related applications can be found in the literature [4e6]. The existing NMF algorithms can be divided into the following four categories: (1) the basic NMF algorithm, the typical acceleration algorithms such as Brain and Nature-Inspired Learning, Computation and Recognition. https://doi.org/10.1016/B978-0-12-819795-0.00004-9 Copyright © 2020 Tsinghua University Press. Published by Elsevier Inc. All rights reserved.

97

98 Chapter 4 projection gradient method [7], Newton’s method [8]; (2) the constrained NMF algorithm, the representative work is graph regular NMF [9], orthogonal constrained NMF [10], semisupervised NMF [11], and robust l1 or l2,1 norm-constrained NMF [12,13]; (3) the structured NMF algorithm, such as weighted NMF [14] and non-negative matrix triple decomposition [15]; and (4) the generalization of NMF algorithm, such as non-negative tensor decomposition [16] and seminon-negative matrix decomposition [17], etc. The subspace method is easy to be implemented and very effective in practical problems. Therefore there are many linear subspace methods such as principal component analysis, linear discriminant analysis, and non-negative matrix decomposition. Using the kernel technique to transform into reproducing kernel Hilbert space, the extended linear method is used to deal with nonlinear problems. Typical examples include kernel principal component analysis (KPCA) [18] and kernel linear discriminant analysis (KLDA) [19]. Nuclear techniques have become a powerful tool for nonlinear applications in many fields. In unsupervised learning, it is generally assumed that data are located or approximately located on multiple low-dimensional submanifolds. These submanifolds can be well approximated by linear subspaces with slightly higher dimensions sometimes, such as handwritten data. These handwritten images contain changes to the rotation of the target, the size of the scale, the movement of the position, and the thickness of the characters. Simard et al. [20] presented a seven-dimensional manifold model to describe the above changes in handwritten images, and obtained a good recognition effect. As a classical data analysis technique, PCA can find a single hidden structure of linear subspace data. However, many real data are approximately distributed in multiple linear subspaces. In recent years, many subspace clustering methods have been proposed, such as generalized PCA (GPCA) [21]. The subspace segmentation is defined as follows [22]. The given data samples are taken in plurality of linear subspaces and they always contain noise. Subspace segmentation is to eliminate noise and divide all samples into their subspaces at the same time. The existing subspace segmentation methods can be broadly divided into four categories: (1) the algebraic methods, such as GPCA [21]; (2) the iterative methods, such as K-subspace segmentation algorithms [23]; (3) the statistical methods, typical of which is the random sampling consistency method [24] and condensed lossy squeezing [25]; and (4) the methods based on spectral clustering, such as sparse subspace clustering (SSC) [26] and the low-rank representation method (LRR).

4.1.2 Nonlinear dimensionality reduction According to different standards, dimension reduction can be divided into different types. According to the standard that whether the dimensionality reduction is linear, dimension reduction can be divided into linear and nonlinear subspace. PCA and LDA are typical

Theoretical basis of machine learning 99 linear subspace methods. Isometric mapping (ISOMAP) [27], local linear embedding [28] (LLE), Laplacian eigenvalue mapping (LE) [29], local tangent space arrangement [30], and spectral clustering (SC) [31] are common nonlinear algorithms. It certainly includes the kernel of the above linear methods, such as KPCA and KLDA, etc. According to the geometric structure types of data, dimension reduction can be divided into local and global methods. The manifold learning algorithms such as LLE and LE and their corresponding linearization versions, such as nearest neighbor embedding [32] and locally preserving projection [33], are local methods. Methods such as ISOMAP, PCA, and LDA are global methods. According to the standard that whether the dimensionality reduction process uses labels or other forms of supervisory information, dimension reduction can be divided into supervised, semisupervised, and unsupervised methods. PCA, ISOMAP, LLE, LE, and SC are unsupervised algorithms. SSDR [34] and FME [35] are semisupervised algorithms. LDA and maximum interval criterion [36] are supervised algorithms. The literature [37] has presented a unified graph-based embedding framework, in which dimension reduction algorithms are regarded as linearization, kernelization, or tensor quantization based on graph embedding. The nonlinear dimensionality reduction technique is still one of the hottest issues in the field of machine learning. As an effective manifold learning technique, the ISOMAP algorithm reflects the inherent global geometric structure of the data through the geodesic distance. Although the algorithm has succeed in artificial data sets, biomedical data visualization, and face recognition, its topological stability is poor. In addition, the algorithm has high time complexity in computing geodesic distance and eigenvalue decomposition of dense matrix. The complexity is usually cubic of sample number, that is O(n3) (where n is the sample number of data). In recent years, the spectral clustering algorithm has been widely concerned in the fields of machine learning, data mining, and computer vision. It has been successfully applied to many practical problems, such as image and video segmentation, speech recognition, VLSI design, text mining, and biological information mining. This kind of algorithm is based on the theory of spectrum partition in graph theory. Its essence is to divide the constructed graph into two or more smaller subgraphs, and to minimize the sum of the weights of the connected edges of each subgraph according to the partition criterion. The existing partitioning criteria are mainly normal tangent, rate tangent, minimum and maximum tangent, etc. Finding the optimal solution is an NP-hard problem. The effective solution is to relax the discrete combinatorial problem into the form of continuous region and transform the original problem into the matrix eigenvalue decomposition problem. Then the global optimal solution can be obtained. As a typical representation of a spectral method, a spectral clustering algorithm can accomplish dimensionality reduction of high-dimensional data. Its intermediate links are also the key of some semisupervised learning and graph regularization. Although a

100 Chapter 4 spectral clustering algorithm is widely used, it still has several problems, such as the selection of its scale factor when using Gaussian function to calculate connection weight, and the selection of the number of nearest neighbors or the size of the radius of the nearest neighbor when constructing a sparse graph. The choice of nearest neighbor or radius can be avoided by constructing a complete graph, but the time complexity of eigenvalue decomposition of graph Laplace is bound to limit its scope of application.

4.2 Sparseness and low rank 4.2.1 Sparse representation Sparse representation can be expressed as the following convex optimization problem: minkZk1 þ lkEk2;1 ;

where kEk2;1

Z;E (4.1) s.t.; X ¼ XZ; diagðZÞ ¼ 0 Pm 2ffi P qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ nj¼1 i¼1 Eij is the l2,1 norm of the noise matrix E. Choose the data X

itself as a dictionary. In addition, in order to avoid the trivial solution Z ¼ I, the selfdiagonal elements of the matrix Z should be 0, that is diag(Z) ¼ 0. Sparse representation or sparse coding can be seen as the automatic feature selection problem which is closely related to the famous Lasso problem [38]. In the past few years, sparse representation has been successfully applied in many fields, such as signal processing, statistical analysis, computer vision, and pattern recognition. For example, in the field of signal processing, sparse representation is used in signal compression and coding, and image recovery [39]. In the field of image processing, sparse representation obtained good results in image denoising, restoration, and super-resolution processing [40]. In the field of computer vision, the sparse subspace segmentation method shows superior performance on the motion segmentation problem. In addition, sparse representation is also widely used in many pattern recognition problems, such as signal and image target classification, face recognition, texture classification, and handwritten numeral recognition etc.

4.2.2 Matrix recovery and completion The RPCA model can be transformed into the following convex optimization problem: min kZk þ lkEk1 ; Z;E

s.t.; X ¼ Z þ E

(4.2)

where k,k* is the kernel norm of the matrix, that is, the sum of all singular values, and P kEk1 ¼ i,jjEijj is the l1 norm of the noise matrix E. As long as the rank of the lower rank

Theoretical basis of machine learning

101

matrix is not too high and the noise term is sparse enough, the exact recovery can be obtained from the convex hull of the l0 norm and rank function [41]. Also, under appropriate conditions, the robust principal component analysis model is equal to the optimal solution of the above optimization problem. The model has been successfully applied to practical issues such as text data mining [42], video surveillance [43], image alignment, and low-rank textures [44], etc. The low-rank representation model can be described as a convex optimization problem as follows: minkZk þ lkEk2;1 Z;E

s.t.; X ¼ XZ þ E

(4.3)

where the data X are selected as a dictionary. Liu et al. use the inexact augmented Lagrange multiplier [45] to solve the above problem model. Then the similarity matrix W is obtained by applying the optimal solution Z* through the following definition:   W ¼ kZ  k þ ðZ  ÞT  (4.4) Lin et al. [46] proposed a fast linear alternating direction method of partial singular value decomposition to solve the optimization problem effectively. In addition, the literature [47] has presented an LRR model with positive semidefinite constraints. The above LRR model has been successfully applied to motion clustering, face recognition, salience detection, and image segmentation, etc. Generally speaking, matrix reconstruction is divided into matrix recovery and matrix completion. The former focuses on restoring accurate matrices such as the RPCA and LRR models mentioned above when some data are seriously damaged. The latter focuses on how to supplement missing data when the data are incomplete, such as the Netflix recommendation system [48]. The kernel norm model of matrix filling usually takes the following form: min kXk ; X

s.t.;

PU ðXÞ ¼ PU ðZÞ

(4.5)

Its Lagrange form is: 1 min m kXk þ kPU ðXÞ  PU ðZÞk2F X 2 where m > 0 is the regularization parameter.

(4.6)

Many researchers have given numerous theoretical results, such as the theory which can solve the low-rank matrix accurately filling problem by solving the nuclear norm minimization problem under appropriate conditions [49]. In particular, Cande`s and Recht

102 Chapter 4 [50] proved that the low-rank matrix Z ˛ Rnn , which meets specific incompatible conditions, can be reconstructed by small random sampling order Crn5=4 log n through the nuclear norm model by probability Crn5=4 log n. Where r is a rank of low-rank matrix, c and C are constants. The low-rank matrix filling algorithm can be roughly divided into three categories: (1) a positive semidefinite programming-based algorithm, such as CVX [51]; (2) a soft threshold operator-based algorithm, such as SVT [52], FPCA [53], and APG [54]; and (3) manifold optimization-based algorithms, such as Optspace [55] and SET [56]. The above LRR model, RPCA model, and MC model have to be solved by iteration. And singular value decomposition (SVD) or eigenvalue decomposition (EVD) of the larger matrix are required for each iteration. Therefore, these algorithms always have high time complexity. Although some algorithms adopt the strategy of partial singular value decomposition or eigenvalue decomposition, they all need to estimate the rank of the matrix accurately. The problem of estimating the rank of matrix is still an open problem [57].

4.3 Semisupervised learning and kernel learning 4.3.1 Semisupervised learning In recent years, with the continuous progress and development of machine learning and statistical learning technology, semisupervised learning has made great progress in theory and application research. A large number of SSL methods have emerged [58,59]. According to the working mode of the various methods, SSL can be divided into production model, self-learning, co-training, TSVM, and graph-based methods. Among these, graph-based semisupervised learning is the most widely studied method in SSL. It has a rich theoretical foundation and is closely related to the kernel method, sparse representation, and low-rank learning. It has achieved good performance in many fields, such as text classification, digital recognition, music classification, and face recognition. According to the purpose of SSL, semisupervised learning algorithms can be divided into three categories: semisupervised classification, semisupervised clustering, and semisupervised regression. Semisupervised classification is the most studied problem in SSL. The common algorithms are the TSVMs [60,61], Gaussian random field [62], local and global consistent [62], and manifold regularization [63,64], etc. Semisupervised clustering algorithms can be roughly divided into three categories. (1) The constraint-based semisupervised clustering algorithm, which generally uses ML and CL constraints to guide the clustering process. The typical algorithms are spectral learning [65] and affinity propagation constraint clustering [66]. (2) The semisupervised clustering algorithm based on distance. This kind of algorithm uses pair constraints to learn distance measurement, thus changing the distance between samples to make it advantageous to clustering, such as the distance measure learning method proposed by Xing et al. [67]. (3)

Theoretical basis of machine learning

103

The semisupervised clustering algorithm combining constraint and distance, such as the method combining constraint and measure learning proposed by Bilenko et al. [68].

4.3.2 Nonparametric kernel learning The kernel method has been one of the most active areas of machine learning, such as support vector machines (SVMs) [69] and kernel logistic regression (KLR). These methods are widely used in many practical problems and achieve good results. The kernel method skillfully uses the Mercer kernel technique to map the original data to a reproducing kernel Hilbert space. It has good generalization ability and strong nonlinear processing ability. It includes the algorithms such as KPCA and KLDA, which are obtained by coring of all kinds of linear methods mentioned above.  d   The common kernel functions are polynomial kernel K x ; x $ x þ 1 and Gaussian ¼ x i j i j 2  2        radial primary kernel K xi ; xj ¼ exp  xi xj 2s . Although the kernel method has very successful applications, it has some common disadvantages. It is difficult to select the appropriate kernel function and its corresponding parameters [70]. Especially when the labeled samples are limited, the cross-validation technique cannot effectively obtain the optimal parameters. In order to solve these problems, many nuclear learning methods have been proposed, including two main categories: the multiple kernel learning (MKL) and nonparametric kernel learning (NPKL) method. The former method mainly obtains target kernel by convex combination of many predefined basic kernels, such as positive semidefinite programming kernel learning [71], two-stage nuclear learning, and unsupervised MKL [72]. The latter method directly learns the positive semidefinite kernel matrix by using a small amount of supervised information such as data labels or pairwise constraints, so that the nuclear energy obtained from the learning can better describe the similarity between the data, such as the order-constrained spectral kernel (OSK) [73], nonparametric kernel, low-rank kernel learning [74], and transductive spectral kernel (TSK), etc. Although MKL technology has been widely used in the fields of biological information, image target recognition, and text classification, its target core can be expressed as a weighted combination of basic cores. This kind of method cannot deal with problems such as a heterogeneous pattern well: K¼

m X i¼1

ai Ki ;

ai  0;

Xm

a i¼1 i

¼1

(4.7)

where ai is the weighted coefficients of the ith basic nuclear, m is the number of basic nuclear. The nonparametric kernel learning method can provide more flexible nuclear matrix data. The time complexity of obtaining the whole kernel matrix using the semidefinite

104 Chapter 4   programming based on the standard interior point method is O n6:5 (where n is the number of data samples) [75,76]. In addition, there is also a class of effective nonparametric kernel learning methods such as OSK and TSK, which are obtained by the spectral embedding of graph Laplace. The model can be summarized as follows: Xm K¼ sðli Þfi fTi (4.8) i¼1 where fi ; i ¼ 1; /; m is the eigenvector corresponding to the smallest m eigenvalues of graph Laplace L. sð ,Þ is the spectral transformation operator of the kernel matrix K to be solved.

References [1] Biederman I. Recognition-by-components: a theory of human image understanding. Psychological Review 1987;94(2):115e47. [2] Ross DA, Zemel RS. Learning parts-based representations of data. Journal of Machine Learning Research 2006;7(Nov):2369e97. [3] Paatero P, Tapper U. Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 1994;5(2):111e26. [4] Wang YX, Zhang YJ. Nonnegative matrix factorization: a comprehensive review. IEEE Transactions on Knowledge and Data Engineering 2013;25(6):1336e53. [5] Berry MW, Browne M, Langville AN, et al. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics & Data Analysis 2007;52(1):155e73. [6] Cichocki A, Zdunek R, Phan AH, et al. Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. John Wiley & Sons; 2009. [7] Lin CJ. Projected gradient methods for nonnegative matrix factorization. Neural Computation 2007;19(10):2756e79. [8] Kim D, Sra S, Dhillon IS. Fast Newton-type methods for the least squares nonnegative matrix approximation problem. In: Siam International Conference on Data Mining, April 26e28, 2007, Minneapolis, Minnesota, USA; 2007. p. 38e51. [9] Cai D, He X, Han J, et al. Graph regularized non-negative matrix factorization for data representation. . IEEE Transactions on Pattern Analysis & Machine Intelligence 2010;33(8):1548e60. [10] Ding C, Li T, Peng W, et al. Orthogonal nonnegative matrix tri-factorization for clustering. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD); 2006. p. 126e35. [11] Liu HF, Wu ZH, Li XL, et al. Constrained nonnegative matrix factorization for image representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 2012;34(7):1299e311. [12] Ke Q, Kanade T. Robust L 1 norm factorization in the presence of outliers and missing data by alternative convex programming. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). vol. 1. IEEE; 2005. p. 739e46. [13] Kong D, Ding C, Huang H. Robust nonnegative matrix factorization using l21-norm. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM; 2011. p. 673e82. [14] Zhang S, Wang W, Ford J, et al. Learning from incomplete ratings using non-negative matrix factorizationvol. 6. SDM; 2006. p. 548e52. [15] Wang H, Nie FP, Huang H, et al. Fast nonnegative matrix tri-factorization for large-scale data coclustering. IJCAI Proceedings-International Joint Conference on Artificial Intelligence 2011;22(1):1553. [16] Kolda TG, Bader BW. Tensor decompositions and applications. SIAM Review 2009;51(3):455e500.

Theoretical basis of machine learning

105

[17] Ding CHQ, Li T, Jordan MI. Convex and semi-nonnegative matrix factorizations. IEEE Transactions on Pattern Analysis and Machine Intelligence 2010;32(1):45e55. [18] Scho¨lkopf B, Smola A, Mu¨ller KR. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 1998;10(5):1299e319. [19] Cai D, He XF, Han JW. Speed up kernel discriminant analysis. The VLDB JournaldThe International Journal on Very Large Data Bases 2011;20(1):21e33. [20] Simard P, LeCun Y, Denker JS. Efficient pattern recognition using a new transformation distance. Advances in neural information processing systems 1993:50e8. [21] Vidal R, Ma Y, Sastry S. Generalized principal component analysis (GPCA). IEEE Transactions on Pattern Analysis and Machine Intelligence 2005;27(12):1945e59. [22] Liu GC, Lin ZC, Yan SC, et al. Robust recovery of subspace structures by low-rank representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 2013;35(1):171e84. [23] Lu L, Vidal R. Combined central and subspace clustering on computer vision applications. In: Proc. 23rd Int’l Conf. Machine Learning (ICML); 2006. p. 593e600. [24] Fischler MA, Bolles RC. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 1981;24(6):381e95. [25] Derksen H, Ma Y, Hong W, et al. Segmentation of multivariate mixed data via lossy coding and compression. IEEE Transactions on Pattern Analysis and Machine Intelligence 2007;29(9):15461e562. [26] Elhamifar E, Vidal R. Sparse subspace clustering. IEEE Conference on Computer Vision and Pattern Recognition. 2009. IEEE; 2009. p. 2790e7. [27] Tenenbaum JB, De Silva V, Langford JC. A global geometric framework for nonlinear dimensionality reduction. Science 2000;290(5500):2319e23. [28] Roweis ST, Saul LK. Nonlinear dimensionality reduction by locally linear embedding. Science 2000;290(5500):2323e6. [29] Belkin M, Niyogi P. Laplacian eigenmaps and spectral techniques for embedding and clustering. News in Physiological Sciences 2001;14:585e91. [30] Zhang ZY, Zha HY. Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. Journal of Shanghai University (English Edition) 2004;8(4):406e24. [31] Von Luxburg U. A tutorial on spectral clustering. Statistics and Computing 2007;17(4):395e416. [32] He XF, Cai D, Yan SC, et al. Neighborhood preserving embedding. In: Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1. vol. 2. IEEE; 2005. p. 1208e13. [33] He XF, Niyogi P. Locality preserving projections (LPP). Advances in Neural Information Processing Systems 2002;16(1):186e97. [34] Zhang DQ, Zhou ZH, Chen SC. Semi-supervised dimensionality reduction. SDM; 2007. p. 629e34. [35] Nie FP, Xu D, Tsang IWH, et al. Flexible manifold embedding: a framework for semi-supervised and unsupervised dimension reduction. IEEE Transactions on Image Processing 2010;19(7):1921e32. [36] Li HF, Jiang T, Zhang KS. Efficient and robust feature extraction by maximum margin criterion. IEEE Transactions on Neural Networks 2006;17(1):157e65. [37] Yan SC, Xu D, Zhang BY, et al. Graph embedding and extensions: a general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence 2007;29(1):40e51. [38] Hesterberg T, Choi NH, Meier L, et al. Least angle and [1 penalized regression: a review. Statistics Surveys 2008;2:61e93. [39] Dong WS, Zhang L, Shi GM, et al. Image deblurring and super-resolution by adaptive sparse domain selection and adaptive regularization. IEEE Transactions on Image Processing 2011;20(7):1838e57. [40] Gao XB, Zhang KB, Tao DC, et al. Image super-resolution with sparse neighbor embedding. IEEE Transactions on Image Processing 2012;21(7):3194e205. [41] Peng Y, Ganesh A, Wright J, et al. RASL: robust alignment by sparse and low-rank decomposition for linearly correlated images. IEEE Transactions on Pattern Analysis and Machine Intelligence 2012;34(11):2233e46.

106 Chapter 4 [42] Min KR, Zhang ZD, Wright J, et al. Decomposing background topics from keywords by principal component pursuit. In: ACM Conference on Information and Knowledge Management, CIKM 2010, Toronto, Ontario, Canada, October; 2010. p. 269e78. [43] Zhou TY, Tao DC. Godec: randomized low-rank & sparse matrix decomposition in noisy case. Proceedings of the 28th International Conference on Machine Learning (ICML-11) 2011:33e40. [44] Zhang ZD, Ganesh A, Liang X, et al. Tilt: transform invariant low-rank textures. International Journal of Computer Vision 2012;99(1):1e24. [45] Lin Z, Chen M, Ma Y. The augmented lagrange multiplier method for exact recovery of corrupted lowrank matrices. arXiv preprint arXiv:1009.5055 2010. [46] Lin ZC, Liu RS, Su ZX. Linearized alternating direction method with adaptive penalty for low-rank representation. In: Advances in Neural Information Processing Systems; 2011. p. 612e20. [47] Ni YZ, Sun J, Yuan XT, et al. Robust low-rank subspace segmentation with semidefinite guarantees. In: 2010 IEEE International Conference on Data Mining Workshops. IEEE; 2010. p. 1179e88. [48] Bennett J, Lanning S. The Netflix prize. Proceedings of KDD Cup and Workshop 2007;2007:35. [49] Recht B, Fazel M, Parrilo PA. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review 2010;52(3):471e501. [50] Gong PH, Zhang CS. Efficient nonnegative matrix factorization via projected Newton method. Pattern Recognition 2012;45(9):3557e65. [51] Grant M, Boyd S. Cvx Users’ Guide for cvx version 1.22. 2012. [52] Cai JF, Cande`s EJ, Shen Z. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization 2010;20(4):1956e82. [53] Ma SQ, Goldfarb D, Chen LF. Fixed point and Bregman iterative methods for matrix rank minimization. Mathematical Programming 2011;128(1e2):321e53. [54] Toh KC, Yun S. An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Pacific Journal of Optimization 2010;6(3):615e40. [55] Keshavan RH, Oh S. A gradient descent algorithm on the grassman manifold for matrix completion. arXiv preprint arXiv:0910.5260. 2009. [56] Dai W, Milenkovic O, Kerman E. Subspace evolution and transfer (SET) for low-rank matrix completion. IEEE Transactions on Signal Processing 2011;59(7):3120e32. [57] Kim H, Park H, Drake BL. Extracting unrecognized gene relationships from the biomedical literature via matrix factorizations. BMC Bioinformatics 2007;8(9):1. [58] Chapelle O, Scholkopf B, Zien A. Semi-Supervised Learning (Chapelle, O. et al., Eds.; 2006)[Book reviews]. IEEE Transactions on Neural Networks 2009;20(3). 542-542. [59] Zhu XJ. Semi-supervised learning literature survey. Computer Science 2008;37(1):63e77. [60] Joachims T. Transductive inference for text classification using support vector machinesvol. 99. ICML; 1999. p. 200e9. [61] Chapelle O, Zien A. Semi-supervised classification by low density separation. AISTATS; 2005. p. 57e64. [62] Zhu XJ, Ghahramani Z, Lafferty J. Semi-supervised learning using Gaussian fields and harmonic functionsvol. 3. ICML; 2003. p. 912e9. [63] Belkin M, Niyogi P, Sindhwani V. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research 2006;7(Nov):2399e434. [64] Melacci S, Belkin M. Laplacian support vector machines trained in the primal. Journal of Machine Learning Research 2011;12(Mar):1149e84. [65] Kamvar K, Sepandar S, Klein K, et al. Spectral learning. International Joint Conference of Artificial Intelligence. Stanford InfoLab; 2003. p. 561e6. [66] Lu Z, Carreira-Perpinan MA. Constrained spectral clustering through affinity propagation. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE conference on. IEEE; 2008. p. 1e8. [67] Xing EP, Ng AY, Jordan MI, et al. Distance metric learning with application to clustering with sideinformation. Advances in Neural Information Processing Systems 2003;15:505e12.

Theoretical basis of machine learning

107

[68] Bilenko M, Basu S, Mooney RJ. Integrating constraints and metric learning in semi-supervised clustering. International Conference 2004:81e8. [69] Vapnik VN, Vapnik V. Statistical learning theory. New York: Wiley; 1998. [70] Zhuang JF, Tsang IW, Hoi SCH. A family of simple non-parametric kernel learning algorithms. Journal of Machine Learning Research 2011;12(Apr):1313e47. [71] Lanckriet GRG, Cristianini N, Bartlett P, et al. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research 2004;5(Jan):27e72. [72] Zhuang JF, Wang JL, Hoi SCH, et al. Unsupervised multiple kernel learning. Journal of Machine Learning Research. Proceedings Track 2011;20:129e44. [73] Zhu X, Kandola J, Ghahramani Z, et al. Nonparametric transforms of graph kernels for semi-supervised learning. Advances in Neural Information Processing Systems 2004:1641e8. [74] Shang FH, Jiao LC, Wang F. Semi-supervised learning with mixed knowledge information. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM; 2012. p. 732e40. [75] Li ZG, Liu JZ, Tang XO. Pairwise constraint propagation by semidefinite programming for semisupervised classification. In: Proceedings of the 25th International Conference on Machine Learning. ACM; 2008. p. 576e83. [76] Hoi SCH, Jin R, Lyu MR. Learning nonparametric kernel matrices from pairwise constraints. In: Machine Learning, Proceedings of the Twenty-Fourth International Conference; 2007. p. 361e8.

CHAPTER 5

Theoretical basis of compressive sensing Chapter Outline 5.1 Sparse representation

109

5.1.1 Stationary dictionary 111 5.1.2 Learning dictionary 112

5.2 Compressed observation 113 5.3 Sparse reconstruction 115 5.3.1 5.3.2 5.3.3 5.3.4

Relaxation methods 117 Greedy methods 118 Natural computation methods Other methods 121

References

120

122

Compressed sensing is considered as a new idea about signal acquisition, representation, and processing. It not only makes people re-examine the existing signal processing methods and technology, but also brings a wealth of new ideas about signal acquisition and processing, which greatly promotes the combination of mathematics theory and engineering application [1], and will play an important role in the treatment of a large and complex data. Research on compressed sensing gained attention from the work of Cande`s, Roberg, Tao [2,3], and Donoho [4], the classical compressive sensing theory put forward by them has pointed out that signals, which are sparse or can be sparsely represented, can be restored accurately from small-scale and nonadaptive compression observations. The compressed sensing framework mainly includes three parts: sparse representation, compressed observations, and reconstruction models and methods. Among them, the signal’s sparsity and sparse representation are the basic requirements and premises of compressed sensing; compressed observation theory and access technology are the bases of compressed sensing; the reconstruction model and reconstruction method are the core contents of compressed sensing.

5.1 Sparse representation Sparsity and sparse representation are the preconditions and prerequisites of compressed sensing. And one n-dimensional signal with sparsity can be expressed as: x ¼ Ds Brain and Nature-Inspired Learning, Computation and Recognition. https://doi.org/10.1016/B978-0-12-819795-0.00005-0 Copyright © 2020 Tsinghua University Press. Published by Elsevier Inc. All rights reserved.

109

(5.1)

110 Chapter 5 or: kx  Dsk2  ε

(5.2)

When the signal x is sparse, i.e., x only has K(K < < n) nonzero elements or x is able to use its own K nonzero elements to be approximately represented, D is a unit matrix, and signal x is called the K-sparse signal or K-compressible signal. K is called the sparsity of signal x [5], which is a measure of the sparsity of the signal. The classical theory of compressed sensing mainly studies the situation that the signal x is a sparse signal or compressible signal, D is the orthogonal matrix (D ˛ Rnn ), and the signal representing coefficient s is a sparse signal or compressible signal [3,6,7]. In the compressed sensing theory, the information contained in the signal with sparsity can be measured by the signal sparsity. Therefore, in the application of compressed sensing, sparsity is closely related to the sampling rate and recovery, which is different from the sampling rate related to the bandwidth of the signal and the frequency of Nyquist in a traditional sampling method. In the traditional ways of sampling, the higher the signal’s highest frequency is, the higher the uniform sampling frequency is. But in compressed sensing, the sparser the signal is, the fewer compression observations are need to accurately reconstruct the signal. Therefore, in the actual application of compressed sensing, the first thing needed to do is to find or get the signal’s sparsity or sparse representation. The orthogonal transform analysis and sparse dictionary are commonly used to obtain the sparse representation of the signal. The traditional sparse representation is obtained by decomposing the signal in a set of complete orthogonal basis function, for example, Fourier transform, discrete cosine transform [8], wavelet transform [9], and so on. But orthogonal basis has no redundancy, is unstable, and is also sensitive to errors. For the developed sparse representation based on the framework, the framework has a certain degree of redundancy, and there are some certain correlations between the basic function, so its calculation is relatively stable. However, the experiments of signal processing and harmonic analysis show that the overcomplete dictionary in nature can obtain better sparse characteristics than a single orthogonal basis and framework [10]. The basic idea of signal sparse representation based on an overcomplete dictionary was first proposed in 1993 by Mallat [11]. Olshausen and Field thought that the natural images had sparse structure, and the image representing an overcomplete dictionary accords with the working principle of the V1 region of human visual perception [12,13]. In general, the number of atoms in the dictionary is far bigger than the dimension of signal, that is to say, in formulas (5e1) and (5e2), sparse dictionary l is a rectangular matrix, i.e., D ˛ Rnn , n > > if uð0; 1Þ  0:5 < ð2uÞhc þ1 ; bk ¼ > > >  1 > > : ½2ð1  uÞ hc þ1 ; if uð0; 1Þ < 0:5

(6.11)

In Formula (6.11), aik and ajk(i s j, k ¼ 1, ., n) are the k-th dimension of individuals i and j, respectively, u and r are random numbers ranging from 0 to 1, and hc is a distribution index. ( vk þ dðuk  vk Þ; if rð0; 1Þ  0:5 v0k ¼ vk  dðvk  lk Þ; if rð0; 1Þ > 0:5 (6.12) where l d ¼ 1  rð1it=TÞ In Formula (6.12), vk (k ¼ 1, ., n) is the k-th dimension of an individual, uk and lk are the upper and lower boundaries of this dimension, respectively. it is the current generation number, T is the maximum generation number, and l is a parameter that tunes the area of local search, which usually ranges from 2 to 5.

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering

137

6.3 Learning simultaneous adaptive clustering and classification learning via MOEA In this section, we introduce multiobjective evolutionary algorithms into simultaneous clustering and classification (denoted as MOASCC). The final goal of this is to enhance the performance of the classification through the cooperation of clustering and classification. In order to achieve this goal, two objective functions, fuzzy clustering connectedness function and classification error rate, are adopted. Furthermore, a specific mutation operator is designed to make use of the feedback from both clustering and classification. We give a detailed description of MOASCC in this section, including objective functions of MOASCC, the framework of MOASCC, and computational complexity and convergence analysis of MOASCC.

6.3.1 Objective functions of MOASCC In order to optimize clustering learning and classification learning simultaneously, MOASCC uses two objective functions: clustering objective function and classification objective function. Given a dataset of size N, whose number of classes is M, assume that it can be classified into K clusters during the optimization process, then the objective functions we adopted are as follows. In terms of clustering, MOASCC designs an objective function called fuzzy cluster connectedness to measure the quality of clustering. This objective function is based on the assumption that a sample and its neighbors tend to belong to the same cluster and the connectedness between different clusters should be minimized. In the objective function f1 [see Formula (6.13)], L is a parameter to control the number of neighbors which contribute to the overall fuzzy connectedness, nnij represents the j-th nearest sample apart from sample xi. ti;nnij is the connectedness between sample i and nnij, a decreasing value of 1/j, which gives emphasis to the nearer neighbor, is assigned to it if samples xi and nnij locate in different clusters [31]. p(ckjnnij) represents the probability of sample nnij belonging to P cluster ck. For each sample xi, Lj¼1 ti;nnij ,pðck jnnij Þ means the fuzzy connectedness between sample xi and the clusters which xi does not belong to. If all the L nearest neighbors of sample xi belong to the same cluster, then its fuzzy connectedness is 0, otherwise, 1/j,p(ckjnnij) will be assigned to the j-th nearest neighbor as a penalty term. 0 1 N L X X @ f1 ¼ ti;nnij ,pðck jnnij ÞA i¼1

j¼1

where ti;nnij ¼

(

0;

if dck : xi ˛ ck o nnij ˛ ck

1=j;

otherwise

(6.13)

138 Chapter 6 In MOASCC, it designs three methods [see Formulas (6.14), (6.15), and (6.16)] to calculate p(ckjxi). In Formula (6.14), p(ckjxi) is defined as the proportion of the L-nearest neighbors of sample xi belonging to the k-th cluster. This approach is not strict to the underlying structure of the dataset, it is also the reason why MOASCC chooses it to work in the later experiments. XL ( 1; a ˛ ck sðnnij ; ck Þ j¼1 (6.14) where sða; ck Þ ¼ pðck jxi Þ ¼ L 0; a;ck 1=minkxi  xk pðck jxi Þ ¼ XK ; cx ˛ ck (6.15) ð1=minkx  xkÞ i k¼1 1=kxi  centerk k pðck jxi Þ ¼ XK ð1=kxi  centerk kÞ k¼1 jck j 1 X centerk ¼ xj ; xj ˛ ck jck j j¼1

(6.16)

In Formulas (6.15) and (6.16), kxixjk denotes the Euclidean distance from sample xi to xj, jckj represents the number of samples in the cluster ck. Both methods adopt the Euclidean distance to calculate p(ckjxi). p(ckjxi) is related to the minimum Euclidean distance from sample xi to the samples in the cluster ck in Formula (6.15), it is unbiased to the structure of the given dataset. However, in Formula (6.16), p(ckjxi) is decided by the Euclidean distance between xi and the center of cluster ck (denoted as centerk), the downside of this method is that it is biased to spherically shaped samples. Note that in Formulas (6.15) and (6.16), if minkxixk ¼ 0 or kxicenterkk ¼ 0, p(ckjxi) is set to 1. In terms of classification, MOASCC employs an objective function adopted in Ref. [33] to associate it with clustering through the Bayesian theory. f2 [see Formula (6.17)] is the classification objective function, it represents the classification error rate of the training samples. ( Ntr 0; a ¼ b X dðlðxi Þ; yi Þ f2 ¼ where dða; bÞ ¼ (6.17) Ntr 1; a 6¼ b i¼1 In Formula (6.17), Ntr is the number of training samples, yi is the true class label of sample xi, and l(xi) is the predicted class label of xi after our calculation. If l(xi) is different from the true class label, d(l(xi), yi) ¼ 1 and the classification error rate increases. lðxi Þ ¼ arg max pðwm jxi Þ 1mM

(6.18)

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering

139

Formula (6.18) gives us the calculation method of l(xi), it is the value of posterior probability p(wmjxi) that determines the output label of sample xi. p(wmjxi) represents the probability of sample xi belonging to the class wm, and is calculated as in Formula (6.19). Bayesian theory is used to construct the relationship between clustering and classification in Formula (6.19), where p(wmjck) means the probability that samples in the cluster ck belong to the class wm. In terms of p(ckjxi), we can obtain it according to Formulas (6.14), (6.15), and (6.16). For p(wmjck), it can be calculated as in Formula (6.20). pðwm jxi Þ ¼

K X

pðck jxi Þpðwm jck Þ

(6.19)

k¼1

pðwm jck Þ ¼

jck Xwm j jck j

(6.20)

In Formula (6.12), jckXwmj denotes the number of samples belonging to both cluster ck and class 3 2 pðw1 jc1 Þ / pðwM jc1 Þ 7 6 7 « « wm. All the p(wmjck) can constitute a relation matrix P ¼ 6 1 5 4 pðw1 jcK Þ / pðwM jcK Þ whose size is K  M, it plays an important role in discovering the structure of the given P dataset. In terms of each row vector of the relation matrix, M m¼1 pðwm jck Þ ¼ 1, it shows the distribution of the samples in the cluster ck. If and only if one nonzero value p(wmjck) ¼ 1 exists in this row vector, all the samples in the cluster ck belong to the same class. Therefore, the number and values of nonzero elements can reveal the quality of the clustering. In terms of each column vector of the relation matrix, the number of nonzero entries implies the distribution of a given class. If there exists more than one nonzero element, it means the given class scatters into different clusters. Hence, the relation matrix can clearly show the relationship between clustering and classification.

6.3.2 The framework of MOASCC A number of MOEAs have been proposed for multiobjective optimization problems in recent years. MOASCC chooses NSGA-II to optimize clustering and classification because of its popularity and effectiveness. The whole procedure of MOASCC is simply described in Algorithm 6.1. In MOASCC, it uses locus-based adjacency representation [52e55] to encode in the optimization process. In this representation scheme, each individual consists of N genes {g1, g2, ., gN}, and the value of each gene gi is numbered in the range {1, 2, ., N}. If gi is assigned a value of j, it means that sample xi is connected to sample xj. In the decoding process, all the connected samples are partitioned into one component, and the number of

140 Chapter 6 Algorithm 6.1 The pseudocode of MOASCC Require: The size of population: pop; The number of evolutionary generation: gen; The probability of crossover: pc; The probability of mutation: pm; Testing dataset: dataset. Ensure: 1: Initialization: Select the training samples randomly; Generate an MST for the given dataset; Implement the initialization scheme and form the initial population: G1; Decode each individual to find the number of clusters and evaluate the values of two objective functions. 2: for t ¼ 1:gen do 3: Execute uniform crossover and proposed mutation to the current population Gt and generate new individuals: newgeno. 4: Decode newgeno and evaluate the objective function values of them. 5: Combine Gt and newgeno, and make a nondominated sorting to assign the front level rank to them. 6: Select pop solutions to the next population Gtþ1 according to their rank and crowding distance. 7: end for 8: Select the nondominated solutions to non_genotype, decode these nondominated solutions. 9: Find the solution with the best ARI value among all the nondominated solutions, and select it as the final solution. 10: Assign every test samples a class label according to Formulas (6.18) and (6.19) and calculate the classification accuracy of the final solution. End Output: Classification accuracy: accuracy; Number of clusters: clusters.

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering

141

separated components determines the number of clusters. For example, we have an individual encoded as {2, 3, 1, 5, 5}, after decoding, we can find that the samples are divided into two clusters, the first cluster with samples {1, 2, 3}, and the rest of the samples belong to another cluster. In order to generate a group of high-quality individuals in the initialization step, a minimal spanning tree (MST) is created. The algorithm uses the Euclidean distance to measure the similarity of samples, and adopts the Prim’s [56] algorithm to build the MST. The cost of edges is defined as the Euclidean distance between two samples. In the initialization step, removing the edges (by modifying the value of gi to i) of MST will produce different partitions, and whether the edge between two samples should be removed depends on its cost. In a population with pop individuals (see Algorithm 6.2), if the number of individuals is less than that of samples, each individual is represented as a graph obtained by removing the j-th expensive edge from the MST. Otherwise, another edge will be removed randomly from one of the first N - 1 individuals to obtain new individuals. After decoding, the initial population can generate different partitions with at most three clusters. In the following evolutionary process, the crossover and mutation operators will generate diverse solutions with a different number of clusters. The advantages of locus-based adjacency representation can be reflected in the following aspects: (1) it can determine the number of clusters automatically instead of setting it up in advance; (2) it can produce a set of individuals with different partitions in a single run via MOEA. Uniform crossover is adopted to generate new individuals in the crossover step. Suppose A1 ¼ {a11, ., a1i, ., a1N} and A2 ¼ {a21, ., a2i, ., a2N} are two individuals to implement uniform crossover, the offspring individual B ¼ {b1, ., bi, ., bN} is decided by the mask ¼ {m1, ., mi, ., mN}, in which mi ¼ 0/1. When mi ¼ 0, bi ¼ a1i, otherwise, bi ¼ a2i. The uniform crossover can provide an unbiased chance to the chosen parents and produce a new individual containing much of the structure inherited from its parents but differing from both of them.

Algorithm 6.2 Initialization 1: for j ¼ 1: pop do 2: if j < N then 3: Remove the j-th expensive edge from the MST to generate an initial individual genetypej; 4: else 5: Select one individual genetypei (i < N) randomly, and remove one edge of genetypei randomly to obtain a new individual genetypej. 6: end if 7: end for

142 Chapter 6 In order to make use of the feedback drawn from clustering and classification, MOASCC proposed a specific mutation scheme. The procedure of the mutation scheme can be seen in Algorithm 6.3. In this scheme, the probability p(ckjxi) is considered to decide whether to mutate gi or not. After decoding, if sample xi is assigned to the cluster cK1, but the probability p(ckjxi) shows sample xi has the greatest membership in the cluster cK2, then xi will mutate to connect with a random sample in the cluster cK2. Note that the procedure marked with the symbol “*” is calculated in the function evaluation step, and we do not need to calculate here again. According to the proposed mutation scheme, if sample xi belongs to the training samples, and d(l(xi)) ¼ 1, then it will mutate to connect with a training sample with the same label. Since MOEA can obtain a set of solutions with a different number of clusters, how to select a reasonable solution is a problem to be solved. In Ref. [57], a measurement called adjusted Rand index is proposed for classification. In MOASCC, the authors use it to select the final optimal solution from the Pareto front.        P P ni$ P n$j n nij  i i;j j 2 2 2 2       

  (6.21) ARI ¼ P ni$ P n$j P ni$ P n$j n þ j  i 1 2 i j 2 2 2 2 2

Algorithm 6.3 Mutation 1: 2: 3: 4: 5:

for every individual genotypej in the current population do Generate a uniform and random number rand ˛ [0, 1]; if rand < pm then for i ¼ 1:N do Find the cluster sample xi belongs to (suppose cK1), calculate pðcK1 jxi Þ, and find the cluster with the highest probability (suppose cK2). (*) 6: if K1 s K2 then 7: Mutate genotypeij to connect with a random sample in the cluster K2. 8: end if 9: if sample xi is a training sample and d(l(xi),yi) ¼ 1 then 10: Mutate genotypeij to connect with a randomly selected sample with the same label. 11: end if 12: end for 13: end if 14: end for

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering

143

This index measures the similarity between two partitions. Suppose there are two different partitions U and V, nij represents the number of samples in both the i-th class of partition U and the j-th class of partition V, ni$ is the number of samples in the i-th class of partition U, and n$j is the number of samples in the j-th class of partition V. In MOASCC, these two partitions U and V correspond to the real classification of the training examples and the clustering result obtained from MOASCC, respectively. Finally, the solution that has the highest similarity between the real partition and the clustering result is selected as the output. In contrast to Formula (6.20), nij equals p(wijcj)$jcjj, which is also the reason why MOASCC chooses it as the measurement to select the final Pareto optimal solution. Since MOASCC is a simultaneous clustering and classification algorithm, all the samples (both training samples and test samples) are clustered together. It can calculate the fuzzy membership of all the samples and obtain the relation matrix from the training samples. The relation matrix not only reflects the relationship between clustering and classification, but can also be used in the prediction of the test samples. In the classification process, we will predict the class label of the test samples according to Formulas (6.18) and (6.19).

6.3.3 Computational complexity Given a dataset with size N and dimension D, the time complexity is O(N  max{LK, Mk}) for each evaluation of one individual, in which K ˛ {1, ., Kmax} is the number of clusters. The complexity of the nondominated sorting is O(pop2). In the worst case, it requires O(gen  max{N  pop  L  Kmax, N  pop  M  Kmax, pop2}) computations. Note that in this algorithm, some one-off computation is required. Before the initialization step, a similarity matrix and MST are calculated, whose time complexities are O(N2D) and O(N). Finally, the complexity of the final selection is O(N2  nnondom), in which nnondom is the number of the nondominated solution. The authors [46] also gave the convergence analysis of MOASCC, please refer to the literature [46] for details.

6.4 A sparse spectral clustering framework via MOEA In this section, we will introduce the last algorithm in detail, which is how to bring the sparse representation into spectral clustering via MOEAs (denoted as SRMOSC) and how to extend it to the semisupervised clustering, including its mathematical description, specific operators designed for it, Laplacian matrix construction method, and the tradeoff point selection phase.

144 Chapter 6

6.4.1 Mathematical description of SRMOSC For a dataset A ¼ {a1, a2, ., aN} with N samples to be reconstructed, considering the sparsity and reconstruction error, the similarity matrix construction in spectral clustering can be formulated as n o min kxk0 ; kAx  Ak22 x (6.22) s:t: xii ¼ 0 xij ˛ ½0; 1 where x˛ℝNN is the sparse matrix to be optimized, which is used for constructing the similarity matrix in spectral clustering. Since all the samples in the dataset are reconstructed at the same time, A is not only the overcomplete dictionary, but also the measurement matrix. For any sample ai, the authors hope to reconstruct it with Ax:i ¼ PN j¼1 xji aj , the constraint xii ¼ 0 indicates that the sample ai is not used to reconstruct itself. In this way, all the samples in the dataset can be represented by other samples and a sparsity matrix x is formed to reflect the relationship among all the samples. If xij is a nonzero entry, samples ai and aj are more likely to be assigned to the same cluster, otherwise, they may be in different clusters. It should be noted that sparsity matrix x is not a symmetric matrix, it still needs some transformation in order to use it in the spectral clustering algorithm. The procedure of spectral clustering can be seen in Algorithm 6.4. Algorithm 6.5 presents the framework of SRMOSC. Although we do not specify the MOEA in Algorithm 6.5 and

Algorithm 6.4 Unnormalized spectral clustering Input: Dataset A, number of clusters K. Begin: 1: 2: 3: 4:

Construct the similarity matrix S. Compute the unnormalized Laplacian L. Compute the first K eigenvectors {u1, u2, ., uK} of L. Construct a matrix Y˛ℝNK whose column vector are {u1, u2, ., uK}. Let vi˛ℝK be the row vector of Y. 5: Step 5: Cluster {v1, v2, ., vN} with k-means into clusters {C1, ., CK}. Step Step Step Step

1: 2: 3: 4:

End Output: Clustering result A1, ., AK with Ai ¼ { jjvj ˛ Cj}.

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering

145

Algorithm 6.5 Framework of SRMOSC Input: dataset A [ {a1, a2, ., aN}; number of clusters: K; population size: pop; maximum number of iterations: gen; crossover probability: pc; mutation probability: pm; Begin: 1: Step 1 Initialization: Generate the initial population P0 according to the initialization scheme. 2: Step 2 Cycle: Execute the MOEA and generate a set of Pareto solutions Pgen. 3: Step 3 Laplacian matrix construction: Construct a symmetric matrix according to each solution in Pgen and generate the corresponding graph Laplacian matrix L. 4: Step 4 Spectral clustering: Apply Steps 3e5 in Algorithm 6.4 to L. 5: Step 5 Trade-off point selection: Select a trade-off point xTO from the nondominated solutions in Pgen using the proposed selection approach. End Output: Clustering result A1, ., AK with Ai ¼ { j j vj ˛ Cj}.

any state-of-the-art MOEA can be used, such as Pareto envelope-based selection algorithm II (PESA-II) [58], or multiobjective evolutionary algorithm based on decomposition (MOEA/D) [38], SRMOSC will use the nondominated sorting genetic algorithm II (NSGA-II) [35] in its framework. Taking into account the nature of the problem the authors want to solve, specific components tailored to it are designed. Particularly, the algorithm develops a new initialization scheme, specific crossover and mutation operators, and a rule to choose the tradeoff solution from the final Pareto set in the selection phase. Before the selection phase, a preprocessing that transforms the sparsity matrices in Pgen to symmetric matrices should be carried out since the Laplacian matrix L needs to be a symmetric matrix in spectral clustering. The details of these components are described in the following sections.

6.4.2 Extension on semisupervised clustering If there exist some labeled samples, the unsupervised clustering problem can be converted into a semisupervised clustering problem, then we can extend the above model into the

146 Chapter 6 semisupervised problem by adding some constraints involved by the pairwise can-links or cannot-links. Suppose we have some known labeled data that sample ai ˛ Cm, sample aj ˛ Ck,msk. In this way, semisupervised spectral clustering can be modeled as n o min kxk0 ; kA  Axk22 x

s:t: xii ¼ 0 K P

P

k¼1 aj ˛ Ck ;ai ;Ck

where the constraint

PK P k¼1

xij ¼ 0

(6.23)

xij ˛ ½0; 1

aj ˛ Ck ;ai ;Ck xij

¼ 0 guarantees that samples with different

labels will not connect with each other. The reason why SRMOSC does not take all the can-links into constraint is that it may add too many nonzero entries since x is a sparse matrix. It is possible that it is too hard to be satisfied, and it can be relaxed as in the following model (6.24), in which the connection among different clusters is minimized: 9 ( K = X X 2 ;  Axk ; x min kxk0 kA ij 2 x ; k¼1 aj ˛ Ck ;ai ;Ck (6.24) s:t: xii ¼ 0 xij ˛ ½0; 1 In SEMOSC, it adopts model (6.23) to optimize the semisupervised clustering problem. The designed initialization and mutation operator is more strictly used to solve this problem properly, as will be discussed in the corresponding section.

6.4.3 Initialization In order to get a set of high-quality solutions, the authors design an initialization scheme based on the assumption that one sample prefers to be a linear combination of its neighbors. The procedure of the initialization scheme can be seen in Algorithm 6.6. In Algorithm 6.6, pop and N are the population size and number of samples, respectively. For each sample ai, the distances between it to the remaining samples are sorted before initialization, and are called the “neighbor information” of sample ai. For an initialized individual xl˛ℝNN, the l-th nearest neighbor apart from sample aj is am, and the corresponding entry in the sparse matrix xl is xlmj , mod(l, N) is the remainder after division (l/N). There are two cases considered according to the size of the population and dataset. Take the reconstruction error jaj Ax:j j ¼ aj SNi¼1 xij ai of sample ai into account. In the first case, each sample is considered to be reconstructed by its l-th nearest neighbor for the l-th individual, and the reconstruction error is jaj - xmjamj. When the population size

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering

147

Algorithm 6.6 Initialization 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

for each individual xl (l ¼ 1:pop) do for each column x:jl of xl do if l  N then l ¼ rand (rand is a uniform Find the l-th nearest neighbor of sample aj:am, and set xmj random number in [0,1]). else modðl;NÞ x:jl ¼ x:j . Generate a uniform random integer r˛[1,N]/{MOD(l,N), j }, set xrjl ¼ rand. if aj is a labeled sample then Find all the labeled samples that have different labels as aj, and set the corresponding entries to 0. Randomly select a sample with the same label as aj, and set the corresponding entry in xi to a nonzero value. end if end if end for end for

exceeds the number of samples in the dataset, the l-th (l > N) individual is initialized as a sparse matrix, where each volume vector consists of 2 nonzero entries, one of which inherits from the mod(l, N)-th individual, and another is a uniform and randomly selected entry subject to the constraints mentioned in Algorithm 6.6. Lines 8e11 are designed for semisupervised clustering, for each labeled sample, it means that: (1) all the entries that reflect the relationship between the labeled samples in different clusters are set to 0 and (2) a labeled sample in the same cluster is randomly selected and the corresponding entry in the sparse matrix x is set to a nonzero value. In this way, the authors hope to obtain a set of diverse solutions.

6.4.4 Crossover Considering the different effects of nondominated and dominated individuals, SRMOSC designs a crossover strategy (Algorithm 6.7), which includes two different cases. Case 1 makes use of nondominated individuals in the current population, and implements a uniform crossover on each column vector of the current individual and a uniformly selected nondominated individual. It should be noted that different column vectors of the newly generated offspring are obtained from different nondominated solutions. With this process we hope to obtain a set of high-quality offspring by the guiding of the

148 Chapter 6 Algorithm 6.7 Crossover 1: for each individual xl (l ¼ 1, ., ncr) to implement crossover operator do 2: Generate a uniform random number a˛[0,1]. 3: if a > 0.5 then 4: % case 1: 5: for j ¼ 1:N do 6: Choose a nondominated solution y in the current population uniformly at random. 7: Implement uniform crossover to the j-th column vector x:jl and y:j. 8: end for 9: else 10: % case 2: 11: Choose a solution z in the current population uniformly at random and generate a uniform random value b˛[0,1]. 12: xl ¼ bxlþ(1b)z; 13: end if 14: end for

nondominated individuals. Contrary to case 1, case 2 simply uses two individuals to produce an offspring by intermediate crossover, the idea is to preserve the structure of the parents by using the whole information of the current individual and a randomly selected individual. Note that the sparsity of the offspring is greater than that of the parents.

6.4.5 Mutation Taking into account the sparsity property of x, the mutation operator applies a different strategy for those entries that have a value of 0 and those that are different from 0. This strategy is based on the same assumption as the initialization. Suppose ai is the k-th (k ¼ 1, ., Ne1) nearest neighbor of aj, and that g and rand are uniformly at random chosen in [0, 1]. Then if xij is a nonzero entry, the probability that it mutates to zero is set to (k/N), otherwise the zero entry mutates to a nonzero value with probability of 1 e (k/N). The proposed mutation scheme considers that the nearer sample ai is from aj, the higher the probability that the corresponding entry in x is a nonzero value. For the samples that are far away from each other, opportunities are still given to them to reconstruct each other. The procedure of the mutation operator can be seen in Algorithm 6.8. In this scheme, there are nmu individuals to implement this operator, and we can simply execute Formula (6.25) for unsupervised clustering. But for semisupervised clustering, the prior knowledge obtained from labeled data should be taken into consideration. Taking into account that the

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering

149

Algorithm 6.8 Mutation 1: for l ¼ 1:nmu do 2: for j ¼ 1:N do 3: if sample aj is nonlabeled data then 4: for i ¼ 1:N do 5: Execute proposed mutation scheme according to Formula (6.25). 6: end for 7: else 8: Randomly select a sample with the same label, set corresponding entry in xl to rand. 9: Find all the samples with different labels, set corresponding entries in xl to 0 10: end if 11: end for 12: end for

matrix x to be optimized is a sparse matrix, SRMOSC randomly selects a sample with the same label for each labeled sample and sets the corresponding entry in x to a nonzero value, all the constraints taken by labeled data with different labels are strictly satisfied in this scheme. In this way, a set of high-quality feasible solutions may be generated. 8 > k > > > 0; rand  o xij 6¼ 0 > > N > > > > > k > > < xij ,g; rand > o xij 6¼ 0 N xij ¼ (6.25) > k > > g; rand  1  o xij ¼ 0 > > > N > > > > k > > > rand > 1  o xij ¼ 0 : 0; N

6.4.6 Laplacian matrix construction The sparse matrix x obtained from MOEA is not a symmetric matrix, it has to be transformed into a symmetric matrix in order to take the following spectral clustering steps. A simple method that can complete this transformation is sij ¼ maxðxij ; xji Þ

(6.26)

150 Chapter 6 ( dij ¼

0; XN

s ; m¼1 im

i 6¼ j i¼j

(6.27)

sij is the corresponding entry of the similarity matrix S˛ℝNN$ D˛ℝNN is a diagonal matrix with diagonal element dii, which is the sum of the i-th column of similarity matrix S. In this way, it can make sure that the Laplacian matrix L¼D  S

(6.28)

is symmetric and positive semidefinite.

6.4.7 Final solution selection phase In the final step of the algorithm, a tradeoff point should be selected from a set of Pareto optimal solutions. In Ref. [59], a knee point on the PF which is fitted by B-splines is found as the final reconstruction result. SRMOSC does not adopt this strategy here for the reason that the PF is not a smooth curve and there are no obvious knee regions or knee points. Instead, the algorithm uses a measurement called ratio cut (RC) [60], which is defined as

 K X L Vi ; V i RC ¼ (6.29) jVi j i¼1 Suppose a graph G ¼ (V, E), where V is the set of vertices and E is all the edges in the graph. Given a partition that all the vertices of V are divided into K nonempty sets V1, ., K Vi, ., VK with V

i ¼ V  Vi . Wi¼1 ViP¼ V, ci, j, ViXVj ¼ B, and jVij is the number of vertices in Vi. L Vi ; V i is defined as i ˛ Vi ;j ˛ V i sij After implementing step 4 in Algorithm 6.5, different partitions are obtained from the Pareto optimal solutions. In order to measure which one should as the final

be chosen  solution, a standard adjacency matrix is needed to calculate L Vi ; V i of the ratio cut. In the process of constructing the standard adjacency matrix, all the nondominated solutions make the same contribution to the standard adjacency matrix. Once the entry of any nondominated solution xPF ij is a nonzero value, the corresponding entry of the standard adjacency matrix is set to 1. The procedure to construct the standard adjacency matrix Adj can be seen in Algorithm 6.9.

6.4.8 Complexity analysis 1) Space complexity: The memory in our algorithm is used to store the distance rank among all the samples and population, their space complexities are O(N2) and O(pop$N2), respectively.

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering

151

Algorithm 6.9 Standard adjacency matrix construction 1: for each nondominated solution do 2: for each entries xijPF do 3: if xijPF > 0 then 4: Adjij)1. 5: end if 6: end for 7: end for

2) Time complexity: In this algorithm, the main time cost lies in the working cycle of the MOEA. The time complexity of initialization, crossover, mutation, and evaluation is O(pop$N2), O(ncr$N2), O(nmu$N2), and O(pop$N2), respectively, where ncr and nmu are the number of individuals to implement crossover and mutation. The time complexity of the updating of each generation relies on the MOEA adopted. In the experiment, the time complexity of this step is of O((2pop)2). Before initialization, a distance matrix among all the samples needs to be calculated, and then the distances between a sample to the remaining samples should be sorted. The complexity of this step depends on the sorting algorithms. In step 3 and step 5 of SRMOSC (Algorithm 6.5), both of their time complexities are O(N2$nP), where nP is the number of Pareto solutions. The time complexity of step 4 also depends on the first K eigenvectors calculation method adopted. Hence, the total time complexity of SRMOSC is simplified as O(pop$gen$N2).

6.5 Experiments This section will conduct experiments and analysis on the three algorithms.

6.5.1 The experiments of MOEA on constrained multiobjective optimization problems As the algorithm in this section is proposed on the foundation of NSGA-II, and two methods are added to handle constraint or fix infeasible individuals, the contribution of the two components is shown up correspondingly and the overall effect can be seen in this section. 6.5.1.1 Experimental setup In the paper [46], the proposed algorithm is compared with NSGAII and the algorithm in the literature [51] (indicated by Woldesenbet’s algorithm). Fourteen benchmark functions

152 Chapter 6 Table 6.1: The characteristics of the test problems.

Test problems BNH SRN TNK CONSTR OSY CTP1 CTP2 CTP3 CTP4 CTP5 CTP6 CTP7 CTP8 Welded beam

Objective dimensions 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Decision dimensions 2 2 2 2 6 10 10 10 10 10 10 10 10 4

Constraints Inequality 2 2 2 2 2 1 1 1 1 1 1 1 2 4

Equality 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Linear 0 1 0 2 4 0 0 0 0 0 0 0 0 1

Nonlinear Active 2 1 2 0 2 1 1 1 1 1 1 1 1 3

0 0 1 1 3 1 1 1 1 1 1 0 1 0

are adopted to test the performance of the proposed algorithm, which are BNH [61], SRN [62], TNK [63], CONSTR [64], OSY [65], Welded Beam [66], and CTP1eCTP8 [67]. The characteristics of these problems are summarized in Table 6.1. All the algorithms run 30 times on adopted test problems with a population size of 100, crossover rate 0.8, mutation rate 0.2, and distribution index hC ¼ 15, l ¼ 2. To have a fair comparison, all the algorithms use an archive to store Pareto optimal solutions, and the number of archives is set to 100. In order to select appropriate evaluation times, we chose CTP2 as a representative to make an experiment of how IGD values change with evaluation times in Fig. 6.2, and the evaluation times range from 10,000 to 100,000 every 10,000 times. 6.5.1.2 Performance metrics 6.5.1.2.1 IGD IGD [68] is a performance metric that measures both the convergence and diversity of the nondominated fronts obtained from one algorithm. Assume P is a set of uniformly distributed solutions of the true Pareto-front (PF), A is the solution set obtained from the optimization algorithm, and IGD is defined as the average distance from P to A: P dðv; PÞ IDGðA; PÞ ¼ v ˛ P (6.30) jPj

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering

153

Figure 6.2 The average value of IGD metric changing with evaluation times on CTP2.

where d(v, P) is the Euclidean distance from v to the nearest point in A. The lower IGD(A, P) is, the more approximate A is. 6.5.1.2.2 Minimal spacing

Minimal spacing [69] is an enhanced performance metric of uniformity modified from spacing. The calculation of minimal spacing (Sm) of set A is described as follows. Step 1 Normalize all the solutions A. Step 2 Separate solutions in A into two parts: calculated set Ac and uncalculated set Au. Take all the solutions in A to Au, and randomly mark one solution with “true,” the rest with “false.” Step 2.1 Put the “true” solution in Au to Ac, and calculate the minimal distance from this true solution to Au. The nearest solution in Au is marked with “true.” Step 2.2 Repeat Step 2.1 until Au ¼ B. rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  PjA1j   d d i i¼1 Sm ðAÞ ¼ (6.31) jA  1j where di is the minimal Euclidean distance obtained from Step 2.1, and d is the average value of di. If Sm(A) ¼ 0, it represents that solutions in A distribute uniformly.

154 Chapter 6 6.5.1.2.3 Coverage of two sets (2)

2ðA1 ; A2 Þ ¼

jfa00 ˛ A2 ; da0 ˛ A1 : a0 _a00 gj jA2 j

(6.32)

2(A1, A2) [70] ranges from 0 to 1. When 2(A1,A2) ¼ 1, it means all the solutions in A2 can be dominated by some solutions in A1, and 2(A1,A2) ¼ 0 represents that there are no solutions in A1 that can dominate any solutions in A2. It needs to be mentioned that 2(A1,A2) has nothing to do with 2(A1,A2), so it is necessary to calculate both of them respectively. In the experiments, we use these three performance metrics to measure the quality of the proposed algorithm compared with NSGA-II and Woldesenbet’s algorithm. But for test problems CTP3 and CTP4, their PFs are a set of discrete points, for test problem CTP5, its PF is a disjoint curve and some discrete points. It is not reasonable to calculate its uniformity since its true PF distributes nonuniformly, so the minimal spacing values of these only problems are offered to us to work from. 6.5.1.3 Comparison experiment results In order to compare the performance of three algorithms in a condensed way, we give the results of simulation and performance metrics on all the selected test problems obtained from three algorithms. Before showing the experiment results, we need to make a classification on CTP problems according to the characteristics of their PF [25]. For other problems, we will not classify them since they are not very complicated. As mentioned in Ref. [25], the classification is as follows. Group 1: CTP1 and CTP6, since they both have continuous PFs; group 2: CTP2, CTP7, and CTP8, since the PFs of these problems are a finite number of disconnected regions; group 3: CTP3, CTP4, and CTP5, since PFs of these problems consist of a finite number of discrete points. Before a detailed analysis is presented, it is necessary to make an illustration. Figs. 6.3, 6.5, 6.7, and 6.9 show the simulation results obtained from the three algorithms. It is noticed that each plot shown in these figures has the best IGD value in 30 runs. In the figures, plots marked with “Proposed,” “NSGA-II,” and “Woldesenbet” are the simulation results obtained from the proposed algorithm, NSGA-II, and Woldesenbet’s algorithm, respectively. True Pareto fronts are marked with red solid lines, while the Pareto optimal solutions obtained from three CMOEAs are marked with small blue circles, and feasible objective spaces of CTP problems are shaded. In Figs. 6.4, 6.6, 6.8, and 6.10, box plots of performance metrics for these problems are shown, respectively. The box plots marked with “IGD” and “Sm” are the IGD and minimal spacing metric of the three compared CMOEAs, respectively, “1, 2, 3” represents the proposed algorithm, NSGA-II, and

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering BNH(Proposed)

BNH(NSGA-II) 50

40

40

40

30

30

30

f2

f2

f2

50

20

20

20

10

10

10

0

50

100 f1

150

0

200

0

SRN(Proposed)

50

100 f1

150

0

200

0

SRN(NSGA-II) 50

0

0

0

-5 0

-5 0

-5 0

-1 0 0

-1 5 0

-1 5 0

-2 0 0

-2 0 0

-2 0 0

0

50

100

150

200

250

-2 5 0 0

50

100

200

250

-2 50 0

1

1

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0.5

1 f1

100

0

150

200

250

1.2

f2

1.2

1

0

50

TNK(Woldesenbet)

1.2

0

200

f1

TNK(NSGA-II)

f2

f2

150 f1

f1

TNK(Proposed)

150

-1 0 0

-1 5 0

-2 5 0

100 f1

50

f2

-1 0 0

50

SRN(Woldesenbet)

50

f2

f2

BNH( Woldesenbet)

50

0

155

0

0.5

1 f1

0

0

0.5

1 f1

Figure 6.3 The simulation results of three algorithms on BNH, SRN, TNK, CONSTR, OSY, and welded beam.

Woldesenbet’s algorithm in that order. In the figures marked with “2 of Pro and A,” “1” is 2 (proposed algorithm, A) and “2” is 2 (A, proposed algorithm). Fig. 6.3 shows the simulation results obtained from three algorithms on test problem BNH, SRN, TNK, CONSTR, OSY, and welded beam. We can see that the proposed algorithm gets better performance compared with the other two algorithms on TNK, CONSTR, OSY, and welded beam. Nondominated solutions obtained from NSGA-II and Woldesenbet’s algorithm do not distribute well enough in the smooth part of PF on TNK, while the proposed algorithm gets a better spread in the nondominated optimal set. For test problem

156 Chapter 6 CONSTR(Proposed)

CONSTR(NSGA-II)

6

6

6 f2

8

f2

8

f2

8

4

4

4

2

2

2

0.4

0.6 f1

0.8

1

0.4

OSY(Proposed)

0.6 f1

0.8

1

0.4

OSY(NSGA-II) 80

60

60

60

20

f2

80

40

40 20

0 -300

-200

-100

-200

f1

-100

0 -300

0

WeldedBeam(Proposed)

WeldedBeam(NSGA-II)

0.01

0.01

20 f1

f2

f2

f2

0.01

10

0.005

30

0

WeldedBeam(Woldesenbet) 0.014

0

-100 f1

0.014

0

-200

f1

OSY( Proposed )

1

40

0.014

0.005

0.8

20

0 -300

0

0.6 f1

OSY(Woldesenbet)

80

f2

f2

CONSTR(Woldesenbet)

0

0

0.005

OSY( NSGA-II ) 10

20 f1

30

0

0

OSY( Woldesenbet ) 10

20

30

f1

Figure 6.3 Cont’d

CONSTR, neither NSGA-II nor Woldesenbet’s algorithm can attain overall Pareto optimal covered true PF. For test problem OSY, neither NSGA-II nor Woldesenbet’s algorithm can find the overall PF. In addition, NSGA-II cannot converge to the true PF. Fig. 6.4 is the box plots of performance metrics for test problems BNH, SRN, TNK, CONSTR, OSY, and welded beam. In these plots, we can observe that three algorithms have comparative performance on SRN and TNK, but the proposed algorithm still gets weak superiority on IGD values. For CONSTR, lower IGD values and almost equal 2 values prove the advantage of the proposed algorithm on the diversity of optimal solutions. For OSY, both box plots of IGD and 2 show the priority of the proposed algorithm on

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering BNH(IGD)

BNH(ς of Proand NSGA-II)

BNH(ς of Pro and Woldesenbet)

1

BNH(Sm)

x 10

x 10

157

1 3.2

4 .2

4 .1 5

0.98

0.98

0.96

0.96

0.94

0.94

2.4

0.92

0.92

2.2

3 2.8 2.6

4 .1

2 1

2

1

3

SRN(ς of Pro and NSGA-II)

SRN(IGD)

x 10

2

5

1

2

1

SRN(ς of Pro and Woldesenbet) 1

0.98

2.4

0.98 2.2

0.975 0.96 1

2

2

0.97

3

1

2

1

TNK(ς of Pro and NSGA-II)

TNK(IGD)

x 10

x 10

2.6

0.985

0.97 3.5

3

2.8

0.99

4

2

SRN(Sm)

3

0.995

0.99

4.5

1

2

1

TNK(ς of Pro and Woldesenbet) 2

0.95 4.5

0.9

4

0.85

3

TNK(Sm)

x 10

1

5

2

0.95

1.8 1.6

0.9

1.4

0.8

1.2

3.5 0.75 3

1

2

3

1

0.85 1

2

1

2

1

2

3

Figure 6.4 Box plots performance metrics for BNH, SRN, TNK, CONSTR, OSY, and welded beam.

diversity and convergence, which is the same as the visual appearance. It can be seen from the box plots of Sm, that optimal solutions from the three comparison algorithms have similar uniformity on these test problems. Fig. 6.5 shows the simulation results obtained from three algorithms on group 1 test problems. Both have continuous PF, and the shaded regions in the figure are feasible objective spaces. Seen from simulation results with the best IGD values, we can only derive a conclusion that three algorithms have comparative performance on these problems. In order to make a further comparison among the three algorithms, box plots of performance metrics on group 1 test problems are shown in Fig. 6.6. Fig. 6.6 shows box plots of performance metrics for group 1 test problems. According to box plots on CTP1, it is obvious that the proposed algorithm has a better convergence and diversity than the other two algorithms. Feasible objective spaces of CTP6 are presented in banded distribution so that it is easy to be trapped in a local optimal situation for this problem. IGD box plots of CTP6 shows that the proposed algorithms can converge to the true Pareto front, while NSGA-II is usually trapped in the local optimum, and Woldesenbet’s algorithm has a worse convergence than the proposed algorithm. As can be

158 Chapter 6 CONSTR(ς of Pro and NSGA-II)

CONSTR(IGD) 0.06

1

0.05

0.98

0.04

0.96

0.03

0.94

0.02

0.92

CONSTR(ς of Pro and Woldesenbet) 15

0.98 0.96

10 0.94

0.9

0.88 1

2

3

1

OSY(IGD)

5

0.92

0.9

0.01

1

2

2

1

2

3

1

2

3

1

2

3

OSY(ς of Pro and NSGA-II) OSY(ς of Pro and Woldesenbet)

0.5

1

1

0.4

0.8

0.8

5

0.3

0.6

0.6

4

0.2

0.4

0.4

0.1

0.2

0.2

0

0

0

6

3 2

1

2

3

1

2

1

0.02

0.8

0.015

0.6

0.01

1

Welded Beam (ς of Pro and NSGA-II)

WeldedBeam(IGD) 0.025

1 2

0

Welded Beam(ς of Pro and Woldesenbet) 1

0.05

0.9

0.04

0.8 0.03

0.7

0.4

0.6

0.02

0.5

0.01

0.4

0.005

0 1

2

3

1

2

1

2

Figure 6.4 Cont’d

seen from the box plots of Sm, solutions from the three comparison algorithms have similar uniformity on CTP1 and CTP6. Fig. 6.7 shows simulation results obtained from the three algorithms on group 2 test problems, from which we can obviously see that group 2 problems have a disconnected PF. Comparing the three algorithms on CTP2, no superiority can be seen to derive which algorithm is better. For CTP7, the banded distribution of feasible objective spaces determines that it is not easy to find the overall PF. From Fig. 6.7, we can see that the proposed algorithm and Woldesenbet’s algorithm give a comparative performance, while NSGA-II misses a part of the disconnected optimal solutions. For CTP8, the feasible objective spaces are distributed in blocks, which determines that it is not only easy to miss part of the PF but also to be trapped into the local optimum situation. However, we can’t see the difference among the three algorithms only using the simulation results with the best IGD values on CTP8, since all the algorithms performed well.

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering

159

Figure 6.5 Simulation results of the three algorithms on CTP1 and CTP6.

CTP1(ς of Pro and NSGA-II) CTP1(ς of Pro and Woldesenbet)

CTP1 (IGD)

1

1

0.08

0.8

0.8

0.06

0.6

0.1

0.02 0.015 0.6 0.01

0.04

0.4

0.4

0.005

0.02

0.2 0.2

0 1

2

3

CTP6 (IGD)

1

1

2

2

1

2

3

1

2

3

CTP6(ς of Pro and NSGA-II) CTP6(ς of Pro and Woldesenbet) 1

1

0.8

0.8

0.6

0.6

0.4

0.4

3

0.2

0.2

2

6

1.5

1

5 4

0.5

0

0 1

2

3

1

0 1

2

1

2

Figure 6.6 Box plots of performance metrics for CTP1 and CTP6.

160 Chapter 6

Figure 6.7 Simulation results of the three algorithms on CTP2, CTP7, and CTP8.

Fig. 6.8 shows the box plots of performance metrics on group 2 test problems, from which we can see the superiority of the proposed algorithm. Lower IGD values and better 2 values on CTP2 show the weak advantage of the proposed algorithm. For CTP7 and CTP8, the superiority of the proposed algorithm can be seen obviously. High IGD values and low 2 values show the disadvantage of other two algorithms on convergence and diversity of nondominated solutions. We can see that the proposed algorithm strictly dominates the other two algorithms on CTP7 and CTP8 since 2 (pro, NSGA-II) z 1 and 2 (NSGA-II, pro) z 0, which proves that the proposed algorithm can find more approximate nondominated solutions on or near the true PF.

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering CTP2 (IGD)

CTP2(ς of Pro and NSGA-II) CTP2(ς of Pro and Woldesenbet) 1

CTP2 (Sm)

1 0.03

0.8

0. 1

161

0.8 0.6 0.6

0.0 5

0.02

0.4 0.01 0.2

0.4 0 1

2

3

CTP7 (IGD) 0.1 5

0. 1

1

2

0

0 1

2

1

CTP7(ς of Pro and NSGA-II) CTP7(ς of Pro and Woldesenbet)

x 10

1

1

20

0.8

0.8

15

0.6

0.6

10

0.4

0.4

5

0.2

0.2

0

0

0

-5

2

3

CTP7 (Sm)

0.0 5

0 1

2

3

CTP8 (IGD)

1

2

1

2

1

CTP8(ς of Pro and NSGA-II) CTP8(ς of Pro and Woldesenbet) 1

0.8

0.8

0.05

0.6

0.04

0.4

0.03

0.4

0.2

0.02

0.5

0.2

0

0.01

0

0

2

3

0.06

1 2.5

2

CTP8 (Sm)

0.6 1.5 1

1

2

3

-0.2 1

2

0 1

2

1

2

3

Figure 6.8 Box plots of performance metrics for CTP2, CTP7, and CTP8.

As mentioned in Ref. [25], it is not suitable for group 2 test problems to measure the diversity performance of an algorithm because of the property of the PFs, so we adopt the number of disconnected regions found to evaluate it. From Table 6.2, it can be seen that the proposed algorithm has a weak advantage over the other two algorithms on CTP2 and CTP7. It shows that all the disconnected regions in 30 runs, which confirms that our algorithm can obtain well distributed and convergent solutions. All the disconnected regions can be found by the proposed algorithm on CTP8, while NSGA-II and Woldesenbet’s algorithm are easy to trap in the local optimum so that they cannot find the correct PFs. As shown in Fig. 6.9, an infeasible tunnel needs to be traveled in searching for discrete Pareto optimal points at the end of the feasible tunnel for group 3 test problems. The narrower and longer the tunnel is, the more difficult is the search. In order to find all discrete feasible points, some infeasible tunnel must be gone through. Optimal solutions obtained from the proposed algorithm have better convergence and diversity compared with the other two algorithms on group 2 test problems, especially CTP4. On CTP5, Pareto optimal solutions found by the proposed algorithm are more approximate and overall than the other two algorithms, but a discrete point near f1 ¼ 0 is still missed.

162 Chapter 6

Figure 6.9 Simulation results of the three algorithms on CTP3, CTP4, and CTP5.

Fig. 6.10 shows the box plots of the performance metrics on group 3 test problems, from which we can see the superiority of the proposed algorithm. Lower IGD values and better 2 values indicate that the proposed algorithm has better convergence and diversity on these problems than the other two algorithms. Especially for CTP4, box plots of 2 values prove the capacity for searching discrete points. As mentioned above, it is not reasonable to calculate the uniformity of problems CTP3, CTP4, and CTP5, so the experiment takes the number of discrete points found by the algorithms instead of box plots of Sm. In Table 6.3, we can observe that the number of discrete points found by the proposed algorithm is greater than with the other algorithms, which proves the effectiveness of the proposed algorithm. It is worth noticing that the PF of CTP5 consists of a disconnected region and a set of discrete points, but only the set of discrete points is taken into account in Table 6.3. Bold values represent the best results.

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering

163

CTP3(ς of Pro and NSGA-II) CTP3(ς of Pro and Woldesenbet)

CTP3(IGD) 0.07

1

1

0.06 0.8

0.8

0.6

0.6

0.03

0.4

0.4

0.02

0.2

0.2

0.05 0.04

0.01 0 1

2

3

0 1

2

1

2

CTP4(ς of Pro and NSGA-II) CTP4(ς of Pro and Woldesenbet)

CTP4(IGD) 0.7

1

1

0.8

0.8

0.4

0.6

0.6

0.3

0.4

0.4

0.2

0.2

0.6 0.5

0.2 0.1

0

0 1

2

3

1

2

1

2

CTP5(ς of Pro and NSGA-II) CTP5(ς of Pro and Woldesenbet)

CTP5(IGD)

1

1 0.8

0.8

0.1

0.6 0.6

0.4

0.05 0.4

0.2

0

0 1

2

3

1

2

1

2

Figure 6.10 Box plots of performance metrics for CTP3, CTP4, and CTP5.

Table 6.2: Statistics of the number of disconnected regions found by the three algorithms on group 2 test problems. Disconnected regions Test problems CTP2

CTP7

CTP8

Algorithms Proposed algorithm NSGA-II Woldesenbet’s algorithm Proposed algorithm NSGA-II Woldesenbet’s algorithm Proposed algorithm NSGA-II Woldesenbet’s algorithm

Mean 13 12.5 12.46667 7 6.066667 6.433,333 3 0.433,333 0.966,667

S.D. 0 0.776,819 0.776,079 0 0.253,708 0.568,321 0 1.04004 1.351,457

164 Chapter 6 Table 6.3: Statistics of the number of discrete points found by the three algorithms on group 3 test problems. Discrete points Test problems CTP3

CTP4

CTP5

Algorithms Proposed algorithm NSGA-II Woldesenbet’s algorithm Proposed algorithm NSGA-II Woldesenbet’s algorithm Proposed algorithm NSGA-II Woldesenbet’s algorithm

Mean 12.76667 10.8 11 11.6 7.666,667 8.8 13.06667 12.26667 12.56667

S.D. 0.504,007 2.265,179 1.618,854 1.220,514 2.170,862 1.689,726 1.142,693 2.887,946 2.095699

6.5.2 The experiments of MOEA on clustering learning and classification learning In this section, detailed comparison experiments against other algorithms are shown, including MOASCC, MSCC [71], SVM [72], RBFNN [73], MOCK [31], and semi-MOCK [74]. MSCC is a simultaneous clustering and classification learning algorithm using MOPSO. SVM is a state-of-the-art classifier. RBFNN is a radial basis function neural network model which handles clustering and classification sequentially. MOCK and semiMOCK are unsupervised and semisupervised multiobjective evolutionary clustering algorithms, respectively. The algorithms mentioned above are first tested on the synthetic datasets to show the efficiency of MOASCC. In order to give a further analysis of MOASCC, the experiments implemented it on the real-life datasets, including the parameter analysis, the benefit of MOEA, the convergence of MOASCC, and the comparison results on these real-life datasets. 6.5.2.1 Experiment setup Suppose for a dataset whose size is N, we select N/2 samples in a dataset randomly as training samples and the rest as test samples for all the supervised and semisupervised learning algorithms. In terms of the nature-inspired algorithms MOASCC, MSCC, MOCK, and semi-MOCK, they share the same values on pop and gen, which are set to 100 and 50 for synthetic datasets, and 100 and 100 for UCI datasets, respectively. The probabilities of crossover (pc) and mutation (pm) in MOASCC, MOCK, and semi-MOCK are set to 0.7 and 0.3, respectively. In terms of MSCC and RBFNN, the number of p clusters K ranges from C to Cmax, where C ffiffiffiffi is the true number of classes and Cmax is set to N according to Ref. [71]. l is the scale

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering

165

factor in Gaussian kernel function and l˛{0.001,0.01,0.05,0.1,0.5,1,5,10,15}. All the combinations of K and l are tested for 30 independent runs and the one with the best classification accuracy is selected to be shown in the experiments. In SVM, K is set to the real number of clusters, the regularization parameter is selected from {21, 20, 23, 25, 27, 29}, and the scale factor l˛{0.001, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 15}. In SVM, the combination of parameters with the best classification accuracy is selected to undertake the prediction task. 6.5.2.2 Experiment on a synthetic datasets This experiment adopts four synthetic datasets with different structures ASD_11_2, 2moons, eyes, and spiral (see Fig. 6.11) to test the effectivity of MOASCC by comparing it with MSCC, SVM, RBFNN, MOCK, and semi-MOCK. ASD_11_2 consists of 515 twodimensional samples distributed in 11 spherical shaped clusters. 2moons consists of 200 two-dimensional samples distributed in 2 moon-shaped clusters. Eyes is a synthetic dataset that consists of 238 two-dimensional samples, it has 1 ring-shaped cluster and 2 squared clusters. Spiral consists of 1000 two-dimensional samples distributed in 2 spiral line shaped clusters. The classification results of these four synthetic datasets obtained from MOASCC, MSCC, SVM, RBFNN, MOCK, and semi-MOCK are shown in Table 6.4. In Table 6.4, the classification accuracy over 30 runs is written in the form: mean accuracy (standard deviation), K is determined adaptively in MOASCC, MOCK, and semi-MOCK, while it is specified before execution in MSCC, SVM, and RBFNN. As can be seen from this table, MOASCC gets better classification results than MSCC and RBFNN, especially on dataset spiral, in which all the samples can be classified into the correct category by MOASCC. Notice that in SVM, the scale factor l has multiple values when it achieves the best classification performance. Comparing MOASCC with SVM and semi-MOCK, they get neck-to-neck performance, and it therefore needs further comparison on real-life datasets.

Figure 6.11 Synthetic dataset ASD_11_2 (A), 2moons (B), eyes (C), and spiral (D).

166 Chapter 6 Table 6.4: Parameter setting and classification accuracy on synthetic datasets.

Dataset ASD_11_2

2moons

Eyes

spiral

Algorithm

K

MOASCC MSCC SVM

11 11 11

RBFNN MOCK Semi-MOCK MOASCC MSCC SVM RBFNN MOCK Semi-MOCK MOASCC MSCC SVM RBFNN MOCK Semi-MOCK MOASCC MSCC SVM RBFNN MOCK Semi-MOCK

16 9, 10, 11 11 2 10 2 10 2, 4, 5 2, 3 3 10 2 11 2, 3 3, 4 2 22 2 22 4, 5 2

l d 0.01 0.05, 0.1, 0.5, 1, 5, 10, 15 1 d d d 0.001 1, 5, 10, 15 1 d d d 0.001 0.1, 0.5, 1, 5, 15 1 d d d 0.001 5, 10, 15 1 d d

Maximum accuracy (%)

Mean accuracy (standard deviation) (%)

100 99.42 100

100 (0) 95.71 (0.69) 100 (0)

100 96.12 100 100 100 100 100 68.50 100 100 99.16 100 100 84.03 100 100 87.10 100 97.10 100 100

98.91 (1.12) 88.71 (1.99) 100 (0) 100 (0) 99.15 (0.41) 100 (0) 99.07 (0.73) 65.20 (1.44) 100 (0) 100 (0) 98.49 (1.24) 100 (0) 98.71 (1.08) 77.77 (2.44) 98.71 (1.08) 100 (0) 85.36 (2.5) 100 (0) 94.75 (3.44) 98.92 (3.44) 100 (0)

Take dataset ASD_11_2 as an example, we give an analysis on the relation matrix and the classification accuracy obtained from MOASCC and MSCC, and show them in Table 6.5. Here, we do not compare MOASCC with SVM, RBFNN, MOCK, and semi-MOCK, since SVM, MOCK, and semi-MOCK do not have a relation matrix and the relation matrix in RBFNN cannot represent intuitive meanings. For each row vector in the relation matrix, PM m¼1 pðwm jcj Þ ¼ 1, it shows the distribution of the samples in the cluster cj. Take the relation matrix obtained from MSCC as an example, p(w3jc2) and p(w9jc2) are nonzero entries, which indicates that the samples in the cluster c2 distribute in the class w3 and w9. If there exists a value p(wmjcj) ¼ 1, then all the training samples in this cluster have the same class label. When all the nonzero values equal 1, such as the relation matrix of MOASCC, the underlying structure of the given dataset is correctly detected by clustering

Table 6.5: Relation matrix obtained from MOASCC and MSCC on ASD_11_2.

Relation matrix P

Classification accuracy

2

1 60 6 6 60 6 60 6 6 60 6 6 60 6 60 6 6 60 6 60 6 6 40 0

MOASCC 0 1 0 0

0 0 1 0

0 0 0 1

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

0 0 0 0

0 0 0 0

0 0 0

0 0 0

0 0 0 0 0 0

0 0 0

0 0 0

0 0 0

1 0 0 1 0 0

100%

3

0 07 7 7 07 7 07 7 7 07 7 7 07 7 07 7 7 07 7 07 7 7 05 1

2

MSCC 0 0 0 0

6 6 6 6 6 6 6 6 6 0 6 6 6 0 6 6 0 6 6 6 0 6 6 0 6 6 4 0:82 0

0 0 0 0 0:07 0 0 0 0 0 1 0

0 0 0 0

0 0 1 0

0 0 0 0

0 0 0 0 0:93 0 0 0 0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 1

0 0 1 0

0 0 0 0

0 0 0 0

1 0 0 0

0 0 1

0 0 0

0 0 0

0 0 0

0 0 0

1 0:09 0

99.42%

0 0 0 0

0 1 0 0

0 0 0 0 0:09 0 0 0 0

3 1 07 7 7 07 7 07 7 7 07 7 7 07 7 07 7 7 07 7 07 7 7 05 0

168 Chapter 6 learning. In this table, we can see that MOASCC works better than MSCC and the relation matrix shows that MOASCC has a clearer relationship between clusters and the given classes. 6.5.2.3 Experiment on real-life datasets In this section, the experiment selects 19 real-life datasets from the University of California at Irvine (UCI) Machine Learning Repository [75] to test the efficiency and accuracy of MOASCC. Four datasets glass, vowel, ecoli, and lung_cancer are selected as examples to undertake further analysis of MOASCC, including parameter analysis, the benefit of MOEA, and the convergence of MOASCC. Fig. 6.12 gives us an analysis of parameters pc (pm ¼ 1  pc) and L on MOASCC. We take four UCI datasets for examples in each experiment. The number of samples, attributes, and categories of each dataset is written in the form: dataset (samples  attributes  categories). According to Fig. 6.12A, we can see that the classification accuracy is not sensitive to the value of pc. When pc ¼ 0.7, MOASCC performs a bit better than on the other values, so we choose pc ¼ 0.7 in the experiments. It is noticed that the objective function values of MOASCC are related to parameter L, usually the higher it is, the higher is the clustering objective function value. As recommended in Ref. [31], L˛ {5,/,10}, Fig. 6.12B gives us an experimental analysis of the effect of L. We found that the classification accuracy is not sensitive to the value of parameter L, so the experiment chooses a consistent value 10 for all the datasets. Next, we give a simple discussion on the effect of multiobjective optimization. In order to see how multiobjective optimization affects the result of classification, an experiment (A)

(B)

Figure 6.12 Parameter analysis [glass (214  9  6), vowel (528  10  11), ecoli (336  7  8), lung_cancer (32  56  3)].

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering

169

Figure 6.13 Number of clusters obtained during the optimization process on dataset glass, vowel, ecoli, and lung_cancer.

about the effect of single objective and multiobjective optimization was carried out and is shown in Fig. 6.13. “MOASCC þ strategy1” represents the algorithm that replaces the objective functions of MOASCC with single objective function (classification error rate), “MOASCC þ strategy2” is MOASCC with the initialization strategy adopted in MOCK, and “MOASCC þ strategy3” represents “MOASCC þ strategy1” with the initialization strategy adopted in MOCK. MOCK adopts two schemes to generate initial individuals, half of which are derived from the MST, and the rest are generated from k-means (these solutions are converted to MST-based individuals). Take UCI dataset “glass,” “vowel,” “ecoli,” and “lung_cancer” as examples in this figure, we can clearly see that using multiobjective optimization can make the number of clusters decrease/increase to a value close to the real number of clusters no matter which initialization scheme is used. It is easy for the number of clusters to be affected by the initialization scheme with single objective optimization, because the quality of the clustering can’t be guaranteed without a clustering objective function. What’s more, we can also see that the number of clusters has little to do with the initialization strategies, which is also the reason why the algorithm uses a more simple initialization scheme in MOASCC. In order to show the experiment result intuitively, Fig. 6.14 gives the Pareto front obtained from MOASCC on datasets glass, vowel, ecoli, and lung_cancer. In terms of coordinate Si(x, y) in this figure, x is the number of clusters and y is the classification accuracy. The

170 Chapter 6

Figure 6.14 Pareto front obtained from MOASCC on datasets Glass, Vowel, Ecoli, and Lung_cancer.

symbol “o” marked in red represents the Pareto optimal solution with the best ARI value. From Fig. 6.14, we can see that MOASCC is able to obtain a set of solutions with a different number of clusters and the solution with relatively low classification error rate on training samples gives a high accuracy on test samples. Note that on dataset “vowel,” MOASCC obtained the optimal solution for it. However, it is a difficult task to find the optimal solution for all the tested datasets. Another observation is that there are no solutions whose number of clusters is far more than the real number of classes, which is because they are dominated in the evolutionary process by the Pareto solutions. In order to verify the convergence of MOASCC, an intuitive experiment to verify it is shown in Fig. 6.15. In this figure, we can see how classification accuracies obtained from MOASCC, MSCC, and semi-MOCK change during the evolutionary process on datasets glass, vowel, ecoli, and lung_cancer. In this experiment, gen is set to 100, the classification accuracy is calculated every five generations from the first generation to the 100th generation except the first interval is set to 4. The results show that the classification accuracies obtained from all the algorithms increase with generation in the early stage, and then converge to a stable status in the later stage. This indicates that the Pareto optimal solutions are superior to dominated solutions and rules out the possibility of overtraining. In the later stage, MOASCC gets a relatively higher classification accuracy compared with MSCC and semi-MOCK except for lung_cancer. This indicates that the two objective functions in MOASCC are reasonable and efficient in solving classification problems.

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering

171

Figure 6.15 The classification accuracies of MOASCC, MSCC, and semi-MOCK obtained from different generations on datasets glass, vowel, ecoli, and lung_cancer.

What’s more, MOASCC has started to converge since the 40th generation, it seems suitable to set the parameter gen ¼ 100 in our algorithm. In conclusion, this experiment not only proves the convergence of MOASCC, but also its efficiency. As a combination of clustering and classification, the performance of MOASCC usually relies on the following aspects: (1) the difficulty in dealing with the given dataset; (2) the clustering ability of the clustering scheme; and (3) the effect of the cooperation between clustering and classification. In order to discuss these issues, the overall comparison results of MOASCC against other algorithms in the UCI datasets are presented in this subsection. Table 6.6 shows the detailed experiment results obtained from MOASCC, MSCC, SVM, RBFNN, MOCK, and semi-MOCK. In Table 6.6, the best or comparative classification results are marked in bold. First, we make a comparison between two simultaneous clustering and classification algorithms, MOASCC and MSCC. MOASCC adopts a locusbased adjacency representation encoding scheme so that the number of clusters can be determined adaptively, which reduces the time in searching and deciding the best combination of parameters K and l. We can see that MOASCC achieves better performance than MSCC on all the datasets, especially datasets glass, sonar, vowel,

Table 6.6: The experiment result obtained from MOASCC, MSCC, SVM, RBFNN, MOCKs, and semi-MOCK on real-life datasets [the classification accuracy over 30 runs is written in the following form: mean (standard deviation)]. Datasets (#Samples 3 #dim 3 # class) Wine (178  13  3) Glass (214  9  6) Lenses (24  4  3) Iris (150  4  3) Wdbc (569  30  2) Heart disease (270  13  2) Soybean (small) (47  35  4) Balance scale (625  4  3) Sonar (208  60  2) Vowel (528  10  11) Thyroid (215  5  3) Lung_cancer (32  56  3) Pima Indians diabetes (768  8  2) Bupa (345  6  2) Vote (435  26  2) Vehicle (846  18  4) Ecoli (336  7  8) Image segmentation (2310  19  7) Waveform(5000  21  3)

MOASCC Accuracy (%)

MSCC K

Accuracy (%)

K

SVM l

Accuracy (%)

RBFNN l

Accuracy (%)

MOCK l

K

Accuracy (%)

Semi-MOCK K

Accuracy (%)

K

97.81 79.37 100 97.13 97.71 87.83 100

(0.58) (2.99) (0) (0.32) (0.44) (0.39) (0)

3,4,5 5,6,7 3 3,4 2 2e4 4

95.79 (1.50) 65.98 (2.67) 87.29 (10.60) 96.63 (1.43) 94.38 (1.52) 83.44(1.52) 86.27 (9.45)

3 20 3 3 2 2 4

0.001 0.01 0.1 0.01 0.05 0.01 0.1

98.05 (0.87) 77.54 (2.40) 94.86 (7.56) 96.93 (0.73) 94.80 (0.28) 86.22 (1.42) 76.88 (8.56)

1 0.1 15 0.1 10 5 15

97.55 52.18 97.64 97.07 94.69 84.89 72.27

(0.88) (5.38) (4.05) (0.90) (1.11) (1.38) (19.10)

6 6 6 9 4 12 4

1 0.05 0.05 5 1 0.1 5

68.65 44.16 68.96 90.10 94.52 80.91 42.91

(3.62) (4.46) (3.94) (0.73) (0.12) (0.52) (6.78)

3, 4 5e8 3,4 3,4 2,3 3e6 4

97.36 66.05 100 97.73 96.70 82.93 100

(0.55) (2.40) (0) (0.34) (0.58) (0.41) (0)

3,4 5e7 3 3,4 2,3 3e6 4

89.41 88.13 99.05 97.10 67.85 79.69

(0.67) (2.45) (0) (1.12) (2.74) (0.88)

3 2e4 11e14 3e5 2e6 2

89.48 (1.32) 67.90 (4.49) 40.83 (2.27) 95.70(1.92) 48.13 (8.32) 75.50(1.86)

15 9 16 11 3 2

1 0.001 0.001 0.1 0.001 0.05

92.97 (0.97) 86.70 (2.39) 93.21 (0.42) 96.09 (1.45) 66.98 (5.27) 83.51 (1.33)

0.1 15 0.1 0.1 0.01 0.1

91.26 71.74 46.35 91.06 53.64 77.32

(0.63) (5.97) (8.32) (1.78) (6.09) (0.74)

15 6 11 5 4 9

0.01 0.01 0.5 10 0.01 0.01

54.88 56.88 48.61 73.53 48.44 68.07

(5.23) (1.52) (4.95) (2.77) (1.90) (2.50)

3,4 3e5 9e11 3,4 3e5 2,3

72.74 75.75 56.41 96.00 75.16 74.40

(4.02) (1.74) (1.73) (0.61) (3.58) (0.83)

3, 4 2,3,4 9e12 3,4 3e5 2,3

69.20 94.52 44.67 76.39 62.18

(2.17) (0.61) (8.31) (11.82) (9.16)

8 14 6 8 7

0.01 0.05 0.5 15 15

58.46 65.08 44.49 64.06 57.84

(0.49) (0.92) (2.98) (0.65) (4.95)

2e4 2e4 4e6 6e11 7,8

64.52 91.53 56.35 85.74 83.30

(1.74) (0.55) (1.71) (1.58) (2.57)

2 3 4,5 5e8 7,8

17

15

69.21 (2.35)

3e5

85.33 (0.56)

3,4

72.90(1.31) 95.27 (0.67) 83.74 (2.97) 89.97 (1.12) 97.82 (0.36)

2 2 4e6 5e8 7e9

64.19 92.22 45.42 79.20 85.66

(2.60) (1.56) (4.60) (3.23) (2.26)

4 3 7 12 7

0.1 0.001 1 0.05 0.1

81.19 (2.06) 92.76 (0.79) 82.27 (1.74) 87.68 (1.45) 95.87 (0.40)

0.1 10 0.5 0.5 0.5

88.51 (0.32)

3,4

81.39 (2.91)

50

0.01

87.63 (0.92)

0.5

86.78 (0.26)

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering

173

lung_cancer, pima_indians_diabetes, bupa, vehicle, and ecoli. On large-scale datasets image segmentation and waveform, MOASCC also shows efficiency. Second, we compare MOASCC with another hybrid clustering and classification learning model, RBFNN, which shows that MOASCC is superior to RBFNN in most of the real-life datasets, except that they get comparative results on dataset wine, lenses, iris, and vote. The better performance of MOASCC over RBFNN comes from its effective cooperation between clustering and classification. Third, the state-of-the-art classifier SVM is further compared in this part to see whether simultaneous clustering and classification can enhance the classification performance. The experiment shows that MOASCC is better than SVM on most datasets except for wine, balance_scale, pima indians diabetes, and bupa. Finally, a comparison between MOASCC and different multiobjective clustering algorithms MOCK and semi-MOCK is discussed. Although they use the same representation scheme, MOASCC still shows its superiority on most of the UCI datasets. From Table 6.6, we can derive the conclusions that: (1) the value of K determined adaptively by MOASCC is close to the true number of clusters; and (2) MOASCC can improve the performance of both clustering and classification. In real life, there are many datasets which are difficult to deal with. As a simultaneous clustering and classification algorithm, the result of clustering usually has a great effect on the classification performance. However, the comparison between MOASCC and other clustering/ classification algorithms in Table 6.6 indicates that clustering and classification can benefit from their cooperation. On the one hand, MOASCC is not strict to the underlying structure of the given dataset considering the clustering objective function, which can be demonstrated by the comparison between MOASCC and MSCC. On the other hand, MOASCC is based on multiobjective optimization, which demands that only the individuals with both better clustering quality and classification quality can be selected to replace the original individuals. What’s more, a mutation operator which is related to the feedback from classification is designed to guide the search. This scheme also improves the performance of clustering. Unfortunately, many of the features and attributes of the real-life datasets are redundant, noisy, or irrelevant to the clustering and classification task. It is difficult for most clustering and classification algorithms and even the ensemble algorithms to deal with such a task. According to [76,77], we can apply feature selection to clustering and classification to improve the performance of data mining. It is also one of our current efforts to use multiobjective optimization for subspace learning. To further analyze the effect of different MOEAs on MOASCC, we selected three state-ofthe-art MOEAs, MOEA/D [38], SPEA2 [36], and NSGA-II [35], to carry out this experiment (see Fig. 6.16). In this experiment, MOEA/D, SPEA2, and NSGA-II share the same values on parameters pop, gen, pc, and L. For the remaining parameters T (the

174 Chapter 6

classification accuracy

1.2 1 0.8 0.6 0.4 0.2

MOEA/D SPEA2 NSGA-II

0

Figure 6.16 The classification result obtained from three state-of-the-art multiobjective evolutionary algorithms: MOEA/D, SPEA2, and NSGA-II.

number of weight vectors in the neighborhood of each weight vector) in MOEA/D and the archive size in SPEA2, they are set to 20 and 100, respectively. Fig. 6.16 shows that MOEA/D, SPEA2, and NSGAII have similar performance on most tested datasets. Since these algorithms adopt different nondominated solution reservation strategies, each algorithm gains its own advantage on different datasets. This experiment also demonstrates the efficiency of MOEAs in solving clustering/classification problems.

6.5.3 The experiments of MOEA on sparse spectral clustering The experiments are mainly carried out on the basis of NSGA-II, and this section has been divided into two parts. The first part presents a detailed analysis of SRMOSC on the basis of NSGA-II, five experiments are carried out including detailed analysis of the parameter setting, the sparsity of the Pareto optimal solutions, the effectivity of the final solution selection strategy, the experiments about the proposed initialization, crossover and mutation schemes, and the benefit of MOEAs in solving spectral clustering. The second part gives experimental results on real-life datasets. The proposed algorithms based on NSGA-II and MOEA/D are compared with four other similarity matrix construction methods and two multiobjective clustering algorithms, including unsupervised clustering and semisupervised clustering. The four commonly used similarity matrix construction methods discussed in [47,78e81] are used for comparison. In addition, multiobjective clustering with automatic

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering

175

k-determination (MOCK) and multiobjective genetic algorithm optimizing p and sep [MOGA(p, sep)] are also compared. MOCK [31] is a graph representation-based multiobjective clustering algorithm, which uses overall deviation and connectivity as objective functions to reflect cluster compactness and connectedness, respectively. MOGA(p, sep) [82] is a prototype representation-based multiobjective fuzzy clustering algorithm. In the experiments, we use supervised classification datasets, therefore the number of clusters in all the algorithms is fixed to the number of classes, the clustering accuracy is measured in terms of percent of instances that are correctly classified, and the clustering result with the highest accuracy is considered as the best result. The parameters pop, gen, pc, and pm are set to 50, 50, 0.7, and 0.3, respectively, for SRMOSC, MOCK, and MOGA(p, sep). When constructing the similarity matrix using fully connected kNN and mutual kNN construction methods [79e81], the Gaussian kernel K(x, y) ¼ e^-(jjx-yjj2/ (2s2)) is adopted to calculate the similarity. We carry out experiments with the following values {0.001, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 15} for s, choosing the one that outputs the best clustering result as the final value. For kNN and mutual kNN methods, k is set to log(N). ε is set to the value with the best clustering result from {0.2, 0.3, 0.4, 0.5, 0.6} in the ε-neighborhood method [47,79]. 6.5.3.1 Detailed analysis of SRMOSC In order to evaluate the effectiveness of SRMOSC, a detailed analysis is shown in this section. 1. Parameter analysis: Fig. 6.17 gives a detailed analysis of the setting of parameters. It is carried out on dataset “wine” for illustration. Analyzing this figure, two observations can be obtained: (1) SRMOSC is not sensitive to parameter pc especially pop  40 and (2) when gen  20, SRMOSC converges to a stable state. Taking the stability and time complexity of SRMOSC into consideration, the values of parameters pop, gen, pc, and pm are set to 50, 50, 0.7, and 0.3, respectively. 2. Sparsity of the Pareto optimal solutions: An experiment about the sparsity of the Pareto optimal solutions and whether they can exactly describe the relationship among samples is shown. We have used the UCI dataset [83] “wine” for illustration purposes because it is very clear to see the relationship between the different clusters. Dataset wine has 178 samples, 13 attributes, and three categories, with samples 1e59 belonging to category “1,” samples 60e130 belonging to category “2,” and the rest in category “3.” In Fig. 6.18, the sparse matrices that correspond to some Pareto optimal solutions found by one run of SRMOSC, including the solution with the best ratio cut value, are visually shown to see to what extent the sparsity matrices can reveal the relationship among different clusters. All the nonzero entries in x are represented with black pixels. The weights are not considered in order to get a clearer picture. We can see from the sparse matrices in Fig. 6.18 that they do have an obvious property that most of the nonzero

(A)

(B)

Figure 6.17 Parameter analysis. (A) Effect of parameter pc and pop on the clustering accuracy (gen is a constant value and set to 200). (B) Convergence of SRMOSC when pop is set to different values (pc is set to 0.7, gen starts from 1 to 200, the results are recorded every five generations except in the first interval that is 4). All the results are the average clustering accuracies obtained from 20 independent runs, and pm ¼ 1 e pc.

(A)

wine

17

(B)

Pareto solutions Best Ratio cut solution

(A) 16

||y - Ax || 2 2

15 14 13

(B)

12 11

0

(C)

500

(C)

(D)

1000 1500 || x || 0

(E)

2000

Accuracy:96.07%

2500

Accuracy:59.55%

(D)

Accuracy:95.51%

(E)

Accuracy:94.94%

Accuracy:95.51%

Figure 6.18 Visualization of sparse matrices in the PF. The solutions marked with the symbol “8” are Pareto optimal solutions obtained from one run by SRMOSC, and the one marked with a red symbol “-” is the best ratio cut solution. Five Pareto optimal solutions are selected and the corresponding sparse matrices and clustering accuracy are shown visually. (A) Accuracy: 59.55%. (B) Accuracy: 95.51%. (C) Accuracy: 96.07%. (D) Accuracy: 94.94%. (E) Accuracy: 95.51%.

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering

177

entries are distributed in the same cluster and rarely among different clusters no matter how sparse the solution is. In this sense, they can reveal the relationship among samples. In order to further evaluate the effect of the sparse matrix in spectral clustering, a visualization of similarity matrices, corresponding eigenvalues, and eigenvectors constructed with SRMOSC and several conventional methods is shown in Fig. 6.19. In the first row of Fig. 6.19, we show the similarity matrix with the best ratio cut value obtained from SRMOSC, and the results from the other methods are shown in the rest. Note that in the case of the fully connected similarity matrix construction method, the similarity matrix is not sparse. Unlike Fig. 6.18, the similarity matrices here are symmetric, and the weights are taken into account. The maximum and minimum weights are represented with white and black pixels, respectively, for visualization purposes. In this sense, we can see that the similarity matrix obtained from SRMOSC has the following properties: (1) the number of nonzero entries in the similarity matrix is quite low in contrast with that of zero entries; (2) the nonzero entries of the similarity matrix are distributed as intraclass connections mostly, which means they provide more discriminative information to the clustering; (3) the nonzero entries distributed as interclass connections are much smaller in contrast with intraclass connections when they exist; and (4) the values of nonzero entries are quite different in contrast with those obtained from other methods. The visualization graph of SRMOSC shows a high variance of gray levels, while most pixels of nonzero entries in other graphs share similar gray levels. These four properties demonstrate that the similarity matrix obtained from SRMOSC can reveal the relationship between samples more clearly than other methods. As shown in Algorithm 6.4, the eigenvectors obtained from the Laplacian of the similarity matrix are finally responsible for the clustering result. In the case of SRMOSC (Fig. 6.19A) and kNN (Fig. 6.19C), we can see that eigenvector 1 cannot provide exact discriminating information to carry out the clustering task, however, eigenvectors 2 and 3 can mostly classify the samples into three different clusters with k-means. In the case of (Figs. 6.19B, D, and E), it is obvious that using the eigenvectors of the fully connected similarity matrix, mutual kNN similarity matrix, and ε-neighborhood similarity matrix is much harder to divide the samples into different clusters exactly. 3. Efficiency of the final solution selection method: In this part, the reason SRMOSC adopts the ratio cut as the measurement to select the final solution and its efficiency are described. In Fig. 6.20, the plots of the relationship between sparsity and the measurement error jjA  Axk2, ratio cut, and clustering accuracy are shown for several UCI datasets. All the results are from one execution. In order to put ratio cut, clustering ac2

curacy, and the objective function measurement error jjA  Axk2 in one plot, we normalized the objective function values into [0, 1]. Additionally, given that our 2

178 Chapter 6 Similarity matrix

Eigenvector 1

Eigenvalues

Eigenvector

Eigenvector 2

0.075

3

0.1 0.05

0.075 2

3

0.15 0.1

0

0.05

0.075 1

0.05

0

0.1

0.05

0.075 0

0

0.075 200 0

100

(A) 150

0.075

100

0.075

50

0.075

0

0.075

50

0

0.075 200 0

100

(B)

0.15

100

200

0

100

200

0 .1

0

100

200

SRMOSC (accuracy: 96.07%)

100

200

1

0.5

0.5

0

0

0.5

0.5

0

100

1

200

0

100

200

Fully connected (accuracy: 63.48%) 0.075

15 10

0.075

5

0.075

0.15

0.2

0.1

0.1

0.05 0 0 0 5

0.075

0

0.075 200 0

100

(C) 8 6

0.1

0.05

100

200

0.1

0

100

200

0.2

0

100

200

kNN (accuracy: 96.07%) 0.4

0.2

0.2

0.2

0.15

0.1

4 0

0.1

0

0.2

0.05

0.1

2 0 2

0

100

200

(D)

0.4

0

100

0

200

0

100

200

0.2

0

100

200

100

200

Mutual kNN (accuracy: 57.87%)

40 30

0.2

0.4

0

0

0.2

0.05

0.2

0

0.1

0

0.4

0.2

0.15

10

0.6

0.4

20 10

0

100

(E)

200

0

100

200

0

100

200

0.2

0

ε - neighborhood (accuracy : 62.92%)

Figure 6.19 Visualization of similarity matrices (column 1), eigenvalues (column 2), and eigenvectors (columns 3e5) obtained from five different methods.

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering accuracy

1

0.8

|| or

0.6

0.4







0.2

||

||

0.4 0.2

||

||

||

|| or

0.6

0.8

0

500

1000 || ||

1500

||

accuracy

accuracy

0



accuracy

1

0

2000

0

500

1000 || ||

1500

2000

accuracy

1 0.8

2

0.4

||

||



0.2



|| 2 or

0.6

||

accuracy

0

0

(A) Wine

500

1000 || ||

(C) Heart disease

accuracy

1

0.8

0.6 || or

0.6

0.8

0.4



||

0.2

||

||

||





||

0.2

0.4

accuracy

0

0

500

1000



accuracy

2000

(B) Glass

1

|| or

1500

||

accuracy

0 0

1500

200

400

|| ||

600 || ||

800 1000 1200

0

accuracy

1 0.8

0.4



||

||

0.2

||

accurac y

0 0

(D) Thyroid



|| or

0.6

200

400 || ||

(E) Zoo

600

800

(F) Iris

Figure 6.20 Relationship between objective functions, ratio cut, and the clustering accuracy.

179

180 Chapter 6 algorithm tries to select a solution that has the minimal ratio cut value while keeping a high clustering accuracy among all the Pareto solutions as the final solution, we normalize the eRC into [0, 1] in order to see the relationship between ratio cut and clustering accuracy more clearly in Fig. 6.20. In order to see the detail clearly, an enlarged view of some figures is shown in Fig. 6.20. We can clearly see from the nondominated solutions obtained from SRMOSC that there are no obvious knee regions or knee points. Even though we can use the B-splines to fit the PF, the clustering accuracy of the solutions in the knee region is not stable. Hence, we cannot use such a criterion to select a solution. Given a solution with the sparsity jjxjj0 in the PF, the cases we tested mostly have the following property: when jjxjjjjxjj0 , the clustering accuracy cannot keep increasing. This property increases the difficulty of selecting the final solution. We can see from Fig. 6.20 that the changing of clustering accuracy follows that of eRC. The solutions with better clustering accuracies usually have better RC values, although it is not always the best one. Therefore, it is reasonable and effective to use ratio cut as the selection measurement. In Fig. 6.21, three experiments, including clustering and semisupervised clustering with 10% and 20% labeled samples, are carried out. In each experiment, we show the accuracies of two solutions, the one with the best ratio cut value and the one with the highest accuracy, selected from the Pareto optimal solutions in 20 runs. It can be seen from these boxplots that we can get two conclusions: (1) it is appropriate for clustering or semisupervised clustering to use ratio cut as the measurement to select the final solution, although the result is not always the best and (2) the method used to extend clustering to semisupervised clustering is effective since the results are improved with the guidance of labeled data. 4. Effect of the specific evolutionary operators: First, we discuss the effect of designed initialization and mutation schemes against random initialization and mutation schemes in this section. The designed schemes are based on the assumption that a sample prefers to reconstruct itself with its neighbors, and when taking the distance between different samples into account, this neighbor information reduces the “blind search” in such a huge searching space. We compare the clustering accuracy between the proposed schemes and random schemes to illustrate it more clearly, as presented in Fig. 6.22. In this figure, we can see that the proposed schemes that use of neighbor information significantly outperforms the random schemes. In addition, the PF obtained from the random scheme and the proposed scheme is shown in the supplementary material. To further discuss the benefit of the proposed mutation scheme, a comparison between the expansion of classic polynomial mutation and the proposed mutation is shown in Fig. 6.23. In polynomial mutation, we take each column vector of an individual as a

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering

(A) Wine

(D) Thyroid

(B) Glass

(E) Zoo

181

(C) Heart disease

(F) Iris

Figure 6.21 Boxplot of clustering accuracy comparison between the best ratio cut and the best accuracy solutions on PF obtained from 20 runs. “RC” and “best” represent the result of clustering with the best ratio cut value and the best clustering accuracy, respectively, and “RC a%” and “best a%” represent the corresponding result of the semisupervised clustering with a% labeled data.

182 Chapter 6

(A) Wine

(B) Glass

(C) Heart Disease

(D) Thyroid

(E) Zoo

(F) Iris

Figure 6.22 Boxplot of the clustering accuracy obtained from the proposed schemes and the random schemes (“proposed” and “random” represent the proposed schemes and random schemes, respectively).

basic unit to execute the classic mutation scheme. This comparison shows that the performance of the proposed mutation scheme is slightly better than a classic polynomial mutation. Furthermore, the effect of the proposed crossover scheme is discussed. In the proposed crossover scheme (Algorithm 6.7), there are two cases considered, whose effect are

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering

(A) Wine

(B) Glass

(C) Heart Disease

(D) Thyroid

(E) Zoo

183

(F) Iris

Figure 6.23 Boxplot of the clustering accuracy obtained from the proposed mutation and the expansion of the polynomial mutation.

shown in Fig. 6.24. In Fig. 6.24, four crossover schemes are compared in this experiment, including the proposed crossover, the expansion of simulated binary crossover (SBX), case 2 of the proposed crossover (represented as “case2”), and “DE/rand/1” (denoted as DE). All the results are obtained from 20 independent executions. Taking the properties of individuals into account, we carry out SBX on each individual by taking each column vector as a basic element. We can derive that the proposed crossover has a better performance than SBX and DE/rand/1. What is more, using the nondominated solutions to guide the search shows a slight advantage over case 2. 5. Benefit of multiobjective optimization: In SRMOSC, the spectral clustering is formulated as a multiobjective optimization problem (6.22). In order to discuss the benefit of MOEA in solving this problem, a single objective optimization model [formulated as (6.33)] is compared in this part.

184 Chapter 6

(A)Wine

(B) Glass

(C) Heart Disease

(D)Thyroid

(E) Zoo

(F) Iris

Figure 6.24 Boxplot of the clustering accuracy obtained from the proposed crossover and other crossover schemes.

min kAx  Ak22 þ gkxk x

s:t:

xii ¼ 0 xij ˛ ½0; 1

0

(6.33)

In Formula (6.33), the most difficult problem is how to select the value of parameter g. Refer to Fig. 6.25, jjAx  Ajj2  jjxjj0 , which means that g tends to be a small value (g > 0, the upper bound of it depends on the problems), which was proven in the experiment (seen from Fig. 6.25). Taking dataset “wine” and “thyroid” as examples, we can see that: (1) the best g values are different for different datasets, and it is time consuming to find a suitable value for each problem. We cannot obtain a satisfying clustering result by simply sampling a few values for g and run a few times and (2) it performs worse than SRMOSC (refer to Table 6.7). In the author’s view, some adaptive schemes for choosing g values during the optimization process may improve its performance. 2

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering (A)

185

(B)

Wine

Thyroi d

Figure 6.25 Clustering accuracies under different g values. For each g value, the result is the average clustering accuracy of 10 runs.

6.5.3.2 Experimental comparison between SRMOSC and other algorithms In this section, the experimental comparison among SRMOSC, unnormalized spectral clustering algorithms based on conventional similarity matrix construction methods, and the famous multiobjective clustering evolutionary algorithms MOCK [31], MOGA(p, sep) [82] is presented. These experiments are extended to semisupervised clustering based on SRMOSC against other traditional methods and semi-MOCK [74] with 10% and 20% labeled data. Furthermore, the authors also carried out SRMOSC on the basis of MOEA/D in order to show the flexibility of the proposed framework. Twelve UCI datasets are adopted to test their performance, and all the results are the average of 20 independent runs for each algorithm on all the datasets. The parameter setting in the other algorithms is the same as in the previous experiments. Note that in MOEA/D-based SRMOSC, T is set to pop e 1, and only case 2 of the proposed crossover scheme can be used. Tables 6.7e6.9 give the experimental results with no labeled data, 10%, and 20% labeled data, respectively. In these three tables, two experimental comparisons are shown. The first case is the comparison among the algorithms that construct a similarity matrix, and we have written the best result in bold. The other case is the comparison between SRMOSC, MOCK, and MOGA(p, sep), we have marked the result of MOCK or MOGA(p, sep) in bold italics if it reaches a better result than SRMOSC. In order to know the statistical significance of the results, we have carried out a KruskaleWallis statistical test (a ¼ 0.05) between each algorithm and the one that reaches the best results for each dataset. When an algorithm does not present statistical difference with the best, its results have been marked with the symbol “*” in the tables. Table 6.7 shows the experimental results obtained from all the algorithms. SRMOSC significantly outperforms fully connected, mutual kNN, and ε-neighborhood similarity

186 Chapter 6

Table 6.7: Clustering accuracy comparison obtained from SRMOSC against other algorithms on real-life datasets. SRMOSC Datasets Wine Glass Iris Wdbc Heart disease Balance scale Vote Ecoli Thyroid Zoo Image Waveform

NSGA-II

MOEA/D

Fully connected

kNN

Mutual kNN

ε-neighborhood

MOCK

MOGA (p, sep)

95.90 ± 0.93* 62.45 ± 2.23 92.50 ± 2.46 94.37 ± 0.63* 68.54 ± 5.60*

95.54 ± 1.08* 57.78 ± 5.29 91.44 ± 2.82 92.07 ± 2.73 65.43 ± 6.15

63.48 ± 0.00 47.03 ± 0.94 66.80 ± 0.00 65.40 ± 2.23 56.41 ± 1.21

96.07 ± 0.00* 56.50 ± 5.12 88.36 ± 2.58 94.90 ± 0.00* 66.30 ± 0.00*

54.61 ± 6.38 44.01 ± 3.33 53.60 ± 7.46 62.74 ± 0.00 56.04 ± 1.06

59.97 ± 6.93 49.95 ± 1.33 68.00 ± 0.00 69.40 ± 3.60 56.24 ± 2.08

68.65 ± 3.62 44.16 ± 4.46 90.10 ± 0.73 94.52 ± 0.12* 80.91 ± 0.52

95.03 ± 1.34* 60.65 ± 2.92 91.33 ± l.65 93.87 ± 0.46 81.61 ± 4.38

66.72 ± 1.88

69.24 ± 4.61

64.74 ± 2.39

65.44 ± 1.89

61.30 ± 3.98

65.41 ± 6.72

54.88 ± 5.23

75.11 ± 4.25

88.28 ± 0.60* 80.64 ± 2.42 92.84 ± 1.09* 90.79 ± 2.55 70.22 ± 2.89* 63.90 ± 5.40

88.85 ± 0.88* 80.24 ± 1.53* 84.30 ± 4.03 87.97 ± 3.98 69.56 ± 3.12* 65.47 ± 5.09

63.37 ± 1.54 64.43 ± 1.20 73.77 ± 1.98 42.67 ± 1.54 37.62 ± 0.00 63.86 ± 0.00

88.05 ± 0.00* 78.85 ± 2.54 94.18 ± 1.04 83.32 ± 8.17 65.26 ± 3.09 52.04 ± 0.00

62.49 ± 1.65 68.36 ± 5.22 75.19 ± 3.65 59.11 ± 5.56 54.19 ± 4.73 34.03 ± 0.10

77.46 ± 10.62 69.80 ± 4.81 71.88 ± 0.56 50.40 ± 4.04 53.53 ± 3.53 40.38 ± 6.09

65.08 ± 0.92 64.06 ± 0.65 73.53 ± 2.77 50.50 ± 0.00 57.84 ± 4.95 69.21 ± 2.35

87.79 ± 0.75 80.82 ± 1.08* 87.44 ± 2.04* 88.05 ± 1.83 67.37 ± 3.67 63.96 ± 5.07

Table 6.8: Semisupervised clustering with 10% labeled data obtained from SRMOSC against other algorithms on real-life datasets. SRMOSC

ε-neighborhood

MOCK

58.51 ± 11.75

56.57 ± 6.77

95.1 ± 1.19

60.44 ± 3.67

44.02 ± 3.33

48.25 ± 2.98

60.93 ± 4.43

69.33 ± 3.67

91.93 ± 2.92

64.20 ± 3.69

68.00 ± 0.00

96.80 ± 0.34

95.27 ± 0.71

65.14 ± 1.81

96.40 ± 0.70

62.90 ± 0.53

67.99 ± 3.35

95.73 ± 0.67*

78.04 ± 3.91

75.33 ± 7.94

58.44 ± 2.17

70.89 ± 9.46

57.70 ± 4.00

59.80 ± 4.87

81.56 ± 0.53

78.22 ± 3.95

77.98 ± 3.30

82.58 ± 2.09

82.04 ± 1.62

67.11 ± 6.84

81.42 ± 1.76

62.86 ± 7.03

89.79 ± 0.56*

89.55 ± 0.99*

63.91 ± 1.91

90.55 ± 0.80*

66.59 ± 5.72

69.77 ± 11.41

89.61 ± 1.56*

81.84 ± 2.10 *

81.25 ± 3.35*

68.05 ± 7.84

82.35 ± 3.86*

76.90 ± 3.17

68.10 ± 10.16

85.71 ± 1.07

93.65 ± 1.23*

92.26 ± 2.96

73.51 ± 2.47

93.30 ± 2.27*

78.56 ± 4.95

75.35 ± 2.46

92.79 ± 2.27*

90.50 ± 2.19

89.55 ± 2.60

42.67 ± 1.54

85.05 ± 6.91

63.81 ± 4.93

52.60 ± 4.02

92.08 ± 0.00

81.79 ± 2.53*

82.11 ± 2.44*

44.45 ± 5.26

79.32 ± 4.25

73.76 ± 6.57

62.34 ± 7.70

78.46 ± 3.55

75.30 ± 3.24

74.89 ± 2.88

63.82 ± 0.27

69.74 ± 6.13

34.12 ± 0.09

37.38 ± 1.71

79.22 ± 1.24

NSGA-II

MOEA/D

kNN

Wine (178  13  3) Glass (214  9  6) Iris (150  4  3) Wdbc (569  30  2) Heart disease (270  13  2) Balance scale (625  4  3) Vote (435  26  2) Ecoli (336  7  8) Thyroid (215  5  3) Zoo (101  16  7) Image segmentation (2310  19  7) Waveform (5000  21  3)

96.40 ± 1.02*

96.34 ± 1.06*

62.39 ± 6.16

96.07 ± 1.01*

63.90 ± 1.76

62.38 ± 1.35

47.12 ± 2.97

93.53 ± 1.72

92.53 ± 2.61

95.46 ± 0.64

Mutual kNN

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering

Fully connected

Datasets

187

SRMOSC Datasets

NSGA-II

MOEA/D

Fully connected

kNN

Mutual kNN

ε-neighborhood

MOCK

Wine (178  13  3) Glass (214  9  6) Iris (150  4  3) Wdbc (569  30  2) Heart disease (270  13  2) Balance scale (625  4  3) Vote (435  26  2) Ecoli (336  7  8) Thyroid (215  5  3) Zoo (101  16  7) Image segmentation (2310  19  7) Waveform (5000  21  3)

96.94 ± 0.64*

96.91 ± 0.90*

69.35 ± 16.44

96.80 ± 1.00*

58.06 ± 8.76

56.80 ± 5.66

96.38 ± 0.76*

64.60 ± 1.66*

63.55 ± 3.68*

50.23 ± 3.77

63.76 ± 4.37*

52.06 ± 7.25

49.01 ± 3.57

61.66 ± 2.44

95.30 ± 1.63*

96.50 ± 1.14

69.63 ± 4.63

94.73 ± 2.66

66.73 ± 2.12

68.00 ± 0.53

97.47 ± 0.35

96.63 ± 0.63

96.78 ± 0.85

67.54 ± 3.61

97.31 ± 0.54

62.89 ± 0.52

66.79 ± 4.23

95.75 ± 0.75

82.42 ± 2.59*

82.96 ± 1.24*

60.44 ± 4.54

75.74 ± 10.75

57.28 ± 0.79

59.43 ± 4.62

82.24 ± 0.61*

82.57 ± 1.56

82.24 ± 1.23

86.73 ± 1.32

86.50 ± 1.43

77.27 ± 8.96

85.78 ± 1.58

65.02 ± 5.34

91.54 ± 0.92

90.85 ± 0.70

64.43 ± 4.04

90.62 ± 6.91

66.90 ± 8.20

68.31 ± 8.27

89.87 ± 0.91

85.95 ± 1.44*

85.34 ± 1.53*

63.88 ± 7.51

85.55 ± 2.55*

80.07 ± 1.75

70.29 ± 5.16

85.03 ± 2.10*

95.14 ± 1.07*

94.35 ± 1.37*

74.23 ± 3.35

94.39 ± 3.13*

78.12 ± 8.43

73.70 ± 1.61

93.56 ± 1.39

91.09 ± 1.70

90.40 ± 2.53

88.98 ± 2.62

87.22 ± 4.43

69.06 ± 8.06

62.25 ± 3.63

93.66 ± 1.13

90.02 ± 2.03*

89.17 ± 2.32*

44.12 ± 5.01

82.69 ± 6.80

81.26 ± 4.22

64.17 ± 4.58

83.30 ± 2.57

87.22 ± 0.53

89.04 ± 0.49

60.54 ± 0.90

86.23 ± 0.47

34.20 ± 0.15

42.40 ± 3.10

85.33 ± 0.56

188 Chapter 6

Table 6.9: Semisupervised clustering with 20% labeled data obtained from SRMOSC against other algorithms on real-life datasets.

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering

189

matrix-based spectral clustering on all the tested data. In addition, SRMOSC achieves a better performance than kNN on most of the tested datasets. Comparing SRMOSC with MOCK, SRMOSC works much better on the tested datasets except on dataset “heart.” Note that kNN shows a much better performance than mutual kNN on all the tested data. In both cases, parameter k is set to the same value, but the number of nonzero entries in the similarity matrix obtained from kNN is more than that of mutual kNN, which also demonstrates the importance of parameter k. Meanwhile, the problem of deciding the value of k is overcome in SRMOSC. The experimental results of semisupervised clustering with 10% and 20% labeled data are shown in Tables 6.8 and 6.9. Semi-MOCK handles semisupervised information with a third objective function “adjusted rand index” [57], which is an external measure of clustering quality. In traditional similarity matrix construction methods, all the entries of the labeled samples with the same label are set to the maximum value of its similarity matrix, and the corresponding entries with different labels are set to 0. In Table 6.8, SRMOSC works better on most of the tested datasets than other traditional methods and semi-MOCK. When the percentage of labeled data increases to 20%, SRMOSC also shows its efficiency against other algorithms. Note that kNN performs well on some of the datasets, especially with 10% labeled data. It has been mentioned that it is a difficult problem to select a value of k for finite data, and we will give an additional experiment in the supplementary material in order to see how k affects the clustering result. For other traditional spectral clustering methods, even when we choose the best value of the parameter, the results are still quite poor. Fig. 6.26 gives a time evaluation of SRMOSC based on different MOEAs and other algorithms under the same experimental conditions. It shows that: (1) in contrast to 140 120 MOEA/D NSGA -II

80

MOCK

time (s)

100

MOGA

60

kNN mutual kNN

40

fully

20 0

neighborhood wine

glass

heart

thyroid

zoo

iris

Figure 6.26 Time evaluation of SRMOSC and other algorithms (SRMOSCs based on NSGA-II and MOEA/D is represented as NSGA-II and MOEA/D, respectively).

190 Chapter 6 conventional spectral clustering algorithms, the time cost of SRMOSC is higher for the reason that SRMOSC is a multiobjective clustering algorithm. Although SRMOSC costs more time than the conventional spectral clustering algorithms, it overcomes the difficulty of selecting a suitable parameter value in constructing the similarity matrix; (2) in contrast to multiobjective clustering algorithms, its time cost is higher than MOGA(p, sep) (prototype-based representation) but lower than MOCK (graph-based representation), which indicates the time complexity of clustering algorithms is closely related to the cluster representation methods in multiobjective clustering algorithms; and (3) the time cost of SRMOSC based on MOEA/D is lower than NSGA-II, and it shows that the time complexity of MOEAs has a great effect on SRMOSC.

6.6 Summary This chapter has presented three methods based on MOEAs to solve constrained multiobjective optimization problems (CMOPs), the issues of adaptive clustering and classification, and the issues of sparse spectral clustering, respectively [28,46,84]. The first method, a modified objective function method based on a feasible-guiding strategy is introduced to solve CMOPs. The modified objective function method allows the search of Pareto optimal individuals to exploit from both feasible spaces and infeasible spaces. In the modified objective function method, constraint violation and objective function values are both considered to select infeasible individuals, only the one with low constraint violations and better objective function values can survive in the selection mood. The feasibility ratio in current population decides the contribution of these two parts, which guides the evolution to search less violated infeasible individuals with better objective function values or to find better nondominated feasible individuals. Even though there are no feasible individuals in the current population, both of the two parts are still considered together in case that the searching traps in the situation of finding individuals are feasible but not sufficiently optimal. Feasible-guiding strategy makes some feasible individual govern the evolution of infeasible individuals. The cooperation of feasible regions and infeasible regions makes the search process more motivated. Furthermore, both of the two methods are implemented on the basis of NSGA-II because of the popularity of this algorithm. Of course, these methods are easy to be extended on other CMOEAs. The experimental results on test problems indicate that the mentioned algorithm is able to find well distributed Pareto optimal solutions that spread evenly on or near overall true PF, which provides evidence of the capacity of the proposed algorithm. The second method, MOASCC, is an algorithm that learns simultaneous clustering and classification adaptively via MOEA. The main work of this method can be concluded as follows. First, MOASCC adopts the graph-based representation scheme to generate a set of individuals with various partitions and a different number of clusters. Second, the new

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering

191

clustering objective function is designed to make MOASCC more robust to the underlying structure of the given dataset. Multiobjective optimization not only guarantees the quality of both clustering and classification, but also restricts the number of clusters to certain ranges. Third, a specific mutation scheme is designed to make use of the feedback drawn from the classification process, which enhances the classification performance. What’s more, the experimental analysis on the convergence of MOASCC is also given to prove the efficiency of MOASCC. SRMOSC is introduced in the final part and gives several contributions to the chapter. First, the principal one is that a framework based on sparse representation via an MOEA is proposed to construct the similarity matrix for spectral clustering. SRMOSC models the similarity construction process in spectral clustering into a constrained multiobjective problem, and solves it with EAs. It overcomes the difficulty of parameter setting that commonly exists in traditional methods, and the experiments also demonstrate that the multiobjective evolutionary sparse representation model is efficient in solving the spectral clustering problem. Second, SRMOSC is extended to semisupervised spectral clustering by modeling the semisupervised information as a constraint to satisfy, and guiding the search in the initialization and mutation processes. Third, a selection principle is designed which adopts ratio cut as the measurement to select the final solution from all the Pareto optimal solutions based on a standard adjacency matrix constructed by all the nondominated solutions. Detailed experiments show that a satisfying solution can be obtained in this way. Fourth, some special initialization, crossover, and mutation schemes are also designed in solving sparse representation-based spectral clustering with constrained MOEAs. Additionally, the model that constructs the similarity matrix in SRMOSC can be easily extended to other graph-related problems, such as subspace learning. All the contributions mentioned above help SRMOSC gain a more satisfying performance than other conventional methods or multiobjective clustering algorithms. However, considering the coding scheme, its space complexity is high, especially when solving large-scale problems.

References [1] Hsieh MN, Chiang TC, Fu LC. A hybrid constraint handling mechanism with differential evolution for constrained multiobjective optimization. In: Evolutionary computation (CEC), 2011 IEEE congress on. IEEE; 2011. p. 1785e92. [2] Michalewicz Z, Schoenauer M. Evolutionary algorithms for constrained parameter optimization problems. Evolutionary Computation 1996;4(1):1e32. [3] Coello CAC, Carlos A. A survey of constraint handling techniques used with evolutionary algorithms. Lania-RI-99-04 Laboratorio Nacional de Informa´tica Avanzada 1999. [4] Davis L. Handbook of genetic algorithms. 1991. [5] Michalewicz Z. Genetic algorithms þData structures ¼ evolution programs. New York: Springer-Verlag; 1996. [6] Ray T, Singh HK, Isaacs A, et al. Infeasibility driven evolutionary algorithm for constrained optimization. Constraint-handling in evolutionary optimization. Berlin, Heidelberg: Springer; 2009. p. 145e65.

192 Chapter 6 [7] Dasgupta D, Michalewicz Z. Evolutionary algorithms in engineering applications. International Journal of Evolution Optimization 1999;1:93e4. [8] Koziel S, Michalewicz Z. A decoder-based evolutionary algorithm for constrained parameter optimization problems. In: International conference on parallel problem solving from nature. Berlin, Heidelberg: Springer; 1998. p. 231e40. [9] Koziel S, Michalewicz Z. Evolutionary algorithms, homomorphous mappings, and constrained parameter optimization. Evolutionary Computation 1999;7(1):19e44. [10] Michalewicz Z, Nazhiyath G. Genocop III: a co-evolutionary algorithm for numerical optimization problems with nonlinear constraints. In: Evolutionary computation, 1995., IEEE international conference on. vol. 2. IEEE; 1995. p. 647e51. [11] Michalewicz Z. Evaluation of paths in evolutionary planner/navigator. In: Proceedings of the international workshop on biologically inspired evolutionary systems; 1995. [12] Xiao J, Michalewicz Z, Zhang L, et al. Adaptive evolutionary planner/navigator for mobile robots. IEEE Transactions on Evolutionary Computation 1997;1(1):18e28. [13] Xiao J, Michalewicz Z, Zhang L. Evolutionary planner/navigator: operator performance and self-tuning. In: Evolutionary Computation, 1996., proceedings of IEEE international conference on. IEEE; 1996. p. 366e71. [14] Sathya SS, Kuppuswami S. Gene silencingda genetic operator for constrained optimization. Applied Soft Computing 2011;11(8):5801e8. [15] Runarsson TP, Yao X. Stochastic ranking for constrained evolutionary optimization. IEEE Transactions on Evolutionary Computation 2000;4(3):284e94. [16] Deb K. An efficient constraint handling method for genetic algorithms. Computer Methods in Applied Mechanics and Engineering 2000;186(2e4):311e38. [17] Takahama T, Sakai S. Constrained optimization by applying the/spl alpha/constrained method to the nonlinear simplex method with mutations. IEEE Transactions on Evolutionary Computation 2005;9(5):437e51. [18] Takahama T, Sakai S. Constrained optimization by the ε constrained differential evolution with an archive and gradient-based mutation. In: Evolutionary computation (CEC), 2010 IEEE congress on. IEEE; 2010. p. 1e9. [19] Paredis J. Co-evolutionary constraint satisfaction. In: International conference on parallel problem solving from nature. Berlin, Heidelberg: Springer; 1994. p. 46e55. [20] Singh HK, Ray T, Smith W. Performance of infeasibility empowered memetic algorithm for CEC 2010 constrained optimization problems. In: Evolutionary Computation (CEC), 2010 IEEE congress on. IEEE; 2010. p. 1e8. [21] Venkatraman S, Yen GG. A generic framework for constrained optimization using genetic algorithms. IEEE Transactions on Evolutionary Computation 2005;9(4):424e35. [22] Mallipeddi R, Suganthan PN. Evaluation of novel adaptive evolutionary programming on four constraint handling techniques. In: Evolutionary computation, 2008. CEC 2008. (IEEE world congress on computational intelligence). IEEE congress on. IEEE; 2008. p. 4045e52. [23] Mallipeddi R, Suganthan PN. Ensemble of constraint handling techniques. IEEE Transactions on Evolutionary Computation 2010;14(4):561e79. [24] Mallipeddi R, Suganthan PN. Differential evolution with ensemble of constraint handling techniques for solving CEC 2010 benchmark problems. In: Evolutionary computation (CEC), 2010 IEEE congress on. IEEE; 2010. p. 1e8. [25] Xiao H, Zu JW. A new constrained multiobjective optimization algorithm based on artificial immune systems. In: Mechatronics and automation, 2007. ICMA 2007. International conference on. IEEE; 2007. p. 3122e7. [26] Zhang Z. Immune optimization algorithm for constrained nonlinear multiobjective optimization problems. Applied Soft Computing 2007;7(3):840e57.

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering

193

[27] Karaboga D, Akay B. A modified artificial bee colony (ABC) algorithm for constrained optimization problems. Applied Soft Computing 2011;11(3):3021e31. [28] Jiao L, Luo J, Shang R, et al. A modified objective function method with feasible-guiding strategy to solve constrained multiobjective optimization problems. Applied Soft Computing 2014;14:363e80. [29] Jain AK, Duin RPW, Mao J. Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence 2000;22(1):4e37. [30] Maulik U, Bandyopadhyay S. Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence 2002;24(12):1650e4. [31] Handl J, Knowles J. An evolutionary approach to multiobjective clustering. IEEE Transactions on Evolutionary Computation 2007;11(1):56e76. [32] Duda RO, Hart PE, Stork DG. Pattern classification. New York: Wiley; 1973. [33] Cai W, Chen S, Zhang D. A multiobjective simultaneous learning framework for clustering and classification. IEEE Transactions on Neural Networks 2010;21(2):185e200. [34] Coello CAC. Evolutionary multiobjective optimization: a historical view of the field. IEEE Computational Intelligence Magazine 2006;1(1):28e36. [35] Deb K, Pratap A, Agarwal S, et al. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation 2002;6(2):182e97. [36] Zitzler E, Laumanns M, Thiele L. SPEA2: improving the strength pareto evolutionary algorithm. TIKReport. 2001. p. 103. [37] Coello CAC, Pulido GT, Lechuga MS. Handling multiple objectives with particle swarm optimization. IEEE Transactions on Evolutionary Computation 2004;8(3):256e79. [38] Zhang Q, Li H. MOEA/D: a multiobjective evolutionary algorithm based on decomposition. IEEE Transactions on Evolutionary Computation 2007;11(6):712e31. [39] Garcia-Piquer A, Fornells A, Bacardit J, et al. Large-scale experimental evaluation of cluster representations for multiobjective evolutionary clustering. IEEE Transactions on Evolutionary Computation 2014;18(1):36e53. [40] Mukhopadhyay A, Maulik U, Bandyopadhyay S, et al. A survey of multiobjective evolutionary algorithms for data mining: Part I. IEEE Transactions on Evolutionary Computation 2014;18(1):4e19. [41] Mukhopadhyay A, Maulik U, Bandyopadhyay S, et al. Survey of multiobjective evolutionary algorithms for data mining: Part II. IEEE Transactions on Evolutionary Computation 2014;18(1):20e35. [42] Qasem SN, Shamsuddin SM. Memetic elitist pareto differential evolution algorithm based radial basis function networks for classification problems. Applied Soft Computing 2011;11(8):5565e81. [43] Qasem SN, Shamsuddin SM, Zain AM. Multiobjective hybrid evolutionary algorithms for radial basis function neural network design. Knowledge-Based Systems 2012;27:475e97. [44] Qasem SN, Shamsuddin SM, Hashim SZM, et al. Memetic multiobjective particle swarm optimizationbased radial basis function network for classification problems. Information Sciences 2013;239:165e90. [45] Bharill N, Tiwari A. An improved multiobjective simultaneous learning framework for designing a classifier. In: Recent trends in information technology (ICRTIT), 2011 international Conference on. IEEE; 2011. p. 737e42. [46] Luo J, Jiao L, Shang R, et al. Learning simultaneous adaptive clustering and classification via MOEA. Pattern Recognition 2016;60:37e50. [47] Von Luxburg U. A tutorial on spectral clustering. Statistics and Computing 2007;17(4):395e416. [48] Wright J, Ma Y, Mairal J, et al. Sparse representation for computer vision and pattern recognition. Proceedings of the IEEE 2010;98(6):1031e44. [49] Vidal R. Subspace clustering. IEEE Signal Processing Magazine 2011;28(2):52e68. [50] Zelnik-Manor L, Perona P. Self-tuning spectral clustering. In: Advances in neural information processing systems; 2005. p. 1601e8. [51] Woldesenbet YG, Yen GG, Tessema BG. Constraint handling in multiobjective evolutionary optimization. IEEE Transactions on Evolutionary Computation 2009;13(3):514e25.

194 Chapter 6 [52] Park YJ, Song MS. A genetic algorithm for clustering problems. In: Proceedings of the third annual conference on genetic programming; 1998. p. 568e75. [53] Good BH, de Montjoye YA, Clauset A. Performance of modularity maximization in practical contexts. Physical Review E 2010;81(4):046106. [54] Matake N, Hiroyasu T, Miki M, et al. Multiobjective clustering with automatic k-determination for largescale data. In: Proceedings of the 9th annual conference on genetic and evolutionary computation. ACM; 2007. p. 861e8. [55] Pizzuti C. A multiobjective genetic algorithm to find communities in complex networks. IEEE Transactions on Evolutionary Computation 2012;16(3):418e30. [56] Wilson RJ, Watkins JJ. Graphs: an introductory approach: a first course in discrete mathematics. John Wiley & Sons Inc.; 1990. [57] Hubert L, Arabie P. Comparing partitions. Journal of Classification 1985;2(1):193e218. [58] Corne DW, Jerram NR, Knowles JD, et al. PESA-II: region-based selection in evolutionary multiobjective optimization. In: Proceedings of the 3rd annual conference on genetic and evolutionary computation. Morgan Kaufmann Publishers Inc.; 2001. p. 283e90. [59] Li L, Yao X, Stolkin R, et al. An evolutionary multiobjective approach to sparse reconstruction. IEEE Transactions on Evolutionary Computation 2014;18(6):827e45. [60] Wei YC, Cheng CK. Ratio cut partitioning for hierarchical designs. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems 1991;10(7):911e21. [61] Binh TT, Korn U. MOBES: a multiobjective evolution strategy for constrained optimization problems. In: The third international conference on genetic algorithms (Mendel 97). 25; 1997. p. 27. [62] Srinivas N, Deb K. Multiobjective optimization using nondominated sorting in genetic algorithms. Evolutionary Computation 1994;2(3):221e48. [63] Tanaka M, Watanabe H, Furukawa Y, et al. GA-based decision support system for multicriteria optimization. In: Systems, man and cybernetics, 1995. Intelligent Systems for the 21st century., IEEE international conference on, vol. 2. IEEE; 1995. p. 1556e61. [64] Deb K. Multiobjective optimization using evolutionary algorithms. John Wiley & Sons; 2001. [65] Osyczka A, Kundu S. A new method to solve generalized multicriteria optimization problems using the simple genetic algorithm. Structural Optimization 1995;10(2):94e9. [66] Ray T, Tai K. An evolutionary algorithm with a multilevel pairing strategy for single and multiobjective optimization. Foundations of Computing and Decision Sciences 2001;26(1):75e98. [67] Deb K, Pratap A, Meyarivan T. Constrained test problems for multiobjective evolutionary optimization. In: International conference on evolutionary multi-criterion optimization. Berlin, Heidelberg: Springer; 2001. p. 284e98. [68] Zhang Q, Zhou A, Zhao S, et al. Multiobjective optimization test instances for the CEC 2009 special session and competition. Singapore: University of Essex, Colchester, UK and Nanyang technological University; 2008. p. 264. special Session On Performance Assessment Of Multiobjective Optimization Algorithms, Technical Report. [69] Bandyopadhyay S, Pal SK, Aruna B. Multiobjective GAs, quantitative indices, and pattern classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 2004;34(5):2088e99. [70] Zitzler E, Deb K, Thiele L. Comparison of multiobjective evolutionary algorithms: empirical results. Evolutionary Computation 2000;8(2):173e95. [71] Cai W, Chen S, Zhang D. A simultaneous learning framework for clustering and classification. Pattern Recognition 2009;42(7):1248e59. [72] Chang CC, Lin CJ. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2011;2(3):27. [73] Oyang YJ, Hwang SC, Ou YY, et al. Data classification with radial basis function networks based on a novel kernel density estimation algorithm. IEEE Transactions on Neural Networks 2005;16(1):225e36. [74] Handl J, Knowles J. On semi-supervised clustering via multiobjective optimization. In: Proceedings of the 8th annual conference on genetic and evolutionary computation. ACM; 2006. p. 1465e72.

Multiobjective evolutionary algorithm (MOEA)-based sparse clustering

195

[75] Blake C. UCI repository of machine learning databases. 1998. http://www.ics.uci.edu/w mlearn/ MLRepository.html. [76] Dash M, Liu H. Feature selection for clustering. In: Pacific-Asia conference on knowledge discovery and data mining. Berlin, Heidelberg: Springer; 2000. p. 110e21. [77] Liu H, Yu L. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering 2005;17(4):491e502. [78] Zhang X, Li J, Yu H. Local density adaptive similarity measurement for spectral clustering. Pattern Recognition Letters 2011;32(2):352e8. [79] Hamad D, Biela P. Introduction to spectral clustering. In: Information and communication technologies: from theory to applications, 2008. ICTTA 2008. 3rd international Conference on. IEEE; 2008. p. 1e6. [80] Maier M, Hein M, von Luxburg U. Optimal construction of k-nearest-neighbor graphs for identifying noisy clusters. Theoretical Computer Science 2009;410(19):1749e64. [81] Chen WY, Song Y, Bai H, et al. Parallel spectral clustering in distributed systems. IEEE Transactions on Pattern Analysis and Machine Intelligence 2011;33(3):568e86. [82] Mukhopadhyay A, Maulik U, Bandyopadhyay S. Multiobjective genetic algorithm-based fuzzy clustering of categorical attributes. IEEE Transactions on Evolutionary Computation 2009;13(5):991e1005. [83] Lichman M. UCI machine learning repository. Irvine, CA, USA: School Inf. Comput. Sci., Univ. California Irvine; 2013. Available: http://archive.ics.uci.edu/ml. [84] Luo J, Jiao L, Lozano JA. A sparse spectral clustering framework via multiobjective evolutionary algorithm. IEEE Transactions on Evolutionary Computation 2016;20(3):418e33.

CHAPTER 7

MOEA-based community detection Chapter Outline 7.1 Introduction 198 7.2 Multiobjective community detection based on affinity propagation

200

7.2.1 Background to APMOEA 200 7.2.1.1 Affinity propagation method 200 7.2.1.2 Multiobjective optimization 201 7.2.2 Objective functions 202 7.2.3 The selection method for nondominated solutions 203 7.2.4 Preliminary partition by the AP method 204 7.2.5 Further search using multiobjective evolutionary algorithm 205 7.2.5.1 Representation and initialization 205 7.2.5.2 Genetic operators 206 7.2.6 Elitist strategy of the external archive 208

7.3 Multiobjective community detection based on similarity matrix 7.3.1 Background of GMOEA-net 209 7.3.1.1 Structural balance theory 209 7.3.1.2 Tchebycheff approach 210 7.3.2 Objective functions 210 7.3.3 The construction of similarity matrix and k-nodes update policy 7.3.3.1 The function of node similarity 211 7.3.3.2 The k-nodes update policy 213 7.3.4 Evolutionary operators 214 7.3.4.1 The cross-merging operator based on local node sets 214 7.3.4.2 The mutation operator based on similarity matrix 215 7.3.5 The whole framework of GMOEA-net 216

7.4 Experiments

208

211

216

7.4.1 Evaluation index 216 7.4.2 Networks for simulation 218 7.4.2.1 Computer-generated networks 218 7.4.2.2 Real-world networks 219 7.4.3 Comparison algorithms and parameter settings 220 7.4.3.1 Comparison algorithms 220 7.4.3.2 Parameter settings 221 7.4.4 Experiments on computer-generated networks 222 7.4.4.1 Experiments on APMOEA 222 7.4.4.2 Experiments on GMOEA-net 222

Brain and Nature-Inspired Learning, Computation and Recognition. https://doi.org/10.1016/B978-0-12-819795-0.00007-4 Copyright © 2020 Tsinghua University Press. Published by Elsevier Inc. All rights reserved.

197

198 Chapter 7 7.4.5 Experiments on real-world networks

226

7.5 Summary 228 References 229

7.1 Introduction Complex networks are currently implemented to model complex systems in the real-world environment. Nodes in a network represent concrete members of a complex system, while the associations between these members are abstracted as edges in the network. Both the node attributes and the edge attributes in networks generally cover a huge volume of serviceable information, which is complicated and diverse [1]. For instance, in a social network [2], a node simply represents a user, or it further contains the user’s attributes, such as gender, position, age, interest, etc. Analogously, the edges describe some relationships between users, or they can be expressed as the strength, the direction, and the kind of the relationship, such as hostility or friendliness. In addition to the visualized knowledge about the nodes and edges, complex networks also conceal a certain number of topological properties, including small world [3], scale-free [4], structural balance [5], and community structure [6], etc. These properties in networks are always the focus and immense challenges for research. In particular, a topic of current interest in the realm of complex networks is to explore and exploit network clustering. Recently, to identify clusters accurately in complex networks, many state-of-the-art avenues have been created [7e13]. Communities, also known as clusters or modules [14], refer to the collection of nodes with a certain topological structure. To our knowledge, the definition of community structure is not exclusive, but is widely accepted by researchers as the connections between nodes that are more compact than connections with other nodes [15,16]. Some other distinguishing definitions about communities, such as from the perspective of probability, can be found from Refs. [17,18]. Accordingly, community detection algorithms, also known as graph partitioning or network clustering [18], are projected to reveal such topological structures in complex networks. A common practice is to abstract this kind of issue as optimization problems. Through the establishment of an appropriate model, the optimal solution or some near-optimal solutions from optimizing the model can be obtained. In the past few decades, numerous algorithms have been proposed since the significance of mining the community structure of complex networks was realized. Example methods include GN [7], proposed by Girvan and Newman, which is one of the most classical hierarchical clustering algorithms [13,19,20]. Taking the KernighaneLin algorithm [21] and spectral bisection algorithm [22] as their representatives, graph partitioning methods try to divide the entire network into a few subgraphs. However, most optimization models of community detection are confirmed to be NP-hard issues [23], and it is difficult to find the

MOEA-based community detection 199 optimal solution directly to the optimization problem. Since evolutionary algorithms (EAs) are less sensitive to the differentiability of optimization models, EAs, for solving optimization problems, have already become an important branch of network clustering and have shown extensive applications [24] in the field of artificial intelligence [25]. For instance, Pizzuti designed a community detection algorithm [26] based on a genetic algorithm [27] (GA-net) and a multiobjective community detection algorithm (MOGA-net) [11] based on a nondominated sorting genetic algorithm [28] (NSGAII). In 2011, Gong et al. exquisitely devised a hybrid genetic algorithm, called the Memetic-net [29] algorithm, to uncover communities hidden in unsigned networks. To improve the accuracy of identifying communities, the authors also established MOEA/D-net [30] to cope with unsigned networks by using the multiobjective evolutionary algorithm based on decomposition [31] (MOEA/D). Furthermore, Liu et al. masterly devised a metaheuristic method [32] (MEAs-SN) to handle signed networks under the framework of MOEA/D. This chapter presents two algorithms based on MOEA to identify communities in complex networks, which are the multiobjective evolutionary algorithm based on affinity propagation [33] (APMOEA) and the multiobjective evolutionary algorithm based on similarity matrix [34] (GMOEA-net). First, APMOEA uses the method of affinity transmission (AP) to divide the network. In order to speed up the convergence of the algorithm, the multiobjective evolutionary algorithm selects the nonoccupied solution as the initial population from the initial partition results. Secondly, the multiobjective evolutionary algorithm finds the solution to the approximate real Pareto optimal front by constantly selecting the nonoccupied solution from the population in the iterative process, which overcomes the trend that the data clustering method falls into local optimization. Finally, APMOEA uses an elite strategy called “external archiving” to prevent degradation in the search process using multiobjective evolutionary algorithms. According to this strategy, the preliminary partition results obtained by AP will be archived and participate in the final selection of the Pareto-optimal solution. Experiments with benchmark data, including computergenerated networks and eight real-world networks, show that compared with the other seven most advanced algorithms, this algorithm obtains more accurate results and has faster convergence speed. GMOEA-net establishes a generalized similarity function, constructs a similarity matrix, and then proposes a presegmentation strategy according to the similarity matrix. The presegmentation strategy only considers nodes with high similarity, which avoids the interference of noise nodes in the tag updating stage. In this way, in the initial stage of the algorithm, the nodes with strong connections are quickly aggregated into subcommunities. Then, gmoea-net carefully designs a cross-operator, called a cross-merge operator, to merge the subcommunities generated by the presegmentation technique. On this basis, a mutation operator based on the node similarity matrix is proposed to adjust the boundary

200 Chapter 7 nodes connecting different communities. Finally, in order to deal with different types of networks, GMOEA-net proposes a new multiobjective optimization model. Through a large number of strict experiments on unsigned and signed social networks, it is proved that the algorithm can effectively mine communities.

7.2 Multiobjective community detection based on affinity propagation The main parts of APMOEA [33] consist of the choice of objective functions, the selection method for nondominated solutions, the method that uses AP to get the preliminary partitions of networks and the genetic operators in a multiobjective evolutionary algorithm. In the following sections the above are introduced in detail. The procedure for APMOEA is shown in Table 7.1.

7.2.1 Background to APMOEA 7.2.1.1 Affinity propagation method In 2007, Frey and Dueck proposed a powerful clustering method called affinity propagation [35], which has shown its high efficiency in various fields. It has not only a low error rate as well as strong stability, but also a short running time. Furthermore, AP does not need to specify the number of clusters in advance before clustering. The basic idea of AP is relatively simple. Initially, it takes negative real-valued similarities between pairs of data points as input, where s(i, k) indicates how appropriate it is for data point k to be the exemplar for data point i. The algorithm considers all data points as potential exemplars at the beginning and transmits messages between data points until a set of highquality exemplars and corresponding clusters gradually emerge. There are two types of messages. One is called “responsibility” r(i, k), representing the possibility that point k is selected as the exemplar for point i. The other is “availability” a(i, k), representing how

Table 7.1: The procedure for APMOEA. Algorithm 7.1: APMOEA Input: Affinity matrix of network: A; population size of parameter P: NumP; crossover possibility: pc; mutation possibility: pm; maximum number of iterations: Gmax; Output: A set of Pareto-optimal solutions; 1: Get the preliminary partitions Cpre by using AP method; archive Cpre; loop: ¼ 1 2: Chromosomes Cchild)Genetic operation(Cpre, pc, pm); 3: f1(Cchild), f2(Cchild))Objective function f1, f2 of Cchild; update Pareto-optimal front Coptimal through selecting nondominated solutions from Cchild; 4: If loop ¼ Gmax, go to Step 5; otherwise, loop: ¼ loopþ1, return to Step 2. 5: Selecting nondominated solutions from Coptimal and Cpre as final Pareto-optimal solutions and output a set of Pareto-optimal solutions.

MOEA-based community detection 201 appropriate it is for point i to choose point k as its exemplar. The overall process of message transmission can be expressed by the following formulae:   ðtþ1Þ 0 0 ði; kÞ ) ð1  lÞ sði; kÞ  0 max r faði; k Þ þ sði; k Þg þ lrðtÞ ði; kÞ k s:t:k0 6¼k 8 91 0 = < X ðtþ1Þ 0 @ ði; kÞ ) ð1  lÞ min 0; rðk; kÞ þ maxf0; rði ; kÞg A þ laðtÞ ði; kÞ a : ; i0 s:t:i0 ;fi;kg 0 1 X maxf0; rði0 ; kÞgA þ laðtÞ ðk; kÞ aðtþ1Þ ðk; kÞ ) ð1  lÞ@

(7.1)

(7.2)

(7.3)

i0 s:t:i0 6¼k

where parameter l is a damping factor [35] for the prevention of numerical oscillations and its value is between 0 and 1. Before iterations, the values of “responsibility” and “availability” should be set to zero, which can be represented as r(0)(i, k) ¼ 0, a(0)(i, k) ¼ 0. AP takes as input the value of s(k, k) for every data point to weight how likely they are to be chosen as exemplars. These parameters are known as “preferences” (P). As all data points can be regarded as potential exemplars during initialization, the preferences share a common value, which is usually the median or minimum of negative similarity matrix S. 7.2.1.2 Multiobjective optimization A multiobjective optimization problem with q objectives can be defined as [31,36]: max FðxÞ ¼ ff1 ðxÞ; f2 ðxÞ; /; fq ðxÞg

(7.4)

where x ¼ (x1, x2, ., xn) ˛ Z is the decision vector, and Z is the feasible region in decision space. Given two decision vectors x, x* ˛ Z, x* is said to dominate x (denoted as x* > x) if and only if: ðci ˛f1; 2; /; qg: fi ðx Þ  fi ðxÞÞ o ðcj ˛f1; 2; /; qg: fj ðx Þ > fj ðxÞÞ

(7.5)

If in feasible region Z, there exists no decision vector x such that x > x*, we call x* a Pareto-optimal solution or nondominated solution. All these Pareto-optimal solutions compose the Pareto-optimal set and its corresponding figure plotted in the objective space is called the Pareto-optimal front. Thus, the goal for multiobjective optimization is to find a set of solutions approximating the true Pareto-optimal front. Different from the single-objective optimization, multiobjective optimization can achieve a group of nondominated solutions in a single run, and reveals the hierarchical structure of networks to meet different needs for division. Note that the optimal solutions found by single-objective optimization are usually included in the Pareto-optimal set [37]. In the

202 Chapter 7 following sections, we describe experiments to illustrate the advantages of the multiobjective optimization algorithms over the single-objective optimization algorithms.

7.2.2 Objective functions Objective functions that are commonly used in community detection can be summarized as follows: modularity Q [7], modularity density D [38], community score CS [26], and community fitness CF [39]. Modularity Q is a widely used standard put forward by Girvan and Newman, and the solution with higher value of Q indicates the better partitioning of a network. The definition of modularity Q can be formulated as follows: "  2 # K X ls ds (7.6) Q¼  m 2m s¼1 where ls represents the number of edges connecting all nodes in community s, m is the total number of edges in network, and ds is the sum of degrees of all the nodes in community s. The higher the value of Q, the denser the connection within a community. Although many optimization algorithms based on modularity Q have recently emerged [37], they suffer from a problem of resolution limit such that small clusters can often fail to be separated from larger clusters. To avoid this problem, the proposed algorithm adopts modularity density D, which has yielded significant improvement over modularity Q, as an objective function. Consider an undirected network G ¼ (V, E) with vertex set V and edge set E. Its adjacent matrix is A. If there exists a connection between node i and node j, Aij ¼ 1; otherwise Aij ¼ 0. If V1 and V2 are two disjoint subsets of V, then there will be LðV1 ; V2 Þ ¼   P P i ˛ V1 ; j ˛ V2 Aij and L V1 ; V 1 ¼ i ˛ V1 ; j ˛ V 1 Aij , where V1 ¼ V  V1 . For a given

partition U ¼ {V1, V2, ., Vm}, Vi is the vertex set of subgraph Gi. For i ¼ 1, 2, ., m, the modularity density D can be expressed as:   m LðVi ; Vi Þ  L Vi ; Vi X D¼ (7.7) jVi j i¼1 In APMOEA, the equation is divided into two parts as two objectives for optimization. The first part, labeled as the ratio association [38], indicates how closely nodes connect with each other in the same community. The second part, known as the ratio cut [40], indicates how closely nodes connect with others in different communities. Maximizing the modularity density D can find communities with dense intraconnections and sparse

MOEA-based community detection 203 interconnections, which suggests an optimal partition of a network. Thus, the twoobjective optimization problem can be formulated as a maximum optimization problem: 8 m X LðVi ; Vi Þ > > max f1 ðxÞ ¼ > > jVi j > < i¼1   (7.8) > m L Vi ; Vi > X > > > : max f2 ðxÞ ¼  jVi j i¼1

7.2.3 The selection method for nondominated solutions In APMOEA, the method proposed in NSGA-II [28] is employed here to select the nondominated solutions. It consists of two aspects: a fast nondominated sorting approach and a crowded-comparison approach. Firstly, APMOEA employs the fast nondominated sorting approach to sort population Sg into different nondomination levels and choose only individuals of the first nondominated front. The updated population is recorded as Sg-Pareto, then, to obtain a better spread of Pareto-optimal front, the solutions will be screened again by the crowded-comparison approach [36]. For a given individual g ˛ Sg-Pareto, its crowding-distance can be measured by the following formula [36]: dðg; Sg-Pareto Þ ¼

q X dk ðg; Sg-Pareto Þ k¼1

fkmax  fkmin

(7.9)

where, fkmax and fkmin represent the maximum and minimum values of the k-th objective, respectively, and q stands for the number of objective functions. dk(g, Sg-Pareto) which can be expressed as: ( N ; if fk ðgÞ ¼ M or m    (7.10) dk ðg; Sg-Pareto Þ ¼ min fk gj  fk ðgi Þ ; others where, M and m are the maximum and minimum values of k-th objective found in Sg-Pareto, gi and gj are subjected to: {fk(gi) < fk(g) < fk(gj) j gi, gj ˛ Sg-Pareto}. From formula (7.9) we can see that solutions with greater crowding-distance make more contributions in improving the diversity of the population. Hence, according to their corresponding values of crowding-distance, the solutions will be updated by removing some individuals that are too crowded in the Pareto-optimal front.

204 Chapter 7

7.2.4 Preliminary partition by the AP method Data clustering methods such as K-means [41] have fast convergence speeds, but are very sensitive to the choice of initial clustering centers and require prior knowledge about the number of clusters, which is typically unavailable in real-world community detection problems. Compared to K-means, the AP clustering method is more precise and stable. More importantly, there is no need for the AP method to know the number of clusters in advance. Since it was first proposed in 2007, some scholars have applied the AP algorithm to community detection [42e45]. Community detection is supposed to be a graph clustering problem [26,46], in which a network can be viewed as a large graph that is made up of several subgraphs, and connections are much denser within the same subgraph than between different subgraphs. In data clustering, the comparison between two samples actually means the comparison between the same attributes that belong to them. However, we can only know the topological information of networks in community detection as a graph clustering problem. It is critical to choose a high-quality similarity measure to transform community detection into a data clustering problem. In light of comparative experimental results in the literature [47], the proposed algorithm employs a similarity measure based on the signaling process [48], which has proven its high accuracy. The similarity measure based on the signaling process was proposed by Hu et al. in 2008 [48]. The essential principle of this method regards a network with n nodes as a signal transmission system, in which every node can send, receive, and record signals. After a period of transmission, the distribution of signals over the whole network produced by the vertices in the same community will be similar. The signaling process can be expressed as: W ¼ ðI n þ AÞt

(7.11)

where, In is an n-dimensional identity matrix, and t is the transmission time, which takes a value of 3 in APMOEA. Supposing an undirected network with n nodes, its adjacency matrix is A, at first we can compute the signal transmission matrix W ¼ (w1, w2, ., wk, ., wn)T, where wk ¼ (wk1, wk2, ., wkn),k ¼ 1, 2, ., n. Here wk indicates the effect on n nodes produced by the k-th node after t steps. In order to obtain comparable results, we should normalize every row vector in matrix W. Different from the original normalization method mentioned in Ref. [48], the matrix after normalization is recorded as U ¼ (u1, u2, ., uk, ., un)T, where uk ¼ (uk1, uk2, ., ukn), k ¼ 1, 2, ., n, and ukl is subject to: ffiffiffiffiffiffiffiffiffiffiffiffiffiffi ,v uX u n 2 wkj ukl ¼ wkl t (7.12) j¼1

MOEA-based community detection 205 where, l ¼ 1, 2, ., n. Following these procedures, we can transform the topology information of the network into geometrical information of vectors in an n-dimensional Euclidian space. It is worth noting that in order to apply the AP algorithm for clustering, we have to compute negative Euclidean distance between pairs of n vectors u1, u2, ., un for obtaining the negative similarity matrix S. According to the descriptions above, the detailed procedure of using the AP method for the preliminary partitioning of networks is shown in Table 7.2.

7.2.5 Further search using multiobjective evolutionary algorithm In order to get solutions approximating the true Pareto-optimal front and converge to the global optimum, here APMOEA takes multiobjective evolutionary algorithm (MOEA) as a measure for a further search. Through crossover and mutation on the preliminary partitioning results obtained by the AP method, the diversity of the solution space will be greatly increased and it is helpful for avoiding local optima. According to the number of objective functions, evolutionary algorithms can be divided into two categories: single-objective evolutionary algorithms and multiobjective evolutionary algorithms. However, compared to the multiobjective evolutionary algorithm, the single-objective evolutionary algorithm gets only one definite solution rather than a group of solutions in one run, which is not conducive to finding the true partitions. Thus, APMOEA adopts the multiobjective evolutionary algorithm as a further search method. 7.2.5.1 Representation and initialization For each partition of a network with n nodes, we use a string with n integer numbers as its representation, such as a partition x: x ¼ x1 ; x2 ; .; xi ; .; xn (7.13)

Table 7.2: The preliminary partitioning of networks by the AP method. Algorithm 7.2: The preliminary partitioning of networks by AP method Input: Affinity matrix of network: A; population size of parameter P: NumP; maximum size of dominant population: Nmax; Output: The preliminary partitioning results Cpre; Step 1: The negative similarity matrix S)Signal similarity(A); Step 2: Population CP)Initialize parameter P(NumP); Step 3: Population CAP)Affinity propagation(S, CP); Step 4: f1(CAP), f2(CAP))Objective function f1, f2 of CAP; Step 5: Cpre) Selection (CAP, f1(CAP), f2(CAP), Nmax); output Cpre.

206 Chapter 7 Here xi is a class label that represents the cluster node to which i belongs. Nodes in the same cluster have the same label. For example, if nodes 1 and 2 are in the same cluster, then x1 ¼ x2. Generally, in community detection problems, population initialization is typically performed by generating a group of partitions randomly. Although this approach is simple and fast, it takes many iterations for the algorithm to converge to the optimal results. In the proposed algorithm, we employ a set of good partitioning results obtained by the AP method as the initialization population of the evolutionary algorithm, which has greatly enhanced the quality of population initialization and thus promotes rapid convergence to the optimal solution. 7.2.5.2 Genetic operators For the sake of increasing the diversity of the solution space and finding solutions approximating the true Pareto-optimal front, APMOEA uses the crossover and mutation operations in the process of evolution. They are introduced next. Crossover: Conventional methods such as one-point crossover or two-point crossover are simple to operate but, considering the phenotypic characteristics of the chromosomes, they are not suitable for the proposed algorithm as they may destroy some useful genetic information inherited from the parents. To generate offspring carrying features common to their parents, here APMOEA employs a two-way crossover operation [28]. For example, for a network of five nodes, two chromosomes ra ¼ [1 2 1 1 3] and rb ¼ [2 3 3 4 2] are selected randomly from the parent population and their corresponding offspring generated by crossover operation are rc and rd. If we select the one-point crossing and choose the third node as the crossover point, then all the genes in chromosomes ra and rb will be exchanged after that point. As the specific process shown in Table 7.3, the offspring are rc ¼ [2 3 1 1 3] and rd ¼ [1 2 3 4 2]. The circled numbers represent the changing individuals in this step of operation. It is clear to see that nodes 1, 3, and 4 should be in the same community originally in chromosome ra, however, they are assigned to totally different communities in chromosome rd, which has destroyed the original information of the parent. If we choose two-way crossover this time and the third node is still the crossover point, the genes whose value are the same as the value of the third node will be retained, namely the first, the third, and the fourth genes in ra, and the second and the third genes in rb. The rest of the genes will be swapped. This process is shown in Table 7.4. The results rc ¼ [1 3 1 1 2] and rd ¼ [1 3 3 1 3] successfully inherit effective information from their parents.

MOEA-based community detection 207 Table 7.3: One-point crossing.

Table 7.4: Two-way crossing.

Mutation: in APMOEA, it adopts the following mutation mode: randomly select a gene of a chromosome, and change its value to an integer in the set of {1, 2, ., L}, where L is the largest class number in that chromosome. It is easy to operate and helps increase the diversity of the population. In addition, invalid mutation will be effectively avoided through limiting the scope of mutation. For each of the chromosomes to be mutated, 20% of the genes will be selected for mutation. For example, if the chromosome rm ¼ [1 2 1 1 3] is selected, then L should equal to 3. Select 20% of the genes (namely one gene) randomly, assuming it is the fourth vertex, then its corresponding value of the gene can be turned into any one among 1, 2, and 3. This procedure is shown in Table 7.5.

208 Chapter 7 Table 7.5: Mutation operation. v

rm

1

1

1

2

2

2

3

1

1

→ 5

rm’

→ 3

3

7.2.6 Elitist strategy of the external archive An elitist strategy known as external archive is used here as an offset with regard to the problem of degradation that emerges in the evolutionary algorithm. External archive is similar to the elitist strategy proposed in Ref. [49]. As the preliminary partitions obtained by the AP method are a group of superior solutions, they will be archived additionally as the elitists. After a new set of nondominated solutions being found by a further search using the evolutionary algorithm, they will be incorporated with the archived solutions as a whole, from which the final Pareto-optimal set is selected. This can ensure the dominance of solutions and prevent the degradation of the final results to a certain extent.

7.3 Multiobjective community detection based on similarity matrix Recently, research on signed networks has attracted increasing attention. Signed networks, also known as signed social networks [50], are currently adopted to abstract social networks. Compared with unsigned networks, signed networks are constructed by both positive relations and negative relations, since the relations between people or between organizations invariably display double-sided natures. Generally speaking, the connections in signed networks with positive values can be depicted as “friendly,” “like,” etc., while the negative connections are always described as “hostile,” “dislike,” and so on. Therefore, to extend the community definition of unsigned networks to signed networks, it is essential to consider the connection density and the connection symbols simultaneously. In order to cope with such mixed-structure networks, a number of outstanding avenues have been proposed in recent years. For instance, the FEC algorithm [12] adopts a heuristic method based on an agent which is capable of giving nearly optimal solutions. Moreover, two algorithms, called MEAs-SN [32] and SNMOGA [51], leverage the framework of MOEA/

MOEA-based community detection 209 D and that of NSGAII to excavate communities in signed networks, respectively. In this section, another algorithm called GMOEA-net [34] will be introduced, which is combined with MOEA/D to deal with unsigned networks and signed networks.

7.3.1 Background of GMOEA-net Given a network G ¼ (V, E), V is the aggregations of nodes. Without loss of generality, E ¼ (PE, NE) represents the set of edges, where PE is the set of positive edges and NE is the set of negative edges in the network. In particular, NE ¼ F when the network is an unsigned network. Normally, A represents the adjacent matrix of the network which covers the prior information of the input network. The elements in A represent the connection weights between nodes, where Aij ¼ 1 indicates that there is a positive edge between the i-th and j-th node, whereas Aij ¼ 1 illustrates that there is a negative relationship between them, otherwise Aij ¼ 0. Accordingly, PE ¼ {(vi, vj), where Aij ¼ 1}, and NE ¼ {(vi, vj), where Aij ¼ 1}. 7.3.1.1 Structural balance theory The structural balance theory [5], also known as Heider’s balance theory, was originally proposed by Heider in 1944. It aimed to account for the balance of social signed networks. Heider’s balance theory discusses that interpersonal networks tend to form into a balanced structure, which is a friend of a friend is my friend and an enemy of my enemy is also my friend. Therefore, for a basic triangular network structure, we may take Fig. 7.1 to graphically illustrate the substance of Heider’s balance theory. Therefore, for a complete network, the graphs in Fig. 7.1 can adequately represent all the relationships among the nodes v1, v2, and v3. Furthermore, Heider’s balance theory considers that the first two states of the triad are balanced, whereas the latter two are unbalanced states. The following section will formulate the similarity function in signed

Figure 7.1 The schematic illustration of the structural balance theory. (The solid lines and the broken lines represent positive relationships and negative relationships, respectively.) (A) The nodes v1, v2, and v3 are friends; (B) v1 is a friend of v2, and they have a common enemy v3; (C) the nodes v2 and v3 have a common friend, although they are enemies; and (D) the nodes v1, v2, and v3 are enemies.

210 Chapter 7 networks according to Heider’s balance theory. Afterward, the k-nodes update policy is recommended to prepartition networks. 7.3.1.2 Tchebycheff approach Normally, MOEA/D are required to convert the problem of approximation of the Pareto front (PF) to a series of scalar optimization problems. To our knowledge, there are two commonly used decomposition methods [31], called the weighted sum approach and the Tchebycheff approach. Since it is tough to determine whether the PF is concave or nonconcave for maximizing the optimization models and the weighted sum approach is quite propitious for the circumstance where the PF is concave, we are apt to select the Tchebycheff approach as our transformation method. The decomposition formula is written as:

 min gte ðindjl; z Þ ¼ max li fi ðindÞ  zi i¼1;/;m (7.14) subject to ind ˛ pop where m represents the number of objective functions that is equal to 2 in this chapter. The notation ind represents an individual in the population pop. Moreover, MOEA/D selects z* ¼ (z1, z2)T to be on behalf of the ideal reference point of solutions, which is the optimal value of the objective functions generated in the evolutionary course of the population pop.

7.3.2 Objective functions Due to the differences in the structures of signed and unsigned networks, it should adopt different objective functions for them. For unsigned networks, GMOEA-net takes formula (7.8) as its objective function. In order to build a multiobjective optimization model for signed networks, first, we make a detailed description for the first objective function. As we know, in unsigned networks, we can use internal edge density and external edge density to shed light on the degree of connectivity in communities and between communities [18]. However, for signed networks, we demand that both the positive edges of intracommunities and the negative edges of intercommunities are close-knit. Based on this fact, we choose the positive edge density of intracommunities and the negative edge density of intercommunities to evaluate the degree of connectivity in communities and between communities. Thus, a function, called edge density (ED), is presented to measure the degree of connectivity of a community for signed networks as in_pos

ext_neg

kC kC EDC ¼ þ nC ðnC  1Þ nC ðn  nC Þ

(7.15)

MOEA-based community detection 211 P in_pos kC ¼ i; j ˛ C 1fAij > 0g represents the sum of positive degree of the nodes in the community C, where 1{Aij>0} is an indicator function that is equal to 1 when the P ext_neg ¼ i ˛ C; j;C 1fAij < 0g condition in braces is true and 0 otherwise. Analogously, kC represents the sum of the negative edges between communities. In addition, nC represents the number of the nodes in the community C, and n is the number of nodes in the network G. Then, the first objective function is formulated as follows: ED ¼

l 1X EDCi l i¼1

(7.16)

where l is the number of communities. Furthermore, we choose the signed modularity [52] as the second objective function, denoted by SQ. To our knowledge, SQ is an expansion of the modularity Q [7] proposed by Newman for handling unsigned networks. The formula is defined as ! X di dj diþ djþ 1 Aij þ dðCi ; Cj Þ  (7.17) SQ ¼ þ 2m þ 2m i; j ˛V 2m 2mþ where mþ and m represent the number of positive edges and the number of negative edges in signed networks, respectively. In addition, diþ (or di ) represents the positive (or P negative) degree of the i-th node, and diþ ¼ j ˛V 1fAij > 0g. d(Ci, Cj) is the Kronecker function which has a value equal to 1 if and only if Ci ¼ Cj, otherwise, d(Ci, Cj) ¼ 0. To maximize formulas (7.16) and (7.17), the multiobjective optimization model for signed networks is easily obtained as follows. 8 l > 1X > > > EDCi max f1 ¼ ED ¼ > > l i¼1 < (7.18) ! > X > di dj diþ djþ 1 > > > Aij þ dðCi ; Cj Þ max f2 ¼ SQ ¼ þ  > : 2m 2m þ 2m i; j ˛V 2mþ

7.3.3 The construction of similarity matrix and k-nodes update policy 7.3.3.1 The function of node similarity Similarity measurement plays a crucial role in clustering issues. Taking the similarity index into consideration for the issue of network clustering is capable of commendably predividing networks. Therefore, to measure the similarity between the connected nodes,

212 Chapter 7 numerous metrics have been devised by researchers, such as the Salton index [53], Jaccard index [54], and the function of common neighbors (CN), etc. [55,56]. In the initialization phase, GMOEA-net chooses the Salton index as the measurement of the node similarity. The reason for this is that the Salton index takes the degree of nodes into account, and the literature [57] has illustrated that using the Salton index as a similarity measurement enables the algorithms to yield fruitful solutions. Then, given any two nodes vi and vj in an unsigned network, and (vi, vj) ˛E, the Salton index is presented as follows: jGðvi ÞXGðvj Þj Sðvi ; vj Þ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dðvi Þ  dðvj Þ

(7.19)

where G(vi) indicates the set of neighbor nodes which have edges with the node vi, and contains the node vi itself. It is obvious that the numerator in formula (7.18) is the number of common neighbors which are held by the two nodes. In addition, d(vi) is the degree of the node vi. Nevertheless, formula (7.19) merely applies to unsigned networks. For a signed network, any two connected nodes may possess a positive edge or a negative edge. As a consequence, there are three cases existing in the states of common neighbors in terms of one relationship between vi and vj, which are shown in Fig. 7.2A and B.

Figure 7.2 The cases of common neighbors between the node vi and the node vj. (A) vi and vj have a positive relationship; (B) vi and vj have a negative relationship.

MOEA-based community detection 213 In terms of the first two cases (Fig. 7.2A), the black nodes should be regarded as common neighbors of vi and vj, since these two cases belong to the balance structure. However, the third case in Fig. 7.2A is an unbalanced network structure, which the black node is a friend of vi but an enemy of vj. Thus, this case should be considered in the similarity index. As for the negative relationship between the nodes vi and vj in Fig. 7.2B, only the third case is a balanced network structure. However, we cannot regard this black node as a common neighbor. A principal reason is that the black node in this status obviously cannot evaluate the degree of similarity between vi and vj. In real social networks, the elements in communities are frequently accompanied by the negative relationship. In order to handle such a circumstance, we cannot just ignore the negative relationship between the nodes in a community. Therefore, in spite of the unbalanced status of the first case (Fig. 7.2B), the function tolerates this unbalanced status locally as long as the network presents a balanced status on the whole. Thereupon, the following formula is proposed to evaluate the degree of similarity for the nodes in signed networks. 8



> > þ þ

þ G ðvi ÞXG ðvj Þ þ 1 > ðv ÞXG ðv Þ jG i j > > pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; ðvi ; vj Þ ˛ PE > > < dðvi Þ  dðvj Þ Sðvi ; vj Þ ¼ (7.20)

þ þ >

> ðv ÞXG ðv Þ jG i j > > pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; ðvi ; vj Þ ˛ NE > > > dðvi Þ  dðvj Þ : where Gþ(vi) (or G(vi)) represents the neighbor nodes that have the positive (or negative) relationship with the node vi. Thus, we have G(vi) ¼ Gþ(vi)WG(vi)Wvi. When (vi, vj) ˛ PE, in order to keep consistent with the Salton index in unsigned networks, the formula adds 1 to the numerator. Without loss of generality, if the network degrades to an unsigned network, Eq. (7.20) will be transformed into Eq. (7.19). 7.3.3.2 The k-nodes update policy In the previous section, we introduced a similarity index to construct the similarity matrix S. Now, a prepartitioning strategy, called the k-nodes update policy, is presented by analyzing the matrix S. First, the strategy acquires the first k nearest neighbors of a pending node according to the similarities in S. After that, using the common label owned by majority nodes in these k nearest neighbors to label the pending node. The update rule is presented as labelðvi Þ ¼ arg maxfcountðlabelðVkneighbors ÞÞ; Vkneighbors 3Gðvi Þg r

(7.21)

where Vk-neighbors is the set of the first k neighbors of vi with the highest similarities. In addition, the portion in the curly brackets is to count clusters, and r is an integer label. Therefore, formula (7.21) indicates the operation of counting the labels of the nodes in Vk-

214 Chapter 7 Table 7.6: The procedure of the k-nodes update policy. Algorithm 7.3: The k-nodes update policy. Parameters: Population size: popsize, dimension of each individual: n, running times: runtimes. Input: Population: pop, similarity matrix: S. Output: The preprocessed pop by the k-nodes update policy. Sort S in descending order if handling unsigned networks Vk_neighbors takes the first half neighbor nodes whose similarities are greater than zero; elseif handling signed networks Vk_neighbors takes the neighbor nodes whose similarities are greater than zero; end if while(runtimes w¼ 0) for i ¼ 1:popsize for j ¼ 1:n The label of the node vj is updated by using Eq. (7.21); end for end for runtimes ¼ runtimes e 1; end while

and obtains a label r owned by the majority of nodes. Finally, it utilizes the label r to update the label of vi. The pseudocode of the k-nodes update policy is shown in Table 7.6. neighbors

Since Algorithm 7.3 converges very fast, the runtimes is recommended to be set to 3e5.

7.3.4 Evolutionary operators 7.3.4.1 The cross-merging operator based on local node sets In the prepartitioning phase of GMOEA-net, a fast clustering method for the population has been carried out. The primary preponderance of the prepartitioning is that the algorithm enables itself to rapidly and accurately aggregate the denser connected nodes into subcommunities. In order to merge subcommunities and get the right number of communities, a crossover operator called the cross-merging operator is presented as follows. First, we randomly select two different individuals; second, we get a random gene on each individual and denote them by li and lj; third, we take label li to update the gene with the label lj on the individual itself, whereas for another individual, we use lj to update the gene with the label li on itself. Fig. 7.3 presents a schematic of the operation. Assuming two existing individuals are coded as ind1 ¼ {1 1 1 2 2 4 3 3} and ind2 ¼ {1 1 2 2 2 3 4 4}. First, we randomly select the fifth gene labeled with ’20 on the individual ind1, which is denoted by l1 ¼ 2; analogously, we randomly pick the second gene labeled with ’10 on the individual ind2, which is denoted by l2 ¼ 1. Second, we traverse all the

MOEA-based community detection 215

Figure 7.3 A specific instance of the cross-merging operation.

genes on the ind1 to find the genes with label l2. Finally, we label those selected nodes with the label l1. The same approach is adopted to cope with the individual ind2. It is clear in Fig. 7.3 that two subcommunities {v1, v2, v3} and {v4, v5} are merged into a larger community {v1, v2, v3, v4, v5}. 7.3.4.2 The mutation operator based on similarity matrix However, there are several drawbacks to GMOEA-net because of the too fast convergence of the cross-merging operator and the neglecting of the misclassified nodes. In order to settle these issues effectively, a mutation operator based on the similarity matrix is proposed in GMOEA-net to correct the misclassified nodes. First, the algorithm extracts a similarity vector at random from the similarity matrix S. Next, it removes the zero elements in the selected similarity vector and adopts the roulette wheel method to get a similarity value. Finally, the label of the pending node is updated by the label of the node corresponding to the similarity value. The merits of the mutation operator are presented as follows: on the one hand, through eliminating the zero elements in similarity vectors, the operator not only mitigates the interference of noise nodes, but avoids useless explorations in the search space. On the other hand, instead of simply assigning the label of a neighbor node with the highest similarity to the pending node, this method is chosen based on the proportion of similarity values, which makes the nodes with small similarities have the opportunity to update the pending node. It avoids overconvergence and increases the local search ability of

216 Chapter 7 Table 7.7: The procedure of the mutation operator. Algorithm 7.4: The mutation operator based on similarity matrix Input: The randomly selected individual: ind, similarity matrix: S. Output: The individual after mutation: ind’. 1: Vboundary_nodes ¼ find_boundary_nodes(ind); 2: Randomly select a node denoted by vsp from Vboundary_nodes; 3: Select the similarity vector Sv corresponding to the node vsp from S, and perform the zero-eliminating operation on the Sv; 4: A similarity value is selected from Sv by the roulette wheel selection, denoted by sim, and then find its corresponding node vsim; 5: Use the label of vsim to update the label of the node vsp, i.e., label(vsp) ¼ label(vsim).

GMOEA-net in the search space. Since operating on an internal node (i.e., all its neighbors are located in the same community) makes no sense, the mutation operator merely operates on the boundary nodes. Table 7.7 presents the pseudocode of this mutation operation.

7.3.5 The whole framework of GMOEA-net In the foregoing sections, we have described the initialization of the population, the k-nodes update policy, and the operators of GMOEA-net in detail. Finally, a summing-up of GMOEA-net is presented as follows.

7.4 Experiments 7.4.1 Evaluation index Comparing the clustering results with the ground truth divisions currently enables us to evaluate the pros and cons of an algorithm. Consequently, in order to appraise the superiority and inferiority of the proposed algorithm, the well-known normalized mutual information [22] (NMI) is selected as an evaluation index. Normally, when the ground truth division of an input network is known, NMI is a fairly frequently used evaluation index. It works as follows: let A be the ground truth partition of a network, and B the detected partition of an algorithm. The confusion matrix H is constructed jointly by the partition A and (B) Thus, the NMI is formulated as follows: P A PNB 2 Ni¼1 j¼1 Hij logðHij  n=Hi, H,j Þ NMIðA; BÞ ¼ PNA (7.22) PNB i¼1 Hi,  logðHi, =nÞ þ j¼1 H,j  logðH,j =nÞ where NA and NB denote the number of communities in the partition A and the number of communities in the partition B, respectively; and n is the number of nodes in a network. In addition, the element Hij of the confusion matrix H signifies that the community i in A and

MOEA-based community detection 217 Table 7.8: The procedure of GMOEA-net. Algorithm 7.5: The overall description of GMOEA-net Parameters: Population size: popsize; maximum iterations: maxgen; crossover probability: pc; mutation probability: pm; the number of the weight vectors in the neighborhood of each weight vector: NT.     Input: The adjacent matrix A of a network; the evenly distributed weight vectors: l11 ; l12 , l21 ; l22 ,   popsize popsize ; l2 . l1 Output: The optimal partitions of the input network. Initialization: Calculate the Euclidean distances between any two weight vectors, and store NT weight vectors which are closest to each weight vector. That is, for i ¼ 1, 2, ., popsize,   of the  NTweight vectors   the indexes closest to the i-th weight vector are N(i) ¼ {i1,i2, .,iNT}, so li11 ; li21 ; li12 ; li22 ; .; li1NT ; li2NT is the NT   weight vectors which are closest to the vector li1 ; li2 ; Initialize the population pop, and then take k-nodes update policy to process the population; Initialize the reference point z* ¼ (z1, z2)T. Here, z1 ¼ max{f1} and z2 ¼ max{f2} because of maximizing the optimization model. Update:generations ¼ 1; while(generations  maxgen) for i ¼ 1:popsize Randomly select two indexes p, q from N(i), and get two individuals indp, indq ˛ pop; if rand_number < pc Operate the cross-merging operator on indp and indq, and generate two offsprings: child1, child2; elseif rand_number < pm Duplicate the individual indi, and perform the mutation operator on indi; therefore, generate a child denoted by child3; end if child ¼ [child1, child2, child3]; Calculate the values of objective functions of these offsprings, and update the reference point z*; for j ¼ 1:length(child) For each index m ˛ N(i), calculate the values of gte(indmjl, z*) and gte(childjjl, z*) by using Eq. (7.14), and compare them. If gte(childjjl, z*) < gte(indmjl, z*), then set indm ¼ childj to update the neighbors of indi; end for end for generations ¼ generations þ 1; end while

the community j in B possess the number of the common nodes, while Hi∙ (H,j) represents the sum of the elements in the i-th row (in the j-th column). NMI(A, B) is in the range of [0,1]. In general, a higher value of NMI means a higher detecting precision; when NMI ¼ 1, this indicates that the detecting communities of an algorithm are the same as the ground truth communities, i.e., A and B are completely identical. Recently, however, Romano et al. [58] pointed out that the NMI index has a severe selection bias. Concretely speaking, it is inclined to select a clustering result that possesses

218 Chapter 7 more clusters than the ground truth communities. In other words, when the algorithm uncovers a number of communities that is far more than the true number of communities, the value of NMI may be higher than the value of NMI whose number of communities is in close proximity to the number of ground truth communities. The following experiments will shed light on this conclusion. For the sake of addressing such a deficiency, an adjusted NMI index [51,59], called weighted normalized mutual information (WNMI), is proposed by Amelio and Pizzuti. Specifically, these authors introduce a negative exponent weight as a penalty factor in the NMI index to alleviate the selection bias. When the number of the detecting communities displays a distinct difference with the true number of communities, the corresponding NMI will be assigned a small weight in order to result in a poor WNMI. Therefore, WNMI can commendably evaluate the difference between the detected communities and the ground truth communities, which is calculated as   jNA  NB j WNMIðA; BÞ ¼ NMIðA; BÞ  exp  (7.23) NA From Eq. (7.12), we easily reach a conclusion that, if NA ¼ NB, WNMI ¼ NMI; otherwise, 0  WNMI < NMI. Besides, if NA 0.5, GMOEA-net and MODPSO still maintain a good performance, which both the maximum NMI and the maximum WNMI surpassing 0.5. It is worth mentioning that, although both MOEA/D-net and GMOEA-net utilize MOEA/D to detect communities, GMOEA-net is significantly superior to MOEA/D-net in terms of LFR networks, which directly illustrates that GMOEA-net has obvious superiority due to its prepartitioning strategy and evolution operators. Now, we manipulate the application of GMOEA-net on SLFR networks. The maximum values of NMI and the maximum values of WNMI obtained by GMOEA-net are plotted in Fig. 7.6. From Fig. 7.6, we can conclude that GMOEA-net has relatively little sensitivity to Pþ but is influenced highly by the parameter Pe. More specifically, when changing Pe from 0.0 to 0.5, GMOEA-net always presents stable results with high accuracy. However, for 0.5 < Pe  1.0, both the maximum NMI and the maximum WNMI perform unsatisfactorily and exhibit instabilities on these networks. It should be considered that, for Pe > 0.5, to a certain extent, those signed networks have lost their significance. For instance, if the majority of people in the community are hostile in terms of a social network, such a network should be considered impractical. Then, we take SNMOEA and MEAs-SN as comparison algorithms on SLFR networks with g ¼ 0.5, because when g ¼ 0.5, the communities in SLFR networks are difficult to identify. Experimental results are plotted in Fig. 7.7. In addition, NC represents the number of communities in the testing network, and the black dotted line represents the number of real communities in the network. In Fig. 7.7, when the community structure in networks is quite ambiguous, i.e., g ¼ 0.5, GMOEA-net still presents an apparent superiority. From the viewpoint of the whole status, the detecting accuracy of GMOEA-net seems to be less affected by Pþ, whereas both the mixing parameter g and the Pe show a significant impact on it. We can also summarize that, even though MEAs-SN performs better than SNMOGA on the NMI index, it gives a worst performance on the NC index and the WNMI index. Especially when the mixing parameter g gets larger, the selection bias becomes more and more evident. In addition,

224 Chapter 7

Figure 7.6 The results detected by GMOEA-net on SLFR networks.

MOEA-based community detection 225

Figure 7.7 Comparison of results on the SLFR networks.

226 Chapter 7 MEAs-SN shows a severe selection bias. Specifically, the NC index and the WNMI index are the values corresponding to the maximum NMI over 10 runs, however, both the value of NC and the value of WNMI corresponding to the optimal NMI are definitely not optimal. As a consequence, the MEAs-SN tends to select the network partitions when the number of communities far oversteps the number of the ground truth communities. Finally, it seems that the SNMOGA is more easily affected by the parameters Pþ, Pe, and the mixing parameter g.

7.4.5 Experiments on real-world networks Table 7.12 shows the best detection results of NMI obtained by FastNewman(Alg1), Infomap(Alg2), GA(Alg3), Meme-Net(Alg4), MIGA(Alg5), MOEA/D-Net(Alg6), MODPSO(Alg7), APMOEA(Alg8), MOGA-net(Alg9), and GMOEA-net(Alg10) algorithms on the first four real-world networks whose true partitions are known in 30 runs and Table 7.13 shows the best detection results of Q obtained by the six algorithms mentioned above except for MIGA, MOGA-net, and GMOEA-net, which need prior information, on the remaining four networks whose true partitions are unknown in 30 runs. Here symbol “” means the algorithm has no detection results and symbol “d” means the algorithm cannot give its result after many iterations. It can be seen from Table 7.12 that APMOEA can find the true partition on the karate network and dolphin network. Although the result on the football network obtained by APMOEA is slightly worse than that by MOEA/D-Net, it is better than the others. Furthermore, the solution obtained in the polbooks network is much better than those of the other four proposed algorithms. This suggests that the proposed algorithm has better Table 7.12: The best values of NMI obtained by eight algorithms in 30 runs. Network

Alg1

Alg2

Alg3

Alg4

Alg5

Alg6

Alg7

Alg8

Alg9

Alg10

Karate Dolphins Football Polbooks

0.837 0.814 0.710 0.588

0.699 0.587 0.924 0.537

0.699 0.667 0.881 0.575

0.699 0.687 0.911 0.554

1 0.814 0.916 0.585

1 1 0.937 0.621

1 1 0.927 0.598

1 1 0.927 0.659

1 1 0.825 0.602

1 1 0.937 0.621

Table 7.13: The best values of Q obtained by seven algorithms in 30 runs. Network

Alg1

Alg2

Alg3

Alg4

SFI Netscience Power grid PGP Internet

0.734 d d d d

0.733 0.931 0.830 0.813 0.576

0.587 0.858 0.666 0.645 0.454

0.710 d d d d

Alg5     

Alg6

Alg7

Alg8

0.731 0.914 0.688 0.676 d

0.748 0.950 0.842 0.335 d

0.739 0.923 0.858 0.726 0.516

MOEA-based community detection 227 Table 7.14: The detecting results on four signed real-world networks. Networks IS1

Evaluation indexes NMI

WNMI

IS2

NMI

WNMI

SPP

NMI

WNMI

GGS

NMI

WNMI

NMImax NMIavg NMIstd WNMImax WNMIavg WNMIstd NMImax NMIavg NMIstd WNMImax WNMIavg WNMIstd NMImax NMIavg NMIstd WNMImax WNMIavg WNMIstd NMImax NMIavg NMIstd WNMImax WNMIavg WNMIstd

MODPSO

MEAs-SN

SNMOGA

1.0000 0.9877 0.0388 1.0000 0.9629 0.1175 0.9223 0.8811 0.0241 0.6609 0.5968 0.0851 1.0000 0.9847 0.0483 1.0000 0.9514 0.1538 1.0000 1.0000 0.0000 1.0000 1.0000 0.0000

0.4830 0.4830 0.0000 0.0001 0.0001 0.0000 0.4299 0.4299 0.0000 0.0003 0.0003 0.0000 1.0000 1.0000 0.0000 1.0000 1.0000 0.0000 0.6883 0.5276 0.1866 0.5437 0.4089 0.1533

1.0000 1.0000 0.0000 1.0000 1.0000 0.0000 1.0000 1.0000 0.0000 1.0000 1.0000 0.0000 1.0000 1.0000 0.0000 1.0000 1.0000 0.0000 1.0000 1.0000 0.0000 1.0000 1.0000 0.0000

GMOEA-net 1.0000 1.0000 0.0000 1.0000 1.0000 0.0000 1.0000 1.0000 0.0000 1.0000 1.0000 0.0000 1.0000 1.0000 0.0000 1.0000 1.0000 0.0000 1.0000 1.0000 0.0000 1.0000 1.0000 0.0000

performance, especially on some networks with fuzzy community structure. It also can be seen that GMOEA-net and MODPSO obtain the best NMIs on the karate, dolphin, and football networks, although they are slightly worse that APMOEA on the polbooks network. In summary, APMOEA, MODPSO, and GMOEA-net are excellent multiobjective community detection algorithms. Table 7.13 shows that APMOEA achieves the best values of modularity Q in power grid networks, but fails to exceed Infomap in Internet network and PGP network. However, APMOEA can still obtain better results compared with most of the other algorithms, especially as some of them are incapable of calculating their results after many iterations. Generally speaking, APMOEA could detect better results in the eight real-world networks. Next, Table 7.14 gives experimental results of the GMOEA-net algorithm on signed realworld networks, i.e., IS1, IS2, SPP, and GGS networks. Since MODPSO can deal with small signed networks, it is used here as a comparison algorithm. Due to the small scale and the easily identified structure of these four networks, the algorithms except MEAs-SN perform well in accuracy. In particular, GMOEA-net and

228 Chapter 7 SNMOGA give the completely correct network partitions. In addition, the performance of MODPSO is slightly inferior to SNMOGA and GMOEA-net.

7.5 Summary This chapter has presented two community detection algorithms based on MOEAs, i.e., APMOEA and GMOEA-net [33,34]. The first algorithm, APMOEA, is based on affinity propagation to solve community detection problems. First, the algorithm employs a similarity measure based on signal transmission to transform the graph clustering problem into a data clustering problem, and uses the AP method to obtain a set of preliminary partitions of the network. As the AP method has high accuracy and fast clustering speed, we can make use of it to obtain satisfactory preliminary partition results within a few steps. Next, those AP solutions are taken as the initial population of the multiobjective evolutionary algorithm, in which the set of Pareto-optimal solutions will be updated through constantly selecting the nondominated ones from the population after crossover and mutation. Through the above steps, the diversity of the population is increased, thereby improving the likelihood of getting better overall partition results. Finally, these two parts of the solutions will be merged into one, from which the final Pareto-optimal solutions are chosen. The proposed method not only takes advantage of the AP method to quickly find a set of superior initial solutions, but also uses the characteristic of multipoint searching in multiobjective evolutionary algorithm for a further search to reach the global optimum. Through the effective combination of these two components, AP clustering methods and a multiobjective evolutionary algorithm, we can quickly pretreat the network through data clustering method and then use the evolutionary algorithm to search for globally optimal solutions. Experimental results have shown that in most of the networks, APMOEA has a faster convergence rate as well as more accurate detection results compared with other algorithms. The second algorithm, GMOEA-net, is a more generic multiobjective network clustering technique and can identify communities in both unsigned and signed networks. In particular, GMOEA-net takes the decomposition-based multiobjective evolutionary algorithm as its framework, and establishes the similarity matrix of nodes based on the structural balance theory. To provide a good initial population for the evolution phase, based on the similarity matrix, we propose a prepartitioning strategy called the k-nodes update policy, which is conducive to the evolution of the population. In addition, the establishment of the cross-merging operator and the construction of the mutation operator based on the similarity matrix are also the novelties of this chapter. These two operators are more targeted, and not only accelerate the convergence of the proposed algorithm, but also improve its accuracy. The overall framework of GMOEA-net is straightforward and explicit. Although both MOEA/D-net and MEAs-SN adopt the same

MOEA-based community detection 229 multiobjective evolutionary framework as GMOEA-net, MOEA/D-net is merely cranked out for unsigned networks, while MEAs-SN is inclined to handle signed networks. The experimental results have shown that, in terms of accuracy and stability, MEAs-SN and MOEA/D-net are definitely inferior to GMOEA-net on the testing networks. Furthermore, we have implemented extensive experiments on signed and unsigned networks for GMOEA-net, and discussed the validity of the prepartitioning of GMOEAnet. Experiments demonstrate that GMOEA-net is indeed a preeminent multiobjective network clustering technique.

References [1] Xie FD, Ji M, Zhang Y, Huang D. The detection of community structure in network via an improved spectral method. Physica A 2009;388(15e16):3268e72. [2] Wasserman S, Faust K. Social network analysis: methods and applications. Contemporary Sociology 1994;91(435):219e20. [3] Watts DJ, Strogatz SH. Collective dynamics of ‘small-world’ networks. Nature 1998;393(6684):440e2. [4] Baraba´si AL, Albert R. Emergence of scaling in random networks. Science 1999;286(5439):509e12. [5] Heider F. Social perception and phenomenal causality. Psychological Review 1944;51(6):358. [6] Girvan M, Newman MEJ. Community structure in social and biological networks. Proceedings of the National Academy of Sciences of the United States of America 2002;99(12):7821e6. [7] Newman MEJ, Girvan M. Finding and evaluating community structure in networks. Physical Review E 2004;69(2):026113. [8] Rosvall M, Bergstrom CT. Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences of the United States of America 2008;105(4):1118e23. [9] Shang RH, Luo S, Li YY, Jiao LC, Stolkin R. Large-scale community detection based on node membership grade and sub-communities integration. Physica A 2015;428:279e94. [10] Zhang JR, Wu YH, Guo YR, Wang B, Wang HY, Liu HD. A hybrid harmony search algorithm with differential evolution for day-ahead scheduling problem of a microgrid with consideration of power flow constraints. Applied Energy 2016;183:791e804. [11] Pizzuti C. A multiobjective genetic algorithm to find communities in complex networks. IEEE Transactions on Evolutionary Computation 2012;16(3):418e30. [12] Yang B, Cheung W, Liu J. Community mining from signed social networks. IEEE Transactions on Knowledge and Data Engineering 2007;19(10):1333e48. [13] Zhang DW, Xie FD, Zhang Y, Dong FY, Hirota K. Fuzzy analysis of community detection in complex networks. Physica A 2010;389(22):5319e27. [14] Fortunato S. Community detection in graphs. Physics Reports 2010;486(3e5):75e174. [15] Luccio F, Sami M. On the decomposition of networks in minimally interconnected sub-networks. IEEE Transactions on Circuit Theory 1969;16(2):184e8. [16] Radicchi F, Castellano C, Cecconi F, Loreto V, Parisi D. Defining and identifying communities in networks. Proceedings of the National Academy of Sciences of the United States of America 2004;101(9):2658e63. [17] Hu G, Heitmann JA, Rojas OJ. Feedstock pretreatment strategies for producing ethanol from wood, bark, and forest residues. BioResources 2008;3(1):270e94. [18] Fortunato S, Hric D. Community detection in networks: a user guide. Physics Reports 2016;659:1e44. [19] Wu JS, Hou YT, Jiao Y, Li Y, Li XX, Jiao LC. Density shrinking algorithm for community detection with path based similarity. Physica A 2015;433:218e28.

230 Chapter 7 [20] Fortunato S, Latora V, Marchiori M. A method to find community structures based on information centrality. Physical Review E 2004;70:056104. [21] Kernighan BM, Lin S. An efficient heuristic procedure for partitioning graphs. Bell System Technical Journal 1970;49(2):2912307. [22] Wu F, Huberman BA. Finding communities in linear time: a physics approach. European Physical Journal B 2004;38:331e8. [23] Brandes U, Delling D, Gaertler M, Goerke R, Hoefer M, Nikoloski Z, Wagner D. Maximizing modularity is hard. arXiv: physics/0608255vol. 2. 2006. [24] Zhang J, Tang Q, Li P, Deng D, Chen Y. A modified MOEA/D approach to the solution of multiobjective optimal power flow problem. Applied Soft Computing 2016;47(C):494e514. [25] Ma TH, Zhou JJ, Tang ML, Tian Y, Al-Dhelaan A, Al-Rodhaan M, Lee S. Social network and tag sources based augmenting collaborative recommender system. IEICE Transactions on Information and Systems 2015;98(4):902e10. [26] Pizzuti C. Ga-net: a genetic algorithm for community detection in social networks. In: International conference on parallel problem solving from nature: PPSN X. Heidelberg: Springer-Verlag Berlin; 2008. p. 1081e90. [27] Holland JH. Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT Press; 1992. [28] Deb K, Pratap A, Agarwal S, Meyarivan T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation 2002;6(2):182e97. [29] Gong MG, Fu B, Jiao LC, Du HF. Memetic algorithm for community detection in networks. Physical Review E 2011;84(5):056101. [30] Gong MG, Ma LJ, Zhang QF, Jiao LC. Community detection in networks by using multiobjective evolutionary algorithm with decomposition. Physica A 2012;391(15):4050e60. [31] Zhang QF, Li H. MOEA/D: a multiobjective evolutionary algorithm based on decomposition. IEEE Transactions on Evolutionary Computation 2007;11(6):712e31. [32] Liu CL, Liu J, Jiang ZZ. A multiobjective evolutionary algorithm based on similarity for community detection from signed social networks. IEEE Transactions on Cybernetics 2014;44(12):2274e87. [33] Shang RH, Luo S, Zhang WT, et al. A multiobjective evolutionary algorithm to find community structures based on affinity propagation[J]. Physica A: Statistical Mechanics and Its Applications 2016;453:203e27. [34] Shang RH, Liu H, Jiao LC. Multiobjective clustering technique based on k-nodes update policy and similarity matrix for mining communities in social networks[J]. Physica A: Statistical Mechanics and Its Applications 2017;486:1e24. [35] Frey BJ, Dueck D. Clustering by passing messages between data points. Science 2007;315:972e6. [36] Gong M, Jiao L, Du H, Bo L. Multiobjective immune algorithm with nondominated neighbor-based selection. Evolutionary Computation, MIT Press 2008;16(2):225e55. [37] Fortunato S, Barthelemy M. Resolution limit in community detection. Proceedings of the National Academy of Sciences of the United States of America 2007;104:36e41. [38] Li Z, Zhang S, Wang RS, Zhang XS, Chen L. Quantitative function for community detection. Physical Review E 2008;77:036109. [39] Pizzuti C. A multiobjective genetic algorithm for community detection in networks. In: Proceedings of the 21st IEEE international conference on tools with artificial intelligence. New Jersey, USA: Newark; 2009. p. 379e86. [40] Angelini L, Boccaletti S, Marinazzo D, Pellicoro M, Stramaglia S. Identification of network modules by optimization of ratio association. Chaos 2007;17(2):023114. [41] MacQueen JB. Some methods for classification and analysis of multivariate observations. In: Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and probability, vol. 1. Berkeley: University of California Press; 1967. p. 281. [42] Ding F, Luo Z, Shi J, Fang X. Overlapping community detection by kernel-based fuzzy affinity propagation, intelligent systems and applications. ISA); 2010. 2nd International Workshop on.

MOEA-based community detection 231 [43] Jia C, Jiang Y, Yu J. Affinity propagation on identifying communities in social and biological networks. In: Proceedings of the fourth international conference on knowledge science, engineering and management, KSEM’2010, Sep. 1e3, Belfast, UK. Heidelberg: LNAI Springer; 2010. [44] Lai D, Nardini C, Lu H. Partitioning networks into communities by message passing. Physical Review E 2011;83:016115. [45] Yang S. Community detection based on adaptive kernel affinity propagation, Computer Science and Information Technology. 2nd IEEE International Conference on; 2009. ICCSIT 2009. [46] Schaeffer SE. Graph clustering. Computer Science Review 2007;1:27e64. [47] Jiang Y-W, Jia C-Y, Yu J. Community detection in complex networks based on vertex similarities. Computer Science 2011;38(7). [48] Hu Y-Q, Li M-H, et al. Community detection by signaling on complex networks. Physical Review E 2008;78:016115. [49] Tan KC, Yang YJ, Goh CK. A distributed cooperative coevolutionary algorithm for multiobjective optimization. IEEE Transactions on Evolutionary Computation 2006;10(5):527e49. [50] Doreian P, Mrvar A. A partitioning approach to structural balance. Social Networks 1996;18(2):149e68. [51] Amelio A, Pizzuti C. An evolutionary and local refinement approach for community detection in signed networks. The International Journal on Artificial Intelligence Tools 2016;25(04):1650021. [52] Go´mez S, Jensen P, Arenas A. Analysis of community structure in networks of correlated data. Physical Review E 2009;80(1):016114. [53] Salton G, Mcgill MJ. Introduction to modern information retrieval. New York, USA: McGraw-Hill, Inc.; 1986. [54] Jaccard P. Etude de la distribution floraledansune portion des Alpeset du Jura. Bulletin De La Societe Vaudoise Des Sciences Naturelles 1901;37(142):547e79. [55] Sørensen T. A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Biologiske Skrifter 1948;5:1e34. [56] Leicht EA, Holme P, Newman MEJ. Vertex similarity in networks. Physical Review E 2006;73(2):026120. [57] Guo WF, Zhang SW. A general method of community detection by identifying community centers with affinity propagation. Physica A 2016;447:508e19. [58] Romano S, Bailey J, Nguyen XV, Verspoor KM. Standardized mutual information for clustering comparisons: one step further in adjustment for chance. International Conference on Machine Learning 2014:1143e51. [59] Amelio A, Pizzuti C. Is normalized mutual information a fair measure for comparing community detection methods?. Paris, France. In: Proceedings of the 2015 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM); 2015. p. 1584e5. [60] Lancichinetti A, Fortunato S, Radicchi F. Benchmark graphs for testing community detection algorithms. Physical Review E 2008;78:046110. [61] Zachary WW. An information flow model for conflict and fission in small groups. Journal of Anthropological Research 1977;33(4):452e73. [62] Lusseau D, Schneider K, Boisseau OJ, Haase P, Slooten E, Dawson SM. The bottlenose dolphin community of Doubtful Sound features a large Proportion of long-lasting associations. Behavioral Ecology and Sociobiology 2003;54:396e405. [63] Newman MEJ. Modularity and community structure in networks. Proceedings of the National Academy of Sciences of the United States of America 2006;103:8577e82. [64] Newman MEJ. Finding community structure in networks using the eigenvectors of matrices. Physical Review E 2006;74:036104. [65] Bogun˜a´ M, Pastor-Satorras R, Dı´az-Guilera A, Arenas A. Models of social networks based on social distance attachment. Physical Review E 2004;70:056122. [66] Ferligoj A, Kramberger A. An analysis of the Slovene parliamentary parties network. Developments in statistics and methodology. 1996. p. 209e16.

232 Chapter 7 [67] Read KE. Cultures of the central highlands, New Guinea. Journal of Anthropological Research 1954;10(1):1e43. [68] Shang RH, Bai J, Jiao L, Jin C. Community detection based on modularity and an improved genetic algorithm. Physica A 2013;392:1215e31. [69] Gong MG, Cai Q, Chen XW, Ma LJ. Complex network clustering by multiobjective discrete particle swarm optimization based on decomposition. IEEE Transactions on Evolutionary Computation 2014;18(1):82e97. [70] Newman MEJ. Fast algorithm for detecting community structure in networks. Physical Review E 2004;69(6):066133.

CHAPTER 8

Evolutionary computation-based multiobjective capacitated arc routing optimizations Chapter Outline 8.1 Introduction 234 8.2 Multipopulation cooperative coevolutionary algorithm 237 8.2.1 Related works 237 8.2.1.1 The model of MO-CARP 237 8.2.1.2 The description of direction vector 238 8.2.2 Initial population and subpopulations partition 240 8.2.3 The fitness evaluation in each subpopulation 243 8.2.4 The elitism archiving mechanism 244 8.2.4.1 The external elitism archive 244 8.2.4.2 The internal elitism archive 245 8.2.5 The cooperative coevolutionary process 246 8.2.5.1 Construct evolutionary pool for each subregion 246 8.2.5.2 Crossover 246 8.2.5.3 Local search 247 8.2.5.4 The selection of offspring solutions and diversity preservation mechanism 8.2.6 The processing flow of MPCCA 249

8.3 Immune clonal algorithm via directed evolution

249

8.3.1 Antibody initialization 251 8.3.2 Immune clonal operation 252 8.3.3 Immune gene operations 253 8.3.3.1 The decomposition operation of the population 253 8.3.3.2 Gene recombination operator 253 8.3.3.3 Gene mutation operator 253 8.3.3.4 Directed comparison operator 256 8.3.3.5 Clonal selection operator 257 8.3.4 The processing flow of DE-ICA 258

8.4 Improved memetic algorithm via route distance grouping 258 8.4.1 Solutions for the timely replacement of IRDG-MAENS 259 8.4.2 Determine the regions which individuals belong to 260 8.4.3 The processing flow of IRDG-MAENS 263

8.5 Experiments

264

8.5.1 Test problems and experimental setup 264 Brain and Nature-Inspired Learning, Computation and Recognition. https://doi.org/10.1016/B978-0-12-819795-0.00008-6 Copyright © 2020 Tsinghua University Press. Published by Elsevier Inc. All rights reserved.

233

248

234 Chapter 8

8.5.2

8.5.3 8.5.4

8.5.5

8.5.1.1 MPCCA 264 8.5.1.2 DE-ICA 264 8.5.1.3 IRDG-MAENS 265 The performance metrics 265 8.5.2.1 The distance to the reference set (ID) 265 8.5.2.2 Purity 266 8.5.2.3 Hypervolume (HV) 266 Wilcoxon signed rank test 266 Comparison of the evaluation metrics 266 8.5.4.1 MPCCA 266 8.5.4.2 DE-ICA 277 8.5.4.3 IRDG-MAENS 282 Comparison of nondominant solutions 289 8.5.5.1 MPCCA 289 8.5.5.2 DE-ICA 289 8.5.5.3 IRDG-IDMAENS 293

8.6 Summary 297 References 298

8.1 Introduction The arc routing problem (ARP) has widespread uses, including snow removal in winter, urban rubbish collection, and sprinkler path planning, etc. [1]. ARP is one of the classic combinatorial optimization problems and has many derived models. One of the most important models is capacitated ARP (CARP) which is the closest to real life [2]. CARP means that vehicles start from a depot to serve predetermined tasks under the condition of meeting the capacity of the vehicles and finally return to the depot, aiming to complete their routes with the minimum total cost [3]. A CARP with one objective (total cost) is described as a single-objective CARP, but the single-objective CARP differs significantly from real applications. In practical applications, the relevant departments do not only want to minimize the total cost but also consider other factors. For example, in the rubbish collection example in Troyes, France, described in literature [4], the relevant departments not only hope to minimize the total cost but also want to complete the rubbish clean-up as soon as possible in order to assign other tasks to workers. Considering this, Lacomme et al. set up a corresponding model, which minimizes both the total cost and the makespan (the cost of the longest circuit) [4]. We consider a CARP with two objectives as a multiobjective CARP (MO-CARP). It is apparent that the two objectives conflict and they cannot achieve optimal results at the same time. Therefore, there is no unique global optimal solution when solving MO-CARP and it usually retains solutions which have a good balance between the two objectives.

Evolutionary computation-based multiobjective 235 So far, researchers have proposed many effective algorithms in order to solve basic CARPs, including heuristic algorithms and metaheuristic algorithms. Heuristic algorithms mainly include the path scanning algorithm [5], augment assignment algorithm [6], and Ulusoy-Split algorithm [7]. Heuristic algorithms can converge on the local optimal solutions in a relatively short time, so that these algorithms are effective for relatively small-scale examples. However, for large-scale examples, it is difficult for the algorithms to jump out of the local optimal, and so they are unable to achieve the ideal solutions. Scholars have proposed advanced metaheuristic algorithms in order to improve this problem. Typical metaheuristic algorithms are the simulated annealing algorithm to salt in the wintertime [8], Tabu search algorithm [9], guided local search algorithm [10], memetic algorithm [11], memetic algorithm based on extended neighborhood search [12], and cooperative co-evolution with route distance grouping for large-scale CARPs [13]. These metaheuristic algorithms present their advantages in efficiency, optimal solutions, and stability when solving basic CARPs. However, solving MO-CARP faces great difficulties due to it having a higher degree of complexity and the solutions to MO-CARP are more diversiform. Overall, relatively fewer algorithms have been put forward to solve MO-CARP. In 2006, Lacomme first proposed an effective genetic algorithm (LMOGA) for use in these problems. LMOGA combines fast nondominant sorting with a selection strategy based on the crowding distance and takes it as one of the important steps of the algorithm to solve the problem. Lacomme compared the algorithm with single-objective algorithms on the quality of the achieved solutions and the computation speed [4]. In 2011, Mei et al. put forward a more effective algorithm, namely the decomposition-based memetic algorithm (D-MAENS). This adopts the framework of the multiobjective evolutionary algorithm based on problem decomposition and embeds MAENS algorithm for the single-objective CARP. Experimental results show that D-MAENS is better than LMOGA [14]. Recently, Shang et al. improved D-MAENS algorithm on offspring update and offspring distribution, then proposed an improved decomposition-based memetic algorithm (ID-MAENS). Moreover, ID-MAENS adds an elite strategy which is conducive to the reservation of good solutions [15]. The experimental results show that ID-MAENS can obtain better nondominant solutions than other existing algorithms when solving MO-CARP. At the same time, Shang et al. brought up a multipopulation cooperative co-evolutionary algorithm for MO-CARP. This algorithm applies a variety of elite storage mechanisms and takes the evolutionary strategy and the local search strategy based on an extended neighborhood. Experimental results show that this algorithm has better performance and faster convergence speed [16]. In this chapter, we present three algorithms based on natural inspired algorithms for multiobjective capacitated arc routing problems. These are the multipopulation cooperative coevolutionary algorithm for MO-CARP (MPCCA) [16], immune clonal algorithm based on directed evolution for MO-CARP (DE-ICA) [17], and improved memetic algorithm based on route distance grouping for MO-CARP (IRDG-MEANS) [18].

236 Chapter 8 MPCCA uses a divide-and-conquer method to decompose the whole population into multiple subpopulations according to different direction vectors. These subpopulations evolve separately in each generation, and adjacent subgroups can share their individuals in the form of cooperative subpopulations. Second, multiple subpopulations are used to search different target subregions at the same time, so that the individuals in each subpopulation have different fitness functions, which can be modeled as single objective carp (SO-CARP). The improved MAENS method is used to search each target subregion for a single-target CARP. Third, the use of internal elite files for each partition to build an evolutionary pool, greatly speeding up the convergence speed. Finally, the fast nondominant sequencing and crowded distance of NSGA-II are used to select the offspring and maintain the diversity. For DE-ICA, it first adopts the framework of an immune cloning algorithm. DE-ICA expands the scale of the initial antibody population in the initialization process and increases the diversity of antibodies. Second, in the operation of immune genes, DE-ICA is combined with decomposition strategy. Antibodies are classified to perform immune genetic operations to help antibody populations share neighborhood information in a timely manner. Third, DE-ICA uses a new comparison operator to establish the total population. ICA can develop toward a better population and improve the quality of antibodies. The experimental results show that the DE-ICA algorithm can result in a better nonoccupied solution, especially for large-scale sets. The last approach is based on the coevolution (CC) algorithm of the routing-distance packet (RDG-MANS) recently proposed by the Mei et al. [13], although the method of Mei has proved to be superior to the previous algorithm, but the IRDG-Mens has discussed several remaining disadvantages, and proposes a solution to overcoming these disadvantages. First, while the routing distance packet (RDG) is used to find a potential better solution, the solution resulting from the problem of each generation of decomposition is not the best solution, and the best solution found to date is not used to address the current generation. Second, in order to determine which subpopulation the individual belongs to, only by distance, the imbalance of the number of individuals in different subgroups and the allocation of resources can be caused. Third, the method of Mei et al. is only used to solve the single-target CARP. In order to overcome the above problems, this chapter proposes to improve the RDG-MANS by immediately updating the solution and applying them to resolve the current solution through the shared area, and then to improve the RDG-MANS according to the size of the route direction vector. A quick and simple allocation scheme is proposed to determine the decomposition problem to which the route belongs. Finally, the improved algorithm is combined with the improved decomposition algorithm to solve the multiobjective large-scale CARP (LSCAP). The experimental results show that the improved RDG-MANS can achieve a better effect on the single-target LSCAP and the multitarget LSCAP.

Evolutionary computation-based multiobjective 237

8.2 Multipopulation cooperative coevolutionary algorithm 8.2.1 Related works 8.2.1.1 The model of MO-CARP CARP was proposed by Golden and Wong in 1981 [19]. Given an undirected and connected graph, including a series of task edges and a special vertex called the depot, several vehicles with the same capacity start from the depot to service those task edges and then return to the same depot. The goal of CARP is to determine a reasonable scheme with the minimum total cost on the conditions that all task edges should be serviced and each task edge should only be serviced once by one vehicle. Fig. 8.1 shows a simple scheme for CARP including three vehicle routes in total, in which the straight line indicates the task edge, the dotted line indicates the nontask edge, the arrow direction is the vehicle traveling direction, the red node represents the depot, and the black node represents the intersection point between different edges. In many real-world applications, however, some other factors must be taken into account in addition to the total route consumptions. In the garbage cleaning example of Troyes city in Ref. [4], all the vehicles have to leave the depot at the same time. In order to improve efficiency, the sanitation department wants to make the entire garbage clean-up end as early as possible. With the above considerations, the authors ignored the parameter of vehicle speed in the modeling process and used a second objectivedmakespan (the cost of the longest route)dto reduce the duration time, based on the assumption that the route consumption was proportional to the time [4]. For easy understanding, we define the following functions and symbols in an MO-CARP model, given a graph G(V, E): 1. V ¼ {v0, v1, ., vn} is a collection of vertices in the graph G where v0 is the depot and the remaining nodes represent the cross-points. 2. E ¼ {(vi, vj) j vi ˛ V, vj ˛ V, isj} is the set of edges. For each e ˛ E, there are three non-negative attributes, such as the demand of ed(e), the service cost of es(e), and the traveling cost of ec(e). 3. ER¼(e˛Ej d(e) > 0) is the set of task edges. 7(15) route 2

depot

route 1

6(14)

1(9)

route 2

route 1 2(10)

8(16) route 3

5(13)

3(11) route 3 4(12)

Figure 8.1 A simple scheme for CARP and its coding.

238 Chapter 8 4. For each edge task (vi, vj), it is assigned two positive integers t1 and t2, one for each direction. 5. A CARP solution can be represented as a set of routes x¼(T1, T2, ., Tm), and Tk¼(0, tk1, tk2, ., tkjTkj, 0) is the task sequence of the k-th route, where 0 denotes the depot and jTkj denotes the total number of task edges serviced by the k-th vehicle. 6. For each route Tk, its total cost and total demand can be calculated as follows: costðTk Þ ¼

jTX k 1j

    s tki þ dist tki ; tkiþ1

i¼1

dðTk Þ ¼

jTk j X

(8.1)

dðtki Þ

i¼1

where the function dist(t1, t2) is the shortest distance (with the minimum traveling cost) from t1 to t2. At this point, according to the above definitions, the MO-CARP model can be established. 8 m > X > > > ðxÞ ¼ costðTk Þ min f > 1 > > > k¼1 > > > f2 ðxÞ ¼ max ðcostðTk ÞÞ > < min 1km m (8.2) X > > ¼ where jT j jE j > i R > > > i¼1 > > > Ti XTj ¼ F; i 6¼ j; i; j ¼ 1; 2; . ; m > > > : dðTk Þ  Q ; 1  k  m where f1 is the total-cost of all vehicle routes and f2 is the makespan. jERj is the total number of task edges and Q is the capacity of each vehicle. In Fig. 8.1, each task edge is assigned two integers, one for each direction. The solution of the scheme in Fig. 8.1 can be denoted as x ¼ (0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0). 8.2.1.2 The description of direction vector The direction vector of x can be denoted by l ¼ [l1, l2, ., ln]T, where li is the cosine of the angle di formed by the vector of x in objective space with the i-th axis. R ¼ [f1min, f2min, ., fnmin]T is the reference point. fimin represents the minimum value of all individuals in the i-th dimension of the objective space. Obviously, direction vector is a unit vector and the sum of square of each component is equal to 1. The specific definition of the direction vector x can be expressed as follows: j fi ðxÞ  fimin j ffi; 1  i  n li ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n   P fj ðxÞ  fjmin j¼1

(8.3)

Evolutionary computation-based multiobjective 239 The direction vector is an important concept in analytic geometry. There already exist approaches using various types of direction vector, hyperspheres, or spherical coordinates in the literature [20e22]. In 2003, Messac et al. proposed the normal constraint (NC) method for MOP [22]. In this NC method, the so-called “Utopia line” is drawn between two anchor points in objective space, and then the “Utopia line” is divided into multiple segments, resulting in multiple points. One of the generic points intersecting the segments is used to define a normal to the “Utopia line,” which is called the normal line. This normal line is used to reduce the feasible space. By translating the normal line, a corresponding set of solutions will be generated. Hughes proposed a multiple singleobjective Pareto sampling method (MSOPS) and its improved version MSOPS-II in 2007 [21]. In MSOPS, it first generates a set of a-priori direction vectors, and then each individual in the population is evaluated under each direction vector, which indicates how well the population member satisfies the range of target conditions. The key advantage of this algorithm is that it does not rely on Pareto ranking to provide selective pressure. In 2009, Kramer and Koch proposed a novel evolutionary optimization technique with a geometric-based selection scheme [23]. This scheme is designed to produce approximately equidistant solutions on the Pareto front, which is called rake selection. The rakes lie equidistantly in the objective space and guide the evolutionary selection process. In 2010, Meza et al. used the multiobjective differential evolution algorithm with spherical pruning to design the continuous controllers [24]. The idea of spherical pruning can be understood as that if the decision maker encounters any desired solution, it will be searching for the nearest nondominated solution in the PF in any possible direction using discrete arc increments. In 2011, Batista et al. proposed the concept of cone ε-dominance [20], which is a variant of the ε-dominance. Depending on the hyper grid, several viable solutions may be lost in the ε-dominance. However, the cone ε-dominance uses a mechanism to control the hyper volume dominated by a specific cone. Experimental validation of the proposed cone ε-dominance shows a significant improvement in the diversity of solutions over both the regular Pareto-dominance and the ε-dominance. In 2007, Zhang and Li proposed a multiobjective evolutionary algorithm based on decomposition (MOEA/D) [25], in which a set of weight vectors is used to decompose an MOP into a number of scalar optimization subproblems, optimizing them simultaneously. Direction vector and weight vector both assign weights to all the objective functions, therefore they have the same functionality in aggregation methods. However, their physical meanings are different. Weight vector means the weight of the weight sum, and direction vector denotes the direction of the solution vector. In addition, the weight vector is distributed on a hyper plane, but the direction vector is distributed on a hyper sphere. Recently, a coevolutionary multiobjective optimization algorithm based on direction vectors (DVCMOA) was proposed for MOPs. The main idea of DVCMOA is to solve MOPs by dividing the entire population into several subpopulations on the basis of the initial direction vectors in the objective space

240 Chapter 8 f2 λ(i+1) subregion i+1 λi

Initial individuals Nondominated individuals

subregion i PF

λ(i-1) subregion i-1

R 0

f1

Figure 8.2 An illustration of MOP based on the direction vector.

and individuals are classified according to different direction vectors. An example for MOP based on the direction vector is in Fig. 8.2. In Fig. 8.2, the yellow points are the initial individuals, purple points are the final nondominated solutions, and R ¼ [min(f1(x)), min(f2(x))]T is the reference point. Supposing there are an infinite number of direction vectors, the objective space is divided into infinite subregions. For each direction vector through R, there is always one point which is the closest to R in this objective subregion. In Fig. 8.2, the three purple points are, respectively, the closest in direction li-1, li, and liþ1. All the purple points form a set J. Obviously, the nondominated set in J is the PF [26]. Similarly, for the maximization problems, the reference point R can be selected to [max(f1(x)), max(f2(x))]T. The goal of MOP based on direction vectors is to find the optimal solution in each direction vector.

8.2.2 Initial population and subpopulations partition Based on the model of CA, MPCCA uses a set of direction vectors to divide the entire population into different subpopulations. The size of a population is twice as large as that of the subpopulations. Two individuals in the current population correspond to a unique subpopulation. MPCCA assigns fitness to different subpopulations, and an individual’s fitness is linked with a reference point in an implicit way. These subpopulations evolve separately in each generation and merge together to reassign representatives to each subpopulation before starting each generation. In particular, the adjacent subpopulations can share their individuals in the form of cooperative subpopulations. Incorporated with various features like multielite archiving, neighbors sharing, the fast nondominated sorting and the crowding distance approach of NSGA-II, the MPCCA is capable of maintaining archive diversity and guaranteeing a fast convergence in the evolution.

Evolutionary computation-based multiobjective 241 In the process of population initialization, 2N nonclone individual solutions are generated and inserted into population X. As described in the problem definition of MO-CARP, each individual is denoted as x ¼ (T1, T2, ., Tm), where Tk ¼ (0, tk1, tk2, ., tkjTkj, 0), and t is the integer number assigned to the task edge. Most MOEAs start from the initial population with random individuals. Including a few good individuals, however, can help to accelerate the convergence speed. Therefore, MPCCA uses three heuristic methods to produce three elite individuals during initialization of the MPCCA. The heuristic methods are: path-scanning [27], augment-merge [8], and Ulusoy’s heuristic [7]. The remaining individuals are all generated randomly. In MPCCA, the whole objective space is divided into a number of subregions. By assigning different fitness, the algorithm can conduct local search in different subregions of the objective space. MPCCA maintains multiple populations, each for a separate objective subregion. To construct the N subpopulations, we need N uniformly distributed direction vectors. Because MO-CARP is a two-objective optimization problem, the initialization procedure of direction vectors for two objectives is given here. In the initialization program, the angle between the direction vector l1 and the axis f1 is 0; the angle between the direction vector lN and the axis f1 is p/2. We define the progressive angle d ¼ 0.5*p/(N-1). These direction vectors are uniformly increasing because of a certain angle d. The detailed initialization steps are given in Table 8.1. Through the N evenly distributed direction vectors, the entire objective space has been divided into N subregions. Next, we need to assign these 2N individuals to N different subpopulations according to the above subregions. The ideal allocation is the attribution of individual xi determined by the closeness between the solution vector of xi and the N evenly distributed direction vectors. By using this strategy, it may happen that individuals are distributed nonuniformly. For example, this situation occurred in DVCMOA, where the individuals in intensive subregions are retained and the individuals in sparse subregions are deemphasized. Therefore, the ideal subpopulations partition mechanism is not suitable. In order to ensure that the number of each subpopulation is the same, MPCCA designs a fast and simple allocation scheme. According to the definition of these N direction vectors, as i increases, the angles between these direction vectors and the axis f1 increase, while the angles decrease with f2. In other words, subpopulation 1 focuses on the area with a low f2 Table 8.1: Algorithm: The production of uniformly distributed direction vectors. Input: The number of direction vectors N. Output: A set of uniformly distributed direction vectors {l1, l2, ., lN }, where li ¼ [li1, li2]T. Set u1 ¼ 0, d ¼ 0.5*p/(N-1); For i ¼ 1 to N do Set u2 ¼ 0.5*p-u1, li1 ¼ cosu1, li2 ¼ cosu2; Set u1 ¼ u1þd; End for End

242 Chapter 8 Table 8.2: Algorithm: The assignment of individuals to different subpopulations. Input: An unsorted population X ¼ {x1, x2, ., x2N}. Output: A sorted population Y ¼ {y1, y2, ., y2N}. For i ¼ 1 to 2N-1 do For 0 j ¼ iþ1e2N do

1

f1 ðxj Þf1min f1 ðxi Þf1min B ffi > qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffiC If@qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi A then 2 2 2 2 ðf1 ðxj Þf1min Þ þðf2 ðxj Þf2min Þ ðf1 ðxi Þf1min Þ þðf2 ðxi Þf2min Þ

Swap xi and xj; End if End for End for For i ¼ 1e2N do Set yi ¼ xi; End for End

and subpopulation N focuses on the area with a low f1. Based on this idea, it develops the algorithm shown in Table 8.2 to sort the 2N individuals based on the angles between the individuals and the axis f1. After the sorting is completed, the 2i-th and (2i-1)-th individuals in the sorted population will be assigned to subpopulation i. In Fig. 8.3, there are five evenly distributed direction vectors l1el5.The population X includes 10 individuals Ae J. The distribution of solutions in MPCCA is: B, (F) > subpop 1, D, (G) > subpop 2, A, (C) > subpop 3, H, (I) > subpop 4, E, (J) > subpop 5. In this

Figure 8.3 The division of subpopulations in MPCCA.

Evolutionary computation-based multiobjective 243 subpopulation partition mechanism, the computing resources are evenly distributed. It increases the diversity of PF at the cost of slowing down the convergence speed. Hence, we use two elitism archive strategies to speed up the convergence. The elitism archiving mechanism is given in Section 8.2.4.

8.2.3 The fitness evaluation in each subpopulation As MPCCA adopts multiple subpopulations, the individuals in different subpopulations have different fitness functions. According to Wiegand [28], the fitness of an individual mainly depends on its ability to collaborate with other subpopulations in CA. There are many interactions among subpopulations and the changes in one subpopulation may cause changes in other subpopulations or even the entire population. Based on the above ideas, the PBI (penalty-based boundary intersection) [25] is used as the standard of evaluation. In MPCCA, the reference point R is updated by every individual. If R is changed, the fitness of all individuals within the population will be affected. As shown in Fig. 8.4, l is the direction vector of the subpopulation which individual A belongs to. d1 is the projection of the direction vector of individual A in the direction of l, and d2 is the offset distance between F(xA)-R and l. The fitness of individual A in the direction of l can be formulated as follows:   fitnessðxA Þ ¼ jjðFðxA Þ  RÞT ljj þ sjjFðxA Þ  R þ ljjðFðxA Þ  RÞT ljj jj (8.4) where s is the penalty parameter and its value usually is 5. From Formula (8.4), we can see that the fitness of one individual depends on the other individuals implicitly. Once R changes, all individuals’ direction vectors will change accordingly. Moreover, different

f2 λ

PF d1

d2 A

R 0

f1

Figure 8.4 The fitness assignment in the direction of l.

244 Chapter 8 individuals located in different subpopulations are not comparable and the fitness function reflects the interaction among these populations.

8.2.4 The elitism archiving mechanism The elitism archiving mechanism is an evolutionary strategy commonly used in EAs. By retaining the best individual in the current population, the elitism archiving mechanism can accelerate the convergence of the algorithm [14,29]. This suggests that these highfitness individuals (elite individuals) play an important role in the evolution of the population. In MPCCA, two elitism archiving mechanisms are used: the external elitism archive and the internal elitism archive. The difference between the external elitism archive and the internal elitism archive is whether the elitism archive participates in the evolution. More details about these two archiving mechanisms are given next. 8.2.4.1 The external elitism archive Stored externally, the external elitism archive does not participate in evolution, and the external elitism archive is mainly used to save the current nondominated individuals during the evolutionary process. In the initialization phase, the external elitism archive X* is an empty set. Whenever a new individual x is generated, we first determine whether x dominates any individuals of the current external elite population. If this situation exists, the individuals dominated by x will be removed from X*, and x will be added to X*. If there is no individual in X* dominated by x, then we determine whether there are individuals in X* that dominate x. If there is no individual dominating x, then x and all individuals in X* are mutually nondominated. x will then become a new elite individual and will be added to X*. At the end of the algorithm, X* is just the Pareto-optimal set. The detailed steps are shown in Table 8.3.

Table 8.3: Algorithm: The external elitism archive. Input: An individual x. Output: An external elitism archive X*. If (jX*j ¼ ¼ 0) then Add x to X*; Else if (x dominates any archive member) then Delete dominated members in X*; Add x to X*; Else if (x is dominated by any archive member) then Exit; Else Add x to X*; End if End

Evolutionary computation-based multiobjective 245 8.2.4.2 The internal elitism archive Stored internally, the internal elitism archive participates in evolution. It is mainly used to speed up the convergence of the algorithm. The size of the internal elitism archive is fixed at N and the i-th individual in the internal elitism archive Z* corresponds to the current best individual in the direction li. In MPCCA, at the beginning of each iteration, these 2N individuals are resorted according to their direction vectors and evenly reassigned to N subpopulations. Although the computing resources are evenly distributed, it increases the diversity of PF at the cost of slowing down the convergence speed. Hence, MPCCA uses the internal elitism archive to participate in the evolution process. In this way, when searching the i-th subregion of li, the i-th individual in the internal elitism archive Z* can merge with the original individuals of subpopulation i and some other “adjacent” individuals to construct an evolutionary pool for the i-th objective subregion. The internal elite individuals play a guiding role in searching the different objective subregions. In the initialization phase, the internal elitism archive Z* is also an empty set. Whenever a new individual x is generated, we calculate the direction vector of x. Next, according to its direction vector, the objective subregion j which x belongs to is found. Making a comparison between x and the j-th individual in Z*, the winner is kept as the current best individual in the j-th objective subregion. Table 8.4 gives the details of the internal elitism archive.

Table 8.4: Algorithm: The internal elitism archive. Input: An individual x. Output: An internal elitism archive Z*. Calculate the direction vector of individual x in the objective space; 2 3T 6 7 jf2 ðxÞf2min j 6 jf1 ðxÞf1min j 7 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r ; ; lx ¼ 6rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 7 ; 2 2 P 4 P 2 2 5 ðfj ðxÞfjmin Þ ðfj ðxÞfjmin Þ j¼1

j¼1

For i ¼ 1 to N do Calculate the size of jli-lxj; End for The j-th subregion of objective space which individual x belongs to has the minimum value of jli-lxj; If (the j-th individual in Z* is empty) then Add x to the j-th individual in Z*; Else if (x is better than the j-th individual in Z*) then Add x to the j-th individual in Z*; End if End

246 Chapter 8

8.2.5 The cooperative coevolutionary process MPCCA uses multiple subpopulations to search different objective subregions simultaneously. The individuals in each subpopulation have a special fitness function, which can be modeled as an SO-CARP. For each SO-CARP, a special evolutionary pool is first built, and then some evolutionary operators are carried on this evolutionary pool. The first cooperative coevolutionary mechanism is to construct the evolutionary pool for each SO-CARP (or each subregion). In the framework of MOP based on the direction vector, the optimal individual in the i-th objective subregion should be close to that in the j-th objective subregion if li is close to lj. Hence, the individuals in those subregions, whose direction vectors are close to that of the current subregion, should be helpful for searching the current subregion. The second cooperative coevolutionary mechanism is the fitness evaluation in local search among different subregions. As stated in Section 8.2.3, the fitness functions of different subpopulations have a weak coevolutionary relationship, because once R is updated, all individuals’ fitness will change. In other words, the fitness of an individual depends on all the populations, which is a cooperative coevolution between populations. Next, we briefly describe the cooperative coevolutionary process. 8.2.5.1 Construct evolutionary pool for each subregion The coevolution between subpopulations is emphasized in MPCCA. The entire objective space is divided into N subregions by a set of uniformly distributed direction vectors and the 2N individuals are evenly assigned to N subpopulations according to different direction vectors. The i-th subpopulation is the representative of the i-th objective subregion. The first cooperative coevolutionary mechanism is to build an evolutionary pool for each subregion. In MPCCA, the evolutionary pool of the current objective subregion is composed of the subpopulations and current best elite individuals of the five closest objective subregions (including its own). In Fig. 8.5, the (i-2)-th subregion, the (i-1)-th subregion, the (iþ1)-th subregion, and the (iþ2)-th subregion are the four closest objective subregions of the i-th subregion. When we search the i-th subregion, the (i-2)-th subpopulation, the (i-1)-th subpopulation, the i-th subpopulation, the (iþ1)-th subpopulation, the (iþ2)-th subpopulation, and the current best individuals in i-2, i-1, i, iþ1, and iþ2 subregions are merged together to construct the evolutionary pool. By neighbor coevolution, the evolution of one subregion or subpopulation will result in the response of the others, which is the emphasis in cooperative CAs. 8.2.5.2 Crossover When we search each subregion independently, an evolutionary pool which is associated with this subregion is built. For the i-th evolutionary pool, MPCCA applies the sequencebased crossover (SBX) operator in the evolution process. First, two routes T1 and T2 are

Evolutionary computation-based multiobjective 247 f2 λ(i+2)

cooperative coevolution

λ(i+1) λi

λ(i-1) λ(i-2)

R 0

f1

Figure 8.5 The illustration of multisubpopulations’ neighbor coevolution in MPCCA.

randomly selected from two parents x1 and x2, one route for each parent. Second, both T1 and T2 are split into two parts randomly, for example T1 ¼ (T11, T12) and T2 ¼ (T21, T22). Finally, a new individual x3 is obtained through the following three changes: (1) replace T12 with T22; (2) remove the duplicated tasks; and (3) reinsert the missing tasks into the new individual. 8.2.5.3 Local search In 2009, an MA with extended neighborhood search (MAENS) was proposed by Tang [12], which is novel in terms of the utilization of a large-step local search operator, namely Merge-Split. In general, any local search methods for SO-CARP can be embedded into MPCCA. MPCCA uses a Merge-Split operator due to its excellent performance. Single Insertion, Double Insertion, and Swap are the three traditional move operators for local search, which are widely used in CARP. Meanwhile these three local search operators have “small” search steps and thus are only capable of searching within a “small” neighborhood. The Merge-Split operator has an extended step size, which can easily jump out of the local optimum solution. Because CARP has a large solution space and the capacity constraints are tight, we use the local search strategy of MAENS to search each subregion in MPCCA. The detailed procedure of local search is shown in Fig. 8.6. When searching each subregion of the objective space, it is inevitable to encounter infeasible solutions. The total constraint violation of an individual x can be calculated through the following equation: tcvðxÞ ¼

m X k¼1

 maxðdðTk Þ  Q; 0

(8.5)

248 Chapter 8 Single Insertion

individual x

Swap

Doulbe Insertion

Single Insertion

x1 x2 x3

The best individual x'

Merge-Split

Swap

x''

Doulbe Insertion

x1'' x2''

The best individual y

x3''

Figure 8.6 The local search of MAENS.

When comparing two individuals, we judge whether they are viable individuals. If tcv>0, the individual is unviable; otherwise, the individual is viable. In MPCCA, the viable individuals are prior to unviable individuals. Given two feasible candidate solutions, the fitness evaluation in this direction vector is used as the criteria. Given two infeasible candidate solutions, the individual with a smaller tcv is considered to be the better individual. 8.2.5.4 The selection of offspring solutions and diversity preservation mechanism In MPCCA, the correspondences between the individuals and the subpopulations are not static. Before the beginning of each iteration, the 2N parent individuals and N offspring individuals found in the last iteration will be merged together. Then the algorithm picks 2N outstanding individuals reassigned to N subpopulations and goes to the next generation. In this step, MPCCA uses NSGA-II [30] as the selection mechanism of offspring solutions. Proposed by Deb in 2002, NSGA-II is one of the best MOEAs so far developed. The main idea of fast nondominated sorting is: first, sort all the solutions in the population according to the relations of domination. The nondominated solutions in the front row are assigned with level 1. Second, the solutions less dominated rank before the solutions which are dominated by more solutions and the levels are in a descending order. A lot of previous work in numerical optimization has proven that this selection mechanism based on the fast nondominated ranking is better in the uniformity and broadness of the PF. Therefore, it can be applied to MO-CARP. Diversity is another important performance aspect of MOEA. There are three common existing strategies for diversity preservation so far: the niching technique proposed by Srinivas and Deb [31], the cell-based method proposed by Knowles and Corne [32], and the crowding distance method proposed by Deb et al. [30]. The popular niching technique uses the parameter s as the threshold to evaluate the distance between different solutions, and solutions with a distance less than the threshold s are punished. In other words, the solutions in intensive areas face a more severe punishment than the solutions in sparse subregions. By dividing the whole objective space into many cells of the same size, the cell-based method can control the distribution of solutions by limiting the number of

Evolutionary computation-based multiobjective 249 solutions in each cell. In the crowding distance method, the method first calculates the distance between each solution and its two nearest solutions in objective space. These distances are then normalized and summed. The smaller the value, the more chance it will be retained. Among the methods described above, the performance of the niching technique largely depends on the sharing parameter sshare. The cell-based method largely depends on the cell size. Different from these two methods, the crowding distance method has wider applicability because it has no user-defined parameter. In summary, after the fast nondominated sorting procedure and the crowding distance method of NSGA-II, the best 2N individuals are kept to form the population X and the MPCCA enters the next generation.

8.2.6 The processing flow of MPCCA Generally speaking, this chapter studies the MO-CARP model, in which both the total cost of the routes and the cost of the longest trip (makespan) need to be optimized. Inspired by the divide-and-conquer strategy in CA, a multipopulation cooperative coevolutionary algorithm for MO-CARP is proposed. In this algorithm, the whole objective space is divided into multiple subregions by a set of uniformly distributed direction vectors, and different subregions correspond to different subpopulations. At the beginning of each iteration, all the individuals in the current population are sorted according to different direction vectors and then assigned to N subpopulations evenly. These subpopulations evolve separately, while the adjacent subgroups can share their individuals in the form of cooperative subgroups. By referencing some other evolutionary strategies, such as elitism archiving, the NSGA-II and the MAENS for SO-CARP, MPCCA shows good diversity and fast convergence. The detailed steps of MPCCA are given in Table 8.5.

8.3 Immune clonal algorithm via directed evolution Immune clonal algorithm based on directed evolution (DE-ICA) adopts the process of the immune clonal algorithm as a framework and draws on the effective decomposition algorithm. Meanwhile, DE-ICA analyzes and improves some defects of the current algorithms for MO-CARP. Compared with other algorithms, immune clonal algorithm has the advantages of quick convergence and global optimization [33,34]. The immune clonal algorithm uses a heuristic algorithm to obtain the initial antibody population, then evaluates the fitness of the initial antibodies and determines the clonal ratio of the antibodies by calculating the affinities between antibodies and antigens [35]. Next, the immune clonal algorithm performs immune gene operations including gene recombination and gene mutation. Finally, the immune clonal algorithm selects offspring for the next iteration according to certain principles through clonal selection operation [30]. Here, the

250 Chapter 8 Table 8.5: Algorithm: A multipopulation cooperative coevolutionary algorithm for MO-CARP. Input: A instance of MO-CARP s, the number of subpopulations N, a set of uniformly distributed direction vectors l1, ., lN, the maximum of generation Gmax. Output: A set of nondominated solutions X*. Initialize a population X ¼ {x1, ., x2N}, set the external elitism archive X* ¼ Ø and the internal elitism archive Z* ¼ Ø; Using the direction vectors generating mechanism to generate N uniformly distributed direction vectors l1, ., lN; Set it ¼ 0; While (it < Gmax) do According to the subpopulations partition mechanism, these 2N individuals are assigned to N subpopulations evenly. For i ¼ 1 to N do Construct an evolutionary pool for the i-th SO-CARP(or i-th subregion); Randomly select two individuals from the pool and apply the crossover and local search operators of MAENS to find the offspring yi; Update the archive Z* and X*; End for P ¼ XWY, where Y ¼ {y1, ., yN} are the offspring solutions. Sort the individuals in P by the fast nondominated sorting procedure and crowding distance approach of NSGA-II, then, let X be the top of 2N solutions in the sorted P; End while Export X*; End

antigen represents the object function and constraint condition and the antibody is the solution that satisfies the object function and constraint condition. Specific to MO-CARP, DE-ICA first initializes the antibody population. Compared with the existing algorithms for MO-CARP, DE-ICA enlarges the size of the initial population to increase the diversity of antibodies. Next, DE-ICA directly expands the population size according to the characteristics of MO-CARP, which is helpful in improving the quality of solutions and converges to the Pareto-optimal solutions. Then, DE-ICA combines with the decomposition algorithm. Antibodies in the population are divided into antibody subpopulations to perform the immune gene operations, facilitating the algorithm to converge rapidly on the two goals. At the same time, DE-ICA puts forward a novel directed comparison operator to filtrate the antibodies produced in the previous process and the selected antibodies are added into the total population as candidates for the clonal selection. Finally, DE-ICA applies the fast nondominant sorting and crowding distance method to evaluate the candidate antibodies and select offspring, which can ensure both the quality of solutions and the diversity of the antibody population. It is a very effective strategy to choose offspring. We introduce each part of the DE-ICA algorithm below. In the following parts, the total population represents the enlarged candidate population for the next iteration.

Evolutionary computation-based multiobjective 251

8.3.1 Antibody initialization The antibody initialization operation in DE-ICA adopts the path-scanning algorithm which is a classic heuristic algorithm that was proposed by Golden in 1983 [5]. The basic principle of the path-scanning algorithm to solve CARP can be described as: we establish an empty route first, and then insert tasks into the route according to certain principles. If the total demands of the route exceed the capacity Q after inserting some task, then we give up trying to insert the task and the vehicle returns to the depot directly from the end of the last inserted task. The size of the population has a great influence on the solutions [33], and also results in different performances of algorithms. We describe the influence of the population size on the effect of solutions below in detail. In Fig. 8.7(1), each subproblem is assigned to one solution (A, B, C or D). According to the theory of the decomposition algorithm, A should produce offspring solution (denoted E) with an adjacent solution. In a similar fashion, B, C, and D produce offspring solutions (sequentially denoted F, G, and H) with their adjacent solutions. By contrast, in Fig. 8.7(2), there are two representative solutions in each subproblem (A, A0 , B, B0 , C, C0 , and D, D0 ). We can use the roulette method to select individuals from the representative solutions as parents and the selected individuals produce offspring with their adjacent solutions. Obviously, it is beneficial for the algorithm to find better solutions quickly because of the selectivity of solutions in each subproblem. On the other hand, enlarging the size of the population can increase the diversity of the population to some extent, which is beneficial in improving the quality of the solutions. In practical applications, MO-CARP is usually a medium-scale or large-scale problem and algorithms can obtain many solutions. According to the principle in Fig. 8.7, one of the

Figure 8.7 Influence of the population size on the solutions.

252 Chapter 8 important influence factors on how to find the solutions which are closer to the Paretofront is the population size. The existing algorithms set the population size as 60 in the process of antibody initialization. The algorithms can achieve the ideal nondominant solutions easily when solving small-scale CARP. However, for medium-scale and largescale CARP, the quantity of solutions increases obviously and the solution space expands quickly. If the initial population size is still 60, the diversity of the initial solutions obtained by these algorithms is greatly limited and the search scope is very small, so it is difficult for these algorithms to find the nondominant solutions close to the Pareto-front. As a result, DE-ICA tries to expand the scale of the initial population to maintain the diversity during the process of evolution. However, if the population size is too large, it will require more computing resources. Therefore, the initial population size is set for 120 in DE-ICA to achieve better results.

8.3.2 Immune clonal operation In the theory of the artificial immune system, clonal operation is carried out by reproducing the antibodies in the population according to a certain proportion. The clonal operation makes various gene operations possible and facilitates antibodies to share information [36,37]. The immune clonal algorithm usually regards the total cost of the whole solution as a key evaluation index of the affinity between antibody and antigen when solving single-objective CARP. For instance, we usually define the affinity between the antibody Si and the antigen as:   lower bound 3 Aff ðsiÞ ¼ (8.6) total cos tðsiÞ where, lower_bound represents the lower bound of the test instance and it can be obtained from the reference literature. The larger the affinity value is, the smaller the total cost of the solution. The clonal proportion is not just about the affinity between antigen and antibody, but about the affinity between antibodies. The greater the affinity between antibodies, the higher the similarity between antibodies and the easier the antibodies will be to restrain each other. The immune clonal algorithm usually sets the clonal proportion based on the two above values of affinities. In the process of solving MO-CARP, the goals are to minimize the total cost and the cost of the longest circuit at the same time. The calculation of affinities is very complex, so calculation of the antibody clonal ratio is also complex, especially when solving largescale CARP. Therefore, we directly clone the nondominant solutions in the initial population at the ratio 3 in order to guarantee the speed and the simple calculation of DEICA, and then the clonal individuals are added into the initial population. This increases the proportion of the good solutions in the initial population, which is beneficial in improving the quality of the solutions.

Evolutionary computation-based multiobjective 253

8.3.3 Immune gene operations Immune gene operations usually include genetic recombination and mutation. Immune gene operations can increase the diversity of the population, decrease the affinity between antibodies, and improve the quality of the solutions. The decomposition algorithm is an effective method for MO-CARP [13], so DE-ICA uses the framework of the effective strategy for reference in the immune gene operations. 8.3.3.1 The decomposition operation of the population The cooperative coevolution algorithm was first put forward by Potter et al. [38]. The main idea is to divide a problem into many subproblems, and then solve the subproblems independently. The application of the decomposition strategy in DE-ICA takes examples from the literature [14,25]. Uniformly distributed weight vectors w1, $$$, wR decompose the MO-CARP with two goals into R single objective subproblems. The function expression can be described by the i-th weight vector as follows: FiðxÞ ¼ li1  f 1ðxÞ þ li2  f 2ðxÞ;

1iR

(8.7)

In this formula, R is set to 60 and both f1(x) and f2(x) are normalized [14]. Obviously, li is a two-dimensional vector, which represents the weight vector of the i-th subproblem and can be expressed as: li ¼ ((i1)/(R1),1(i1)/(R1)). When solving each of the singleobjective subproblems, DE-ICA first assigns the new population composed by the cloned nondominant solutions and the initial solutions into the corresponding subproblems according to a certain principle. Considering the expression of Formula (8.7), the principle that DE-ICA assigns the individuals is to sort the individuals according to the ascending order of the second objective function. DE-ICA assigns the (2i-1)-th sorted individual and the 2i-th sorted individual into the i-th subproblem for the first iteration. Because it has cloned the nondominant solutions, the same individuals are in the new population. DEICA assigns the same individuals into the same subproblem to increase the probability of the election as parents, which is also helpful to improve the quality of solutions. Table 8.6 is a pseudocode of the decomposition algorithm in DE-ICA. 8.3.3.2 Gene recombination operator Just like the MPCCA algorithm, the DE-ICA algorithm selects the effective SBX of the current recombination operators [12]. Parents selected through certain rules can obtain new antibodies by the gene recombination operation. Further details are presented in Section 8.2.5. 8.3.3.3 Gene mutation operator Gene mutation provides the possibility of obtaining various types of antibodies. Using gene mutation benefits the improvement of the quality of antibodies and the algorithm to

254 Chapter 8 Table 8.6: Algorithm: The pseudocode of the decomposition operator. Begin Sort the antibodies in the initial population in ascending order according to the second goal; for(i ¼ 0; i < popsize; iþþ) for(j ¼ 0; j < R; jþþ) Assign the i-th antibody into the j-th subproblem; end for end for Match the cloned antibodies into the corresponding subproblems; for(j ¼ 0; j < R; jþþ) Select antibodies in the j-th subproblem by the roulette method to perform the immune gene operations and obtain a new antibody; end for Update the antibodies in the population; The new antibody population is the candidates for the clonal selection; end Where R means that the MO-CARP is decomposed into R single-objective subproblems, and popsize is the size of the initial antibody population. In DE-ICA, R ¼ 60 and popsize ¼ 120.

jump out of the local optimal, because the search range of a single gene recombination operator is small and gene mutation is very helpful to search the related area comprehensively [12]. DE-ICA performs the gene mutation operation with a probability of 0.2 on the basis of gene recombination. The algorithm adopts four traditional gene mutation operators, namely Single-Insertion, Double-Insertion, Swap, and 2-opt. Here, we briefly introduce these four operators. 1) Single-Insertion: The operator randomly selects a route then randomly selects a task from the route when performing the gene mutation operation. Next, the task is reinserted into another location or directly connected with the depot to build a new route. If the task is an edge task, then the situation that the task is inserted in the opposite direction should be considered and the one with less total cost of the two antibodies will be reserved. 2) Double-Insertion: The principle of this operator is similar to that of Single-Insertion. The difference between Single-Insertion and Double-Insertion is that Double-Insertion randomly selects a route and randomly selects two continuous tasks in the route. Then the two continuous tasks are reinserted into other locations. Similarly, if the selected tasks are edge tasks, then the situation that the tasks are inserted in the opposite direction should be considered. 3) Swap: The operator randomly selects two different tasks in the sequences of an antibody and swaps the locations of the two tasks. 4) 2-opt: There are two kinds of 2-opt. One is for single routes, and the other is for double routes [33]. DE-ICA applies the 2-opt operator for double routes. The operator first

Evolutionary computation-based multiobjective 255

Figure 8.8 A simple example where 2-opt works on the double routes.

randomly selects two routes (denoted K1 and K2) in an antibody. Next, K1 and K2 are randomly decomposed into two parts (respectively denoted K11, K12 and K21, K22). Finally, there will be two candidate antibodies formed by reconnecting the four routes through different connection methods. One of the two antibodies is made up of K11, K22 and K12, K21. The other antibody is made up of K11 and the opposite direction of K21, K22 and the opposite direction of K12. Fig. 8.8 describes a simple example where 2-opt works on the double routes. In Fig. 8.8, the solid lines show the task lines and the dotted lines mean travel lines. Arrows indicate the directions of vehicles to serve tasks. Each task has two serial numbers. The serial numbers outside the parentheses represent the current driving direction, and the serial numbers inside the parentheses represent the opposite of the current driving directions. S is the two circuits before the action of the 2-opt operator, expressed as S ¼ (0,1,2,3,4,0,5,6,7,0), where 0 represents the depot and two circuits are K1 ¼ (0,1,2,3,4,0) and K2 ¼ (0,5,6,7,0). The previous route is divided into four subroutes after the 2-opt operator acts on it, namely K11 ¼ (0,1,2,3,0), K12 ¼ (0,4,0), K21 ¼ (0,5,0), and K22 ¼ (0,6,7,0). There will be two different antibodies according to the method of connecting above. The two antibodies are respectively expressed as S1 ¼ (0,1,2,3,6,7,0,4,5,0) and S2 ¼ (0,1,2,3,12,0, 6,7,11,0). The antibody with the least total cost will be output as the result of the 2-opt operator. These gene mutation operators are simple and effective. In DE-ICA, it selects one of the four operators to obtain a new antibody with a probability of Pm and applies the four operators at the same time to get four new antibodies with a probability of (1-Pm), then it

256 Chapter 8 Table 8.7: Algorithm: The pseudocode of the gene mutation operator in DE-ICA. Begin Randomly generate a number v between 0 and 1; if(v < Pm) Randomly select one of the four operators to obtain a new antibody, denoted Bout; else Achieve a new antibody by Single-Insertion, denoted A; Achieve a new antibody by Double-Insertion, denoted B; Achieve a new antibody by Swap, denoted C; Achieve a new antibody by 2-opt, denoted D; Compare the four antibodies (A, B, C, D) and select one according to certain rules, denoted Bout; end if Bout is the final antibody produced in this process; end Where Pm represents the probability of obtaining new antibodies by a randomly selected operator. In DE-ICA, Pm ¼ 0.6.

selects the antibody with less total cost as the output antibody. Choosing one of the four algorithms randomly is helpful to increase the diversity of antibodies, while the application of four algorithms to get antibodies can improve the affinity of antibodies. Pm is set to 0.6 based on the above consideration. On the one hand, it is beneficial to keep the antibodies with high affinity. On the other hand, it helps the algorithm to keep the diversity of antibodies and jump out of local optimum. The pseudocode of the gene mutation operator in DE-ICA is shown in Table 8.7. 8.3.3.4 Directed comparison operator The current algorithms for MO-CARP usually add the individuals obtained by gene recombination or the individuals obtained by the gene mutation with a probability of 0.2 into the total population during the process of local search. However, we know that the evolution algorithm has great randomness. Suppose that we have obtained a good individual after gene recombination of the parents, at this point, the individual has performed gene mutation. We may get an individual worse than the one before the gene mutation, so such an operation is not beneficial for the reservation of good solutions and the fast convergence of the algorithm. In order to make up for the drawbacks to this method, DE-ICA compares the individual before gene mutation with the individual after gene mutation and adds a better solution into the total population. For MO-CARP, the process of individual selection is usually more complicated than a single-objective CARP. In this chapter, MO-CARP needs to optimize both the total cost and the cost of the longest circuit. Considering the objective function of the singleobjective CARP, we usually pay more attention to the optimization of the total cost when solving practical problems. Also, for the cost of the longest circuit subjective to the capacity of vehicles, the potential of the optimization of total cost is greater. Therefore, for

Evolutionary computation-based multiobjective 257 simplicity, DE-ICA compares the total cost of the individual before gene mutation with that of the individual after gene mutation and selects the one with the smaller total cost first. If the total cost of the two individuals is equal, then DE-ICA compares the cost of the longest circuit of the two individuals and the individual with the smaller cost of the longest circuit will be the first choice. It forms a state of directed evolution and the whole population evolves rapidly in the direction of the reduction of the total cost. This method can help DE-ICA to converge to the global optimal fast. Compared with the current effective algorithms, DE-ICA can achieve better solutions and improve the efficiency. The pseudocode of the directed comparison operator in DE-ICA is shown in Table 8.8. 8.3.3.5 Clonal selection operator The process of clonal selection is considered to be a reverse process of the clonal proliferation. The purpose of the clonal selection operation is to select the antibodies with higher affinity from the offspring obtained by cloning and proliferation. For singleobjective CARP, the random sorting method is one of the most effective methods [39]. It randomly sorts the antibodies according to the affinity and the violation against the capacity of vehicles. For MO-CARP, considering it always uses the total cost as the primary criteria to select new individuals in directed comparison operator, DE-ICA applies the fast nondominant sorting and crowded distance method described in the literature [30] to avoid the problem trapping in the local optimum. Fast nondominant sorting and crowded distance method can ensure not only the quality of solutions but also the diversity of the antibody population. It is a very effective strategy to choose offspring and plays a very important role in the DE-ICA algorithm. Table 8.8: Algorithm: The pseudocode of the directed comparison operator. Begin for(i ¼ 0; i < R; iþþ) Mark the antibody got by the gene recombination as Aout; Randomly generate a number between 0 and 1; if(w < Pn) Randomly select one of the four gene mutation operators to obtain an antibody, denoted Bout; else Achieve a new antibody by Single-Insertion, denoted A; Achieve a new antibody by Double-Insertion, denoted B; Achieve a new antibody by Swap, denoted C; Achieve a new antibody by 2-opt, denoted D; Compare the four antibodies (A, B, C, D) and select one according to certain rules, denoted Bout; Compare the total cost of Aout and Bout and add the one with smaller total cost into the total population. If the total cost of the two antibodies is equal, then we add the individual with smaller cost of the longest circuit into the total population; end if Put the individual Aout into the total population; end for end Where Pn means the probability of gene mutation and in DE-ICA, Pn ¼ 0.2.

258 Chapter 8

8.3.4 The processing flow of DE-ICA This section presents a novel immune clonal algorithm based on directed evolution to solve MO-CARP. The algorithm applies the main idea of the immune clonal algorithm. It first initializes the antibody population then carries out the immune clonal operation. Next, the algorithm performs immune gene operations and draws lessons from the framework of the effective decomposition algorithm during the process of immune gene operations. At the same time, a directed comparison operator is added into the algorithm. Finally, it performs the clonal selection operation. The main steps of DE-ICA are described in Table 8.9.

8.4 Improved memetic algorithm via route distance grouping For large-scale CARP (LSCARP) [4,20], an improved RDG-MAENS [13] (IRDGMAENS) [18] is presented here. In IRDG-MAENS, it first decomposes a large-scale Table 8.9: Algorithm: The overall structure of DE-ICA. Begin Set the termination condition and initialize the iteration ite ¼ 0, then set the size of the initial antibody population to Psize; Initialize the initial population P by path-scanning algorithm; while(ite qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 0 0 f1 c j f1min þ f2 cj f2min ðf1 ðsi Þf1min Þ þðf2 ðsi Þf2min Þ Then swap cj’ and si. In this process, si ˛ s\care evaluated in terms of the new objective value (the minimal one) and use it to conduct on c; end if end for end for for i ¼ 1/g do zi ¼ {ci} Using the membership of si to ci to obtain the fuzzy distance and updating the value. After that, assign all the nonmedoid routes to the groups and then obtain a route zi (the vivid algorithm is in Mei’s Algorithm 3); The decomposition of the task set Z can be obtained directly from the grouping of the routes Zi ¼ {}; for sk˛zi do //a task is in a route of zi for z ˛ sk do Replace solution Zi and use it solve adjacent decomposed problems, Zi )Zi Wz; end for end for end for Return (Z1, ., Zg); end procedure

Evolutionary computation-based multiobjective 263

8.4.3 The processing flow of IRDG-MAENS The improved decomposition-based memetic algorithm (IDMAENS) has been proven to be superior to other algorithms in solving multiobjective CARP [15]. However, referring to LSCARP, IDMAENS has some limitations. For multiobjective LSCARP, it is unable to perform the decomposition under the direction vector directly. The size of space to be searched doubles with the scale of the problem increasing, which makes it more difficult to find potential solutions in such a large solution space. Previous algorithms predominantly ignore the problem of scalability. From the description of two improvements in IRDG-MAENS, it can be seen that IRDGMAENS can: increase convergence speed, update better solutions immediately to participate in solving the current cycle and other subproblems, and is consistent with the theory of coevolution. This approach enhances area sharing and searching potentially better solutions, and also helps to maintain the diversity of populations. Therefore, IRDGMAENS combines IDMAENS to solve multiobjective LSCARP. The details of IRDGMAENS are shown in Table 8.12. Table 8.12: Algorithm: Improved RDG-MAENS combined with IDMAENS for multiobjective LSCARP. Begin Initialize population P(Z); sðZÞ ¼ argminsðZÞ ˛ pðzÞ ðtcðsðZÞÞÞ t)1 repeat (Z1, .,Zg) ¼ Decompose(Z); //the step of IRDG for i ¼ 1/g do //begin the IDMAENS process P(Zi) ¼ Pop2subpop(P(Z),Zi); Generate N evenly distributed weight vectors l1, ., lN randomly. Find T weight vectors which are closest to li according to the Euclidean distance between each pair of weight vectors; According to the l1, ., lN, decompose the multiobjective CARP into N single-objective CARPs (g1, ., gN); for (i ¼ 1 to N; iþþ) do sg ðgi Þ ¼ argminsg ðgi Þ ˛ pðzi ÞðT ðsg ðgi ÞÞÞ; Choose the optimal solution of i decomposed problem, replace solution and use it to solve adjacent  decomposed    problems;     sg gi ; P gi ¼ Evolve sg gi ; P gi ; end for P(Z) ¼ subpop2Pop(P(Z 1),/,P(Z g));

sðtÞ ðZÞ ¼ sðZ1 Þ; /; sðZg Þ ; Use fast nondominated sorting and crowding method for Z sort, select solution to save; end for t)tþ1; Until t reaches to a predefined upper bound; Return sðZÞ end

264 Chapter 8

8.5 Experiments 8.5.1 Test problems and experimental setup 8.5.1.1 MPCCA For MPCCA, the experiments discuss four benchmark problems, which are gdb, val, egl, and EGL-G. gdb contains 23 small size instances [40]. val contains 34 instances in 10 groups composed of 10 different graphs. In each graph, different instances are generated by changing the capacity of the vehicle. Based on the data from a winter gritting application in Lancashire, egl contains 24 instances, which are in two graphs [8]. Each graph corresponds to 12 instances. These three groups of instances are commonly used in the assessment of the algorithm for solving CARP. Because the CARP instances in real life are usually ultra-large-scale, EGL-G generated 10 ultra-large-scale instances based on the traffic topology of Lancashire [41]. Their graph includes 255 vertices and 375 edges. In order to verify the effectiveness of the algorithm, we make a comparison among MPCCA, DVCMOA [42], and D-MAENS [14]. Since MPCCA is different to DVCMOA, it is also necessary to compare it with DVCMOA. For this reason, we also hybridize the framework of DVCMOA with the SO-CARP MAENS algorithm and consider it as a reference. The comparison is carried out on the four benchmark test sets of CARP instances. For a fair comparison, MPCCA and DVCMOA adopt the same parameters as DMAENS. For all three methods, the maximum number of iterations Gmax ¼ 200, the size of the offspring population osize ¼ 60, and the local search probability pls ¼ 0.1. Because MPCCA and DVCMOA emphasize the coevolution between different subpopulations, the size of their parent population is set to 120. It is set to 60 in D-MAENS. All the algorithms are run 51 times independently. 8.5.1.2 DE-ICA This experiment contains three sets of test problems Beullens [10], egl [8], and EGL-G [41]. The test set of Beullens consists of data of the Belgian Flanders intercity road network, including four sets of data containing 28e121 tasks under 25 different situations. D and F, C and E in Beullens are from the same network diagram, but these problems have a larger capacity. In summary, all test sets are instances of large CARP, but have differing sizes. In order to evaluate the effectiveness of DE-ICA, the experiment chooses D-MAENS [14] and ID-MAENS [15] as the compared algorithms. The main parameters in DE-ICA are set as follows: the maximum iteration number Gmax is set as 200, the probability of local search Pls is 0.1, the clone ratio q is 3, and the initial popsize is set at 120 in DE-ICA. In order to make the comparison between DE-ICA and the compared algorithms more fair, the three algorithms run 30 times independently on the same computing platform.

Evolutionary computation-based multiobjective 265 8.5.1.3 IRDG-MAENS As for IRDG-MAENS, Beullens [10], egl [8], and EGL-G [41] are still selected as test problems. The performance of the CC framework combined with the RDG decomposition strategy in RDG-MAENS mainly depends on two parameters g and a for solving LSCARP. In order to choose a good parameter to obtain good results, the parameters are set as g ¼ 2, 3 and a ¼ 1, 5, 10. In IRDG-MAENS and its combination with IDMAENS to solve LSCARP, the parameters are set as g ¼ 2, a ¼ 5. For a fair comparison, we set the same parameters: the maximum number of iterations Gmax ¼ 500, population size psize ¼ 30, the probability of local search pls ¼ 0.2, and the number of cycles ¼ 50. We have performed two types of experiments, as follows. In order to verify the effectiveness of IRDG-MAENS, we make a comparison between IRDG-MAENS and RDG-MAENS.

8.5.2 The performance metrics We select indicators to measure the performance of the algorithms for MO-CARP. There are three measures [43]. (1) Measuring the convergence of the algorithms. (2) Measuring the diversity of the nondominant solutions obtained by algorithms. (3) Measuring the convergence and the diversity of the nondominant solutions. The convergence refers to the proximity between the nondominant set obtained by the algorithm and the Pareto-optimal set. The diversity means the distribution of the nondominant solutions. Because the solution space of CARP is discrete, the Pareto-front of the test instances is not evenly distributed. Therefore, measuring the diversity of the nondominant solutions has no practical significance [44].We adopt three criteria to measure the performance of the algorithms for MO-CARP. 8.5.2.1 The distance to the reference set (ID) The measure standard (ID) was first proposed in the literature [45]. ID is defined as the following type [Formula(8.8)]: N P

ID ðXÞ ¼

ðminðdðxi; yjÞÞÞ

j¼1

jSj

;

1iM

(8.8)

where, x1, ., xM are points in the test set X. y1, ., yN are points in the reference set S. d(xi, yj) means the Euclidean distance. ID(X) is the average distance between the points in the reference set S and the closest points in X. The smaller the distance, the closer the test set X to the reference set S. It is difficult for us to obtain the accurate Pareto-optimal set when solving practical MO-CARP, so we select a new nondominant set as the reference set S. We first merge the nondominant sets obtained by the three algorithms and then get a new nondominant set, namely the reference set S.

266 Chapter 8 8.5.2.2 Purity Purity was first proposed in the literature [46] as follows: PurityðXÞ ¼

jXXSj jXj

(8.9)

where, X is the nondominant set obtained by the test algorithm. jXj means the total number of the solutions in X. S is the reference set. jXXSj has the number of the same solutions in X and S. Purity is a ratio and the greater the value, the better the convergence of the algorithm. 8.5.2.3 Hypervolume (HV) HV describes the area in the object space formed by the nondominant solutions of the test algorithms [47].  HVðXÞ ¼ vol WNi¼1 xi (8.10) There is a reference point during the calculation. We select the point formed by the two maximal objective function values of all the nondominant solutions obtained by the three algorithms as the reference point. HV can reflect the proximity between the Pareto-front and the obtained nondominant set well. The greater the HV, the closer the nondominant set obtained by the test algorithm to the Pareto-front. If the value of HV is zero, it implies that there is only one nondominant solution.

8.5.3 Wilcoxon signed rank test In order to compare the performance of different algorithms on CARP, the Wilcoxon symbolic rank test was used to estimate the performance of each dataset, and the effectiveness of the method was verified [48]. This method is suitable for pairing comparison in T test and does not depend on the distribution of pairing data. Therefore, the symmetrical distribution is sufficient to meet the requirements of the test without the need for normal distribution. The differences in pairing observations need to be tested to determine whether they belong to the sum of zero mean [49]. The Wilcoxon symbolic rank test was used in the experiment, and the significance level was 0.05.

8.5.4 Comparison of the evaluation metrics 8.5.4.1 MPCCA The following tables show the experimental results of MPCCA, D-MAENS, and DVCMOA. All the experiments were conducted for 51 independent runs. The mean and standard deviation values of each metric are included in the tables. For a statistical

Evolutionary computation-based multiobjective 267 understanding, the Wilcoxon signed rank test [48] has been carried out at a significance level of 0.05 between MPCCA and other algorithms. Tables 8.13e8.20 show the statistical test results. In the following analysis, some notations are used for clarity: 1) V, T, and E, respectively, denote the number of vertices, the number of tasks, and the total number of edges in the instance. The edges in gdb and val are all tasks, while there are task edges and nontask edges in egl and EGL-G. 2) For the Wilcoxon signed rank test, p is the probability of a hypothesis of equal median for two paired samples. h is the result of the test. If the median of the difference between MPCCA and another compared algorithm is zero, h ¼ 0; otherwise, if there is a significant difference, then h ¼ 1. 3) The best results of average values and standard deviations are indicated in bold in each test instance. The winner of these algorithms has the smallest mean for each of the metrics. 4) For the most ID obtained by the three algorithms, they are accurate to the fourth decimal place, while HV is accurate to the first decimal place. As seen in Table 8.13, the gdb test set contains 23 small-scale instances. The interval of vertex varies from 7 to 27, and the number of tasks is up to 46. This shows that MPCCA obtains better results in 20 out of 23 gdb instances than the remaining two algorithms, while D-MAENS and DVCMOA only obtain the winner in one gdb instance. Moreover, it is noteworthy that MPCCA not only has the smallest means but also obtains the smallest standard deviations on most of the gdb instances. As for the Wilcoxon signed rank test, on most test instances, the median of the differences between MPCCA and the compared algorithms are significant. Summing up the above, in the light of ID, MPCCA has a stronger capability of reaching the area which is closer to the true PF than D-MAENS and DVCMOA. Table 8.14 shows the statistics results of the performance indicator HV among the three algorithms on small-scale test set gdb. Because the reference point selected for calculating HV in each independent run is different, the results of HV change greatly. As shown in the statistical results in Table 8.14, the standard deviations of HV change greatly, and do not reflect the stability of the algorithms. However, to some extent, the average values of HV still can represent the diversity and convergence of three algorithms. As can be seen in Table 8.14, on most gdb instances, the means of MPCCA about HV are always the largest. As for the Wilcoxon signed rank test, in most test instances, the median of the differences between MPCCA and compared algorithms is significant. Table 8.15 shows the statistics results of performance indicator ID among the three algorithms on medium-scale test set val. val test set contains a total of 34 medium-scale

268 Chapter 8

Table 8.13: The statistical results of ID on the gdb benchmark test set. MPCCA gdb 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

V

T

E

Mean

12 12 12 11 13 12 12 27 27 12 22 13 10 7 7 8 8 9 8 11 11 11 11

22 26 22 19 26 22 22 46 51 25 45 23 28 21 21 28 28 36 11 22 33 44 55

22 26 22 19 26 22 22 46 51 25 45 23 28 21 21 28 28 36 11 22 33 44 55

0 0.4482 0.1373 0 0.1343 0.0482 0 3.3327 0.6727 0.2055 2.2848 0.7579 12.1311 0 0.0098 0.2052 0.7457 3.3915 0 0.6920 1.2453 0.4422 1.0077

D-MAENS Std

0 0.6164 0.4752 0 0.1998 0.3442 0 0.8426 0.9508 0.4283 1.8376 0.3959 0 0 0.0396 0.2319 0.2827 1.6823 0 0.6239 0.8422 0.3686 0.1909

Mean 0.0784 2.4786 0.1855 0.1503 1.0119 0.6869 0.0392 3.3283 3.6258 1.7958 7.8611 0.9924 11.6975 1.8189 0.4620 1.6257 2.6243 6.5813 0 1.3733 2.4881 1.6175 10.9106

Std 0.2169 1.3102 0.5061 0.3848 0.5937 0.8366 0.0802 0.9364 0.8797 1.7703 3.2722 0.9953 1.7624 2.2124 0.2962 1.0922 0.9217 2.5215 0 0.4927 1.0228 0.4994 46.3249

DVCMOA p 0.0313 7.0e-10 0.6865 0.0156 3.4e-9 3.4e-5 0.002 0.5300 7.5e-10 2.6e-9 8.8e-10 0.2165 1 3.8e-6 1.7e-9 6.9e-10 5.3e-10 9.6e-8 1 8.5e-7 6.4e-7 6.1e-10 5.1e-10

h 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 1 1

Mean 0.1307 2.4152 0.6025 0.0131 1.2141 1.0460 0.0235 3.3228 3.7786 2.1253 7.3531 1.2714 11.9719 1.5013 0.5047 2.0823 2.6768 6.8393 0 1.2225 2.4494 1.8371 1.6812

Std 0.2673 1.0212 0.7959 0.0934 0.7281 1.0193 0.0651 0.8860 0.8724 1.6581 3.3839 1.2200 1.1366 2.1047 0.2517 1.2834 0.8509 2.5748 0 0.5789 0.9857 0.5148 0.2407

p 0.0020 9.7e-10 0.0049 1 1.1e-9 2.3e-6 0.0313 0.8734 5.1e-10 9.3e-10 1.2e-9 0.0047 0.25 1.3e-6 5.8e-10 1.6e-9 4.7e-10 1.8e-7 1 1.4e-4 3.7e-7 5.5e-10 6.5e-10

h 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1

Table 8.14: The statistical results of HV on the gdb benchmark test set. MPCCA gdb

T

E

Mean

12 12 12 11 13 12 12 27 27 12 22 13 10 7 7 8 8 9 8 11 11 11 11

22 26 22 19 26 22 22 46 51 25 45 23 28 21 21 28 28 36 11 22 33 44 55

22 26 22 19 26 22 22 46 51 25 45 23 28 21 21 28 28 36 11 22 33 44 55

112.0 690.8 267.9 294.9 878.8 425.6 420.0 301.1 169.9 3358.7 6835.4 95.6 0.0 158.8 40.0 240.3 25.5 481.8 0.0 116.7 412.0 98.1 64.8

Std 0.0 88.9 13.9 73.9 291.6 167.9 0.0 111.5 88.2 172.9 1181.6 86.8 0.0 22.9 6.4 34.8 4.9 88.7 0.0 43.5 83.2 26.4 36.5

Mean 108.7 674.2 266.9 293.1 859.6 417.8 414.9 274.9 129.6 3312.8 6548.5 94.3 0.0 157.0 38.2 230.6 18.2 439.4 0.0 114.9 399.2 86.9 50.3

Std 9.1 83.5 15.2 69.4 283.1 160.9 10.4 104.0 74.5 178.2 1168.3 87.5 0.0 23.2 6.0 33.9 4.7 84.5 0.0 41.7 79.1 23.8 36.2

DVCMOA p

0.0313 8.3e-9 0.2075 0.0156 1.2e-7 0.0015 0.0020 3.4e-9 1.1e-9 4.9e-9 6.1e-10 0.1446 1 2.4e-4 6.6e-8 3.4e-9 3.2e-9 5.1e-10 1 4.2e-4 1.7e-8 1.1e-9 1.5e-7

h 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 1 1 1

Mean 106.5 674.7 264.3 294.7 852.6 407.6 416.9 269.7 125.3 3303.4 6485.5 90.7 3.4118 156.4 37.7 227.6 17.3 439.2 0.0 114.7 399.6 84.4 52.3

Std 11.2 86.2 15.5 73.6 270.2 149.2 8.5 107.6 71.4 173.4 1119.6 72.2 24.3 22.4 5.8 31.4 4.5 79.2 0.0 41.2 79.2 23.2 34.0

p 0.0020 3.8e-9 1.9e-4 1 6.2e-9 4.5e-5 0.0313 2.1e-9 5.1e-10 1.5e-9 5.5e-10 0.0038 1 1.6e-4 9.2e-9 2.7e-9 6.3e-10 7.1e-10 1 0.0023 2.2e-9 6.6e-10 3.5e-9

h 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1

Evolutionary computation-based multiobjective 269

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

V

D-MAENS

MPCCA val

V

T

E

Mean

1A 1B 1C 2A 2B 2C 3A 3B 3C 4A 4B 4C 4D 5A 5B 5C 5D 6A 6B 6C 7A 7B 7C 8A 8B 8C 9A 9B 9C 9D 10A 10B 10C 10D

24 24 24 24 24 24 24 24 24 41 41 41 41 34 34 34 34 31 31 31 40 40 40 30 30 30 50 50 50 50 50 50 50 50

39 39 39 34 34 34 35 35 35 69 69 69 69 65 65 65 65 50 50 50 66 66 66 63 63 63 92 92 92 92 97 97 97 97

39 39 39 34 34 34 35 35 35 69 69 69 69 65 65 65 65 50 50 50 66 66 66 63 63 63 92 92 92 92 97 97 97 97

0.7347 0.3169 0 2.0143 1.4359 0.3333 0.1046 0.0609 0 2.2892 1.8104 2.8570 3.3622 7.3482 4.9838 4.8847 5.1665 0.5451 0.5815 0.7440 1.7721 2.2281 0.7734 4.8313 4.2505 5.6968 4.8723 3.8456 3.0642 2.2825 4.2827 7.0197 4.2129 9.1537

D-MAENS Std

0.2399 0.4355 0 0.3104 0.5600 1.3515 0.1245 0.1117 0 0.7232 0.6475 0.3942 1.0986 1.4432 1.1593 1.3528 1.9545 0.5290 0.4350 0.4243 0.9027 1.1430 0.7086 1.3814 0.8620 1.5593 1.0923 1.0268 0.6392 0.5508 1.2032 2.9660 1.3248 2.7449

Mean 1.5033 2.2771 0.5980 2.6054 2.0857 3.5249 0.2778 0.0788 0 4.8302 3.9310 4.6028 6.9950 10.2012 8.2064 7.1668 6.8559 2.5362 2.6359 3.7135 5.1106 4.4361 1.8326 6.0429 6.3853 9.4106 8.7414 6.6365 5.8339 4.0440 12.4337 22.8875 8.4750 14.9530

Std 0.5141 0.5148 0.7452 0.6675 0.8298 3.4506 0.2463 0.1674 0 0.5724 0.4954 1.0646 3.0449 1.8477 1.6260 1.1533 1.8516 0.5720 0.8980 1.4936 1.2917 1.2064 0.7575 1.0333 1.3347 2.1106 2.1818 1.5565 1.3928 0.8158 4.9076 8.0258 2.7262 5.8712

DVCMOA p

4.4e-9 6.1e-10 1.8e-5 6.2e-6 6.5e-5 1.5e-5 1.7e-4 0.5442 1 5.5e-10 5.1e-10 1.2e-9 1.0e-8 3.7e-9 1.2e-9 9.3e-10 3.7e-5 7.8e-10 6.1e-10 1.5e-9 6.9e-10 5.7e-8 1.4e-7 3.2e-6 1.2e-9 3.7e-9 9.9e-10 5.2e-9 5.5e-10 5.5e-10 5.1e-10 5.1e-10 3.7e-9 1.3e-6

h 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Mean 1.7059 2.2742 0.6739 2.7963 2.1876 3.0997 0.4297 0.2473 0.0196 5.3878 4.3662 5.3038 8.7749 10.3154 8.6129 7.7889 8.6582 2.8089 3.0997 4.1556 5.5599 4.9894 2.7528 6.5455 6.8588 9.6163 8.7541 7.0523 6.6473 4.5067 11.9129 21.4672 9.9041 14.1472

Std 0.5517 0.6172 0.7529 0.6000 0.7298 2.8601 0.3023 0.2848 0.1400 0.7233 0.6860 1.1189 3.5113 1.5779 1.7630 1.5778 2.4944 0.6033 0.6713 1.5542 1.1842 1.2725 1.4495 1.2592 1.4776 2.0671 1.7669 1.4506 1.0276 0.8574 4.4633 6.3699 3.9178 4.9689

p 8.8e-10 5.1e-10 5.1e-6 4.4e-9 8.6e-7 4.9e-5 8.0e-8 2.1e-5 1 5.1e-10 5.5e-10 5.1e-10 1.2e-9 1.2e-8 5.1e-10 9.3e-10 2.0e-7 5.1e-10 5.1e-10 5.5e-10 5.5e-10 2.6e-9 1.3e-8 4.5e-7 5.8e-10 3.5e-9 6.2e-10 1.8e-9 5.1e-10 5.1e-10 5.1e-10 8.3e-10 1.8e-9 2.3e-7

h 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

270 Chapter 8

Table 8.15: The statistical results of ID on the val benchmark test set.

Table 8.16: The statistical results of HV in the val benchmark test set. MPCCA V

T

E

Mean

1A 1B 1C 2A 2B 2C 3A 3B 3C 4A 4B 4C 4D 5A 5B 5C 5D 6A 6B 6C 7A 7B 7C 8A 8B 8C 9A 9B 9C 9D 10A 10B 10C 10D

24 24 24 24 24 24 24 24 24 41 41 41 41 34 34 34 34 31 31 31 40 40 40 30 30 30 50 50 50 50 50 50 50 50

39 39 39 34 34 34 35 35 35 69 69 69 69 65 65 65 65 50 50 50 66 66 66 63 63 63 92 92 92 92 97 97 97 97

39 39 39 34 34 34 35 35 35 69 69 69 69 65 65 65 65 50 50 50 66 66 66 63 63 63 92 92 92 92 97 97 97 97

626.7 873.8 4.5 4684.6 3599.6 1.7 221.5 36.8 0 6893.5 3624.6 1455.3 87.9 1.6eþ4 9095.6 6077.5 1027.6 2106.9 1187.7 78.1 5646.9 2073.5 460.2 1.4eþ4 7077.3 986.3 6938.4 4203.0 2838.1 363.1 2.3eþ4 1.0eþ4 6226.8 2498.3

Std 69.2 256.4 8.4 133.6 329.0 5.2 26.8 8.2 0 1335.8 1218.9 622.3 73.7 0.2eþ4 1787.9 1556.2 321.8 353.0 165.3 30.9 863.6 419.5 91.5 0.1eþ4 1210.3 372.8 1195.2 794.9 653.1 176.5 0.4eþ4 0.2eþ4 1011.1 1112.1

Mean 594.5 829.5 1.2 4621.9 3560.8 0.1 219.1 36.7 0 6561.5 3367.0 1335.4 43.0 1.5eþ4 8624.5 5780.5 921.1 2047.7 1143.8 48.3 5462.5 1976.8 446.8 1.3eþ4 6559.0 813.4 6236.8 3737.3 2467.1 280.4 2.1eþ4 0.9eþ4 5591.0 2095.1

Std 64.1 239.7 4.9 133.2 329.6 0.4 25.5 8.4 0 1295.4 1211.7 599.2 32.9 0.2eþ4 1727.7 1577.1 301.0 343.9 159.2 29.5 845.8 416.5 88.7 0.1eþ4 1157.0 346.6 1192.2 752.8 629.4 158.7 0.3eþ4 0.1eþ4 989.8 923.1

DVCMOA p

6.5e-10 1.0e-9 2.4e-4 7.5e-10 3.7e-8 0.0313 5.2e-6 0.3352 1 5.1e-10 5.5e-10 1.4e-9 3.8e-8 6.1e-10 5.8e-10 5.1e-10 8.3e-10 1.8e-9 7.3e-10 3.7e-9 5.1e-10 6.5e-10 9.7e-10 5.5e-10 5.1e-10 7.8e-10 5.1e-10 7.3e-10 5.1e-10 5.1e-10 5.1e-10 5.5e-10 5.1e-10 5.5e-10

h 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Mean 583.9 823.3 1.4 4604.2 3555.3 1.0 216.5 36.0 0 6472.3 3335.6 1278.6 31.1 1.5eþ4 8547.7 5710.9 897.8 2030.8 1130.4 42.5 5443.3 1939.7 438.2 1.3eþ4 6483.5 786.9 6133.6 3664.0 2400.2 257.8 2.1eþ4 0.9eþ4 5474.7 2073.4

Std 66.7 237.0 4.1 141.6 335.5 4.1 26.5 7.6 0 1314.3 1199.9 570.9 39.7 0.2eþ4 1761.8 1513.7 298.3 346.4 156.5 25.8 852.1 409.6 83.6 0.1eþ4 1147.8 322.1 1165.6 727.8 620.6 147.6 0.3eþ4 0.1eþ4 1012.4 921.0

p 5.1e-10 8.8e-10 9.8e-4 5.4e-10 2.5e-8 0.2500 1.8e-8 3.4e-4 1 5.1e-10 6.1e-10 6.1e-10 2.9e-9 5.1e-10 5.1e-10 5.8e-10 5.5e-10 5.1e-10 5.1e-10 5.1e-10 5.1e-10 5.1e-10 2.2e-9 5.1e-10 5.8e-10 5.1e-10 5.1e-10 5.8e-10 5.5e-10 5.1e-10 5.5e-10 5.1e-10 5.1e-10 5.1e-10

h 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Evolutionary computation-based multiobjective 271

val

D-MAENS

272 Chapter 8

Table 8.17: The statistical results of ID on the egl benchmark test set. MPCCA egl

V

T

E

E1-A E1-B E1-C E2-A E2-B E2-C E3-A E3-B E3-C E4-A E4-B E4-C S1-A S1-B S1-C S2-A S2-B S2-C S3-A S3-B S3-C S4-A S4-B S4-C

77 77 77 77 77 77 77 77 77 77 77 77 140 140 140 140 140 140 140 140 140 140 140 140

51 51 51 72 72 72 87 87 87 98 98 98 75 75 75 147 147 147 159 159 159 190 190 190

98 98 98 98 98 98 98 98 98 98 98 98 190 190 190 190 190 190 190 190 190 190 190 190

Mean 6.4871 4.0821 37.3669 19.1655 25.0122 19.1610 44.2949 46.2759 39.8912 50.0045 28.0476 56.6145 49.6998 37.9986 15.2154 74.0856 63.2305 55.3809 108.4623 60.1846 48.9827 25.7847 55.0863 132.2291

D-MAENS Std

1.3813 7.3082 4.6199 18.7182 5.0953 5.8300 14.1913 6.6165 15.5999 7.7156 5.0138 40.9926 10.4353 4.6869 7.5144 14.3440 15.3847 34.8551 40.8624 19.5550 42.1472 17.0052 69.0522 113.4577

Mean 7.3649 14.1238 52.8392 52.6231 42.1731 39.1281 76.3673 57.0486 58.3136 73.7701 66.4717 156.5018 61.2755 65.9309 67.8097 98.1120 108.8641 244.2146 109.9562 93.7351 413.4969 166.1725 366.7897 178.9059

Std 2.2318 6.4036 12.4102 16.2404 6.2965 11.6592 18.0139 11.3180 22.3946 31.1991 20.3543 30.4954 14.3539 16.1134 23.2621 21.5575 34.1791 76.7322 25.9053 28.3941 101.2926 57.0508 68.9704 105.0823

DVCMOA p 0.0127 4.5e-7 2.8e-8 9.1e-9 5.1e-10 6.2e-10 6.2e-9 4.2e-7 3.2e-7 5.9e-6 5.5e-10 1.2e-9 8.3e-5 8.8e-10 5.5e-10 0.5178 2.8e-9 5.1e-10 0.7500 3.0e-8 5.1e-10 5.1e-10 5.1e-10 0.0236

h 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1

Mean 8.0623 17.5598 54.5251 62.6490 45.5208 48.1858 82.8305 61.3614 77.9532 86.6980 87.8175 171.7349 70.4663 75.1488 80.0359 114.9619 127.9945 259.8283 121.8685 109.7255 441.7966 213.7841 406.5299 231.7199

Std 1.9873 5.8098 14.9810 14.7111 8.4915 13.1415 20.4652 12.3690 29.5696 22.4186 24.9333 36.7921 18.5674 15.8693 25.3047 26.6838 39.6104 101.5085 26.3229 33.9281 120.5941 69.4853 75.8416 101.7923

p 5.1e-5 2.7e-8 4.0e-8 6.2e-10 6.5e-10 6.2e-10 4.4e-9 1.3e-8 9.1e-9 2.4e-9 5.1e-10 7.3e-10 2.9e-7 6.9e-10 5.1e-10 8.9e-5 1.7e-9 5.5e-10 0.0916 3.0e-9 5.1e-10 5.1e-10 5.1e-10 2.8e-4

h 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1

Table 8.18: The statistical results of HV in the egl benchmark test set. MPCCA V

T

E

Mean

E1-A E1-B E1-C E2-A E2-B E2-C E3-A E3-B E3-C E4-A E4-B E4-C S1-A S1-B S1-C S2-A S2-B S2-C S3-A S3-B S3-C S4-A S4-B S4-C

77 77 77 77 77 77 77 77 77 77 77 77 140 140 140 140 140 140 140 140 140 140 140 140

51 51 51 72 72 72 87 87 87 98 98 98 75 75 75 147 147 147 159 159 159 190 190 190

98 98 98 98 98 98 98 98 98 98 98 98 190 190 190 190 190 190 190 190 190 190 190 190

24,697.1 950.9 691.7 14.7eþ4 2.4eþ4 4764.0 1.6eþ5 2.0eþ4 1.3eþ4 1.1eþ5 4.1eþ4 8093.8 3.7eþ5 1.1eþ5 6.0eþ4 2.2eþ5 7.6eþ4 3.7eþ4 2.9eþ5 8.9eþ4 4.8eþ4 3.2eþ4 9081.3 5227.5

Std 3987.0 783.2 1209.2 2.6eþ4 1.2eþ4 2180.2 0.4eþ5 1.1eþ4 0.3eþ4 0.2eþ5 1.3eþ4 5735.7 1.4eþ5 0.4eþ5 0.5eþ4 0.7eþ5 2.8eþ4 1.3eþ4 0.8eþ5 3.1eþ4 1.7eþ4 1.7eþ4 8974.5 3772.2

Mean 23,605. 4 606.0 715.9 14.1eþ4 2.0eþ4 3291.1 1.5eþ5 1.7eþ4 1.0eþ4 1.0eþ5 3.1eþ4 2550.6 3.5eþ5 1.0eþ5 5.0eþ4 2.0eþ5 5.7eþ4 1.4eþ4 2.6eþ5 6.9eþ4 1.8eþ4 1.9eþ4 1774.3 1663.7

Std 4044.0 590.5 914.9 2.5eþ4 1.1eþ4 1900.9 0.4eþ5 1.2eþ4 0.3eþ4 0.2eþ5 1.1eþ4 2718.5 1.3eþ5 0.4eþ5 0.5eþ4 0.6eþ5 2.3eþ4 0.8eþ4 0.8eþ5 2.4eþ4 1.1eþ4 1.3eþ4 2782.0 2192.8

DVCMOA p 0.0024 1.6e-5 0.9043 8.2e-9 5.1e-10 5.1e-10 1.6e-9 6.5e-9 7.3e-9 5.5e-10 1.8e-9 5.1e-10 5.1e-10 5.1e-10 5.1e-10 5.8e-10 5.1e-10 5.5e-10 5.1e-10 5.1e-10 5.1e-10 5.5e-10 7.6e-9 3.5e-8

h 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Mean 23,473. 2 417.8 832.3 13.9eþ4 1.9eþ4 2696.9 1.4eþ5 1.6eþ4 0.7eþ4 1.0eþ5 2.8eþ4 1837.4 3.4eþ5 1.0eþ5 5.0eþ4 1.8eþ5 5.3eþ4 1.2eþ4 2.5eþ5 6.4eþ4 1.4eþ4 1.4eþ4 688.3 619.8

Std 3864.4 389.9 1135.3 2.5eþ4 0.9eþ4 1600.5 0.4eþ5 1.1eþ4 0.4eþ4 0.2eþ5 1.1eþ4 2197.4 1.3eþ5 0.3eþ5 0.5eþ4 0.6eþ5 2.1eþ4 0.9eþ4 0.8eþ5 2.4eþ4 1.0eþ4 1.1eþ4 1808.7 1187.6

p 7.5e-6 4.1e-7 0.1738 5.1e-10 5.1e-10 5.1e-10 5.1e-10 6.5e-10 2.8e-9 5.5e-10 5.8e-10 5.1e-10 5.1e-10 5.1e-10 5.1e-10 5.1e-10 5.1e-10 5.1e-10 5.5e-10 5.1e-10 5.1e-10 7.3e-10 7.6e-9 2.8e-8

h 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Evolutionary computation-based multiobjective 273

egl

D-MAENS

274 Chapter 8

Table 8.19: The statistical results of ID on the EGL-G benchmark test set. MPCCA EGL-G 1-A 1-B 1-C 1-D 1-E 2-A 2-B 2-C 2-D 2-E

V

T

E

Mean

255 255 255 255 255 255 255 255 255 255

347 347 347 347 347 375 375 375 375 375

375 375 375 375 375 375 375 375 375 375

0.9eþ4 1.9eþ4 1.4eþ4 1.9eþ4 1.4eþ4 1.1eþ4 1.4eþ4 0.1eþ5 1.6eþ4 1.3eþ4

D-MAENS Std

0.5eþ4 1.1eþ4 1.3eþ4 0.9eþ4 0.5eþ4 0.9eþ4 0.7eþ4 0.1eþ5 0.3eþ4 1.0eþ4

Mean 2.3eþ4 4.0eþ4 3.1eþ4 3.4eþ4 2.7eþ4 3.0eþ4 2.8eþ4 0.2eþ5 3.3eþ4 4.3eþ4

Std 0.6eþ4 0.6eþ4 0.4eþ4 0.4eþ4 0.6eþ4 0.5eþ4 0.4eþ4 0.1eþ5 0.7eþ4 0.9eþ4

DVCMOA p

5.8e-10 1.6e-9 5.2e-8 2.8e-9 5.1e-10 3.3e-9 1.4e-9 7.3e-9 5.1e-10 6.2e-10

h 1 1 1 1 1 1 1 1 1 1

Mean 2.8eþ4 4.4eþ4 3.7eþ4 3.5eþ4 2.6eþ4 3.1eþ4 2.8eþ4 1.4eþ5 3.6eþ4 4.1eþ4

Std 0.6eþ4 0.5eþ4 0.6eþ4 0.6eþ4 0.5eþ4 0.6eþ4 0.5eþ4 3.9eþ5 0.7eþ4 1.1eþ4

p 5.1e-10 9.9e-10 2.3e-9 7.8e-10 5.5e-10 1.9e-9 9.9e-10 4.2e-9 5.1e-10 1.7e-9

h 1 1 1 1 1 1 1 1 1 1

Table 8.20: The statistical results of HV on the EGL-G benchmark test set. MPCCA EGL-G

T

E

Mean

255 255 255 255 255 255 255 255 255 255

347 347 347 347 347 375 375 375 375 375

375 375 375 375 375 375 375 375 375 375

1.4eþ9 6.9eþ8 5.2eþ8 1.8eþ8 3.6eþ7 9.1eþ8 4.9eþ8 1.8eþ8 4.3eþ7 3.6eþ7

D-MAENS Std

0.5eþ9 4.7eþ8 3.0eþ8 2.6eþ8 3.8eþ7 4.4eþ8 2.7eþ8 1.3eþ8 4.5eþ7 4.2eþ7

Mean 1.2eþ9 5.2eþ8 3.2eþ8 1.0eþ8 1.1eþ7 7.4eþ8 3.1eþ8 1.0eþ8 1.4eþ7 0.9eþ7

Std 0.3eþ9 3.4eþ8 2.4eþ8 1.5eþ8 1.7eþ7 3.8eþ8 1.8eþ8 0.7eþ8 2.1eþ7 2.0eþ7

DVCMOA p

3.7e-9 3.9e-9 2.7e-8 1.3e-9 1.3e-9 1.4e-9 7.8e-10 6.5e-9 5.1e-10 4.4e-9

h 1 1 1 1 1 1 1 1 1 1

Mean 1.1eþ9 4.9eþ8 3.0eþ8 1.0eþ8 0.9eþ7 6.9eþ8 2.8eþ8 79.5eþ8 1.6eþ7 1.4eþ7

Std 0.4eþ9 3.5eþ8 2.2eþ8 1.5eþ8 1.6eþ7 3.7eþ8 1.8eþ8 272.4eþ8 2.0eþ7 3.0eþ7

p 6.2e-10 6.5e-10 1.1e-9 2.2e-9 6.2e-10 6.5e-10 5.1e-10 5.6e-5 9.9e-10 3.3e-9

h 1 1 1 1 1 1 1 1 1 1

Evolutionary computation-based multiobjective 275

1-A 1-B 1-C 1-D 1-E 2-A 2-B 2-C 2-D 2-E

V

276 Chapter 8 instances. The number of vertices varies from 24 to 50 and the number of tasks varies from 34 to 97. As can be seen from Table 8.15, MPCCA performs significantly better than D-MAENS and DVCMOA in 33 out of the total of 34 val instances. On the remaining val3C, MPCCA also has obvious advantages. Moreover, it can be concluded that MPCCA has better stability because of the smallest standard deviations on most of the val instances. Only in a few instances are the median differences between MPCCA and D-MAENS, DVCMOA not significant. Table 8.16 shows the statistics results of the performance indicator HV among the three algorithms in test set val. As can be seen from Table 8.15, MPCCA can find the largest average value of HV in most of val instances. It shows that MPCCA has a stronger ability to keep diversity and converging to the true Pareto-optimal front. As described above, the standard deviations of HV have no rules to follow. As for the Wilcoxon signed rank test, the median results among the three algorithms are significantly different in most val instances. Table 8.17 shows the statistics results of the performance indicator ID among the three algorithms in test set egl. The egl set was generated by Eglese which is a large-scale test set for CARP. Based on the data from a winter gritting application in Lancashire, it includes 24 instances based on two graphs and the number of tasks varies from 51 to 190. ID reflects the distribution of the nondominated solutions and the extent of the convergence to the true Pareto front. In Table 8.17, it can be observed that MPCCA performs significantly better than the others in a total of 24 egl instances. Moreover, in most of the egl instances, D-MAENS obtains a better result than DVCMOA. In addition, in egl-s2-A and egl-s3-A, the Wilcoxon signed rank tests between MPCCA and DMAENS output is 0. Also in egl-e3-A, the Wilcoxon signed rank tests between MPCCA and DVCMOA return 0. On the other test instances, the results of the Wilcoxon signed rank test return 1. Table 8.18 shows the statistics results of the performance indicator HV among the three algorithms in test set egl. The total number of winners of MPCCA is 23. It is also only in egl-e1-C that MPCCA does not win. In almost all the test instances, the median differences between MPCCA and the compared algorithm are significant. Table 8.19 shows the statistics results of the performance indicator ID among the three algorithms on test set EGL-G. EGL-G has 255 vertices and 375 edges, which is the largest among all these test sets. As can be seen from Table 8.19, MPCCA has more obvious advantages over the other two algorithms in all 10 EGL-G instances. As for the mean of ID, D-MAENS is not as good as MPCCA. Meanwhile D-MAENS has a little better stability than the other two algorithms because it obtains the five smallest standard

Evolutionary computation-based multiobjective 277 deviations in total EGL-G 10 instances. In all the EGL-G test instances, the results of the Wilcoxon signed rank test return 1, which indicates that the median differences between MPCCA and the compared algorithm are significant. Table 8.20 shows the statistics results of the performance indicator HV among the three algorithms in the test set EGL-G. In Table 8.20, MPCCA performs significantly better than the others in nine EGL-G instances. The results of the Wilcoxon signed rank test also demonstrated the advantage of MPCCA. DVCMOA is significantly better in one EGL-G instance and D-MAENS fails to be the best in all the EGL-G instances. 8.5.4.2 DE-ICA We select ID, purity, and hypervolume (HV) to measure the performance of DE-ICA and its comparison algorithms for MO-CARP. During the comparison process, if two or all the indexes of an algorithm are better than the other two algorithms, then this algorithm becomes the winner. Table 8.21 shows the performance of the three algorithms on Beullens C. We can see from Table 8.21 that DE-ICA performs better than D-MAENS and IDMAENS in 14 instances. Meanwhile, the performance of DE-ICA is the same as the performance of ID-MAENS in C04 and better than D-MAENS. DE-ICA is inferior to the other two algorithms only in three instances. Thus, DE-ICA has certain advantages over the performance on Beullens C. Table 8.22 shows the performance of the three algorithms on Beullens D. In Table 8.22, the advantage of DE-ICA is not particularly evident on Beullens D. Three algorithms show different advantages in different instances. The performance of DE-ICA is better in six instances. ID-MAENS has better performance in four instances. D-MAENS performs better in three instances. Therefore, the performance of DE-ICA is slightly better than D-MAENS and ID-MAENS as a whole. Table 8.23 shows the performance of the three algorithms on Beullens E. In Table 8.23, DE-ICA performs the best in 12 instances, and ID-MAENS is the winner in four instances. However, D-MAENS failed to be the winner on any of the instances. Thus, DE-ICA has certain advantages on Beullens E. Table 8.24 shows the performance of the three algorithms on Beullens F. In Table 8.24, DE-ICA works better in nine instances, and ID-MAENS works better only in one instance. D-MAENS works better than DE-ICA and ID-MAENS in two instances. For about half the 25 instances, all three algorithms work pretty well. Thus, compared with D-MAENS and ID-MAENS, DE-ICA has certain advantages on Beullens F. Overall, the performance of DE-ICA has certain advantages over D-MAENS and ID-MAENS on

278 Chapter 8

Table 8.21: The performance of D-MAENS, ID-MAENS, and DE-ICA on Beullens C. D-MAENS Name C01 C02 C03 C04 C05 C06 C07 C08 C09 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20 C21 C22 C23 C24 C25

ID 13.3 0 0 1.7 14.9 6.7 0 3.3 35.4 5 13.8 0 7.6 0 7.5 0 6.1 4.5 0.6 0 0 0 13.8 3.2 0

Purity 0.5 1 0.3 0.7 0.3 0.8 1 0.7 0.3 0.5 0.4 1 0.875 1 0 1 0.8 0 0.9 0.9 1 1 0.3 0.7 1

ID-MAENS HV

1950 0 0 2462.5 83,050 8237.5 0 22,750 4850 2900 41,288 700 84,463 185,740 425 2487.5 117,660 1362.5 54,488 31,400 0 94,700 3362.5 46,000 7387.5

ID 8.9 0 0 0 11.3 5.1 0 3.3 25 5 19.5 0 0.6 0 2.5 0 6.7 0 5 0 0 0 9.7 3.2 0

Purity 0.3 1 0.3 1 0.6 0.7 1 0.5 0.5 0.3 0.5 1 0.9 0.8 0.5 1 1 1 0.9 0.9 1 1 0 0.7 1

DE-ICA HV

2362.5 0 0 2775 99,175 7925 0 24,725 4425 5612.5 40,425 700 85,388 185,740 425 2487.5 128,075 0 52,750 31,400 0 94,700 5725 46,050 7387.5

ID 3.5 0 0 0 3 1.7 0 0 0 0 9.1 0 3.8 0 7.5 0 1.5 4.5 0.6 0 0 0 6.3 8.7 0

Purity 1 1 1 1 0.8 0.8 1 1 1 1 0.8 1 0.8 1 0.5 1 0.9 0 0.9 1 1 1 0.7 0.5 1

HV 2212.5 0 237.5 2775 109,790 8437.5 0 23,000 4237.5 3100 33,013 700 85,613 185,740 437.5 2487.5 135,600 487.5 54,488 31,400 0 94,700 5475 48,575 7387.5

Winner DE-ICA Both DE-ICA ID-MAENS, DE-ICA DE-ICA DE-ICA Both DE-ICA DE-ICA DE-ICA DE-ICA Both ID-MAENS DE-ICA DE-ICA Both DE-ICA MAENS DE-ICA DE-ICA Both Both DE-ICA D-MAENS, ID-MAENS Both

Table 8.22: The performance of D-MAENS, ID-MAENS, and DE-ICA on Beullens D. D-MAENS Name

7.2 0 0 4.4 9.9 2.6 0 5.1 18.2 0 114.9 1.1 0 0 6.3 0 0 15.2 4 0 0 0 11 10 0

Purity 0.8 1 1 0.7 0.6 0.9 1 0.7 0.3 1 0.6 0.8 1 1 0.3 1 1 0.1 0.9 1 1 1 0.3 0.5 1

HV 38,050 44,125 84,675 78,450 293,740 208,275 45,575 202,650 184,375 35,900 113,925 48,150 238,050 477,490 36,838 14,150 268,825 69,500 220,760 138,940 3225 432,640 65,163 314,040 125,175

ID 25.7 0 0 7.5 6.9 1.6 0 4.8 9.6 0 38.8 2.0 1.0 2 1.1 0 0 7.5 16.9 0 0 0 1.7 10 0

Purity 0.7 1 1 1 0.8 1 1 0.9 0.4 1 0.4 0.8 1 1 0.7 1 0.9 0.4 0.9 1 1 1 0.8 0.9 1

DE-ICA HV

38,038 44,125 84,675 80,363 294,100 208,940 45,575 190,325 181,740 35,900 289,825 47,663 236,900 477,490 35,813 14,150 268,825 62,425 220,750 138,940 3225 432,640 57,750 317,790 125,175

ID 7 0 0 3.1 7.8 0 0 3.6 4.6 0 14.3 0.9 0.5 0.5 3.7 0 0 68.2 53.3 0 0 0 4.4 6.2 0

Purity 1 1 1 0.7 0.8 1 1 0.9 0.9 1 0.9 0.9 1 1 0.7 1 1 0.9 1 1 1 1 0.8 1 1

HV 39,313 44,125 84,675 78,313 295,175 208,275 45,575 200,040 176,490 35,900 289,750 48,075 237,800 477,440 36,400 14,150 268,825 64,100 220,890 138,940 3225 432,640 56,613 318,060 125,175

Winner DE-ICA Both Both ID-MAENS ID-MAENS ID-MAENS, DE-ICA Both DE-ICA DE-ICA Both DE-ICA DE-ICA D-MAENS D-MAENS ID-MAENS Both D-MAENS, DE-ICA Both D-MAENS Both Both Both ID-MAENS DE-ICA Both

Evolutionary computation-based multiobjective 279

D01 D02 D03 D04 D05 D06 D07 D08 D09 D10 D11 D12 D13 D14 D15 D16 D17 D18 D19 D20 D21 D22 D23 D24 D25

ID

ID-MAENS

280 Chapter 8

Table 8.23: The performance of D-MAENS, ID-MAENS, and DE-ICA on Beullens E. D-MAENS Name E01 E02 E03 E04 E05 E06 E07 E08 E09 E10 E11 E12 E13 E14 E15 E16 E17 E18 E19 E20 E21 E22 E23 E24 E25

ID 19.8 11 0 40.0 5.3 1.4 0 0 32.0 0 24.0 25.7 6.4 14.3 0 14.3 7.5 6.5 3.2 0 20.0 3.8 13.8 20.6 0

Purity 0 0.6 1 0.7 1 1 1 1 0.3 1 0.5 0.6 0.8 0.8 0.5 0.7 0.8 0.7 0.5 1 0.7 0.8 0.5 0.2 1

ID-MAENS HV

4337.5 20,088 108,700 2550 54,113 27,025 8100 33,575 88,725 103,860 1925 13,013 101,910 216,325 375 1275 172,425 12,300 110,790 850 182,225 63,113 49,025 8162.5 24,600

ID 12.8 25 0 82.6 16.3 1.4 0 0 15 0 25.4 4.1 0 11.6 0 11.6 3.7 2.5 2.2 0 29.0 2.0 0 33.3 0

Purity 0.2 0.8 1 0 0.4 1 1 1 0.4 1 0.5 1 1 0.8 0.5 1 1 0.7 0.6 1 0.3 0.8 1 0.3 1

DE-ICA HV

2950 19,325 108,700 5337.5 65,238 27,025 8100 33,575 83,788 103,860 925 13,388 93,388 258,250 375 1200 174,640 17,950 88,188 850 227,390 62,425 50,600 11,550 24,600

ID 2.5 3.6 0 13.0 2.5 0 0 0 9.2 0 4.0 33 5.6 0.4 0 0.4 36 2.5 0.6 0 2.3 2.4 11.3 13 0

Purity 0.8 1 1 1 0.9 1 1 1 0.4 1 0 0.3 0.8 0.9 0.5 0.8 0.7 0.8 0.8 1 1 1 0.5 0.5 1

HV 3550 24,650 108,700 5312.5 59,900 27,025 8100 33,575 92,063 103,860 4675 11,475 95,725 249,275 375 1175 199,010 13,063 88,000 850 163,260 63,425 49,300 8300 24,600

Winner DE-ICA DE-ICA Both DE-ICA Both DE-ICA Both Both DE-ICA Both DE-ICA ID-MAENS ID-MAENS DE-ICA Both Both ID-MAENS DE-ICA DE-ICA Both DE-ICA DE-ICA ID-MAENS DE-ICA Both

Table 8.24: The performance of D-MAENS, ID-MAENS, and DE-ICA on Beullens F. D-MAENS Name

14.5 6.5 0 8.9 4.7 0 0 3.1 24.8 0.8 10.0 2.8 1.8 4.6 18.3 0 0 17.1 12.8 0 20.6 2.4 15.3 15.8 0

Purity 0.6 0.6 1 0.5 0.7 1 1 0.9 0.1 1 0.2 1 1 0.9 0.6 1 1 0.2 0.4 1 0.6 1 0.4 0.1 1

HV 169,850 353,725 188,600 140,010 464,690 170,140 182,150 571,650 597,940 267,410 143,490 125,000 223,175 711,950 52,400 83,625 490,290 145,000 397,140 97,263 303,690 183,475 287,140 211,010 100,260

ID 12.8 2.9 0 19.7 0 0.8 0 9.4 23.2 0.8 3.4 3.4 1.8 2.6 14.8 6.4 0 13.1 6.1 0 7.9 1.5 12.1 8.1 0

Purity 0.9 0.8 1 0.9 1 1 1 0.8 0.4 1 0.5 0.8 1 1 0.7 0.9 1 0.5 0.8 1 0.7 1 0.7 0.1 1

DE-ICA HV 163,250 321,490 188,600 104,575 465,750 170,090 182,150 569,750 618,210 267,410 141,400 122,775 223,175 654,750 65,938 86,088 490,290 145,000 394,475 97,263 277,350 183,375 290,040 194,890 100,260

ID 5 0 0 0 2.0 0 0 0.9 2.8 0.6 2.6 7.8 0 2.4 6.1 4.1 0 8.1 1.0 0 15.0 0 0 0 0

Purity 0.7 1 1 1 0.8 1 1 0.9 0.9 1 0.6 0.5 1 0.9 0.5 0.7 1 0.9 1 1 0.8 1 1 1 1

HV 166,940 323,010 188,600 135,850 464,710 170,140 182,150 571,225 564,210 265,210 133,590 119,410 222,650 667,960 68,488 83,738 490,290 143,925 394,460 97,263 301,610 180,400 271,900 188,640 100,260

Winner Both DE-ICA Both DE-ICA ID-MAENS D-MAENS,DE-ICA Both D-MAENS,DE-ICA DE-ICA Both DE-ICA D-MAENS Both Both DE-ICA D-MAENS Both DE-ICA DE-ICA Both Both D-MEANS,DE-ICA DE-ICA DE-ICA Both

Evolutionary computation-based multiobjective 281

F01 F02 F03 F04 F05 F06 F07 F08 F09 F10 F11 F12 F13 F14 F15 F16 F17 F18 F19 F20 F21 F22 F23 F24 F25

ID

ID-MAENS

282 Chapter 8 the small-scale set Beullens. It shows that DE-ICA can obtain more satisfactory nondominant sets when solving small-scale MO-CARP. Next we study the performance of the three algorithms on the medium-scale set egl. In Table 8.25, better results are displayed in bold. We can see that DE-ICA performs better than the other two algorithms in 22 instances. As for the two columns ID and Purity in Table 8.25, DE-ICA gets the theoretical optimum 0 of ID in 13 instances and the theoretical optimum one of Purity in 15 instances. This illustrates that DE-ICA steadily obtains all the optimal nondominant solutions. The performance of DE-ICA is far better than D-MAENS and ID-MAENS on the mediumscale set egl, which proves the effectiveness of DE-ICA when solving the medium-scale MO-CARP. Measuring the pros and cons of an algorithm, we should pay attention to not only the small-scale and medium-scale MO-CARP, but also the difficult large-scale MO-CARP. As shown in Table 8.26, we study the performance of the three algorithms on the large-scale set EGL-G. Better results are displayed in bold. In Table 8.26, DE-ICA performs better than D-MAENS and ID-MAENS in all the instances. Furthermore, DE-ICA gets the theoretical optimum 0 of ID in seven instances and the theoretical optimum one of Purity in all instances. It shows DE-ICA steadily obtains all the optimal nondominant solutions. The experimental results show that DE-ICA has obvious advantages over D-MAENS and ID-MAENS. To conclude, DE-ICA has certain advantages in the small-scale MO-CARP, and has powerful advantages in the medium-scale MO-CARP. Meanwhile, it has full advantages in the large-scale MO-CARP. 8.5.4.3 IRDG-MAENS Table 8.27 lists the results of IRDG-MAENS and RDG-MAENS of a simple target LSCARP, and gives the test values of significant differences between the two algorithms. In the evaluation index of sorting various algorithms, if the mean value of the algorithm is the lowest, it is named the winner. Results in bold represent the best performance. In Table 8.27, IRDG-MAENS can find 10 better solutions, and only one of the 25 kinds of Beullens C is equal to the solution of RDG-MAENS. For the other 25 kinds of Beullens D, IRDG-MAENS obtains 15 better solutions than RDG-MAENS, one of which is the same as that of RDG-MAENS. It can be clearly seen from Table 8.27 that the IRDG-MAENS algorithm can find a better solution than the original algorithm. For the Wilcoxon symbolic rank test, examples C05, C09, C11, C13, C16, C19, C22, D03, D17, D18, D19, D23, and D25 get h ¼ 1, which show that the results of IRDG-MAENS are obviously better than those of RDG-MAENS.

Table 8.25: The performance of D-MAENS, ID-MAENS, and DE-ICA on egl D-MAENS Name

0 7.2 22 0.9 38.0 5.0 14.6 24.4 20.2 51.0 36.0 130 26.7 19.2 44.8 81.0 80.1 130.7 53.1 70.5 259.1 92.4 126.9 399

Purity 1 1 0 0.4 0 0.3 0.1 0 0.3 0 0 0 0.5 0.3 0 0 0 0 0 0 0 0 0 0

HV 26,228 142.5 448 104,439 15,542 3313.5 154,422 9498.5 265.5 135,780 13,080 0 197,130 60,322 47,694 97,846 35,152 7852.5 157,560 57,581 7845 4366 10 0

ID 0.9 0 2.5 3.6 16.1 5.7 22.6 12.4 44.0 8.0 32.0 106.8 22.4 19.2 23.5 84.9 72.6 71.0 35.7 29.2 98.0 16.6 55 108

Purity 1 1 0.5 0.4 0.3 0.3 0.3 0.2 0.3 0.6 0 0 0.4 0.6 0.7 0.1 0 0 0.1 0.5 0 0.4 0 0

DE-ICA HV

26,491 32 656 102,920 15,526 3756 195,380 11,137 449.5 115,140 20,468 5599 198,090 62,496 46,137 106,840 34,866 6817.5 177,570 52,420 12,223 5696.5 0 0

ID 7.4 0 0 0 2.5 0 1.0 2.8 2.1 7.7 0 0 1.1 0 0 5.5 0 0 4.8 11.1 0 7.8 0 0

Purity 0.6 1 1 0.9 0.7 1 0.7 1 0.3 0.6 1 1 0.9 1 1 1 1 1 0.9 0.6 1 1 1 1

HV 26,596 43 656 104,430 15,437 4034 142,900 11,852 580 123,231 12,483 4018 195,745 59,986 46,509 98,628 37,432 12,611 122,092 52,235 15,113 6099.5 0 0

Winner D-MAENS Both DE-ICA DE-ICA DE-ICA DE-ICA DE-ICA DE-ICA DE-ICA DE-ICA DE-ICA DE-ICA DE-ICA DE-ICA DE-ICA DE-ICA DE-ICA DE-ICA DE-ICA DE-ICA DE-ICA DE-ICA DE-ICA DE-ICA

Evolutionary computation-based multiobjective 283

e1-A e1-B e1-C e2-A e2-B e2-C e3-A e3-B e3-C e4-A e4-B e4-C s1-A s1-B s1-C s2-A s2-B s2-C s3-A s3-B s3-C s4-A s4-B s4-C

ID

ID-MAENS

284 Chapter 8

Table 8.26: The performance of D-MAENS, ID-MAENS, and DE-ICA on EGL-G. D-MAENS Name G1-A G1-B G1-C G1-D G1-E G2-A G2-B G2-C G2-D G2-E

ID 22,411 35,050 19,935 11,818 10,163 27,601 25,815 5114.5 27,211 37,024

Purity 0 0 0 0 0 0 0 0 0 0

ID-MAENS HV

875,609,980 103,484,856 144,284,011 23,275,162 9,843,456 600,100,000 314,080,000 3,347,904 4,444,384 7,112,672

ID 9312.4 12,299 15,243 9345.1 8584.2 10,898 5926.6 5075 7683.5 7308.4

Purity 0.1 0 0 0.3 0 0 0 0 0 0

DE-ICA HV

518,990,000 130,740,512 205,990,000 27,814,031 13,204,800 578,420,298 213,579,084 2,911,776 8,843,968 13,771,520

ID 36.7 0 0 766 0 0 0 0 0 0

Purity 1 1 1 1 1 1 1 1 1 1

HV 754,420,000 214,402,598 96,504,232 64,570,424 14,034,944 954,960,976 365,219,647 4,420,864 12,163,648 16,136,288

Winner DE-ICA DE-ICA DE-ICA DE-ICA DE-ICA DE-ICA DE-ICA DE-ICA DE-ICA DE-ICA

Evolutionary computation-based multiobjective 285 Table 8.27: The simulation results of two algorithms on Beullens C, D. Name (V, E, T, s) C01(69,98,79,9) C02(48,66,53,7) C03(46,64,51,6) C04(60,84,72,8) C05(56,79,65,10) C06(38,55,51,6) C07(54,70,52,8) C08(66,88,63,8) C09(76,117,97,12) C10(60,82,55,9) C11(83,118,94,10) C12(62,88,72,9) C13(40,60,52,7) C14(58,79,57,8) C15(97,140,107,11) C16(32,42,32,3) C17(43,56,42,7) C18(93,133,121,11) C19(62,84,61,6) C20(45,64,53,5) C21(60,84,76,8) C22(56,76,43,4) C23(78,109,92,8) C24(77,115,84,7) C25(37,50,38,5) D01(69,98,79,5) D02(48,66,53,4) D03(46,64,51,3) D04(60,84,72,4) D05(56,79,65,5) D06(38,55,51,3) D07(54,70,52,4) D08(66,88,63,4) D09(76,117,97,6) D10(60,82,55,5) D11(83,118,94,5) D12(62,88,72,5) D13(40,60,52,4) D14(58,79,57,4) D15(97,140,107,6) D16(32,42,32,2) D17(43,56,42,4) D18(93,133,121,6)

RDG-MAENS 3232.8 2528.5 2082.5 2786.3 3949.7 2171.0 3152.3 3070.5 4127.8 3352.3 3767.8 3327.2 2539.3 3295.0 4001.5 1278.5 2620.0 4194.2 2410.0 1905.5 3097.5 1938.7 3148.0 2738.3 1823.7 3242.0 2537.2 2080.3 2785.0 3945.0 2170.3 3171.3 3079.2 4126.3 3349.0 3767.2 3329.2 2540.2 3290.0 4011.0 1272.0 2626.3 4189.0

IRDG-MAENS 3234.3 2529.7 2082.0 2802.5 3943.3 2167.0 3162.3 3080.3 4130.7 3351.7 3763.5 3330.2 2537.0 3290.7 4012.0 1266.3 2620.0 4194.0 2404.2 1918.7 3099.7 1924.2 3153.2 2744.5 1825.7 3244.0 2534.0 2073.5 2785.0 3942.0 2168.2 3164.3 3076.2 4137.2 3363.7 3773.0 3331.3 2541.0 3293.3 4007.7 1274.7 2620.0 4183.8

p 0.2500 0.1484 1 0.2500 0.0325 0.1816 0.0059 0.0012 0.0325 1 0.0001 0.7813 0.0156 0.2500 0.1563 0.00006 1 1 0.00097 0.0049 1 0.00047 0.00015 0.00097 0.2344 0.5000 0.4375 0.0175 1 0.5000 0.5632 0.3906 0.1875 0.5000 0.0020 0.0050 0.3125 0.00001 0.0225 0.8281 0.0313 0.5000 0.0078

h 0 0 0 0 1 0 1 1 1 0 1 0 1 0 0 1 0 0 1 1 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 1 0 1 1 0 1 0 1

Winner RDG-MAENS RDG-MAENS IRDG-MAENS RDG-MAENS IRDG-MAENS IRDG-MAENS RDG-MAENS RDG-MAENS RDG-MAENS IRDG-MAENS IRDG-MAENS RDG-MAENS IRDG-MAENS IRDG-MAENS RDG-MAENS IRDG-MAENS Both IRDG-MAENS IRDG-MAENS RDG-MAENS RDG-MAENS IRDG-MAENS RDG-MAENS RDG-MAENS RDG-MAENS RDG-MAENS IRDG-MAENS IRDG-MAENS Both IRDG-MAENS IRDG-MAENS IRDG-MAENS IRDG-MAENS RDG-MAENS RDG-MAENS RDG-MAENS RDG-MAENS RDG-MAENS RDG-MAENS IRDG-MAENS RDG-MAENS IRDG-MAENS IRDG-MAENS Continued

286 Chapter 8 Table 8.27: The simulation results of two algorithms on Beullens C, D.dcont’d Name (V, E, T, s) D19(62,84,61,3) D20(45,64,53,3) D21(60,84,76,4) D22(56,76,43,2) D23(78,109,92,4) D24(77,115,84,4) D25(37,50,38,3)

RDG-MAENS 2406.2 1944.0 3098.8 1932.3 3158.2 2740.3 1840.3

IRDG-MAENS 2402.7 1935.5 3098.7 1931.8 3150.3 2740.8 1834.3

p 0.0078 0.1801 0.9795 0.7178 0.0314 0.8242 0.0050

h 1 0 0 0 1 0 1

Winner IRDG-MAENS IRDG-MAENS IRDG-MAENS IRDG-MAENS IRDG-MAENS RDG-MAENS IRDG-MAENS

Table 8.28 shows the simulation results of IRDG-MAENS and RDG-MAENS on Beullens E, F datasets. Compared with RDG-MAENS, IRDG-MAENS produces 15 better solutions, two of which are equivalent to those of RDG-MAENS. When 25 kinds of Beullens E were tested, IRDG-MAENS produced 13 better solutions than RDG-MAENS. For the Wilcoxon symbolic rank test, examples E03, E07, E11, E15, E18, E21, E23, E24, E25, F04, F06, F07, F14, F23, F24, and F25 get h ¼ 1, indicating that IRDG-MAENS has significantly improved the results of these examples. Table 8.28: The simulation results of two algorithms on Beullens E, F. Name (V, E, T, s) E01(73,105,85,10) E02(58,81,58,8) E03(46,61,47,5) E04(70,99,77,9) E05(68,94,61,9) E06(49,66,43,5) E07(73,94,50,8) E08(74,98,59,9) E09(91,141,103,12) E10(56,76,49,7) E11(80,113,94,10) E12(74,103,67,9) E13(49,73,52,7) E14(53,72,55,8) E15(85,126,107,9) E16(60,80,54,7) E17(38,50,36,5) E18(78,110,88,8) E19(77,103,66,6) E20(56,80,63,7) E21(57,82,72,7) E22(54,73,44,5) E23(93,130,89,8)

RDG-MAENS 4064.0 3321.5 1686.5 3508.0 3608.8 1886.0 3393.2 3712.7 4796.0 2940.0 3870.2 3454.3 2860.3 3383.5 4228.0 2738.3 2055.0 3844.8 2526.7 2460.0 3797.3 2079.0 3747.2

IRDG-MAES 4073.0 3320.3 1682.2 3504.2 3612.3 1890.3 3380.0 3711.8 4794.7 2935.7 3866.7 3457.7 2865.3 3383.5 3572.5 2746.3 2055.0 3125.2 2527.2 2458.0 2949.8 2080.2 3016.7

p 0.0032 0.7813 0.0313 0.8438 0.1250 0.1250 0.00049 0.3750 0.3867 1 0.0225 0.1548 0.0625 1 0.000001 0.1328 1 0.000001 1 0.8789 0.000002 0.0313 0.000002

h 1 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 1 1 1

Winner RDG-MAENS IRDG-MAENS IRDG-MAENS IRDG-MAENS RDG-MAENS RDG-MAENS IRDG-MAENS IRDG-MAENS IRDG-MAENS IRDG-MAENS IRDG-MAENS RDG-MAENS RDG-MAENS Both IRDG-MAENS RDG-MAENS Both IRDG-MAENS RDG-MAENS IRDG-MAENS IRDG-MAENS RDG-MAENS IRDG-MAENS

Evolutionary computation-based multiobjective 287 Table 8.28: The simulation results of two algorithms on Beullens E, F.dcont’d Name (V, E, T, s) E24(97,142,86,8) E25(26,35,28,4) F01(73,105,85,5) F02(58,81,58,4) F03(46,61,47,3) F04(70,99,77,5) F05(68,94,61,5) F06(49,66,43,3) F07(73,94,50,4) F08(74,98,59,5) F09(91,141,103,6) F10(56,76,49,4) F11(80,113,94,5) F12(74,103,67,5) F13(49,73,52,4) F14(53,72,55,4) F15(85,126,107,5) F16(60,80,54,4) F17(38,50,36,3) F18(78,110,88,4) F19(77,103,66,3) F20(56,80,63,4) F21(57,82,72,4) F22(54,73,44,3) F23(93,130,89,4) F24(97,142,86,4) F25(26,35,28,2)

RDG-MAENS 4075.5 1652.0 4060.5 3311.3 1696.2 3507.5 3605.3 1909.8 3393.5 3712.3 4801.8 2936.5 3864.8 3462.3 2860.0 3398.8 3575.7 2741.7 2058.0 3126.3 2527.0 2450.3 2943.3 2077.8 3015.7 3253.5 1510.2

IRDG-MAES 3253.3 1527.3 4068.5 3318.5 1693.7 3507.2 3616.3 1882.7 3386.8 3718.0 4797.5 2944.0 3867.5 3462.3 2860.2 3384.8 3575.5 2759.0 2055.0 3124.0 2525.3 2452.8 2952.3 2078.0 3015.0 3251.5 1485.7

p 0.000001 0.00002 0.0034 0.0078 0.6797 0.0337 0.0068 0.0088 0.0381 0.0313 0.1846 0.1250 0.2132 0.9668 0.8960 0.0213 0.8398 0.0625 1 1 0.2500 0.0156 0.0078 1 0.0679 0.0474 0.0002

h 1 1 1 1 0 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 1

Winner IRDG-MAENS IRDG-MAENS RDG-MAENS RDG-MAENS IRDG-MAENS IRDG-MAENS RDG-MAENS IRDG-MAENS IRDG-MAENS RDG-MAENS IRDG-MAENS RDG-MAENS RDG-MAENS Both RDG-MAENS IRDG-MAENS IRDG-MAENS RDG-MAENS IRDG-MAENS IRDG-MAENS IRDG-MAENS RDG-MAENS RDG-MAENS RDG-MAENS IRDG-MAENS IRDG-MAENS IRDG-MAENS

Table 8.28 shows the simulation results of IRDG-MAENS and RDG-MAENS on Beullens E, F data sets. Compared with RDG-MAENS, 15 better solutions were generated by IRDG-MAENS, and there were two solutions equal to those of RDG-MAENS, when testing 25 instances of Beullens E. For 25 instances of Beullens F, IRDG-MAENS yielded 13 better solutions in contrast with RDG-MAENS. For the Wilcoxon signed rank test, instances E03, E07, E11, E15, E18, E21, E23, E24, E25, F04, F06, F07, F14, F23, F24. and F25 get h ¼ 1, which shows that IRDG-MAENS improves the results significantly in these instances. Table 8.29 presents the test results of IRDG-MAENS and RDG-MAENS on the egl test set. Compared with RDG-MAENS, 18 better solutions are obtained by using IRDGMAENS when testing 24 instances of egl. The Wilcoxon signed rank test shows 11 out of the 18 best solutions also show h ¼ 1, which confirms their significance. This explains that IRDG-MAENS can achieve better solutions than RDG-MAENS in most large-scale egl instances.

288 Chapter 8 Table 8.29: The simulation results of two algorithms on egl. Name (V, E, T, s) e1-A (77,98,51,5) e1-B (77,98,51,7) E1-C (77,98,51,10) e2-A (77,98,72,7) e2-B (77,98,72,10) e2-C (77,98,72,14) e3-A (77,98,87,8) e3-B (77,98,87,12) e3-C (77,98,87,17) e4-A (77,98,98,9) e4-B (77,98,98,14) e4-C (77,98,98,19) s1-A (140,190,75,7) s1-B (140,190,75,10) s1-C (140,190,75,14) S2-A (140,190,147,14) S2eB (140,190,147,20) S2eC (140,190,147,27) S3-A (140,190,159,15) S3eB (140,190,159,22) S3eC (140,190,159,29) S4-A (140,190,190,19) S4eB (140,190,190,27) S4eC (140,190,190,35)

RDG-MAENS 3556.7 4530.7 5621.4 5026.8 6344.7 8358.1 5913.5 7817.8 10,327.9 6479.8 9028.4 11,654.5 5059.5 6424.5 8541.9 10,000.9 13,203.5 16,488.6 10,288.5 13,814.1 17,288.7 12,388.6 16,407.7 20,672.1

IRDG-MAENS 3552.0 4530.4 5617.8 5022.2 6340.5 8358.8 5910.4 7814.4 10,322.6 6470.2 9029.1 11,648.6 5048.8 6428.0 8533.4 9987.9 13,200.5 16,490.0 10,292.3 13,797.2 17,287.7 12,396.5 16,394.8 20,661.5

p 0.5000 0.5625 0.0005 0.2500 0.0021 0.5703 0.000001 0.0877 0.0125 0.000002 0.0409 0.1528 0.00097 0.7803 0.00024 0.0324 0.6883 0.3806 0.0113 0.0194 0.3279 0.0010 0.0014 0.0098

h 0 0 1 0 1 0 1 0 1 1 1 0 1 0 1 1 0 0 1 1 0 1 1 1

Winner IRDG-MAENS IRDG-MAENS IRDG-MAENS IRDG-MAENS IRDG-MAENS RDG-MAENS IRDG-MAENS IRDG-MAENS IRDG-MAENS IRDG-MAENS RDG-MAENS IRDG-MAENS IRDG-MAENS RDG-MAENS IRDG-MAENS IRDG-MAENS IRDG-MAENS RDG-MAENS RDG-MAENS IRDG-MAENS IRDG-MAENS RDG-MAENS IRDG-MAENS IRDG-MAENS

In summary, IRDG-MAENS can find a better solution for large-scale single-objective CARP than RDG-MAENS with a faster convergence rate, for the majority of tested instances. Table 8.30 shows the results of IRDG-MAENS and RDG-MAENS on EGL-G test data. Compared with RDG-MAENS, IRDG-MAENS can find seven better solutions when testing 10 EGL-G instances. The Wilcoxon symbolic rank test shows that in seven Table 8.30: The simulation results of two algorithms on EGL-G. Name (V, E, T, s) G1-A (255,375,347,20) G1-B (255,375,347,25) G1-C (255,375,347,30) G1-D (255,375,347,35) G1-E (255,375,347,40) G2-A (255,375,375,22) G2-B (255,375,375,27) G2-C (255,375,375,32) G2-D (255,375,375,37) G2-E (255,375,375,42)

RDG-MAENS

IRDG-MAENS

1,008,717.5 1,126,652.7 1,254,743.4 1,388,719.2 1,533,089.5 1,108,472.7 1,223,670.2 1,354,538.8 1,493,660.2 1,637,388.9

1,007,977.1 1,125,763.6 1,255,674.1 1,388,277.5 1,528,397.0 1,108,959.5 1,223,541.5 1,353,653.7 1,495,822.2 1,636,473.4

p 0.0016 0.0009 0.0003 0.5170 0.0000 0.2536 0.9426 0.0333 0.9590 0.0256

h 1 1 1 0 1 0 0 1 0 1

Winner IRDG-MAENS IRDG-MAENS RDG-MAENS IRDG-MAENS IRDG-MAENS RDG-MAENS IRDG-MAENS IRDG-MAENS RDG-MAENS IRDG-MAENS

Evolutionary computation-based multiobjective 289 instances of IRDG-MAENS, five give the best solution of “h  1.” The test results of most large EGL-G examples are obviously improved by IRDG-MAENS, which shows the effectiveness of IRDG-MAENS in solving single-objective LSCARP.

8.5.5 Comparison of nondominant solutions 8.5.5.1 MPCCA In order to comprehensively evaluate the performance of MPCCA, D-MAENS, and DVCMOA, the nondominated solutions obtained by them in some instances are plotted in the objective space. For those small- and medium-scale data instances (gdb and val), these nondominated solutions are overlaps and are different to distinguish. Therefore, it is more important to visualize the PF for the instances with larger scale than those with lower scale. In practice, for each instance, all the solutions obtained throughout multiple runs are first combined together. Then, the nondominated solutions are identified and chosen for show. Due to the limited space, here we take the egl set as representative and select a few instances to illustrate in Fig. 8.11. As can be seen from Fig. 8.11, on egl-e3-A, MPCCA finds all the nondominated solutions in the intermediate region (that with the value of makespan from 830 to 870 and the value of total-cost from 6300 to 6500). On egl-e2-B, MPCCA also has the advantage of a strong convergence in the intermediate region (6350 < total-cost < 6650 and 825 < makespan < 855). On egl-e4-B and egl-s1-C, MPCCA pays more attention to the low total-cost region, while D-MAENS and DVCMOA do not have this advantage. On egl-e3-B, MPCCA converges better than the others, and the solutions found by the other algorithms are mostly dominated by MPCCA. In addition, on egl-s2-C, egl-s3-C, and egl-s4-A, MPCCA finds almost all the nondominated solutions, and the PF of MPCCA is closer to the true PF. For the remaining test instances of egl, MPCCA still has a stronger capability of converging to the true PF than the other two algorithms. 8.5.5.2 DE-ICA We present the nondominant solutions of three algorithms. Solutions to each of these algorithms presented below are selected according to the following rule. At first, one group of nondominant solutions can be acquired from an independent run. Then, after 30 runs, there will be 30 groups of nondominant solutions. Finally, we put these 30 groups together, from which we select the nondominant solutions to present. Due to limited space, we select some small-scale instances and some large-scale instances in Fig. 8.12 to show the nondominant solutions. In Fig. 8.12, the first two lines of numbers are small-scale examples, where DE-ICA converges better than the other two algorithms in two examples. In addition, in the other four small examples, DE-ICA is no worse than D-MAENS and ID-MAENS. In general,

290 Chapter 8 (1A)

(1B)

egl-e2-B MPCCA D-MAENS DVCMOA

MPCCA D-MAENS DVCMOA

920

860

makespan

makespan

870

850 840 830 820 6300

egl-e3-A 940

880

900 880 860 840

6400

6500

6600

6700

6800

820 5800

6900

total-cost

(1C)

6000

6200

6400

6600

6800

7000

7200

total-cost egl-e3-B

880 MPCCA D-MAENS DVCMOA

makespan

870 860 850 840 830 820 7700

7800

7900

8000

8100

8200

8300

8400

total-cost

(2A)

(2B)

egl-e4-B 960 MPCCA D-MAENS DVCMOA

940

egl-s1-C 1020

920

makespan

980

900 880

960

940

860 920

840 820 9000

9100

9200

9300

900 8500

9400

8600

8700

total-cost

(2C)

8800

8900

total-cost

egl-s2-B 1040 MPCCA D-MAENS DVCMOA

1030 1020

makespan

makespan

MPCCA D-MAENS DVCMOA

1000

1010 1000 990 980 970 1.32

1.34

1.36

1.38

total-cost

1.4

1.42

1.44

x 10

Figure 8.11 The PF of the three algorithms in some instances in egl.

9000

9100

Evolutionary computation-based multiobjective 291 (3A)

(3B)

egl-s2-C 1040 1030

makespan

1060

1010 1000

1040 1020

990

1000

980

980

970 1.66

MPCCA D-MAENS DVCMOA

1080

1020

makespan

egl-s3-A 1100

MPCCA D-MAENS DVCMOA

1.67

1.68

1.69

1.7

(3C)

960 1

1.71

total-cost

1.05

1.1

1.15

1.2

total-cost

x 10

1.25 x 10

egl-s3-B 1060 MPCCA D-MAENS DVCMOA

makespan

1040

1020

1000

980

960 1.38

1.4

1.42

1.44

1.46

1.48

total-cost

(4A)

(4B)

egl-s3-C 1100 MPCCA D-MAENS DVCMOA

1080

MPCCA D-MAENS DVCMOA

1080

makespan

1040 1020 1000

1070 1060 1050 1040

980

1030 1.74

1.75

1.76

1.77

1.78

total-cost

(4C)

1.79

1020 1.24

1.8

1.25

egl-s4-B MPCCA D-MAENS DVCMOA

1027.5

1027

1026.5

1026 1.64

1.65

1.26

total-cost

x 10

1028

makespan

makespan

egl-s4-A 1100 1090

1060

960 1.73

1.5 x 10

1.66

total-cost

Figure 8.11 cont’d

1.67

1.68 x 10

1.27

1.28 x 10

292 Chapter 8

Figure 8.12 The nondominant solutions obtained by three algorithms in some small- and large-scale instances.

DE-ICA shows a slight advantage in small examples. The reason for this is that small examples have a smaller solution space, and they are easier to solve. Therefore, the other two comparison algorithms have enough ability to deal with it and obtain similar performance as DE-ICA. In the last two lines, DE-ICA has the best convergence on five examples (e4a, s1a, s4a, G1-A, G2-E). In these cases, DE-ICA can reach areas where total costs are low and manufacturing costs are low. Therefore, the nondominant solution

Evolutionary computation-based multiobjective 293 obtained by DE-ICA almost completely controls the nondominant solution obtained by DMAENS and ID-MAENS. In other cases, the convergence of the DE-ICA algorithm is no worse than that of the other two algorithms. In short, DE-ICA shows obvious advantages in the convergence of these large examples. The results show that DE-ICA is more suitable for large-scale problems. 8.5.5.3 IRDG-IDMAENS The following figures show experimental results of IRDG-IDMAENS, which combines IRDG-MAENS and IDMAENS, for solving multiobjective LSCARP. In order to show the distribution of the nondominated solutions of IRDG-IDMAENS in the objective space, Fig. 8.13 graphs the results of IRDG-IDMAENS and IDMAENS in four test problems. The horizontal axis represents the total consumption of the circuit, and the vertical axis denotes the maximum consumption of a single circuit. The symbol “o” indicates IDMAENS, and “*” represents IRDG-IDMAENS. Fig. 8.13 shows that IRDG-IDMAENS has better convergence on the gdb5 and gdb10 instances than IDMAENS, which is consistent with the value of purity according to its

Figure 8.13 The PF of the two algorithms in some instances in gdb.

294 Chapter 8 definition. However, because the scale of the gdb test set is small, the superiority of IRDG-IDMAENS for solving multiobjective CARP is not obvious. The performances of the two algorithms in this test set are similar. The distribution of the nondominated solutions obtained by IRDG-IDMAENS and IDMAENS testing in val1B, val4A, val5A, and val7A are shown in Fig. 8.14. It can be seen from Fig. 8.14 that the ability of IRDG-IDMAENS to find the optimum solution, and the convergence rates of IRDG-IDMAENS, are visibly better than those of IDMAENS. IRDG-IDMAENS can find more nondominated solutions than IDMAENS, which demonstrates that the ability of IRDG-IDMAENS in searching for solutions is stronger than that of IDMAENS. Using adjacent shared areas, not only can accelerate the convergence rate but also increase the diversity of solutions. For the medium-sized data set val, the advantages of IRDG-IDMAENS are significantly greater than IDMAENS compared with the results generated on a small-scale data set. In order to see the distribution of the nondominated solutions more clearly, Fig. 8.15 graphs the results of the two algorithms on the test set egl. It can be seen from Fig. 8.15

Figure 8.14 The PF of the two algorithms in some instances in val.

Evolutionary computation-based multiobjective 295

Figure 8.15 The PF of the two algorithms in some instances in egl.

that the ability to find a better solution and the convergence of IRDG-IDMAENS are both stronger than IDMAENS. IRDG-IDMAENS performs well both on finding better solutions in the middle region and on finding the front of the multiobjective problem. IRDGIDMAENS can find more nondominated solutions than IDMAENS. The front found by

296 Chapter 8 IRDG-IDMAENS is significantly better than that of IDMAENS. IRDG-IDMAENS is effective in searching solutions, and is suitable for solving multiobjective LSCARP. In order to show the distribution of the nondominated solutions in the objective space, the results generated by both algorithms on EGL-G are shown in Fig. 8.16. Fig. 8.16 shows that IRDG-IDMAENS can find better optimal solutions, which have better convergence to the true front than IDMAENS. In summary, IRDG-IDMAENS can find a significantly better front than IDMAENS in handling multiobjective LSCARP. The convergence of IRDG- IDMAENS is significantly better than IDMAENS, and the diversity of IRDGIDMAENS is also better in most instances. As the scale of the data grows, the advantages of IRDG-IDMAENS become increasingly apparent, as can be seen when comparing the results on different sizes of (small-scale, medium-scale, large-scale) test data. This is because IRDG-IDMAENS uses the RDG decomposition program to solve large-scale problems, and dynamically allocates the

Figure 8.16 The PF of the two algorithms in some instances in EGL-G.

Evolutionary computation-based multiobjective 297 solutions for each decomposed problem on the basis of the current population information. In addition, it performs timely updates of the optimal solutions of the decomposed problem and enables the replaced solution to participate in the solution of the circulation problem. IRDG-IDMAENS not only retains the characteristics of RDG-MAENS, but also retains the optimal solution for each decomposed problem. Therefore, IRDG-IDMAENS presents a fast and simple allocation scheme according to the magnitude of the vector of the route direction, thereby evenly distributing computing resources. Overall, the results suggest that IRDG-IDMAENS is effective for solving multiobjective LSCARP.

8.6 Summary This chapter has presented three algorithms based on natural inspired algorithms for multiobjective capacitated arc routing problems (MO-CARP), i.e., MPCCA, DE-ICA, and IRDG-MAENS [16e18]. The first algorithm, MPCCA, investigates MO-CARP within the framework of CA. First, a set of uniformly distributed direction vectors is generated to divide the whole objective space into multiple subregions. The individuals in different subregions form different subpopulations, which are not static. Before each iteration, all the individuals in the current population are sorted according to different direction vectors and then assigned to N subpopulations evenly. These subpopulations evolve separately, while the adjacent subgroups can share their individuals in the form of cooperative subgroups. By referencing some other evolutionary strategies, such as the elitism archiving mechanism, the NSGA-II and the MAENS for SO-CARP, a multipopulation CA for MOCARP (MPCCA) is proposed. Compared with the state-of-the-art algorithm D-MAENS and DVCMOA, MPCCA shows better diversity and faster convergence, especially in large-scale instances. The second algorithm, DE-ICA, increases the scale of the initial population and then performs the clonal operation of the nondominant solutions. DE-ICA also draws lessons from the effective decomposition operation. This algorithm also proposes a novel directed comparison operator which can constantly improve the quality of the nondominant solutions on the basis of increasing the diversity of the population. Simulation results show that DE-ICA is competitive in the aspect of improving the quality of the nondominant solutions. DE-ICA also has better performance than D-MAENS and ID-MAENS. Meanwhile, it can be seen that DE-ICA has good convergence by observing the distribution of the Pareto front of the three algorithms. As a result, DE-ICA is very competitive in solving MO-CARP. Finally, IRDG-MAENS has been analyzed, and had improvements proposed for, the RDGMAENS algorithm which has previously been used for solving single-objective LSCARP. Better solutions were generated when solving single-objective LSCARP using the IRDGMAENS algorithm, than by RDG-MAENS. This is due to the algorithm using the RDG

298 Chapter 8 decomposition program to solve large-scale problems, and dynamically allocating the solution for each decomposed problem based on the current population information. IRDG-MAENS benefits from updating the optimal solution of the decomposed problem and enabling its participation in solving circulation problems. The improved algorithm not only retains the advantages of RDG-MAENS but also retains the optimal solution for each decomposed problem. Experimental results show that the performance of IRDG-MAENS outperforms RDG-MAENS on most test data examples [50].

References [1] Feng L, Ong YS, Lim MH, et al. Memetic search with interdomain learning: a realization between CVRP and CARP. IEEE Transactions on Evolutionary Computation 2015;19(5):644e58. [2] Chen X, Ong YS, Lim MH, et al. Cooperating memes for vehicle routing problems. International Journal of Innovative Computing, Information and Control 2011;7(11):1e10. [3] Chen X, Feng L, Soon Ong Y. A self-adaptive memeplexes robust search scheme for solving stochastic demands vehicle routing problem. International Journal of Systems Science 2012;43(7):1347e66. [4] Lacomme P, Prins C, Sevaux M. A genetic algorithm for a bi-objective capacitated arc routing problem. Computers and Operations Research 2006;33(12):3473e93. [5] Ong YS, Lim MH, Zhu N, et al. Classification of adaptive memetic algorithms: a comparative study. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 2006;36(1):141e52. [6] Ong YS, Lim MH, Chen X. Memetic computationdpast, present & future [research frontier]. IEEE Computational Intelligence Magazine 2010;5(2):24e31. [7] Ulusoy G. The fleet size and mix problem for capacitated arc routing. European Journal of Operational Research 1985;22(3):329e37. [8] Eglese RW. Routeing winter gritting vehicles. Discrete Applied Mathematics 1994;48(3):231e44. [9] Hertz A, Laporte G, Mittaz M. A tabu search heuristic for the capacitated arc routing problem. Operations Research 2000;48(1):129e35. [10] Beullens P, Muyldermans L, Cattrysse D, et al. A guided local search heuristic for the capacitated arc routing problem. European Journal of Operational Research 2003;147(3):629e43. [11] Lacomme P, Prins C, Ramdane-Cherif W. Competitive memetic algorithms for arc routing problems. Annals of Operations Research 2004;131(1e4):159e85. [12] Tang K, Mei Y, Yao X. Memetic algorithm with extended neighborhood search for capacitated arc routing problems. IEEE Transactions on Evolutionary Computation 2009;13(5):1151e66. [13] Mei Y, Li X, Yao X. Cooperative coevolution with route distance grouping for large-scale capacitated arc routing problems. IEEE Transactions on Evolutionary Computation 2014;18(3):435e49. [14] Mei Y, Tang K, Yao X. Decomposition-based memetic algorithm for multiobjective capacitated arc routing problem. IEEE Transactions on Evolutionary Computation 2011;15(2):151e65. [15] Shang RH, Wang J, Jiao L, et al. An improved decomposition-based memetic algorithm for multiobjective capacitated arc routing problem. Applied Soft Computing 2014;19:343e61. [16] Shang R, Wang Y, Wang J, et al. A multi-population cooperative coevolutionary algorithm for multiobjective capacitated arc routing problem. Information Sciences 2014;277:609e42. [17] Shang RH, Du BQ, Ma HN, et al. Immune clonal algorithm based on directed evolution for multiobjective capacitated arc routing problem. Applied Soft Computing 2016;49:748e58. [18] Shang RH, Dai KY, Jiao LC, et al. Improved memetic algorithm based on route distance grouping for multiobjective large scale capacitated arc routing problems. IEEE Transactions on Cybernetics 2016;46(4):1000e13. [19] Golden BL, Wong RT. Capacitated arc routing problems. Networks 1981;11(3):305e15.

Evolutionary computation-based multiobjective 299 [20] Batista LS, Campelo F, Guimara˜es FG, et al. Pareto cone ε-dominance: improving convergence and diversity in multiobjective evolutionary algorithms. In: International conference on evolutionary multicriterion optimization. Berlin, Heidelberg: Springer; 2011. p. 76e90. [21] Hughes EJ. MSOPS-II: a general-purpose many-objective optimiser. In: Evolutionary computation, 2007. CEC 2007. IEEE congress on. IEEE; 2007. p. 3944e51. [22] Messac A, Ismail-Yahaya A, Mattson CA. The normalized normal constraint method for generating the Pareto frontier. Structural and Multidisciplinary Optimization 2003;25(2):86e98. [23] Kramer O, Koch P. Rake selection: a novel evolutionary multiobjective optimization algorithm. In: Annual conference on artificial intelligence. Berlin, Heidelberg: Springer; 2009. p. 177e84. [24] Reynoso-Meza G, Sanchis J, Blasco X, et al. Design of continuous controllers using a multiobjective differential evolution algorithm with spherical pruning. In: European conference on the applications of evolutionary computation. Berlin, Heidelberg: Springer; 2010. p. 532e41. [25] Zhang QF, Li H. MOEA/D: a multiobjective evolutionary algorithm based on decomposition. IEEE Transactions on Evolutionary Computation 2007;11(6):712e31. [26] Chakraborty P, Das S, Roy GG, et al. On convergence of the multiobjective particle swarm optimizers. Information Sciences 2011;181(8):1411e25. [27] Golden BL, DeArmon JS, Baker EK. Computational experiments with algorithms for a class of routing problems. Computers and Operations Research 1983;10(1):47e59. [28] Wiegand RP. An analysis of cooperative coevolutionary algorithms. George Mason University; 2003. [29] Hillis WD. Co-evolving parasites improve simulated evolution as an optimization procedure. Physica D: Nonlinear Phenomena 1990;42(1e3):228e34. [30] Deb K, Pratap A, Agarwal S, et al. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation 2002;6(2):182e97. [31] Singh HK, Ray T, Smith W. C-PSA: Constrained Pareto simulated annealing for constrained multiobjective optimization. Information Sciences 2010;180(13):2499e513. [32] Knowles J, Corne D. The Pareto archived evolution strategy: a new baseline algorithm for Pareto multiobjective optimisation. In: Evolutionary computation, 1999. CEC 99. Proceedings of the 1999 congress on, vol. 1. IEEE; 1999. p. 98e105. [33] Shang R, Jiao L, Liu F, et al. A novel immune clonal algorithm for MO problems. IEEE Transactions on Evolutionary Computation 2012;16(1):35e50. [34] Zhang J, Zhou Q. Study on the optimization of logistics distribution VRP based on immune clone algorithm. Journal of Hunan University 2004;5:013. [35] Shang R, Ma H, Wang J, et al. Immune clonal selection algorithm for capacitated arc routing problem. Soft Computing 2016;20(6):2177e204. [36] Gong M, Jiao L, Zhang L. Baldwinian learning in clonal selection algorithm for optimization. Information Sciences 2010;180(8):1218e36. [37] Jiao L, Wang L. A novel genetic algorithm based on immunity. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 2000;30(5):552e61. [38] Potter MA, De Jong KA. A cooperative coevolutionary approach to function optimization. In: International conference on parallel problem solving from nature. Berlin, Heidelberg: Springer; 1994. p. 249e57. [39] Runarsson TP, Yao X. Stochastic ranking for constrained evolutionary optimization. IEEE Transactions on Evolutionary Computation 2000;4(3):284e94. [40] DeArmon JS. A comparison of heuristics for the capacitated Chinese postman problem. University of Maryland; 1981. [41] Branda˜o J, Eglese R. A deterministic tabu search algorithm for the capacitated arc routing problem. Computers and Operations Research 2008;35(4):1112e26. [42] Jiao LC, Wang H, Shang RH, et al. A co-evolutionary multiobjective optimization algorithm based on direction vectors. Information Sciences 2013;228:90e112.

300 Chapter 8 [43] Tan KC, Khor EF, Lee TH. Multiobjective evolutionary algorithms and applications. Springer Science & Business Media; 2006. [44] Dantzig GB, Ramser JH. The truck dispatching problem. Management Science 1959;6(1):80e91. [45] Czyz_zak P, Jaszkiewicz A. Pareto simulated annealingda metaheuristic technique for multiple-objective combinatorial optimization. Journal of Multi-Criteria Decision Analysis 1998;7(1):34e47. [46] Bandyopadhyay S, Pal SK, Aruna B. Multiobjective GAs, quantitative indices, and pattern classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 2004;34(5):2088e99. [47] Zitzler E, Laumanns M, Thiele L. SPEA2: improving the strength Pareto evolutionary algorithm. TIKreport; 2001. p. 103. [48] Gibbons JD, Chakraborti S. Nonparametric statistical inference. In: International encyclopedia of statistical science. Springer Berlin Heidelberg; 2011. p. 977e9. [49] Hollander M, Wolfe DA, Chicken E. Nonparametric statistical methods. John Wiley & Sons; 2013. [50] Tan KC, Yang YJ, Goh CK. A distributed cooperative coevolutionary algorithm for multiobjective optimization. IEEE Transactions on Evolutionary Computation 2006;10(5):527e49.

CHAPTER 9

Multiobjective optimization algorithmbased image segmentation Chapter Outline 9.1 Introduction 301 9.2 Multiobjective evolutionary fuzzy clustering with MOEA/D 9.2.1 9.2.2 9.2.3 9.2.4 9.2.5

Fuzzy-C means clustering algorithms with local information Framework of MOEFC 305 Opposition-based learning operator 308 Mixed population initialization 309 The time complexity analysis 310

303 303

9.3 Multiobjective immune algorithm for SAR image segmentation

310

9.3.1 Definitions of AIS-based, multiobjective optimization 310 9.3.2 The stage of features extraction and preprocessing 312 9.3.2.1 Watershed raw segmentation 313 9.3.2.2 Feature extraction using Gabor filters and GLCP 313 9.3.3 The immune multiobjective framework for SAR imagery segmentation

9.4 Experiments

316

317

9.4.1 The MOEFC experiments 317 9.4.1.1 Experimental setting of MOEFC 317 9.4.1.2 Segmentation results on synthetic images 321 9.4.1.3 Segmentation results on natural images 321 9.4.1.4 Segmentation results on medical images 331 9.4.1.5 Segmentation results on SAR images 336 9.4.2 The IMIS experiments 337 9.4.2.1 IMIS experimental settings 337 9.4.2.2 Analysis of experimental results 339

9.5 Summary 346 References 346

9.1 Introduction A synthetic aperture radar (SAR) is a kind of active microwave instrument, producing high-resolution imagery of the Earth’s surface in all weathers. It has been widely used in environmental monitoring, mapping of Earth resources, and military systems. An important issue in SAR image applications is the correct segmentation and identification Brain and Nature-Inspired Learning, Computation and Recognition. https://doi.org/10.1016/B978-0-12-819795-0.00009-8 Copyright © 2020 Tsinghua University Press. Published by Elsevier Inc. All rights reserved.

301

302 Chapter 9 of objectives in them, which is essential to understand the image clearly. The major purpose of image segmentation is to partition an image into regions of different characteristics such that the pixels in the same group are more similar to each other than pixels in different groups. The difficulties existing in SAR image segmentation are the highly overlapped pixels and large amounts of unpredictable and inestimable speckle noise in this kind of image. The existence of noise deteriorates the quality of SAR images seriously and can conceal important details, leading to the loss of interesting objectives. To date, many SAR image segmentation methods have been proposed and studied, including clustering-based methods [1e3], graph-partitioning methods [4], morphologic methods [5], and model-based methods [6,7]. In this chapter, we introduce two methods based on natural inspired algorithms to handle the issues of SAR image segmentation. They are multiobjective evolutionary fuzzy clustering with MOEA/D [8] (MOFEC), and artificial immune multiobjective SAR image segmentation [9] (IMIS). MOEFC proposes a multiobjective evolutionary fuzzy clustering algorithm, which converts the fuzzy clustering problem in image segmentation into a multiobjective problem. The multiobjective decomposition evolutionary algorithm is used to optimize the multiobjective problem. The decomposition strategy is used to project the multiobjective problem into multiple subproblems. Each subproblem represents a fuzzy clustering problem with local information. Opposition-based learning is used to improve the search ability of MOEFC. In order to improve the performance of the algorithm, two specific problems, adaptive weighted fuzzy factor and hybrid population initialization, are introduced. The experimental results on synthetic images and real images show that MOEFC can realize the compromise of image segmentation on the basis of keeping image details and removing noise. Another method, IMIS, is an artificial immune multiobjective optimization framework and is applied to SAR image segmentation. The important innovations of the framework are listed as follows: (1) an efficient and robust immune, multiobjective optimization algorithm is proposed, which has the features of adaptive rank clones and diversity maintenance by the K-nearest-neighbor list; (2) in addition, two conflicting, fuzzy clustering validity indices are incorporated into this framework and optimized simultaneously; and (3) moreover, an effective, fused feature set for texture representation and discrimination is constructed and researched, which utilizes both the Gabor filter’s ability to precisely extract texture features in low- and mid-frequency components and the gray-level co-occurrence probability’s (GLCP) ability to measure information in high frequency. Two experiments with synthetic texture images and SAR images are implemented to evaluate the performance of IMIS in comparison with five other clustering algorithms: fuzzy C-means (FCM), single-objective genetic algorithm [10] (SOGA), selforganizing map [11] (SOM), wavelet-domain hidden Markov models [12] (HMTseg), and spectral clustering ensemble [13] (SCE). Experimental results show that IMIS has obtained

Multiobjective optimization algorithm-based image segmentation 303 the better performance in segmenting SAR images than the other five algorithms and behaves insensitively to the speckle noise.

9.2 Multiobjective evolutionary fuzzy clustering with MOEA/D 9.2.1 Fuzzy-C means clustering algorithms with local information In recent years, for image segmentation, many improved FCM algorithms [14e20] are proposed to incorporate local image information into original FCM energy function. Some parameters, which are generally chosen by experience or trial-and-error experiments, are utilized to control the influence of local information and have a crucial impact on the performances of the improved FCM algorithms. It is not easy to find the optimal parameters which will lead to the best segmentations of observed images. To deal with this drawback, FLICM [18] defined a fuzzy factor to replace the parameters in the above algorithms and applied on original observed images. Let us set fxi gNi¼1 as the observed image, where xi is the i-th pixel and its value equals the gray level value of the i-th pixel. N represents the total number of pixels. If the number of clusters is c, the energy function of FLICM for partitioning the image fxi gNi¼1 into c clusters can be defined by: Jm ¼

N X c h i X 2 k x  z k þ G um i p ip ip i¼1 p¼1

¼

N X c X

um ip kxi  zp k þ 2

i¼1 p¼1

where Gip ¼

N X c X

(9.1) Gip

i¼1 p¼1

P

m 1 j6¼i;j ˛ Ni dij þ1ð1  ujp Þ kxj

 zp k2 , zp is the p-th cluster center, uip is the

fuzzy membership of pixel xi in the p-th cluster, and m is the weighting parameter on each fuzzy membership. Ni, a 3  3 square window with pixel xi in its center, represents a neighborhood of pixel xi, and pixel xj is one of the neighbor pixels in Ni. dij stands for the spatial Euclidean distance between pixel xi and pixel xj as shown in Fig. 9.1. FLICM works by updating the fuzzy memberships and the cluster centers computed as follows: uip ¼

1

0

11=ðm1Þ

(9.2)

Bkx  z k2 þ G C p ip C B i C q¼1 B 2 @ kxi  zpk þ Giq A

Pc

PN zp ¼

m i¼1 uip xi PN m i¼1 uip

(9.3)

304 Chapter 9 1.414

1

1 1.414

1.414 1

1

1.414

Figure 9.1 The spatial Euclidean distance between pixel xi and its neighbor pixel xj in a 3  3 square window.

Let us employ the Lagrange multipliers to the energy function of FLICM in Eq. (9.1) and set m to 2 [18], the clustering centers and the fuzzy memberships can be obtained by using alternating optimization as follows [21,22]: 2  k xi  zp k con c,con  þ 1 $ uip ¼ 2 1 þ con 1 þ con Pc q¼1 kxi  zp k i PN h m m þ ð1  u Þ $con xi u ip i¼1 ip i zp ¼ P h N m m i¼1 uip þ ð1  uip Þ $con

where con ¼

P

1 j ˛Ni ;j6¼i dij þ 1.

(9.4)

(9.5)

It is obvious that the clustering centers and the fuzzy

memberships computed by Eqs. (9.4) and (9.5) are different from Eqs. (9.2) and (9.3) proposed by FLICM. P P PN Pc 2 Let us set JSm ¼ Ni¼1 cp¼1 um ip kxi  zp k and JCm ¼ i¼1 p¼1 Gip . Then the energy function Jm of FLICM in Eq. (9.1) can be written as the sum of JSm and JCm. JSm is the same as original FCM energy function to preserve image details, and JCm is the neighborhood term designed by spatial information to restrain noise. Obviously, JSm is an increasing function of uip while the opposite trend happens to JCm. It is likely that the trend of JCm conflicts with JSm’s, which is also pointed out in Refs. [21,22]. Furthermore, if the difference between JSm and JCm is bigger, the influence of the smaller term may be suppressed by the larger one in the clustering process. It may happen to other improved FCM approaches which define the sum of original FCM objective function and the local information term as the energy function. Thus, it is still a difficult task to incorporate local information into the fuzzy clustering process effectively and adaptively. To address this problem, this chapter converts fuzzy clustering problems for image segmentation into MOPs. The original FCM energy function to preserve image details and the function based on local information to restrain noise are defined as our two objective functions and minimized simultaneously. To avoid removing significant image details, the function based on local information is defined on original image pixels and designed by local spatial and gray information. To make use of local information more adaptively and effectively, an adaptive weighted fuzzy factor is introduced in the function to restrain noise.

Multiobjective optimization algorithm-based image segmentation 305

9.2.2 Framework of MOEFC In this chapter, MOEFC is presented to achieve a trade-off between preserving image details and restraining noise for image segmentation. Let an image X ¼ fxi gNi¼1 be observed, where xi is the i-th pixel and its value equals the gray level value of the i-th pixel. N is the total number of pixels. MAX and MIN are the maximum and minimum of the pixels’ gray level in the image X, respectively. Let c be the number of clusters, the fuzzy clustering problem is converted into an MOP defined as follows: min FðzÞ ¼ ½f1 ðzÞ; f2 ðzÞT z ¼ ðz1 ; z2 ; .; zc ÞT ; xi ˛ X; i ¼ 1; 2; /; N min gws ðzjl.Þ ¼ lf1 ðzÞ þ ð1  lÞf2 ðzÞ N X c X f1 ðzÞ ¼ um ip Dðxi ; zp Þ

(9.6) (9.7) (9.8)

i¼1 p¼1

f2 ðzÞ ¼

N X c X i¼1 p¼1

um ip

X j ˛ Ni ;j6¼i

upij Dðxj ; zp Þ

(9.9)

where f1 is the original FCM objective function to preserve image details, and f2 is designed by local information to restrain noise. The individual z ¼ (z1, z2,., zc)T represents the candidate cluster centers, and gws is the subproblem with weight vector [23] l ¼ ðl; 1  lÞT . uip is the fuzzy membership of pixel xi in the p-th cluster. m is the weighting parameter on each fuzzy membership and is set to 2 here. Ni is a 3  3 square window with pixel xi in its center, and pixel xj is one of the neighbor pixels to xi in Ni. upij represents the adaptive weighted fuzzy factor, and D(xi, zp) is the similarity measure between pixel xi and cluster center zp. Considering the weakness of Euclidean metric for image segmentation [15,20], Gaussian radial basis function (GRBF), which is widely used among many kernel functions [24,25], is utilized as the similarity measure in MOEFC.  2.  The definition of D(xi, zp) can be written as D(xi, zp) ¼ 1  exp kxi  zp k s , where s is the bandwidth. Similar to the work in Ref. [20], the bandwidth is defined as the distance standard deviation of image pixels, which reflects the distribution of the gray vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  2 u 1 PN uPN t i¼1 disi  disi N i¼1 , where disi ¼ levels of pixels. The bandwidth is computed by s ¼ N1    P 2  xi  N1 Ni¼1 xi  and xi equals the gray level value of the i-th pixel in the observed image. N is the total number of pixels.

306 Chapter 9 The adaptive weighted fuzzy factor uijp in Eq. (9.9) is an adaptive penalty from the neighbor pixel xj to the center pixel xi. The definition of upij can be written as   ðt1Þ m ðt1Þ $wsc $wgc , where ujp is the fuzzy membership of pixel xj in the p-th upij ¼ 1  ujp cluster in the past generation and t is the current generation number. wsc and wgc represent spatial constraint and gray constraint, respectively. Fig. 9.2 shows the computation procedure of upij . In Fig. 9.2, dij represents the spatial Euclidean distance between pixel xi and pixel xj, and its computation is shown in Fig. 9.1. The definition of gray constraint wgc is similar to Ref. [20]. In Fig. 9.2, Cj is the variation of pixel xj transformed from the observed image. The computation of Cj can be written as Cj ¼

varðxÞ , ½meanðxÞ2

where var(x) and

mean(x) are the variance and mean of pixels in the set x. Here, x is a pixel set which consists of pixel xj and its neighbor pixels in the 3  3 square window. C is the mean of P the variations in Ni, and is computed by C ¼ k ˛ Ni Ck =ni, where ni is the cardinality of Ni. xij represents the local coefficient of pixel xj to pixel xi, which is calculated by projecting the variation Cj into kernel space. Then the gray constraint wgc can be computed by normalizing the local coefficient xij and protecting the normalized coefficient ðt1Þ

over the interval [1,3]. If ujp

ðt1Þ

has a high value, the value of upij will be low because of ðt1Þ

the small value of 1  ujp . In contrast, if the value of ujp is small, upij will have a high value in this generation. In this way, the influence from pixel xj to pixel xi changes with the fuzzy memberships of pixel xj. Hence, the adaptive weighted fuzzy factor incorporates the fuzzy memberships of pixels in the past generation into the function

Figure 9.2 p The computation procedure of uij .

Multiobjective optimization algorithm-based image segmentation 307 based on local information to restrain noise. It is helpful to utilize local information more effectively and adaptively in the fuzzy clustering process. The fuzzy memberships of the subproblem gws can be obtained by employing the Lagrange multipliers to gws, and the derivation process is described in detail in Supplementary Materials. The fuzzy memberships of gws can be calculated by: uip ¼

1 1 0 P lDðxi ; zp Þ þ ð1  lÞ j ˛N upij Dðxj ; zp Þ i C B C B j ¼ 6 i C Pc B C B P q¼1 B C 1 BlDðxi ; zp Þ þ ð1  lÞ j ˛Ni uij Dðxj ; zp ÞC A @ j 6¼ i

(9.10)

where l ¼ ðl; 1  lÞT is the weight vector of subproblem gws. To deal with different subproblems, l ¼ ðl; 1  lÞT also changes. l is produced as a different constant vector for each subproblem in the initialization step [26]. Then the framework of MOEFC is given in Algorithm 9.1. In the initialization step, the generations of the well-distributed weighted vectors W ¼ fl1 ; /; lNM g and the neighborhood B(i) ¼ {i1, ., iT} of the i-th individual are similar to Ref. [26]. In the reproduction step, a differential evolution (DE) [27] operator and a Gaussian mutation operator [28] are, respectively, utilized in the crossover and mutation steps for producing new individuals. In the DE operator employed step, each dimension element yi,k of a new individual yi is generated by the DE strategy as follows: (  lzi;k þ F$ zl1 ;k  zl2 ;k if rand  CR yi;k ¼ (9.11) zi;k otherwise where F and CR are the DE mutation constant and crossover probability, respectively. rand is a uniform random number over interval [0, 1]. l1 and l2 are two different indexes selected from B(i). In the Gaussian mutation operator employed step, a candidate y0i is generated from yi as follows: (  lN yi;k ; ðMAX  MINÞ=20 if rand  Pm 0 yi;k ¼ (9.12) yi;k otherwise  where N yi;k ; ðMAX  MINÞ=20 is a Gaussian distribution with mean parameter yi,k and standard deviation parameter (MAX-MIN)/20. Pm is mutation probability, and rand is a uniform random number over interval [0, 1]. MIN and MAX represent the minimum and maximum of the pixels’ gray levels in the observed image.

308 Chapter 9 Algorithm 9.1 Framework of MOEFC Parameters: Max generation number: gmax, population size: NM, neighborhood size: T, crossover probability: CR, DE mutation constant: F, mutation probability: Pm, OBL jumping rate: Jr. Input: The observed image: X ¼ fxi gNi¼1 , the number of clusters: c, the maximum and minimum of gray levels of the image X: MAX and MIN. Output: The obtained Pareto solutions and the corresponding fuzzy membership matrixes. Each solution is a candidate segmentation of the observed image. Step 1) Initialization Step 1.1) Set t ¼ 0. Generate an initial population P ¼ {z1, ., zNM}. Step 1.2) Generate well-distributed weighted vectors W ¼ fl1 ; /; lNM g. For each i ¼ 1, 2, ., NM, initialize the neighborhood B(i) ¼ {i1, ., iT} of each individual, where li1 ; /; liT are the T closest weight vectors to li computed by the Euclidean distance. Step 1.3) Compute the corresponding fuzzy membership matrix and objective functions by Eqs. (9.8) and (9.10) for each individual. Step 1.4) Apply OBL operator on the individuals in P. Step 2) Reproduction For i ¼ 1, 2,., NM, do Step 2.1) Crossover: Randomly select two different indexes l1 and l2 from B(i), and generate a new individual yi from zl1 and zl2 by the DE strategy. Step 2.2) Mutation: Generate a candidate y0i by applying the Gaussian mutation operator on the new individual yi obtained by Step 2.1. Step 2.3) Repair: If an element of the new candidate y0i is larger than MAX or lower than MIN, its value will be reset randomly inside the boundary. Step 2.4) Fitness computation: Compute the corresponding fuzzy membership matrix and objective functions by Eqs. (9.8) and (9.10) for the candidate y0i . Step 2.5) Update  Step 2.5.1) Selection: Pick a random index l ˛ B(i), if g ws y0i jli  g ws ðzl jll , then set zl ¼  y0i and F(zl) ¼ F y0i . Step 2.5.2) OBL jumping: Apply the OBL operator on the individual zl, and go to Step 3. Step 3) Stop criterion: If t > gmax, then stop and output. Otherwise, set t ¼ t þ 1 and go to Step 2.

9.2.3 Opposition-based learning operator In the framework of MOEFC, the OBL operator is utilized in the initialization and update steps. The opposite of the individual z is defined by: ^ zi

¼ MIN þ MAX  zi ; zi ˛ ½MIN; MAX; i ¼ 1; .; c   ^ ^ ^ T z ¼ ðz1 ; z2 ; .; zc ÞT ; z ¼ z 1 ; z 2 ; .; z c

(9.13)

Multiobjective optimization algorithm-based image segmentation 309 where z represents the opposite of z, and c is the number of clusters. MIN and MAX are the minimum and maximum of the pixels’ gray level in the observed image, respectively. In evolution optimization, the searching procedure will be stopped when the terminated criteria are satisfied. The computation time is related to the distances between the optimal solutions and the solutions found so far. With an OBL operator, the solution, which is closer to the optimal solution between the original individual and its opposite, is selected as the current solution. It is helpful to accelerate convergence of the search procedure. The detailed procedure of an OBL operator is summarized in Algorithm 9.2. Jr is the jumping rate to control the OBL operator.

9.2.4 Mixed population initialization In evolution optimization, generating initial individuals randomly is the most common method for population initialization. However, it is valid and helpful to start with an improved initial population for searching the optimal solutions. It will cost less time in the search procedure if the initial individuals are generated close to the optimal solutions. In MOEFC, besides generating randomly to keep the diversity and randomness of the population, the initial individuals are also generated by some simple and effective image segmentation methods. Considering the trade-off between performance and complexity, Kmeans [29], FCM [30,31], and Normalized cut (Ncut) [32] are used in this chapter. As is well known, Kmeans and FCM are two classic clustering algorithms. Ncut is a state-ofthe-art algorithm which can achieve an acceptable result with less time for image segmentation. Starting with acceptable segmentation results, it is valid and effective to search for optimal cluster centers. Moreover, an OBL operator is applied on the initial population to ensure that the algorithm can start with more potential solutions. Algorithm 9.2 Procedure of the OBL operator. if rand  Jr then for i ¼ 1; i < cþ1; iþþ do ^ z i ¼ MIN þ MAX  zi end for g ws ðzjlÞ ¼ lf1 ðzÞ þ ð1 lÞf2 ðzÞ g ws ðzjlÞ ¼ lf1 ðzÞ þ ð1 lÞf2 ðzÞ if g ws ðzjlÞ  g ws ðzjlÞ then z ¼ z; F(z) ¼ F(z); end if end if

310 Chapter 9

9.2.5 The time complexity analysis Supposing NM is the population size and l is the length of the individual, the time cost by the initialization step is O(NM  l), which is the same as that consumed by the reproduction step in each generation. Here, l equals the number of clusters c. If there are N pixels in the observed image, the total time spent on the fitness computation step is O(NM  l  N). The time consumed by the selection step is O(NM). Let G be the maximum generation number, the computational complexity of MOEFC is O(NM  l  N  G).

9.3 Multiobjective immune algorithm for SAR image segmentation In this section, we introduce an effective immune, multiobjective framework for SAR image segmentation (called IMIS for short). The simultaneous optimization of multiple objectives is different from single-objective optimization in that there is no unique optimal solution to multiobjective problems. Multiobjective optimization usually involves many conflicting, incomparable, and noncommensurable objectives; therefore, a set of optimal tradeoff solutions known as the Pareto-optimal solutions can be obtained. The optimal or user-wanted partitions of the image data can be selected from the tradeoff Pareto-optimal solutions. The novelty of IMIS lies in the following issues: (1) an effective, immune, multiobjective, SAR image segmentation method is presented, which has the strategies of adaptive rank clones and diversity maintaining by K-nearest-neighbor list; (2) two conflicting, fuzzy clustering validity indices are incorporated into the method and simultaneously optimized. The searching process regulated by the two indices can set foot more wider feasible regions, which is beneficial to identify promising solutions; and (3) in order to obtain sufficient information for SAR image feature representation and discrimination, the authors constructed a fused feature set both by Gabor filter [33] and GLCP [34], which can be provided with complementarily advantageous for bandwidth responses in different frequencies. Finally, the authors compared the method with five classification algorithms in segmenting the synthetic texture and SAR images to validate the performance of IMIS. The experimental results show that encouraging and impressive performance in partitioning the five images is obtained by IMIS.

9.3.1 Definitions of AIS-based, multiobjective optimization In IMIS, the following type of multiobjective optimization is considered. Min FðxÞ ¼ ðf1 ðxÞ; f2 ðxÞ; /; fK ðxÞÞ T

(9.14)

subject to x ˛ U, where x is a decision variable vector and U is the search space. F:x / RK is the map of decision variable space to the space of k objectives. The objectives in multiobjective optimization usually conflict with each other and no single solution can

Multiobjective optimization algorithm-based image segmentation 311 optimize all the objectives simultaneously. In this framework, two conflicting and complementary indices, called XB [35] and JM [36], are used as two of the optimization objectives, which can be defined as follows: Xk XN u2 kx  zp k2 p¼1 i;j¼1 pj j !2 , XB ¼ ; k X Nminkzi  zj k  x kz k p j i;j (9.15) where upj ¼ 1  x kz p ik i¼1 XN Xk u2 kx  zp k2 ; Jm ¼ j¼1 p¼1 pj j Note that xi, i ¼ 1, 2,., N is a data sample in a clustering data set and zp, P ¼ 1, 2,., k is the cluster center. N is the total number of samples and k is the number of categories. UkN is the fuzzy membership matrix. XB is formulated by the ratio of the summation of variation to the minimum separation. The lower values of XB could provide a better partition of the data set. Jm is defined by minimizing the global, fuzzy squared distance. Bandyopadhyay et al. [10] have noted that the two indices revealed contradictory characteristics and can provide a rich set of alternate partitions for the remote sensing data. Significantly, a desirable searching process should guide the population from multiple directions since we have no prior knowledge about the optimal solution locations. Here, two conflicting indexes can guide the population into more spread feasible area than the traditional manners with only one index. Thus, multiobjective optimization algorithms can find the optimal partition with a larger probability than the single-objective optimization algorithms. The domain knowledge is often utilized to select the final optimal solution in multiobjective optimization because there is no explicit and general methodology to classify images in all styles. However, sometimes such a priori knowledge might not be available for complicated and tight classification tasks. Hence, a tradeoff index (called PBM) is employed for the final optimal solution selection. This index has demonstrated superior performance to three other frequently used indices [36]. The definition of the index could be described in the following way: 2  k X N X 1 E1 upj kxj  zp k and PBMðkÞ ¼   Dk ; where Ek ¼ k Ek p¼1 j¼1 (9.16) Dk ¼ maxki;j¼1 kzi  zj k The definitions of the symbols: xj, zp, upj in Eq. (9.16) are the same as in Eq. (9.15). The PBM index consists of three items: k, Ek, and Dk. k is the number of clustering, Ek is the total variations of a partition, and Dk is used to measure the maximum separation among clusters. These three factors are designed to compete with each other to achieve proper

312 Chapter 9 partitioning. The maximization of PBM index could lead to a partition with the benefit of the least number of compact clusters and a large separation between at least two clusters. Some notations in AIS-based multiobjective optimization are presented in this paragraph. An antigen in AIS is usually defined as a searching problem [37]. Here, the MOPs defined by Eq. (9.14) can be seen as the antigen of multiobjective optimization. The candidate solutions of Eq. (9.14) are named antibodies in the immune system. The binding intensity between antigen and antibody is called the antigeneantibody affinity, which are the values of objective functions in Eq. (9.15). The candidate solutions are called antibodies and their objective values are called the affinity in the following sections because an immune inspired algorithm is discussed here. The concept of dominance in MO is described here. An antibody u ¼ (u1, u2,., uK) is said to dominate another antibody v ¼ (v1, v2,., vK) (denoted by u 3 v) if and only if u is partially less than v, which can be defined by the following expression: ci ˛ f1; /; Kg;

ui  vi o dj ˛ f1; /; Kg: uj < vj

(9.17)

where ^ is a logic symbol, which means two concurrent terms. Furthermore, we say that an antibody x ˛ U is Pareto-optimal or a nondominated solution in problem defined by Eq. (9.14) if another antibody, x ˛ U does not exist such that x 3 x*. All the nondominated antibodies in the decision space are made up of antibodies of the best rank. Once the nondominated antibodies are removed from the current population, the nondominated antibodies among the remaining population are called the next-best-rank antibodies. Through the operator of repetition, all the antibodies in the population can be assigned to be at different ranks. The rank-based adaptive selection and clones are the main characteristics of the proposed AIS-based, multiobjective, segmentation algorithm, which is discussed in a subsequent section.

9.3.2 The stage of features extraction and preprocessing In IMIS, the process of SAR image segmentation is divided into two stages. As shown in Fig. 9.1, the first stage is characterized by initially segmenting the SAR image and extracting features. Evolutionary computation with population iteration on the level of pixels is usually very time consuming because the number of pixels in an SAR image is very high, even for small images of moderate resolution. Therefore, a preprocessing stage is required to oversegment the original image into nonoverlapping small patches or ‘‘superpixels.’’ Additionally, proper features extraction techniques are necessary to differentiate the land covers in SAR images accurately. Here, two complementary feature extraction methods are employed and combined for image presentation and discrimination. Afterward, the fused features are mapped by the oversegmented results of original image so as to obtain the oversegmented extracted image features. In other words, if the pixels

Multiobjective optimization algorithm-based image segmentation 313 are in the same patch in the oversegmented image, their features are classified into the same group correspondingly. In the second stage, an AIS-based multiobjective algorithm is proposed and a fine classification on the oversegmented results is carried out by the algorithm. The preprocessing techniques in the first stage are discussed as follows. 9.3.2.1 Watershed raw segmentation The well-known watershed transformation [38] is used to partition an image into nonoverlapping and homogeneous regions. The definition of the watershed transformation is in the following equation: WTðIÞ ¼

N 1 X ½I 4 Bi  ðI.Bi Þ.Bi1  N i¼1

(9.18)

where 4 and . denote dilation and erosion operations in mathematical morphology, respectively. Bi is called the structural window with size of (2i e 1)  (2i þ 1), and I is the original image. It is far from an easy task to extract suitable watersheds on SAR images. If the watershed regions are too large, it may contain more than one object of interest in the same basin; on the contrary, only one objective may not be in the same local patch and the available algorithms for computing the watershed transformation may be excessively slow. In IMIS, the dilation and erosion operations are both on a 3  3 window and about 1000 small patches are obtained in the image with 256  256. 9.3.2.2 Feature extraction using Gabor filters and GLCP Texture is an important characteristic for identifying objects or regions of interest in an image. To date, a large number of techniques for analyzing image texture have been proposed throughout the last three decades. Popular methodologies can be divided into three categories: statistical methods, model-based methods, and signal-processing techniques. Markov random fields (MRFs) [39], gray-level co-occurrence probability (GLCP) [34], wavelet transforms [40], wavelet packets and wavelet frames [41], and Gabor filters [33] are the state-of-the-art texture extraction methods. Furthermore, the idea of fusing the textures extracted from different techniques has attracted wide attention recently. In a paper by Clausi [42], he presented the comparison and fusion of cooccurrence, Gabor, and MRF texture features to classify SAR imagery of sea ice. In this chapter, the above three techniques were compared in pairs and their fusion versions were investigated in analyzing textures of SAR images of sea ice. In another paper [43], Ruiz compared four different approaches of texture features extraction in remote sensing analysis, which were called GLCM, energy filters, Gabor filters, and wavelet transforms. They also assessed the potential performance of combinations of texture and spectral data from high-resolution satellite images. In addition, Solberg and Jain [44] investigated the performance of texture features extracted by GLCP, local statistics, a fractal model, and a

314 Chapter 9 lognormal random field model. The discrimination abilities of these methods were compared and optimal combinations of the texture features were discussed. Importantly, Clausi et al. presented a design-based texture feature fusion using Gabor-filters and GLCP [45]. The classification performance of the fused features and individual ones were investigated, respectively. Impressive experimental results for unsupervised texture classification were obtained when both of these methods were utilized. Appropriate combinations of features using different methodologies for pattern recognition could provide superior performance to individual ones. Therefore, Gabor filters and GLCP were both used for feature extraction because they can extract features at a wider range of frequency components of SAR images. Gabor filters: Gabor filters directly measure local frequency components by acting as a multichannel, band-pass filter centered on the frequencies and orientations of interest. Clausi et al. have shown that Gabor filters are suitable for texture presentation and discrimination because they are mathematically tractable, simply implemented, and have optimal joint spatial-frequency resolution [46]. Gabor function is a Gaussian modulated complex sinusoid in the spatial domain. The 2-D Gabor function is defined by ! # " 1 1 x2 y2 gðx; yÞ ¼ þ j2pFðx cos q þ y sin qÞ (9.19) exp  þ 2psx sy 2 s2x s2y where F is the modulation frequency and q specifies the orientation of normal to the parallel stripes of the Gabor function. The raw filtering responses at different frequencies and orientations are features extracted by Gabor multichannel filters. Commonly, Gaussian filters are employed for smoothing the raw Gabor magnitude response because they can generate the preferred segmentation results [46]. The Gaussian post filters have the same shape as the corresponding channel filters but with greater spatial extents, which are controlled by g in g(gx, gy). A g of two-thirds is recommended by Bovik et al. [47]. GLCP: Gray-level co-occurrence texture measurements have been widely utilized in image texture analysis since they were proposed by Haralick et al. [34]. The GLCP is a joint probability matrix of how often two different combinations of pixel brightness values occur in a local window where these two pixels have a separation of a certain distance d and direction a. If an image is simply denoted by I(x, y) with size of Nx  Ny and the levels of image gray are NG, the items of GCLP matrix could be defined as follows: pði; j=d; aÞ Pði; jÞ ¼ PW i;j pði; j=d; aÞ

(9.20)

where p(i, j/d, a) is the frequency of occurrence of gray levels i and j separated by distance d and direction a, and W is the total number of pixel pairs in the image window. Once the GLCP is calculated, several statistical parameters can be extracted from the

Multiobjective optimization algorithm-based image segmentation 315 Algorithm 9.3 The stage of feature extraction and preprocessing. Input parameters: Watershed transformation: input image I, dilation and erosion operations; Gabor filters: center frequencies F, orientations q; GLCM: local window size w, interpixel distance d, direction a, quantization level q. Output result: Raw segmentation results of fused features: RSFeatures. Step 1. Watershed raw segmentation WI ¼ Watershed (I); %WI: segment input image I by watershed transformation in Eq. (9.18) maxWI ¼ max(WI); % Build watershed mapping. for I ¼ 0:maxWI. index ¼ Find(WI ¼ ¼ i); IMapping ¼ [IMapping index]; % Add datum index to IMapping; end for Step 2. Feature extraction and fusion using Gabor filters and GLCP for each f ˛ F do. for each th ˛ q do. GF ¼ Gabor-filter(I, f, th); %Gabor multichannel filters using Eq. (9.19) GF ¼ Gauss-filter(GF, g); GaborFeature ¼ [GaborFeature GF]; end for end for for each al ˛ a do GF ¼ GLCP(I, w, d, q, al); %Calculate the statistical parameters of the GLCP by Eq. (9.20) GF ¼ Gauss-filter(GF, g); GLCPFeature ¼ [GLCPFeature GF]; end for FusedFeatures ¼ [GaborFeature GLCPFeature]; % Combine the GaborFeatures and the GLCPFeatures. FusedFeatures ¼ FusedFeatures - min(FusedFeatures)/[max(FusedFeatures) min(FusedFeatures)]; Step 3. Raw segmentation of fused features by mapping the results of watershed segmentation for i ¼ 1 to jIMappingj%jIMappingj denotes the cardinality of IMapping. index ¼ IMapping(i); %Take out the i-th item in IMaping created in Step 1. RSFeatures(i) ¼ mean(FusedFeatures(index)); % Calculate the mean value of FusedFeatures(index) end for

316 Chapter 9 GLCP. There are a total of 14 statistical parameters for the GLCP in the original paper [34], commonly, only three (contrast, entropy, and correction) are recommended for SAR image classification because of their ability to extract independent features and to create the preferred discrimination. The procedure of the preprocessing stage is shown in Algorithm 9.3.

9.3.3 The immune multiobjective framework for SAR imagery segmentation In this section, the total AIS-based multiobjective image segmentation framework will be discussed. A flow chart of the framework is shown in Fig. 9.3. The framework consists of two stages: the stage of preprocessing and feature extraction, and the stage of unsupervised classification and comparison. The watershed initial segmentation in the first stage is a local and coarse classifier, which performs local pixels combination under the condition that the pixels in the same water basin are spatially contiguous. The immune, multiobjective, optimization algorithm (IMIS) in the second stage is a global fine classifier. The fused features of pixels are used for merging local patches produced by watershed transformation and there is no restriction on the spatial adjacency in the second stage. If the segmentation tasks have few categories and simple data distribution, it is possible for watershed transformation to solve the tasks perfectly in the first stage. Therefore, the superior performance of IMIS can only be demonstrated in segmenting image data with many numbers of true classes and complicated distribution. The inference will be validated in the following experiments. In a previous work [48], the authors have performed profound research on the theories of AIS and AIS-based, multiobjective optimization [49]. Based on the great deal of summarizations and comparisons in the current evolutionary, multiobjective optimization community, the authors believe that AIS-based, multiobjective algorithms have superior selection pressure because of their use of the clonal selection principle and immune response operations, which can accelerate the evolutionary process in the population.

Figure 9.3 The basic image segmentation framework using immune, multiobjective optimization, and three other classical algorithms.

Multiobjective optimization algorithm-based image segmentation 317 Additionally, the proposed algorithm has put forward an adaptive hybrid model, which utilizes the online discovered, nondominated solutions to adaptively regulate the searching process [50,51]. In IMIS, the authors want to introduce an efficient and robust AIS-based, multiobjective optimization with adaptive rank clones and a diversity maintenance technique by the K-nearest-neighbor list. The adaptive ranks clones can select the online discovered solutions at different ranks to implement a clone operator. The dynamic information of the online antibody population is efficiently exploited, and they are then proliferated, which can introduce robustness and adaptability to the searching process in MO. Furthermore, the K-nearest-neighbor list is established and maintained to update the solutions in the archive population for diversity maintaining. The K-nearest neighbors of each antibody are found and stored in a list in memory. Once an antibody with a minimal product of K-nearest neighbors is deleted, the neighborhood relations of the remaining antibodies in the list in memory are updated. The ability of diversity maintaining of IMIS can be significantly improved compared with the crowding distance used in NNIA [49]. Another important feature of IMIS is that the online-discovered, nondominated antibodies are selected to update the population. If there are lots of nondominated antibodies in the current generation, few excellent antibodies located at less-crowded, Pareto-optimal front are selected for updating. Once there are few online-discovered, nondominated antibodies, the solutions in the cloning pool are used to update the population. The adaptive selection operator will be presented in Step 6 in Algorithm 9.4. The procedure is described in Algorithm 9.4.

9.4 Experiments 9.4.1 The MOEFC experiments 9.4.1.1 Experimental setting of MOEFC To study the performance of MOEFC, synthetic images, natural images, medical images, and synthetic aperture radar (SAR) images are adopted in the experiments. Table 9.1 contains the names of observed images along with their types, sizes, numbers of segments, and figure numbers. In MOEFC, the population size is 100, and the maximum generation number is 30. The neighborhood size is 20. The crossover probability and the mutation constant in DE operator, two parameters for generating new solutions, are respectively set to be 1 and 0.5. The mutation probability, which should be set with a small value to prevent the search from turning into a primitive random search, is defined as the inverse of the length of an individual. In MOEFC, an individual represents a set of candidate cluster centers, so the length of an individual equals the number of clusters. Hence, the mutation

318 Chapter 9 Algorithm 9.4 The immune, multiobjective optimization for SAR imagery segmentation. Input parameters: IMIS: input image I, number of clusters k, number of objectives K, number of iterations Gmax, population scale N, size clone pool c, crossover probability pc and mutation probability pm Output result: Final segmentation result of image I: Step 1. Call Algorithm 9.3: Get raw segmentation results of the fused features: RSFeatures Step 2. Initialization Pt ¼ rand(N, BoundariesOf RSFeatures); %Generate the population randomly in the determined boundaries. Fitt ¼ FitnessCalculation(Pt); % Calculate the XB and JM indices by Eq. (9.15) and assign them to Fitt NAt ¼ FindNondominatedAntibodies(Fitt, Pt); %Find the nondominated antibodies using Eq. (9.17). Set iteration point t ¼ 0; Step 3. Adaptive selection from different ranks i ¼ 0; rs ¼ []; PNN ¼ []; while i < c do. Find K nearest neighbors for each antibody in NAt and denote them by NNi. Calculate their products of the K nearest neighbors for each item in NAt and denote them by PNNi; if i + jNNij < c. ACt ¼ [ACt NAt]; %Add NAt into ACt; else InsufficientNumber ¼ c - i; Take the insufficient number in NAt with smaller PNNi and add them into ACt end if i ¼ i + jNNij; rs ¼ [rs i]; PNN ¼ [PNN PNNi]; NAt ¼ FindNextBestNon-dominatedAntibodies (Pt); end while Step 4. Perform cloning on the selected c antibodies at different ranks: for i ¼ 1 to jrsj rn ¼ N * rs/sum(rs); for j ¼ 1 to rs(i) cn ¼ rn(i) ⁄ PNNi/sum(PNNi); % Take out PNNi in PNN; tempAC ¼ ACt ði; jÞ; ACt ði; jÞ; /; ACt ði; jÞ ; |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} cnðjÞ

Ct ¼ [Ct tempAC]; end for end for

Multiobjective optimization algorithm-based image segmentation 319 Algorithm 9.4 The immune, multiobjective optimization for SAR imagery segmentation.dcont’d Step 5. The affinity maturity operators:  Ct0 ¼ SBXCrossoverðCt ; pc Þ; Ct00 ¼ PolynomialMutation Ct00 ; pm ; NFitt ¼ FitnessCalculation(Ct00 ); % Calculate the indices in Eq. (9.15) and assign them to NFitt Step 6. The adaptive selection and diversity maintenance by K-nearest neighbors list: MACt ¼ ACtWC00 ; NMACt ¼ FindNondominatedAntibodies(MACt); %Find the nondominated antibodies using Eq. (9.17) Find K nearest neighbors for each antibody in NMACt and denote it by NNi. Calculate their products of the K nearest neighbors for each item in NMACt and denote them by PNNi. if jNMACtj > N. While jNMACtj > N. NMACt ¼ DeleteSolution(NMACt, PNNi); %Remove the items in NMACt with the minimal PNNi. PNNi ¼ Update(NMACt, PNNi); % Update PNNi for the remaining antibodies in NMACt end while else if jNMACtj > c and jNMACtj < N. else if jNMACtj  c. end if Step 7. Stop condition judgment: If t > Gmax is satisfied, calculate the PBM-index of the antibodies in Pt+1, and export the antibodies with the maximum PBM-index as the output of the algorithm; otherwise, t ¼ t + 1, and go to Step 3.

probability equals the inverse of the number of clusters. The jumping rate Jr is set to 1 and 0.3 in the initialization and update step, respectively. We compare MOEFC with seven popular algorithms which can achieve good performance for image segmentation. These comparison algorithms consist of five representative FCM algorithms and two state-of-the-art evolutionary fuzzy clustering algorithms. To reduce the instability in initialization, each algorithm performs 20 independent runs. The final result of each algorithm is calculated by the statistical mean and standard deviation of the numerical results obtained from the 20 independent runs. Five comparison algorithms based on FCM include FLICM, RFLICM, KWFLICM, KFCM_S1, and KFCM_S2 [15].

320 Chapter 9 Table 9.1: Image properties. Type Synthetic images

Natural images from the Berkeley database Natural images without reference segmented images Simulated MR images from BrainWeb MR images without reference segmented images SAR image data sets and each data set includes two images

Name

Size

Number of segments

SI1 SI2 SI3 NI1

128  128 244  244 256  256 481  321

3 4 4 3

NI2 Flower Coins Cameraman SMR1 SMR2 MR1 MR2 Bern Ottawa Italy

481  321 128  128 308  242 256  256 181  217 181  217 256  256 256  256 301  301 290  350 412  300

3 3 3 3 4 4 4 4 2 2 2

To make the comparison fair, the maximum iteration number of these FCM algorithms is set equal to the value of the population size of MOEFC. Each FCM algorithm is repeated five times in one run, and the best result among the five results is selected as the result of this run. Furthermore, a 3  3 square window is chosen to gather local information for each FCM algorithm. For KFCM_S1 and KFCM_S2, the parameter a controls the local information term and the kernel width is set to 4.2 and 150 [15]. Two comparison algorithms based on evolutionary fuzzy clustering are the kernel-induced fuzzy clustering with an improved differential evolution algorithm (KFNDE) [52] and the multiobjective spatial fuzzy clustering algorithm (MSFCA) [53], respectively. These two evolutionary fuzzy clustering algorithms are automatic clustering methods, and they may not determine the correct cluster number of the observed image. Hence, for KFNDE, the best segmentation result with the correct class number is reported as its final result. Unlike KFNDE, MSFCA always determined the correct cluster number of the observed image. Hence, for MSFCA, the statistical mean and standard deviation of the results obtained from the 20 independent runs are reported as its final result. However, for some observed images, KFNDE and MSFCA could not acquire the segmentation result with the correct class number after 20 independent runs. In this case, the lengths of all the individuals are set equal to the correct cluster number for these two automatic clustering methods. The following subsections present the performances of these eight contestant algorithms on synthetic images, natural images, medical images, and SAR images, respectively.

Multiobjective optimization algorithm-based image segmentation 321 Tables 9.2e9.8 list the statistical means and standard deviations of numerical results produced by eight contestant algorithms on the observed images over 20 independent runs, where the numbers in brackets represent the standard deviations and the better numerical results are marked in bold. Figs. 9.4e9.18 show the segmentation results produced by eight contestant algorithms on the observed images. 9.4.1.2 Segmentation results on synthetic images In this experiment, different levels of Gaussian and Salt & Pepper noise are added in three synthetic images, SI1, SI2, and SI3. The accurate rate (AR) and the adjusted rand index (ARI) [54] are used to evaluate the performances of the numerical results produced by eight algorithms with ground truths. Table 9.2 lists the statistical means and standard deviations of the numerical results produced by eight contestant algorithms on SI1 with Gaussian noise (15%, 20%, 30%) and Salt & Pepper noise (15%, 20%, 30%). It can be seen that MOEFC obtains larger values of metrics, especially ARI, than comparison techniques. The segmentation results on SI1 corrupted by 30% Gaussian noise are shown in Fig. 9.4. Visually, the segmentation results of MOEFC have clear edges and remove almost all the noise. In addition, Table 9.3 shows the statistical means and standard deviations of the numerical results produced by the algorithms on SI2 and SI3 corrupted by 30% Gaussian noise and 30% Salt & Pepper noise. It can be seen that MOEFC obtains larger values of AR and ARI than comparison methods. The segmentation results are shown in Figs. 9.5 and 9.6, respectively. In Fig. 9.5, it is observed that KFCM_S1, KWFLICM, and MOEFC achieve better results on SI2 with Salt & Pepper noise. But the result of KFCM_S1 on SI3 with Gaussian noise includes a great deal of noise in Fig. 9.6C. Also, the result of KWFLICM is corrupted by noise in Fig. 9.6G. It is obvious that MOEFC performs better on the images with different levels of noise and removes almost all of the noise. Due to the limitations of space, we only give some experimental results. The detailed experimental results can be found in Ref. [8]. 9.4.1.3 Segmentation results on natural images In many real-world situations, image segmentation always needs to do without ground truths. Thus, in this experiment, both natural images with reference segmented images and those without reference segmented images are adopted to validate the performance of MOEFC. In this experiment, two natural images, NI1 and NI2, are selected from the Berkeley database [55]. To evaluate the performances of segmentation results with sets of ground truths quantitatively, probabilistic rand (PR) index [56] and variation of information (VI)

Image Gau 15%

Gau 20%

Gau 30%

Ave-rage

Method KFCM-S1 KFCM-S2 FLICM RFLICM KW FLICM KFNDE MSFCA MOEFC KFCM-S1 KFCM-S2 FLICM RFLICM KW FLICM KFNDE MSFCA MOEFC KFCM-S1 KFCM-S2 FLICM RFLICM KW FLICM KFNDE MSFCA MOEFC KFCM-S1 KFCM-S2 FLICM RFLICM KW FLICM KFNDE MSFCA MOEFC

AR 96.36 (0.000) 96.72 (0.000) 98.61 (0.000) 98.69 (0.000) 99.26 (0.000) 77.39 98.79 (0.104) 99.29 (0.016) 93.54 (0.000) 94.35 (0.000) 97.47 (0.000) 97.72 (0.000) 98.75 (0.000) 73.18 97.72 (0.051) 98.88 (0.029) 90.77 (0.000) 91.52 (0.000) 95.45 (0.000) 95.98 (0.000) 97.56 (0.000) 68.87 93.87 (0.086) 97.99 (0.086) 93.56 (0.000) 94.20 (0.000) 97.18 (0.000) 97.46 (0.000) 98.52 (0.000) 73.15 96.80 (0.049) 98.72 (0.044)

ARI 87.57 (0.000) 88.80 (0.000) 95.21 (0.000) 95.48 (0.000) 97.42 (0.000) 43.62 95.75 (0.036) 97.54 (0.066) 78.77 (0.000) 81.26 (0.000) 91.37 (0.000) 92.24 (0.000) 95.72 (0.000) 37.28 92.11 (0.172) 96.18 (0.108) 70.89 (0.000) 72.88 (0.000) 84.80 (0.000) 86.60 (0.000) 91.78 (0.000) 31.23 79.92 (0.251) 93 24 (0.285) 79.07 (0.000) 80.98 (0.000) 90.46 (0.000) 91.44 (0.000) 94.97 (0.000) 37.38 89.26 (0.153) 95.65 (0.153)

Image

Method

S&P 15%

KFCM-S1 KFCM-S2 FLICM RFLICM KWFLICM KFNDE MSFCA MOEFC KFCM-S1 KFCM-S2 FLICM RFLICM KWFLICM KFNDE MSFCA MOEFC KFCM-S1 KFCM-S2 FLICM RFLICM KWFLICM KFNDE MSFCA MOEFC KFCM-S1 KFCM-S2 FLICM RFLICM KWFLICM KFNDE MSFCA MOEFC

S&P 20%

S&P 30%

Ave- rage

AR 99.93 (0.000) 99.30(0.000) 97.57 (0.000) 97.48 (0.000) 99.95 (0.000) 98.97 99.39 (0.000) 99.998 (0.003) 99.90 (0.000) 99.12 (0.000) 96.72 (0.000) 96.62 (0.000) 99.95 (0.000) 98.77 99.20 (0.000) 99.998 (0.003) 99.90 (0.000) 98.30 (0.000) 94.56 (0.000) 94.17 (0.000) 99.96 (0.000) 97.88 98.55 (0.003) 99.997 (0.003) 99.91 (0.000) 98.91 (0.000) 96.28 (0.000) 96.09 (0.000) 99.95 (0.000) 98.54 99.05 (0.001) 99.998 (0.003)

ARI 99.82 (0.000) 97.58 (0.000) 91.86 (0.000) 91.54 (0.000) 99.86 (0.000) 97.18 98.23 (0.000) 99.99 (0.014) 99.74 (0.000) 96.92 (0.000) 89.05 (0.000) 88.71 (0.000) 99.85 (0.000) 96.65 97.67 (0.000) 99.99 (0.014) 99.71 (0.000) 94.23 (0.000) 82.27 (0.000) 81.04 (0.000) 99.87 (0.000) 94.17 96.04 (0.012) 99.99 (0.011) 99.76 (0.000) 96.25 (0.000) 87.73 (0.000) 87.10 (0.000) 99.86 (0.000) 96.00 97.31 (0.004) 99.99 (0.013)

322 Chapter 9

Table 9.2: The statistical means and standard deviations of numerical results produced by eight contestant algorithms on the first synthetic image (SI1) with different noises over 20 independent runs.

Multiobjective optimization algorithm-based image segmentation 323 Table 9.3: The statistical means and standard deviations of numerical results produced by eight contestant algorithms on other two synthetic images (SI2 and SI3) with different noises over 20 independent runs. Image

Method

AR

ARI

Image

Method

AR

ARI

SI2 with Gau 30%

KFCM-S1

97.50 (0.000) 97.82 (0.000) 98.79 (0.000) 98.82 (0.000) 99.68 (0.000) 75.29 99.26 (0.008) 99.70 (0.007) 99.93 (0.000) 98.82 (0.000) 97.45 (0.000) 97.70 (0.000) 99.97 (0.000) 52.14 98.94 (0.000) 99.98(0.002)

93.58 (0.000) 94.37 (0.002) 96.84 (0.000) 96.94 (0.000) 99.17 (0.000) 49.66 98.08 (0.022) 99.22 (0.017) 99.82 (0.000) 96.92 (0.000) 93.43 (0.000) 94.05 (0.000) 99.93 (0.000) 48.90 97.24 (0.001) 99.96 (0.004)

SI3 with Gau 30%

KFCM-S1

92.96 (0.000) 93.51 (0.000) 97.55 (0.001) 97.72 (0.000) 99.47 (0.000) 66.29 97.05 (0.186) 99.75 (0.009) 99.94 (0.000) 97.59 (0.0000) 96.63 (0.001) 96.76 (0.000) 99.95 (0.000) 97.51 98.85 (0.000) 99.97 (0.003)

80.25 (0.000) 81.95 (0.000) 93.30 (0.001) 93.83 (0.000) 98.44 (0.000) 31.35 91.35 (0.527) 99.27 (0.029) 99.86 (0.000) 93.29 (0.000) 90.94 (0.001) 91.30 (0.000) 99.88 (0.000) 93.92 96.92 (0.000) 99.93 (0.010)

KFCM-S2 FLICM RFLICM KWFLICM KFNDE MSFCA MOEFC SI2 with S&P 30%

KFCM-S1 KFCM-S2 FLICM RFLICM KWFLICM KFNDE MSFCA MOEFC

KFCM-S2 FLICM RFLICM KWFLICM KFNDE MSFCA MOEFC SI3 with S&P 30%

KFCM-S1 KFCM-S2 FLICM RFLICM KWFLICM KFNDE MSFCA MOEFC

[57] are adopted. The segmentation performance is better achieved when the value of PR is larger and the value of VI is smaller. Table 9.4 lists the statistical means and standard deviations of the quantitative metrics obtained by eight algorithms on these two natural images. It can be seen that MOEFC outperforms other seven approaches with the largest PR values and the smallest VI values. Although the PR values of MSFCA are close to MOEFC’s in Table 9.4, the VI values of MSFCA are much larger. Figs. 9.7 and 9.8 show the segmentation results. It can be observed that the results of MOEFC preserve image details and have clear edges.

324 Chapter 9 Table 9.4: The statistical means and standard deviations of numerical results produced by eight contestant algorithms on two natural images (NI1 and NI2) over 20 independent runs. Image

Method

PR

VI

Image

Method

PR

N/1

KFCM-S1

0.9814 (0.0000) 0.9811 (0.0000) 0.9828 (0.0000) 0.9829 (0.0000) 0.9792 (0.0000) 0.9838 0.9840 (0.0008) 0.9896 (0.0011)

0.7245 (0.0000) 0.7221 (0.0000) 0.6898 (0.0000) 0.6895 (0.0000) 1.1275 (0.0001) 0.3402 0.7172 (0.0334) 0.1624 (0.0016)

N/2

KFCM-S1

0.9648 (0.0000) 0.9663 (0.0000) 0.9741 (0.0000) 0.9749 (0.0000) 0.9772 (0.0000) 0.9496 0.9792 (0.0005) 0.9820 (0.0012)

KFCM-S2 FLICM RFLICM KWFL1CM KFNDE MSFCA MOEFC

KFCM-S2 FLICM RFLICM KWFLICM KFNDE MSFCA MOEFC

VI 1.3066 (0.0000) 1.2943 (0.0000) 1.2437 (0.0001) 1.2400 (0.0000) 1.2632 (0.0004) 1 2844 1.1673 (0.0159) 1.0585 (0.0012)

Table 9.5: The statistical means and standard deviations of numerical results produced by eight contestant algorithms on the natural images without reference segmented images over 20 independent runs. Image

Method

Flower

KFCIM-S1 KFCM-S2 FLICM RFLICM KWFLICM KFNDE MSFCA MOEFC KFCM-S1 KFCM-S2 FLICM RFLICM KWFLICM KFNDE MSFCA MOEFC KFCM-S1 KFCM-S2 FLICM RFLICM KWFLICM KFNDE MSFCA MOEFC

Coins

Cameraman

Hr 1.6774 (0.0000) 1.6812 (0.0000) 1.6906 (0.0000) 1.6919 (0.0000) 1.6907 (0.0000) 1.5889 1.6785 (0.0010) 1.6840 (0.0025) 1.2165 (0.0000) 1.2185 (0.0000) 1.2240 (0.0000) 1.2244 (0.0000) 1.2214 (0.0000) 1.2306 1.2172 (0.0000) 1.2188 (0.0043) 1.7064 (0.0000) 1.7278 (0.0000) 1.7376 (0.0000) 1.7390 (0.0000) 1.7247 (0.0000) 1.8121 1.7039 (0.0001) 1.6966 (0.0045)

HI 0.4379 (0.0000) 0.4385 (0.0000) 0.4415 (0.0000) 0.4416 (0.0000) 0.4329 (0.0000) 0.4474 0.4318 (0.0015) 0.4323 (0.0016) 0.3064 (0.0000) 0.3121 (0.0000) 0.3707 (0.0000) 0.3711 (0.0000) 0.2947 (0.0000) 0.2569 0.3132 (0.0001) 0.3056 (0.0055) 0.4591 (0.0000) 0.4602 (0.0000) 0.4612 (0.0000) 0.4611 (0.0000) 0.4591 (0.0000) 0.2932 0.4599 (0.0001) 0.4591 (0.0035)

E 2.1153 (0.0000) 2.1197 (0.0000) 2.1321 (0.0000) 2.1335 (0.0000) 2.1236 (0.0000) 2.0362 2.1103 (0.0025) 2.1163 (0.0018) 1.5229 (0.0000) 1.5305 (0.0000) 1.5947 (0.0000) 1.5955 (0.0000) 1.5161 (0.0000) 1.4876 1.5304 (0.0001) 1.5243 (0.0039) 2.1655 (0.0000) 2.1880 (0.0000) 2.1988 (0.0000) 2.2001 (0.0000) 2.1838 (0.0000) 2.1054 2.1638 (0.0000) 2.1557 (0.0029)

Table 9.6: The statistical means and standard deviations of numerical results produced by eight contestant algorithms on the two simulated MR images (SMR1 and SMR2) over 20 independent runs. Image

Method

SMR1

KFCM-S1

FLICM RFLICM KWFL1CM KFNDE MSFCA MOEFC SMR2

KFCM-S1 KFCM-S2 FLICM RFLICM KWFLICM KFNDE MSFCA MOEFC

JS 60.91 (0.000) 62.14 (0.000) 59.18 (0.000) 58.84 (0.000) 55.08 (0.012) 56.74 55.20 (0.701) 66.08 (0.732) 62.63 (0.000) 63.46 (0.000) 62.91 (0.000) 62.87 (0.000) 61.76 (0.013) 61.49 62.64 (0.176) 62.57 (0.841)

GM DC 75.71 (0.000) 76.65 (0.000) 74.35 (0.000) 74.09 (0.000) 71.03 (0.010) 72.40 71.13 (0.583) 79.58 (0.532) 77.02 (0.000) 77.65 (0.000) 77.23 (0.000) 77.20 (0.000) 76.36 (0.010) 76.16 77.03 (0.133) 76.97 (0.638)

JS 72.82 (0.000) 71.04 (0.000) 72.43 (0.002) 72.26 (0.000) 73.56 (0.007) 68.89 70.60 (0.380) 71.11 (1.004) 71.32 (0.000) 70.40 (0.000) 70.62 (0.000) 70.58 (0.003) 71.65 (0.006) 69.56 70.15 (0.168) 72.19 (0.569)

WM DC 84.27 (0.000) 83.07 (0.000) 84.01 (0.001) 83.90 (0.000) 84.77 (0.005) 81.58 82.77 (0.262) 83.11 (0.687) 83.26 (0.000) 82.63 (0.000) 82.78 (0.000) 82.75 (0.002) 83.48 (0.004) 82.05 82.46 (0.116) 83.85 (0.383)

JS 83.81 (0.000) 83.75 (0.000) 83.21 (0.002) 83.17 (0.000) 83.52 (0.003) 81.60 81.12 (0.268) 83.33 (0.686) 79.96 (0.000) 79.74 (0.000) 79.80 (0.000) 79.74 (0.003) 80.07 (0.003) 78.51 78.69 (0.074) 80.85 (0.745)

Average DC 91.19 (0.000) 91.15 (0.000) 90.83 (0.001) 90.81 (0.000) 91.02 (0.002) 89.87 89.58 (0.163) 90.91 (0.408) 88.87 (0.000) 88.73 (0.000) 88.77 (0.000) 88.73 (0.002) 88.93 (0.002) 87.96 88.07 (0.046) 89.41 (0.455)

JS 72.51 (0.000) 72.31 (0.000) 71.60 (0.001) 71.42 (0.000) 70.72 (0.007) 69.08 68.97 (0.450) 73.51 (0.807) 71.30 (0.000) 71.20 (0.000) 71.11 (0.000) 71.06 (0.002) 71.16 (0.007) 69.86 70.49 (0.140) 71.87 (0.718)

DC 83.72 (0.000) 83.63 (0.000) 83.06 (0.001) 82.93 (0.000) 82.27 (0.006) 81.28 81.16 (0.336) 84.53 (0.543) 83.05 (0.000) 83.00 (0.000) 82.93 (0.000) 82.89 (0.001) 82.92 (0.005) 82.06 82.52 (0.099) 83.41 (0.492)

Multiobjective optimization algorithm-based image segmentation 325

KFCM-S2

CSF

326 Chapter 9 Table 9.7: The statistical means and standard deviations of numerical results produced by eight contestant algorithms on two MR images (MR1 and MR2) without reference segmented images over 20 independent runs. Image MR1

MR2

Method KFCM-S1 KFCM-S2 FLICM RFLICM KWFLICM KFNDE MSFCA MOEFC KFCM-S1 KFCM-S2 FLICM RFLICM KWFLICM KFNDE MSFCA MOEFC

Hr 1.8526 1.8539 1.8703 1.8747 1.8747 1.7604 1.8590 1.8388 1.7781 1.7805 1.7894 1.7912 1.7851 1.7396 1.7792 1.7964

(0.0000) (0.0000) (0.0000) (0.0000) (0.0000) (0.0028) (0.0074) (0.0000) (0.0000) (0.0018) (0.0000) (0.0021) (0.0019) (0.0061)

HI 0.5130 (0.0000) 0.5159 (0.0000) 0.5117 (0.0000) 0.5113 (0.0000) 0.5121 (0.0000) 0.5021 0.5102 (0.0060) 0.5152 (0.0063) 0.3716 (0.0000) 0.3735 (0.0000) 0.3672 (0.0063) 0.3683 (0.0000) 0.4818 (0.0391) 0.3363 0.3637 (0.0042) 0.3298(0.0081)

E 2.3655 (0.0000) 2.3698 (0.0000) 2.3820 (0.0000) 2.3860 (0.0000) 2.3868 (0.0000) 2.2624 2.3692 (0.0035) 2.3540 (0.0045) 2.1497 (0.0000) 2.1540 (0.0000) 2.1567 (0.0046) 2.1595 (0.0000) 2.2669 (0.0000) 2.0759 2.1429 (0.0025) 2.1262 (0.0034)

For image segmentation without reference segmented images, there is no index to measure the most suitable solution to the observed image from the obtained nondominated solution set. The knee region, the most interesting region along the PF in MOP, is preferred in this experiment. It is composed of solutions which would have a severe deterioration in one objective with another improving a little. In real-world situations, the obtained PF is impossible to converge to the true PF completely, and the location of the knee region is disturbed by the different magnitudes of two objectives [58,59]. To address the above problems, Li and Yao [59] presented that the obtained nondominated solution set could be normalized and interpolated by B-splines. After resampling from the smooth spline, the knee points can be located by the angle-based method [58] effectively. This method is also suitable for MOEFC. Hence, the experiment utilizes the knee-based method proposed in Ref. [59] to seek the knee region of the obtained PF and takes the closest solution to the knee region as the final result. To compare the performances quantitatively, the entropy-based evaluation function (E) [60] is introduced to assess the performance of image segmentation without ground truths, which is calculated by E ¼ Hr þ Hl. Hr and Hl represent the expected region entropy and

Multiobjective optimization algorithm-based image segmentation 327 Table 9.8: The statistical means and standard deviations of numerical results produced by eight contestant algorithms on SAR images over 20 independent runs. Image Bern

Ottawa

Method

FN

OE

84.0 (0.00)

231.0 (0.00)

315.0 (0.00)

KFCM-S2

183.0 (0.00)

141.0 (0.00)

324.0 (0.00)

FLICM

68.0 (0.00)

236.0 (0.00)

304.0 (0.00)

RFLICM

67.0 (0.00)

238.0 (0.00)

305.0 (0.00)

KWFLICM

35.8 (0.43)

328.0 (0.00)

363.8 (0.43)

KFNDE MSFCA

108.0 31.9 (0.30)

648.0 374.35 (4.34)

MOEFC

185.4 (19.69)

KFCM-S1

109.5 (19.81) 429.0 (0.00)

756.0 406.25 (4.12) 294.8 (3.85)

2379.0 (0.00)

KFCM-S2

514.0(0.00)

2166.0 (0.00)

FLICM

211.0 (0.00)

2586.1 (0.22)

RFLICM

221.0 (0.00)

2571.0 (0.00)

KWFLICM

112.0 (0.00)

2736.0(0.00)

KFNDE MSFCA

1000.0 125.85 (4.50) 730.9 (135.16) 1263.0 (0.00) 1322.0 (0.00) 1074.0 (0.00) 1051.0 (0.00) 1018.0 (0.00) 1625.0 1215.8 (1.94) 694.5 (49.09)

3845.0 2860.55 (55.94) 1074.1 (151.41) 796.0 (0.00)

MOEFC Italy

FP

KFCM-S1

KFCM-S1 KFCM-S2 FLICM RFLICM KWFLICM KFNDE MSFCA MOEFC

733.0 (0.00) 810.0 (0.00) 815.0 (0.00) 855.0 (0.00) 847.0 799.35 (1.01) 989.5 (38.63)

2808.0 (0.00) 2680.0 (0.00) 2797.1 (0.22) 2792.0 (0.00) 2848.0(0.00) 4845.0 2986.4 (51.49) 1804.9 (26.40) 2059.0 (0.00) 2055.0 (0.00) 1884.0 (0.00) 1866.0 (0.00) 1873.0 (0.00) 2472.0 2015.15 (1.11) 1684.1 (11.41)

PCC 0.9965 (0.0000) 0.9964 (0.0000) 0.9966 (0.0000) 0.9966 (0.0000) 0.9960 (0.0000) 0.9917 0.9955 (0.0001) 0.9968 (0.0001) 0.9723 (0.0000) 0.9736 (0.0000) 0.9724 (0.0000) 0.9725 (0.0000) 0.9719 (0.0000) 0.9523 0.9706 (0.0005) 0.9822 (0.0003) 0.9833 (0.0000) 0.9834 (0.0000) 0.9848 (0.0000) 0.9849 (0.0000) 0.9849 (0.0000) 0.9800 0.9837 (0.0000) 0.9864 (0.0001)

KC 0.8526 (0.0000) 0.8604 (0.0000) 0.8564 (0.0000) 0.8557 (0.0000) 0.8177 (0.0002) 0.5691 0.7913 (0.0026) 0.8664 (0.0029) 0.8908 (0.0000) 0.8966 (0.0000) 0.8900 (0.0000) 0.8903 (0.0000) 0.8872 (0.0000) 0.8070 0.8814 (0.0023) 0.9327 (0.0013) 0.8601 (0.0000) 0.8614 (0.0000) 0.8704 (0.0000) 0.8715 (0.0000) 0.8704 (0.0000) 0.8351 0.8627 (0.0001) 0.8802 (0.0003)

328 Chapter 9

Figure 9.4 Segmentation results on the first synthetic image (SI1) corrupted by Gaussian noise (30%): (A) original image; (B) noisy image; (C) KFCM_S1 result; (D) KFCM_S2 result; (E) FLICM result; (F) RFLICM result; (G) KWFLICM result; (H) KFNDE result; (I) MSFCA result; (J) MOEFC result.

Figure 9.5 Segmentation results on the second synthetic image (SI2) corrupted by Salt & Pepper noise (30%): (A) original image; (B) noisy image; (C) KFCM_S1 result; (D) KFCM_S2 result; (E) FLICM result; (F) RFLICM result; (G) KWFLICM result; (H) KFNDE result; (I) MSFCA result; (J) MOEFC result.

the layout entropy, respectively. The details of Hr and Hl can be found in Ref. [60], and the segmentation performance is better achieved when the value of E is minimal. In this experiment, three natural images, Flower corrupted by 20% Gaussian noise, Coins and Cameraman corrupted by 20% Salt & Pepper noise are adopted. Table 9.5 gives the

Multiobjective optimization algorithm-based image segmentation 329

Figure 9.6 Segmentation results on the third synthetic image (SI3) corrupted by Gaussian noise (30%): (A) original image; (B) noisy image; (C) KFCM_S1 result; (D) KFCM_S2 result; (E) FLICM result; (F) RFLICM result; (G) KWFLICM result; (H) KFNDE result; (I) MSFCA result; (J) MOEFC result.

Figure 9.7 Segmentation results on the first natural image (NI1): (A) original image; (B) KFCM_S1 result; (C) KFCM_S2 result; (D) FLICM result; (E) RFLICM result; (F) KWFLICM result; (G) KFNDE result; (H) MSFCA result; (I) MOEFC result.

statistical means and standard deviations of numerical results on the three natural images produced by the algorithms. The segmentation results on the natural images are shown in Figs. 9.9e9.11, respectively. It is observed that KWFLICM and MOEFC remove most of noise. But in Fig. 9.10G, the result of KWFLICM on the Coins image misses a number of coin details. In Fig. 9.10J, MOEFC preserves coin details and obtains smooth regions. In Table 9.5, the E values of KFNDE on three natural images are smaller than MOEFC. However, it is observed that the results of KFNDE are corrupted by noise seriously in Figs. 9.9H, 9.10H, and 9.11H. For natural image Flower, KFCM_S1 and MSFCA have

330 Chapter 9

Figure 9.8 Segmentation results on the second natural image (NI2): (A) original image; (B) KFCM_S1 result; (C) KFCM_S2 result; (D) FLICM result; (E) RFLICM result; (F) KWFLICM result; (G) KFNDE result; (H) MSFCA result; (I) MOEFC result.

Figure 9.9 Segmentation results on the Flower image corrupted by 20% Gaussian noise: (A) original image; (B) noisy image; (C) KFCM_S1 result; (D) KFCM_S2 result; (E) FLICM result; (F) RFLICM result; (G) KWFLICM result; (H) KFNDE result; (I) MSFCA result; (J) MOEFC result.

smaller E values in Table 9.5, but their segmentation results include much noise in Figs. 9.9C and I. From the above analyses, it can be seen that the segmentation results with low E values may be corrupted by noise seriously, and the results corrupted by noise do not achieve good performance for image segmentation. In Figs. 9.9J, 9.10J, and 9.11J, the results of MOEFC have clearer edges and smoother regions while removing noise. Furthermore, compared with the algorithms which obtain results with less noise in Figs. 9.9e9.11, MOEFC acquires lower E values in Table 9.5.

Multiobjective optimization algorithm-based image segmentation 331

Figure 9.10 Segmentation results on the Coins image corrupted by 20% Salt & Pepper noise: (A) original image; (B) noisy image; (C) KFCM_S1 result; (D) KFCM_S2 result; (E) FLICM result; (F) RFLICM result; (G) KWFLICM result; (H) KFNDE result; (I) MSFCA result; (J) MOEFC result.

Figure 9.11 Segmentation results on the Cameraman image corrupted by 20% Salt & Pepper noise: (A) original image; (B) noisy image; (C) KFCM_S1 result; (D) KFCM_S2 result; (E) FLICM result; (F) RFLICM result; (G) KWFLICM result; (H) KFNDE result; (I) MSFCA result; (J) MOEFC result.

9.4.1.4 Segmentation results on medical images Besides medical images with reference segmented images, those without reference segmented images are utilized in this experiment. In this experiment, we adopt two simulated MR images, SMR1 and SMR2, which are selected from the BrainWeb [61]. As shown in Figs. 9.12A and 9.13A, these two simulated MR images with 9% Rician noise (l ¼ 9) [62] are the 90th and 180th brain region slices in the axial plane with high-resolution T1 weighted phantom, 1 mm slice thickness,

332 Chapter 9

Figure 9.12 Segmentation results on the first simulated MR image (SMR1) corrupted by 9% Rician noise (l ¼ 9): (A) original image; (B) KFCM_S1 result; (C) KFCM_S2 result; (D) FLICM result; (E) RFLICM result; (F) KWFLICM result; (G) KFNDE result; (H) MSFCA result; (I) MOEFC result.

Figure 9.13 Segmentation results on the second simulated MR image (SMR2) corrupted by 9% Rician noise (l ¼ 9): (A) original image; (B) KFCM_S1 result; (C) KFCM_S2 result; (D) FLICM result; (E) RFLICM result; (F) KWFLICM result; (G) KFNDE result; (H) MSFCA result; (I) MOEFC result.

respectively. The observed images include four clusters: background, cerebral spinal fluid (CSF), gray matter (GM), and white matter (WM). To calculate the statistical accuracies of the segmentation results with ground truths, the Jaccard similarity (JS) [63,64] and Dice

Multiobjective optimization algorithm-based image segmentation 333

Figure 9.14 Segmentation results on the first MR image (MR1) corrupted by 20% Rician noise (l ¼ 20): (A) original image; (B) noisy image; (C) KFCM_S1 result; (D) KFCM_S2 result; (E) FLICM result; (F) RFLICM result; (G) KWFLICM result; (H) KFNDE result; (I) MSFCA result; (J) MOEFC result.

Figure 9.15 Segmentation results on the second MR image (MR2) corrupted by 20% Rician noise (l ¼ 20): (A) original image; (B) noisy image; (C) KFCM_S1 result; (D) KFCM_S2 result; (E) FLICM result; (F) RFLICM result; (G) KWFLICM result; (H) KFNDE result; (I) MSFCA result; (J) MOEFC result.

coefficient (DC) [65e67] are utilized, where the larger values of JS and DC indicate the segmentation result is more similar to ground truths. Table 9.6 gives the statistical means and standard deviations of the metrics of CSF, GM, and WM produced by the algorithms. Figs. 9.12 and 9.13 show the segmentation results on these two slices, respectively. Visually, the result of KFNDE includes noise in

334 Chapter 9

Figure 9.16 Segmentation results on the Bern data set: (A) image in April 1999; (B) image in May 1999; (C) ground truth image; (D) KFCM_S1 result; (E) KFCM_S2 result; (F) FLICM result; (G) RFLICM result; (H) KWFLICM result; (I) KFNDE result; (J) MSFCA result; (K) MOEFC result.

Fig. 9.12G. MOEFC preserves the most details of CSF in Fig. 9.12I, and it achieves a larger value on the CSF accuracy in Table 9.6. Also, KWFLICM achieves a larger value on the GM accuracy, and KFCM_S1 achieves a larger value on the WM accuracy in Table 9.6. But the CSF accuracies of KWFLICM and KFCM_S1 are much smaller than MOEFC’s. In Fig. 9.12I, it can be seen that the segmentation results of MOEFC have clear edges and smooth regions in both GM and WM while restraining noise. In the experiment on SMR2, it appears that MOEFC obtains larger accuracies of GM and WM than other approaches. Although MSFCA acquires greater accuracy of CSF, it does not perform well in the segmentation of GM and WM. The accuracies of MSFCA on GM and WM are lower in Table 9.6. In Table 9.6, it can be seen that KFCM_S2 performs the best on the segment of CSF. Compared with KFCM_S2, MOEFC performs better on the segments of GM and WM. Moreover, the average accuracies of MOEFC are greater than

Multiobjective optimization algorithm-based image segmentation 335

Figure 9.17 Segmentation results on the Ottawa data set: (A) image in May 1997; (B) image in August 1997; (C) ground truth image; (D) KFCM_S1 result; (E) KFCM_S2 result; (F) FLICM result; (G) RFLICM result; (H) KWFLICM result; (I) KFNDE result; (J) MSFCA result; (K) MOEFC result.

other approaches in Table 9.6. It also can be observed that MOEFC obtains clearer edges and smoother regions in Fig. 9.13I. In the next experiment, two MR images corrupted by 20% Rician noise (l ¼ 20), MR1 and MR2, are adopted without ground truths. To compare the performances quantitatively, the entropy-based evaluation function (E) [60], the same metric in the experiments on the natural images without ground truths, is utilized to assess the performances of algorithms. Figs. 9.14 and 9.15 show the segmentation results on these two MR images, respectively. Table 9.7 gives the statistical means and standard deviations of numerical results produced

336 Chapter 9

Figure 9.18 Segmentation results on the Italy data set: (A) image in September 2010; (B) image in June 2011; (C) ground truth image; (D) KFCM_S1 result; (E) KFCM_S2 result; (F) FLICM result; (G) RFLICM result; (H) KWFLICM result; (I) KFNDE result; (J) MSFCA result; (K) MOEFC result.

by the algorithms on these two MR images. It can be seen that the E values of KFNDE on the MR images are smaller than MOEFC’s. However, in Figs. 9.14H and 9.15H, it is obvious that the results of KFNDE include much noise. Compared with other comparison algorithms, MOEFC achieves lower E values in Table 9.7. In Figs. 9.14J and 9.15J, the segmentation results obtained by MOEFC have clearer edges and smoother regions while removing noise. This reflects that MOEFC can preserve significant image details while removing noise for medical image segmentation. 9.4.1.5 Segmentation results on SAR images In this experiment, three SAR image data sets, Bern, Ottawa, and Italy, are utilized. Each SAR image data set includes two SAR images acquired in the same geographical area at different times. The difference image (DI) is generated from the two SAR images. Here, the DI is generated by the log-ratio operator [19]. The goal to segment the DI into two classes, changed and unchanged, is to measure the difference between the two SAR images to analyze the changes in the same area at different time.

Multiobjective optimization algorithm-based image segmentation 337 To evaluate results with available ground truths quantitatively, the percentage correct classification (PCC) [68], the Kappa coefficient (KC), and the overall error (OE) are defined as follows: TP þ TN N PCC  PRE KC ¼ 1  PRE OE ¼ FP þ FN PCC ¼

(9.21) (9.22) (9.23)

ðTPþFPÞðTPþFNÞðFNþTNÞðTNþFPÞ . N2

where PRE ¼ TP and TN reflect the changed area and the unchanged area in both the segmented result and ground truth, respectively. FP reflects the area labeled as changed in the segment but unchanged in the ground truth, and FN reflects the area labeled as unchanged in the segment but changed in the ground truth. N is the number of pixels. Both PCC and KC fall into interval [0, 1], and the larger value means the better performance of the algorithm. Table 9.8 contains the statistical means and standard deviations of numerical results on these three SAR image data sets obtained by the algorithms over 20 independent runs. It can be seen that MOEFC outperforms other algorithms with the largest values of PCC and KC on these three SAR data sets. In the experiment on the Bern data set, the FP value MSFCA and the FN value of KFCM_S2 are smaller than MOEFC. However, the FN value of MSFCA and the FP value of KFCM_S2 are much larger. This leads to the larger OE values of these two comparison algorithms. Thus, MSFCA and KFCM_S2 obtain smaller values of PCC and KC on the Bern data set. For these three SAR image data sets, MOEFC achieves the lowest OE values in Table 9.8. Furthermore, MOEFC achieves the highest values of PCC and KC. Figs. 9.16e9.18 show the segmentation results of the algorithms on three SAR data sets. Visually, the results of MOEFC are more similar to the ground truths.

9.4.2 The IMIS experiments 9.4.2.1 IMIS experimental settings To study the performance of this framework, IMIS is tested on both synthetic texture images and real SAR images in this section. In addition, five other algorithms: FCM, SOGA [10], SOM [11], HMTseg [12], and SCE [37], are used to compare with IMIS. The source codes of the first three algorithms are programmed by the authors of the study in MATLAB 7.01 and SCE is provided by Dr. Zhang ([email protected]) and the HMTseg procedure can be downloaded from the website (http://www.dsp.rice.edu/ software). All the experimentations are implemented on an HP Workstation xw9300 (2.19 GHZ, 16 GB RAM; Hewlett-Packard, Palo Alto, CA).

338 Chapter 9 The optimal parameter settings are difficult to determine for Gabor filters and GLCP. Chang and Kuo [41] have noted that the most significant information in texture images usually appears in the intermediate frequency bands. Clausi verified that the low and middle frequencies of texture information can be accurately acquired using Gabor multichannel filters, and GLCP can provide stable texture discrimination ability in higherfrequency components [45]. The parameters used in IMIS are as follows, considering the above propositions and conclusions. The Gabor multichannel filters are created with six center frequencies (F ¼ 6.1876, 4.3878, 3.9135, 3.6751, 3.3991, and 2.9551 pixels per cycle) and six orientations (q ¼ 0 , 30 , 60 , 90 , 120 , and 150 ). Furthermore, the parameters involved in designing the GLCP are the local window size w, the quantization level q, the interpixel distance d, and the direction a. A 9  9 local window is adopted to estimate the texture information for each pixel. A 64-level quantization, the interpixel distance d ¼ 1, and the directions a ¼ 0 , 45 , 90 , and 135 are employed for feature representation. We can obtain 12-dimensional GLCP feature vectors by calculating the three statistics (contrast, entropy, and correlation) from Eq. (9.20). In the experiment, a total of 48-dimensional feature vectors were used for subsequent segmentation, which were obtained from 36-dimensional feature vectors created by the Gabor filters and the 12dimensional GLCP feature vectors. How about the parameter settings of the six algorithms are? The specific experimental settings for them are presented in this paragraph for fair comparison. Parameter settings in FCM: the maximum number of iterations was 100 and the fuzzy exponent was 2.0. FCM was repeated five times in each run, and the best result was selected to reduce the instability in initialization. Parameter settings in SOM: the dimensions of the features map were 5  5. Parameter settings in SCE: the scaling parameter s for each component of spectral clustering with the Nystro¨m method (SC_Nys) was distributed randomly in the interval [10,46]. Two hundred feature vectors were sampled randomly from the features in SCE for each component of the SC_Nys. Thirty components of the SC_Nys were employed and combined for the final ensemble learning. Parameter settings in HMTseg: a Haar wavelet with three levels of decomposition was adopted for the wavelet transformation. The 64  64 image blocks were used for model training. The size of the raw HMT-based multiscale classifications window was 8  8, 4  4, and 2  2 as suggested in Ref. [12]. Parameter settings for IMIS and SOGA are presented in Table 9.9.

Table 9.9: Specific parameter settings in IMIS and SOGA. Methods IMIS SOGA

Population size 40 40

Number of generations 20 40

Crossover probability 0.8 0.8

Mutation probability 0.1 0.1

Optimized index XB JM XB

Size of clone pool 20 d

Multiobjective optimization algorithm-based image segmentation 339 Additionally, simulated binary crossover (SBX) and polynomial mutation operator were adopted many times in current EMOA literatures [49,51,69,70]. Here, the experiments continue to use them for affinity maturity operation. For the stop criterion for IMIS and SOGA, the maximal number of function-evaluations for IMIS was 800 and 400 was set for SOGA because there are two optimization indexes in IMIS [49,71]. Which methods are used to assess the segmentation results? An intuitive way is to inspect the partitioning results with the eyes. Furthermore, a more effective and convincing manner is to calculate the statistical results of the six algorithms over multiple runs. The statistical indices are divided into two categories: internal and external. The internal indices are used to evaluate the goodness of partitions of data sets without true class labels. XB, JM, and PBM are representatives of the internal indices. The external indices require the true class labels, including accurate rate (AR), rand index [54], and adjusted rand index (ARI) [54]. Here, five indices are used to evaluate the performance of the final segmentation results, which are XB, JM, PBM, AR, and ARI. The detailed definitions of AR and ARI can be found in Ref. [54], and three other indices have been presented in Section 9.3.1. 9.4.2.2 Analysis of experimental results In this section, we investigate the performance of the AIS-based, multiobjective optimization algorithm compared to five other pattern classification algorithms in segmenting three synthetic texture images and two real SAR images. The three synthetic texture images have two, four, and five categories from the Brodatz album of the University of Southern California. In the following experiments, 30 independent runs on each test image were performed to evaluate and compare the robustness of the algorithms. The statistical results (mean and variance) of the selected metrics are shown in Tables 9.10 and 9.11. It is noteworthy that the real class labels of the three synthetic texture images are easily obtained; therefore, their segmentation results can be evaluated by the AR and ARI indices. However, the ground truth class labels of the SAR images are difficult to acquire. Thus, AR and ARI cannot be used to evaluate the segmentation results of the two SAR images. Figs. 9.19e9.21 show the segmentation results of the synthetic texture image with two, four, and five categories using FCM, SOGA, SOM, IMIS, HMTseg, and SCE, respectively. Table 9.10 lists the statistical means and standard deviations of segmentation results over 30 independent runs. Fig. 9.19 shows that all six algorithms can obtain good segmentation results for the synthetic texture image with two categories. The reason may be that, once an effective and appropriate feature extraction technique is designed, the difficulty of classification task can be simplified enormously. Table 9.10 shows FCM and SOGA perform the best in segmenting the synthesized texture image, followed by IMIS. The statistical results of HMTseg and SCE are inferior to the other four algorithms in terms of

340 Chapter 9 Table 9.10: The statistical means and standard deviations of the segmentation results of the three synthesized texture image over 30 independent runs. Data set Texture2

Texture4

Texture5

e

Method FCM SOGA SOM IMIS HMTseg SCE FCM SOGA SOM IMIS HMTseg SCE FCM SOGA SOM IMIS HMTseg SCE

XB 0.1261 0.1149 0.1413 0.1406

(0.0002) (0.0002) (0.0001) (0.0002) e e 0.2924 (0.0164) 0.2343 (0.0132) 0.4099 (0.0140) 0.2443 (0.0130) e e 0.7209 (0.4050) 0.1219 (0.0172) 0.1376 (0.0278) 0.1991 (0.0235) e e

JM 297.6128 (0.0021) 364.5140 (0.0022) 265.7456 (0.0021) 280.3019 (0.0022) e e 111.9748 (8.3792) 155.7982 (9.5098) 108.2735 (8.4029) 125.8723 (5.9111) e e 98.2155 (21.0367) 110.1688 (9.6690) 103.9490 (1.4234) 102.1979 (1.1285) e e

PBM 1.9703 (0.0008) 1.5403 (0.0006) 1.4558 (0.0008) 1.5529 (0.0007) e e 0.7136 (0.0588) 0.7297 (0.2604) 0.4307 (0.0823) 0.8792 (0.0111) e e 2.7655 (0.4237) 2.6989 (0.3892) 2.3325 (0.3741) 3.0279 (0.1347) e e

AR 99.4098 (0.0011) 99.3998 (0.0012) 99.3607 (0.0022) 99.3792 (0.0010) 98.7793 (0.2256) 98.7281 (0.0024) 86.4814 (13.4882) 81.1752 (14.8180) 76.3222 (12.2265) 98.6142 (0.0029) 97.3498 (0.0897) 96.8168 (0.0098) 84.8603 (21.5286) 86.6389 (6.3854) 77.1517 (3.0085) 96.4925 (2.1729) 93.9183 (0.6288) 92.7210 (3.5862)

ARI 0.9765 (0.0089) 0.9761 (0.0001) 0.9745 (0.0001) 0.9754 (0.0000) 0.9517 (0.0088) 0.9498 (0.0001) 0.8755 (0.1607) 0.8758 (0.1701) 0.7845 (0.1337) 0.9635 (0.0001) 0.9309 (0.0023) 0.9178 (0.0002) 0.8956 (0.0721) 0.8996 (0.0571) 0.7975 (0.0144) 0.9251 (0.0208) 0.8595 (0.0111) 0.8484 (0.0452)

Multiobjective optimization algorithm-based image segmentation 341 Table 9.11: The statistical means and standard deviations of the segmentation results of the two SAR image over 30 independent runs. Data set Real SAR1

Real SAR2

Method FCM SOGA SOM IMIS FCM SOGA SOM IMIS

XB 0.3196 0.1669 0.3042 0.2510 0.4256 0.1983 0.6445 0.2053

(0.0112) (0.0071) (0.0112) (0.0062) (0.0757) (0.0242) (0.3352) (0.0784)

JM 170.7836 (0.1777) 175.1025 (0.4184) 141.2703 0.8591) 171.0940 (0.1765) 131.6360 (4.7731) 162.4240 (1.1779) 118.1468 (0.7821) 128.3862 (0.3017)

PBM 0.8499 0.9040 0.8267 1.1348 0.4584 0.4237 0.4326 0.8499

(0.0042) (0.0094) (0.0211) (0.0024) (0.0959) (0.0429) (0.0424) (0.0220)

Figure 9.19 The segmentation results of the synthesized texture image with two categories: (A) the original image (256  256 pixels); (B) the segmentation model, and (C)e(H) the segmentation results by FCM, SOGA, SOM, IMIS, HMTSeg, and SCE, respectively.

AR and ARI indices. In order to further compare the six algorithms, more difficult classification tasks should be designed and tested. Once the true number of clusters is increased, the statistical results of the classification performance of the algorithms without global search ability will degrade. In Figs. 9.20D and E, some local patches are misclassified as other categories in the consistent regions. Thus, SOGA and SOM have demonstrated the relatively poor performance in segmenting the synthetic texture image with four categories. On the

342 Chapter 9

Figure 9.20 The segmentation results of the synthesized texture image with four categories: (A) the original image (256  256 pixels); (B) the segmentation model, and (C)e(H) the segmentation results using FCM, SOGA, SOM, IMIS, HMTSeg, and SCE, respectively.

Figure 9.21 The segmentation results of the synthesized texture image with five categories: (A) the original image (256  256 pixels); (B) the segmentation model, and (C)e(H) the segmentation results using FCM, SOGA, SOM, IMIS, HMTSeg, and SCE, respectively.

Multiobjective optimization algorithm-based image segmentation 343 contrary, IMIS and HMTseg obtain better segmentation results in region homogeneity, followed by SCE and FCM. Furthermore, the category boundaries in the middle of the partitioning results by HMTseg are not well divided. For SCE, there are two obviously misclassified spots at the lower right of its segmentation results. The statistical results in Table 9.10 also agree with the visual inspection in Fig. 9.20. SOGA aims to minimize the XB index; thus, the optimal value of the index XB can be obtained by the algorithm. Nevertheless, it deteriorates greatly for another index, the JM index. The opposite case lies in FCM and SOM since both of them try to optimize the JM index. Table 9.10 shows SOGA, FCM, and SOM cannot provide the best statistical values of PBM, AR, and ARI though they perform the best in optimizing the XB index or JM index. For IMIS, the two indices are both incorporated in it and optimized simultaneously, the algorithm can obtained the best segmentation results both in visual inspection and statistical results of classification accuracy despite IMIS not obtaining the best in the XB index and JM index. Therefore, we believe that the evolutionary process regulated by the two conflicting objectives is beneficial for searching the optimal partition, and the tradeoff solutions provided by them are useful for discovering the complicated relationships between image data. Fig. 9.21 shows the segmentation results of the synthetic texture image with five categories using the six algorithms, respectively. Most of the real SAR images are contaminated by a lot of speckle noise, which could conceal and destroy regions or objectives of interest. It may cause difficulty and uncertainty for further discrimination and understanding. To simulate the effect of speckle noise on real SAR images, we added a high level of noise to the synthetic texture image with five categories. The added speckle noise has a mean of zero and a standard deviation of 35. Fig. 9.21 and Table 9.10 indicate IMIS is the best in visual inspection and statistical results of PBM, AR, and ARI. FCM and SOGA obtain the next-best one-off segmentation result in Fig. 9.21, but they show high instability over multiple runs in Table 9.10. To the best of the authors’ knowledge, FCM is the most widely used method for data clustering and image classification. The algorithm is easy to implement and gives reasonable results in most cases. Unfortunately, the algorithm makes local changes to the original partition. Thus, it often gets stuck at a local minimum at the early iteration. Here, the segmentation results of the synthetic texture image with four and five categories have verified the conclusion. For SCE and HMTseg, the category boundaries of the segmentation results by them are not well defined. Considering the segmentation experimental results of six algorithms on the three synthesized texture images, we can conclude that IMIS has obtained impressive and encouraging partitioning results. Additionally, in order to maintain the original intentions of the HMTseg and SCE algorithms, a multiscale Haar wavelet is employed in HMTseg; gray-level co-occurrence matrix-based statistics and energy features from the undecimated wavelet decomposition are adopted in SCE. Therefore, the magnitudes of the features in HMTseg and SCE are different from the features in the other four algorithms in the framework. It is meaningless

344 Chapter 9

Figure 9.22 The segmentation results of the SAR image with three categories: (A) the original image (256  256 pixels) and (B)e(G) the segmentation results using FCM, SOGA, SOM, IMIS, HMTSeg, and SCE, respectively.

to compare their XB, JM, and PBM indices. Therefore, the three items of HMTseg and SCE in Table 9.10 are not presented. The next two experiments were implemented using two real SAR images. Fig. 9.22 shows the segmentation results of a four-look SAR image provided by the second European Remote Sensing Satellite (ERS-2). The original image is shown in Fig. 9.22A with three types of crops in black, gray, and white. It can be seen that the gray crop in the middle of the image was seriously misclassified as black regions in Figs. 9.22B, C, and D, whereas, the segmented results in Figs. 9.22E and G are better than the above three algorithms. The difficulty of segmenting the SAR image lies in the highly overlapping crops. Taking the regions at the upper right corner of the image as an example, the grayer crops and dark crops are mixed together. It can induce ambiguity for accurate and clear partitions. Similar regions are located at the upper middle, upper right, and middle right of the original image. Table 9.11 presents the statistical segmentation results of the SAR image over 30 independent runs. As can be seen in Fig. 9.22 and Table 9.11, IMIS and SCE have obtained more satisfying segmentation results than other four algorithms. The last experiment was carried out on an SAR image of a certain open field in western China. The image consists of four typical ground objects: three types of crops and several regions of water. Visually, the difficulty of the classification task lies in how to distinguish

Multiobjective optimization algorithm-based image segmentation 345

Figure 9.23 The segmentation results of the SAR image with four categories: (A) the original image (256  256 pixels) and (B)e(G) the segmentation results by FCM, SOGA, SOM, IMIS, HMTSeg, and SCE, respectively.

the light gray crops, dark gray crops, and black water, clearly and accurately. The segmentation results by FCM and SOM are shown in Figs. 9.23B and D. They have the same troubles in separating the light gray crops from the dark gray crops in the upper left corner of the image. In addition, some dark gray crops are misclassified as water regions. Fig. 9.23C shows the segmentation results by SOGA. Most of the light gray crops and the dark are mixed together, and IMIS cannot find clearly partitioning results of the SAR images. For SCE, some dark gray crops are misclassified as light gray ones. In Figs. 9.23E and F, IMIS and HMTseg have obtained better partitions than FCM, SOGA, and SCE. In particular, IMIS seems to have presented relatively better results not only in region consistency but also in boundary localization. The statistical means and standard deviations of the segmentation results of the two SAR image over 30 independent runs are presented in Table 9.11. The statistical results of XB, JM, and PBM by HMTseg and SCE are not demonstrated in this study because the features used in HMTseg and SCE are different from the features in the other four algorithms. Table 9.11 shows that two complementary optimization indexes in IMIS could produce more suitable segmentation results despite not being at their optimal values.

346 Chapter 9

9.5 Summary Two methods are introduced in this chapter, based on the artificial immune multiobjective optimization technique and the multiobjective evolutionary fuzzy clustering technique, respectively, for image segmentation [72,73]. The two algorithms are denoted as MOEFC and IMIS, respectively. In the first method, a multiobjective evolutionary fuzzy clustering algorithm is proposed, which converts the fuzzy clustering problem into a multiobjective evolutionary fuzzy clustering algorithm. To achieve a trade-off between preserving significant image details and removing noise, original FCM energy function to preserve image details and the function based on local information to restrain noise are minimized by MOEA/D simultaneously. The decomposition strategy is utilized to project the multiobjective fuzzy clustering problem into a number of subproblems. Each subproblem represents a fuzzy clustering problem, which has a different balance between maintaining image details and suppressing noise. To improve the performance of the proposed algorithm, two problem-specific techniques, an adaptive weighted fuzzy factor and a hybrid population initialization method, are introduced. With the adaptive weighted fuzzy factor, local information can be incorporated into multiobjective fuzzy clustering more effectively and adaptively. With the mixed population initialization, MOEFC can start with potential segmentation results. Furthermore, OBL is adopted to improve the search capability of the algorithm. To study the performance of MOEFC, it is compared with seven state-of-the-art approaches in experiments on synthetic and real images. The experimental results show that MOEFC not only removes the noise of image segmentation but also maintains significant image details. Also, it has no preference to noise type and performs on original images without removing image details. Another algorithm, called IMIS, is an efficient, artificial immune, multiobjective image segmentation framework. Two conflicting indices are selected and optimized simultaneously in this framework. This AIS-based, multiobjective optimization algorithm was designed by incorporating an adaptive selection scheme, an adaptive rank clone, and a diversity maintenance technique by dynamic K-nearest-neighbor list. The robustness, adaptability, and diversity maintenance of IMIS can be improved accordingly. Through the systematic experimental study in segmenting three synthetic texture images and two complicated SAR images, one can see good and encouraging partitioning results of IMIS have been obtained.

References [1] Jain AK, Duin RPW, Mao J. Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence 2000;22(1):4e37. [2] Leung SH, Wang SL, Lau WH. Lip image segmentation using fuzzy clustering incorporating an elliptic shape function. IEEE Transactions on Image Processing 2004;13(1):51e62.

Multiobjective optimization algorithm-based image segmentation 347 [3] Zhou H, Schaefer G, Sadka AH, et al. Anisotropic mean shift based fuzzy C-means segmentation of dermoscopy images. IEEE Journal of Selected Topics in Signal Processing 2009;3(1):26e34. [4] Lee CH, Zaı¨ane OR, Park HH, et al. Clustering high dimensional data: a graph-based relaxed optimization approach. Information Sciences 2008;178(23):4501e11. [5] Lemarechal C, Fjortoft R, Marthon P, et al. SAR image segmentation by morphological methods[C]//SAR Image Analysis, Modeling, and Techniques. International Society for Optics and Photonics 1998;3497:111e22. [6] Dong Y, Forster BC, Milne AK. Comparison of radar image segmentation by Gaussian-and gammaMarkov random field models. International Journal of Remote Sensing 2003;24(4):711e22. [7] Zeng J, Feng W, Xie L, et al. Cascade Markov random fields for stroke extraction of Chinese characters. Information Sciences 2010;180(2):301e11. [8] Zhang M, Jiao L, Ma W, et al. Multiobjective evolutionary fuzzy clustering for image segmentation with MOEA/D. Applied Soft Computing 2016;48:621e37. [9] Yang D, Jiao L, Gong M, et al. Artificial immune multiobjective SAR image segmentation with fused complementary features. Information Sciences 2011;181(13):2797e812. [10] Bandyopadhyay S, Maulik U, Mukhopadhyay A. Multiobjective genetic clustering for pixel classification in remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing 2007;45(5):1506e11. [11] Zhong Y, Zhang L, Gong J, et al. A supervised artificial immune classifier for remote-sensing imagery. IEEE Transactions on Geoscience and Remote Sensing 2007;45(12):3957e66. [12] Choi H, Baraniuk RG. Multiscale image segmentation using wavelet-domain hidden Markov models. IEEE Transactions on Image Processing 2001;10(9):1309e21. [13] Zhang X, Jiao L, Liu F, et al. Spectral clustering ensemble applied to SAR image segmentation. IEEE Transactions on Geoscience and Remote Sensing 2008;46(7):2126e36. [14] Ahmed MN, Yamany SM, Mohamed N, et al. A modified fuzzy c-means algorithm for bias field estimation and segmentation of MRI data. IEEE Transactions on Medical Imaging 2002;21(3):193e9. [15] Chen S, Zhang D. Robust image segmentation using FCM with spatial constraints based on new kernelinduced distance measure. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 2004;34(4):1907e16. [16] Szilagyi L, Benyo Z, Szila´gyi SM, et al. MR brain image segmentation using an enhanced fuzzy c-means algorithm[C]//Engineering in Medicine and Biology Society. In: Proceedings of the 25th annual international conference of the IEEE. vol. 1. IEEE; 2003. p. 724e6. 2003. [17] Cai W, Chen S, Zhang D. Fast and robust fuzzy c-means clustering algorithms incorporating local information for image segmentation. Pattern Recognition 2007;40(3):825e38. [18] Krinidis S, Chatzis V. A robust fuzzy local information C-means clustering algorithm. IEEE Transactions on Image Processing 2010;19(5):1328e37. [19] Gong M, Zhou Z, Ma J. Change detection in synthetic aperture radar images based on image fusion and fuzzy clustering. IEEE Transactions on Image Processing 2012;21(4):2141e51. [20] Gong M, Liang Y, Shi J, et al. Fuzzy c-means clustering with local information and kernel metric for image segmentation. IEEE Transactions on Image Processing 2013;22(2):573e84. [21] Celik T, Lee HK. Comments on “A robust fuzzy local information C-means clustering algorithm”. IEEE Transactions on Image Processing 2013;22(3):1258e61. [22] Szila´gyi L. Lessons to learn from a mistaken optimization. Pattern Recognition Letters 2014;36:29e35. [23] Miettinen K. Nonlinear multiobjective optimization, volume 12 of international series in operations research and management science. 1999. [24] Muller KR, Mika S, Ratsch G, et al. An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks 2001;12(2):181e201. [25] Fan J, Wang J, Han M. Cooperative coevolution for large-scale optimization based on kernel fuzzy clustering and variable trust region methods. IEEE Transactions on Fuzzy Systems 2014;22(4):829e39. [26] Zhang Q, Li H. MOEA/D: a multiobjective evolutionary algorithm based on decomposition. IEEE Transactions on Evolutionary Computation 2007;11(6):712e31.

348 Chapter 9 [27] Price K, Storn RM, Lampinen JA. Differential evolution: a practical approach to global optimization[M]. Springer Science & Business Media; 2006. [28] Yang X, Zhang G, Lu J, et al. A kernel fuzzy C-means clustering-based fuzzy support vector machine algorithm for classification problems with outliers or noises. IEEE Transactions on Fuzzy Systems 2011;19(1):105e15. [29] MacQueen J. Some methods for classification and analysis of multivariate observations[C]. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability 1967;1(14):281e97. [30] Dunn JCA. Fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics 1974;3:32e57. [31] Bezdek JC. Pattern recognition with fuzzy objective function algorithms[M]. US: Springer; 1981. [32] Shi J, Malik J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 2000;22(8):888e905. [33] Jain AK, Farrokhnia F. Unsupervised texture segmentation using Gabor filters. Pattern Recognition 1991;24(12):1167e86. [34] Haralick RM, Shanmugam K. Textural features for image classification. IEEE Transactions on Systems, Man, and Cybernetics 1973;(6):610e21. [35] Xie XL, Beni G. A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 1991;13(8):841e7. [36] Maulik U, Bandyopadhyay S. Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence 2002;24(12):1650e4. [37] De Castro L, Timmis J. Artificial immune systems: a new computational approach, September 2002. 2002. [38] Vincent L, Soille P. Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Transactions on Pattern Analysis and Machine Intelligence 1991;(6):583e98. [39] Deng H, Clausi DA. Unsupervised image segmentation using a simple MRF model with a new implementation scheme. Pattern Recognition 2004;37(12):2323e35. [40] Randen T, Husoy JH. Filtering for texture classification: a comparative study. IEEE Transactions on Pattern Analysis and Machine Intelligence 1999;21(4):291e310. [41] Chang T, Kuo CCJ. Texture analysis and classification with tree-structured wavelet transform. IEEE Transactions on Image Processing 1993;2(4):429e41. [42] Clausi DA. Comparison and fusion of co-occurrence, Gabor and MRF texture features for classification of SAR sea-ice imagery. Atmosphere-Ocean 2001;39(3):183e94. [43] Ruiz LA, Fdez-Sarrı´a A, Recio JA. Texture feature extraction for classification of remote sensing data using wavelet decomposition: a comparative study[C]. 20th ISPRS Congress 2004;35:1109e14 (part B). [44] Solberg AHS, Jain AK. Texture fusion and feature selection applied to SAR imagery. IEEE Transactions on Geoscience and Remote Sensing 1997;35(2):475e9. [45] Clausi DA, Deng H. Design-based texture feature fusion using Gabor filters and co-occurrence probabilities. IEEE Transactions on Image Processing 2005;14(7):925e36. [46] Clausi DA, Jernigan ME. Designing Gabor filters for optimal texture separability. Pattern Recognition 2000;33(11):1835e49. [47] Bovik AC, Clark M, Geisler WS. Multichannel texture analysis using localized spatial filters. IEEE Transactions on Pattern Analysis and Machine Intelligence 1990;12(1):55e73. [48] Jiao L, Wang L. A novel genetic algorithm based on immunity. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 2000;30(5):552e61. [49] Gong M, Jiao L, Du H, et al. Multiobjective immune algorithm with nondominated neighbor-based selection. Evolutionary Computation 2008;16(2):225e55. [50] Yang D, Jiao L, Gong M. Adaptive multiobjective optimization based on nondominated solutions. Computational Intelligence 2009;25(2):84e108. [51] Yang D, Jiao L, Gong M, et al. Adaptive ranks clone and k-nearest neighbor list-based immune multiobjective optimization. Computational Intelligence 2010;26(4):359e85.

Multiobjective optimization algorithm-based image segmentation 349 [52] Das S, Sil S. Kernel-induced fuzzy clustering of image pixels with an improved differential evolution algorithm. Information Sciences 2010;180(8):1237e56. [53] Zhao F, Liu H, Fan J. A multiobjective spatial fuzzy clustering algorithm for image segmentation. Applied Soft Computing 2015;30:48e57. [54] Hubert L, Arabie P. Comparing partitions. Journal of Classification 1985;2(1):193e218. [55] Martin D, Fowlkes C, Tal D, et al. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE international conference on, vol. 2. IEEE; 2001. p. 416e23. [56] Unnikrishnan R, Hebert M. Measures of similarity. In: Application of computer vision, 2005. WACV/ MOTIONS’05 volume 1. Seventh IEEE workshops on, vol. 1. IEEE; 2005. 394-394. [57] Meilǎ M. Comparing clusterings: an axiomatic view. In: Proceedings of the 22nd international conference on machine learning. ACM; 2005. p. 577e84. [58] Branke J, Deb K, Dierolf H, et al. Finding knees in multiobjective optimization. In: International conference on parallel problem solving from nature. Berlin, Heidelberg: Springer; 2004. p. 722e31. [59] Li L, Yao X, Stolkin R, et al. An evolutionary multiobjective approach to sparse reconstruction. IEEE Transactions on Evolutionary Computation 2014;18(6):827e45. [60] Zhang H, Fritts JE, Goldman SA. An entropy-based objective evaluation method for image segmentation. In: Storage and Retrieval Methods and Applications for Multimedia 2004, vol. 5307. International Society for Optics and Photonics; 2003. p. 38e50. [61] Cocosco CA, Kollokian V, Kwan RKS, et al. Brainweb: online interface to a 3D MRI simulated brain database. NeuroImage 1997. [62] MathWorks, Image processing toolbox, natick, ma [online], 2011, http://www.mathworks.com/ matlabcentral/fileexchange/14237/. [63] Shattuck DW, Sandor-Leahy SR, Schaper KA, et al. Magnetic resonance image tissue classification using a partial volume model. NeuroImage 2001;13(5):856e76. [64] Van Leemput K, Maes F, Vandermeulen D, et al. Automated model-based tissue classification of MR images of the brain. IEEE Transactions on Medical Imaging 1999;18(10):897e908. [65] Ashburner J, Friston KJ. Unified segmentation. NeuroImage 2005;26(3):839e51. [66] Johnston B, Atkins MS, Mackiewich B, et al. Segmentation of multiple sclerosis lesions in intensity corrected multispectral MRI. IEEE Transactions on Medical Imaging 1996;15(2):154e69. [67] Liew AWC, Yan H. An adaptive spatial fuzzy clustering algorithm for 3-D MR image segmentation. IEEE Transactions on Medical Imaging 2003;22(9):1063e75. [68] Rosin PL, Ioannidis E. Evaluation of global image thresholding for change detection. Pattern Recognition Letters 2003;24(14):2345e56. [69] Coello CAC. Evolutionary multiobjective optimization: a historical view of the field. IEEE Computational Intelligence Magazine 2006;1(1):28e36. [70] Yang D, Jiao L, Gong M. Adaptive multi-objective optimization based on nondominated solutions. Computational Intelligence 2009;25(2):84e108. [71] Deb K, Pratap A, Agarwal S, et al. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation 2002;6(2):182e97. [72] Zhang M, Jiao L, Ma W, et al. Multi-objective evolutionary fuzzy clustering for image segmentation with MOEA/D. Applied Soft Computing 2016;48:621e37. [73] Yang D, Jiao L, Gong M, et al. Artificial immune multi-objective SAR image segmentation with fused complementary features. Information Sciences 2011;181(13):2797e812.

C H A P T E R 10

Graph-regularized feature selection based on spectral learning and subspace learning Chapter Outline 10.1 Nonnegative spectral learning and subspace learning-based graph-regularized feature selection 352 10.1.1 10.1.2 10.1.3 10.1.4 10.1.5 10.1.6

Dual-graph nonnegative spectral learning Dual-graph sparse regression 356 Feature selection 357 Optimization 358 Local structure preserving 360 Update rules for SGFS 361

352

10.2 Experiments of spectral learning and subspace learning methods for feature selection 362 10.2.1 Experiments and analysis of NSSRD 362 10.2.1.1 Experimental settings 363 10.2.1.2 Simple illustrative example problem 364 10.2.1.3 Evaluating the effectiveness of NSSRD 364 10.2.1.4 Clustering results and analysis 365 10.2.2 Experiments and analysis of SGFS 368 10.2.2.1 Experimental setting 369 10.2.2.2 Convergence test 370 10.2.2.3 AT&T face dataset example 370 10.2.2.4 Experimental results and analysis 372 10.2.2.5 Robustness test of algorithms 378 10.2.2.6 Parameter sensitivity analysis 378

References

384

Feature selection is an important approach for reducing the dimension of high-dimensional data. In recent years, many feature selection algorithms have been proposed [1,2]. However, most of them only exploit information from the data space. They often neglect useful information contained in the feature space, and typically do not exploit information about the underlying geometry of the data. To overcome these problems, we introduce new unsupervised feature selection methods based on the feature selection framework of joint embedding learning, sparse regression, and subspace learning, and extend the framework by introducing the feature graph. Brain and Nature-Inspired Learning, Computation and Recognition. https://doi.org/10.1016/B978-0-12-819795-0.00010-4 Copyright © 2020 Tsinghua University Press. Published by Elsevier Inc. All rights reserved.

351

352 Chapter 10

10.1 Nonnegative spectral learning and subspace learning-based graphregularized feature selection Dealing with high-dimensional data is a difficult problem in data mining, pattern recognition, machine learning, and other fields [3]. Often, only a small subset of the features are important or useful in dealing with these data [4], while the vast majority of features are often redundant or artifacts of noise [5], which can interfere with processing of the data. Therefore, it is often necessary to reduce the dimension of high-dimensional data. Feature selection and feature extraction are two main dimension reduction methods [6e8]. Feature selection chooses a subset of original features that are representative of the original data. In contrast, feature extraction transforms the original data from a highdimensional space to a low-dimensional space, by merging the original features into some new types of features to represent the original data. Compared to feature extraction, feature selection preserves the physical meaning of the original data, which is often more convenient during subsequent data analysis. In this section, we introduce our proposed algorithms, the nonnegative spectral learning and sparse regression-based dual-graph regularized feature selection (NSSRD) and feature selection of graph regularization based on subspace learning (SGFS). The framework of NSSRD comprises four main parts: dual-graph nonnegative spectral learning, dual-graph sparse regression, feature selection, and optimization. Inspired by the idea of the dualgraph regularized algorithms, we introduce the feature graph based on an unsupervised feature selection framework: joint embedding learning, and sparse regression. By making full use of underlying information of feature manifold and the advantages of this framework, we obtain a more efficient unsupervised feature selection algorithm. However, it also increases the computational complexity of the algorithm because it needs to construct both the data graph and the feature graph at the same time. SGFS is built based on the feature selection framework of subspace learning, and extends it with the idea of graph regularization, and constructs feature mapping on feature space in order to preserve the geometric structure information on feature manifolds. The framework of SGFS comprises three main parts: sparse subspace learning, local structure preserving, and update rules.

10.1.1 Dual-graph nonnegative spectral learning Spectral theory has been successfully applied in a number of fields [9e13]. Among these, the spectral clustering method uses graph theory to describe the potential data manifold structure, to achieve effective clustering. Using spectral graph theory, high-dimensional data can be embedded into a low-dimensional space, which effectively eliminates redundant features or noise, and facilitates the subsequent analysis. Therefore, this

Graph-regularized feature selection 353 advantage can also be applied to feature selection. In recent years, several researchers have suggested that the manifold information of data is not only distributed in the data space but also in the feature space [14,15]. We construct nearest-neighbor graphs in both the data space and the feature space. We first construct a k-nearest neighbor graph G ¼ (V, E) in data space, where V denotes the vertex set {X:, 1, ., X:, n}, E denotes the weight of the edge between two points, which represents the similarity of two points. We choose Gaussian function and a parameter-free method [16] as weight measures, respectively. The Gaussian function is defined as follows:     8 X:;i  X:;j 2 =s2 ; if exp  > > 2 <  S W ij ¼ or > > : 0;

  X:;i ˛ N X:;j   X:;j ˛ N X:;i .

(10.1)

otherwise

where, i, j ¼ 1, . ,n. X:,i denotes the i-th column of the data matrix, which represents the i-th data points. N(X:,i) denotes the k-nearest neighborhood set of X:,i, and s is the bandwidth parameter of the Gaussian function. The parameter-free method is defined as follows: 8   ei;kþ1  ei;j > > ; if X:;i ˛ N X:;j > k > X > > > ei;h > kei;kþ1  <  S h¼1 . W ij ¼ >   > > or X:;j ˛ N X:;i > > > > > : 0; otherwise  2 where k is the number of neighbors, ei;j ¼ X:;i  X:;j 2 .

(10.2)

The graph Laplacian matrix of the data graph is LS ¼ DS-WS, where DS is a diagonal P matrix, and ½DS ii ¼ j ½WS ij . Similarly, we construct a k-nearest neighbor graph in feature space. The vertex set of the graph is a feature set {XT1,:, .,XTd,:}. The Gaussian function is defined as follows:

 8 2  T T 2 >  X ; if exp  =s > Xi;: j;:  > > 2 <  P W ij ¼ > or > > > : 0;

  XTi;: ˛ N XTj;:  . XTj;: ˛ N XTi;: otherwise

(10.3)

354 Chapter 10 The parameter-free method is defined as follows:   8 ei;kþ1  ei;j T > ; if X ˛ N XTj;: > i;: > k X > > > kei;kþ1  ei;h > > > < h¼1  P (10.4) W ij ¼  . > > T T > > or Xj;: ˛ N Xi;: > > > > > : 0; otherwise 2    where ei;j ¼ XTi;:  XTj;:  ,i, j ¼ 1, . ,d. Xi,: denotes the i-th row of the data matrix, 2

which represents the i-th feature. The graph Laplacian matrix of the feature graph is P LP ¼ DPWP, where DP is a diagonal matrix, and ½DP ii ¼ j ½WP ij . We obtain the similarity matrix and Laplacian matrix of the data space and the feature space. Next, we use these matrices to carry out dual-graph nonnegative spectral learning, which means that we need to embed the data from the high-dimensional data and feature spaces into low-dimensional spaces. More specifically, we transform the original data X:,i ˛ > eij  p ffiffiffiffiffiffi p ffiffiffiffiffiffi min þ m K  kFi  Yi k2 :   > 1 > : F 2 i;j¼1  Dii Djj  i¼1

(11.5)

In other words, the following two stages can be used to effectively solve the problem: the first stage involving one variable U is to calculate the desired enhancement matrix U with a given pairwise constraint; the second stage is based on the learning similarity matrix Ke from the first stage to achieve classification and allocation.

11.1.3 Modified fixed point algorithm In the first stage, the optimization problem is as follows: 2  1 min mkUk þ M1 QUQT  ZÞF ; U 2 s.t.; U  0:

(11.6)

390 Chapter 11 The nuclear norm minimization problem can be converted into a semidefinite programming (SDP) problem. The time complexity of each iteration of the standard SDP solver based on the interior point method is at least O(m6) [8]. Many first-order algorithms have been improved to solve those problems, such as FPCA [9], which is a fixed point continuation algorithm. In addition, the FPCA method can prove that it converges to the global optimal solution and is superior to the SDP solver in matrix recoverable. Ni et al. [10] proposed an augmented Lagrange multiplier (ALM) method to solve the low-order representation problem with PSD constraints. Our model is also a nuclear norm minimization problem with a PSD constraint. In this part, we propose a modified fixed point continuation (MFPC) algorithm with an eigenvalue thresholding operator to learn the enhanced matrix U, also called MFPC. The MFPC algorithm proposed in this chapter can reduce the number of auxiliary variables used in the ALM algorithm [11] to accelerate its convergence. Inspired by the fixed point continuation algorithm proposed by Ma et al., which has been used to multilabel transductive learning [12], we develop a modified fixed point iterative algorithm with an EVT operator to solve the proposed nuclear norm minimization problem.  Let gðUÞ : ¼ mjjUjj þ12 jjM 1 QUQT ZÞjj2F , the derivative of the function g(,) with respect to U is given by   vg ¼ mvU  þ H; where vjjUjj* is the set of the subgradients of the nuclear norm, and H ¼ hðUÞ : ¼ QT ðM 1ðQUQT ZÞÞQ: Following Ref. [13], an explicit expression of the subdifferential of the nuclear norm at a symmetric matrix is given by the following lemma. Lemma 1. Let U ˛ ℝmm be a real symmetric matrix, then    

T

T T

vU  ¼ V ð1Þ V ð1Þ V ð2Þ V ð2Þ þS : V ð1Þ ; V ð2Þ S ¼ 0; S2  1 ;where V (1) and V(2) are orthogonal eigenvectors associated with the positive and negative eigenvalues of U, respectively, and jj,jj2 denotes the spectral norm of a matrix. In addition, the following optimality condition in Ref. [14] can be adopted for the proposed nuclear norm minimization problem. Theorem 1. Let g(,) be a convex function. Then U* is an optimal solution to the problem, if and only if U*0, and there exists a matrix E ˛vg(U*) such that hE; F  U  i  0;

for all F  0:

Based on the above theorem, we can develop a modified fixed point iterative scheme for solving the problem by adopting the operator splitting technique.

Semisupervised learning based on nuclear norm regularization 391 The operator T(,) is defined as

  Tð , Þ: ¼ smv ,  þ shð , Þ;

where s > 0. And T(,) can be split into two parts: Tð , Þ ¼ T1 ð , Þ  T2 ð , Þ   where T1 ð ,Þ ¼ smv , þ Ið ,Þ, T2 ð ,Þ ¼ Ið ,Þ  shð ,Þ, and I(,) is an identity operator.   Let Y ¼ T2(U), then TðUÞ ¼ smvU  þU Y, and U  0. For tackling the proposed model, we need to solve the following nuclear norm minimization problem,   2 1 min smU  þ U  Y F : U0 2

(11.7)

The convex optimization problem has a closed-form optimal solution, and the optimal solution is given by the eigenvalue thresholding (EVT) operator, which will be defined later: U  ¼ EVTsm ðYÞ: Thus, our modified fixed point scheme for solving the problem can be expressed by the following two-step iteration as follows: (  Y k ¼ U k  sh U k ; (11.8)  U kþ1 ¼ EVTsm Y k : [eigenvalue thresholding (EVT) operator] Assume U ¼ UT  0, and its eigenvalue decomposition is given by U ¼ VdiagðlÞV T , where V ˛ ℝmr , and l ˛ ℝrþ . Given v > 0, EVTv(,) is defined as:

Definition 1.

EVTv ðUÞ: ¼ Vdiagðmaxfl  v; 0gÞV T ;

(11.9)

where max{,,,} should be understood element-wise. Theorem 2.

Suppose a symmetric matrix U*0 satisfies:  1: jjM 1 QU  QT  ZÞjj2F < m = m for a small positive constant m 2: U  ¼ EVTsm ðU  þ shðU  ÞÞ

Then U* is the unique optimal solution of the problem. Proof. Please refer to Refs. [9,14].

(11.10)

392 Chapter 11

11.1.4 Implementation We develop a modified fixed point iterative scheme to learn the enhanced matrix U with a PSD constraint. As suggested in Refs. [9,14], the continuation technique can accelerate the convergence of the fixed point iterative method, and the parameter b determines the rate of reduction of consecutive mk , mkþ1 ¼ maxfmk b; mg:

(11.11)

where m is a moderately small constant. Thus, the continuation strategy is also adopted by our modified fixed point algorithm, which solves a sequence of the problem, easy to difficult, corresponding to a sequence of large to small values of mk .   In the Ref. [9], the parameter s is always set to 1, in contrast, it is set to s ˛ ð0; 2 J2 Þ for the proposed fixed point continuation algorithm so that our algorithm’s convergence is guaranteed, where J : ¼ ðQT5QT ÞIU ðQ5QÞ, 5 denotes the Kronecker product of two 2 2 matrices, and IU ˛ ℝn n is a diagonal matrix in which entries associated with U are set to 1, or 0 otherwise. There are many ways to select the parameter s to accelerate the convergence of gradient algorithms for compressing sensing tasks. We now specify a strategy, which is based on the BarzilaieBorwein (BB) method [15] for choosing the parameter sk. The shrinkage iteration first takes a gradient descent step with the step size   2   sk along the negative gradient direction hk of the smooth function M 1 QUQT  Z  = F

2, and then applies the EVT operator EVTv(,) to accommodate the nonsmooth term jjUjj*. Therefore, it is natural to choose the parameter sk based on the function   2 .   M 1 QUQT  Z  2 alone. F

Let H ¼ Q (M1(QUQTZ))Q, DU ¼ UkeUk1, and Dh ¼ HkHk1, then the BB step is defined by k

T

sk ¼

hDU; Dhi hDU; DUi ; or sk ¼ hDh; Dhi hDU; Dhi

To avoid the BB step size sk being either too small or too large, we take sk ¼ maxfsmin ; minfsk ; smax gg;

(11.12)

where 0 < smin < smax < N are two constants. Because our ultimate goal is to learn the enhanced matrix U, the accurate solution of the problem is not required. Therefore, we use the following criterion as a stopping rule, jjU kþ1  U k jjF   < tol; max 1; U k F

(11.13)

Semisupervised learning based on nuclear norm regularization 393 where tol is a small positive number. Experiments shows that tol ¼ 104 is good enough for obtaining the optimal matrix U*. Based on the previous analysis, we develop a modified fixed point continuation (MFPC) algorithm to learn the enhanced matrix U, as listed in Algorithm 1. k Theorem 3. The sequence {U } generated by our modified fixed point iterations with   s ˛ ð0; 2 J2 Þ converges to some U ˛ G, where G is the set of optimal solutions of the problem. Proof. Please refer to Refs. [9,14]. We now claim that our modified fixed point continuation algorithm converges to an optimal solution of the problem. Algorithm 1: MFPC algorithm Input: A data set of n instances X ¼ fx1 ; x2 ; /; xl ; xlþ1 ; /; xn g, XL ¼ fxi gli¼1 are labeled, and XU ¼ fxi gni¼lþ1 are unlabeled. ML ¼ {(xi, xj)} is the set of must-link constraints, and CL ¼ {(xi, xj)} is the set of cannot-link constraints. The number of nearest neighbors T and the constant m. Output: The enhanced matrix U. Initialize: Given M, U0, m, b, and tol. And select m1 > m2 > / > mL ¼ m > 0. 1. Construct the T-NN graph and compute the normalized graph Laplacian L ¼ ID1/2WD1/2. 2. Compute the m eigenvectors f1 ; .; fm of L associated with the first m smallest eigenvalues, and form the spectral representation matrix Q ¼ ½f1 ; .; fm  ˛ ℝnm . for mk ¼ m1 ; m2 ; /; mL ; do while not converged do 1. Choose the BB step size sk 2. Update Y k Hk ¼ QT(M1(QUQTZ))Q, Y k ¼ Uk  sk Hk . 3. Update Ukþ1 Ukþ1 ¼ EVT sk mk ðY k Þ: stop condition:

jjUkþ1 Uk jjF maxf1;jjUk jjF g

< tol.

end while end for

11.1.5 Label propagation The enhanced spectral kernel K has been constructed using the above proposed MFPC algorithm, and we would have to take advantage of it to predict the labels of the unlabeled instances. We also present a semisupervised learning method with enhanced spectral kernel, also called ESK, as shown in Algorithm 2. Here, our iteration equation can be written as follows: F tþ1 ¼ aPF t þ ð1  aÞY;

(11.14)

394 Chapter 11 e 1=2 . We use the equation to update the labels of each data point where P ¼ D1=2 KD until convergence. We give a toy example to illustrate how our ESK algorithm works. At first glance, the toy data consist of three separate groups, and are composed of a mixture of Gaussian-like and curve-like groups, as shown in Fig. 11.1A. Moreover, we also present the comparison between the similarity matrix in the input space and the kernel matrices learned by OSK, TSK, and MFPC, where the data are ordered such that all the instances in two Gaussianlike groups appear first; all the instances in the curve-like group appear second. It can be clearly observed that the enhanced kernel matrix learned by our MFPC algorithm exhibits two clear block structures so that the two classes are well-separated groups. In addition, we can draw a similar conclusion as in Ref. [16] that the kernel matrices learned by OSK and TSK have some uninformative eigenvectors even though they are optimally combined according to their own optimization criteria, and they fail to classify data points into the proper class. Algorithm 2: ESK algorithm Input: The enhanced matrix U and the constant a. Output: The assigned labels of all the data points. 1. Obtain the enhanced matrix U by solving the problem via the proposed MFPC algorithm. 2. Construct the enhanced spectral kernel matrix K ¼ QUQT, and iterate until convergence. 3. Let F* be the limit of the sequence {F t}, and assign the labels of each data point xi by yðxi Þ ¼ argmaxkc Fik .

11.1.6 Valid kernel If a normalized graph Laplacian L ˛ ℝnn has the first m eigenvectors f1 ; .; fm corresponding to the m smallest eigenvalues, and the enhanced matrix U obtained by solving the problem for ESK is symmetric positive semidefinite, then the family of matrices K ¼ QUQT is a valid kernel matrix. 

T 

T Proof: Because U¼UT  0, U ¼ U 1=2 U 1=2 , and K ¼ QUQT ¼ QU 1=2 QU 1=2 , the enhanced spectral kernel matrix K is certainly positive semidefinite and thus a valid kernel matrix. Theorem 4.

Remark: Similar to the existing spectral kernel learning approaches, such as OSK and TSK, the kernel matrix K learnt by the proposed MFPC algorithm is also nonparametric spectral kernels from the graph Laplacian kernel L, and is referred to as the enhanced spectral kernel. Hence, the enhanced spectral kernel can be used in traditional kernel machines such as SVMs.

Semisupervised learning based on nuclear norm regularization 395 (A) 1.6

(B)

(C)

1.6

1.6

1.4

1.4

1.4

1.2

1.2

1.2

Unlabeled Data Labeled Point 1 Labeled Point -1

1 0.8 0.6 0.4

1

1

0.8

0.8

Class1

0.6

0

0.5

1

1.5

0.4

2

0

0.5

1

1.5

(D)

(E)

1.6

1.6

1.4

1.4

1.2

1.2

1

1

0.8

0.8

Class1

0.6 0.4

(F)

0.5

1

1.5

(G)

2

0.4

Class2

2

0.4

1

0.5

0

1.5

2

Class1 Class2

0.6

Class2 0

Class1

0.6

Class2

0

0.5

(H)

1

1.5

2

(I)

Figure 11.1 Classification results on the toy data set. (A) Toy data set with two labeled points and 1 ML constraint. (B)e(E) Classification results using LGC with s ¼ 0:2, OSK, TSK, and the proposed ESK algorithm with only one iteration. (F) Similarity matrix for the toy data set in the input space. (G)e(I) Learned kernel matrices by OSK, TSK, and MFPC with the m ¼ 5 smoothest eigenvectors of graph Laplacians and a neighborhood size T ¼ 7. The brighter a pixel, the greater similarity the pixel represents.

11.2 Experiments and analysis In this section, we present a set of experiments on many data sets, including a synthetic data set and many transductive settings.

11.2.1 Compared algorithms and parameter settings We compared the performance of the proposed ESK algorithm with the existing state-ofthe-art SSL algorithms or related SSL methods (Fig. 11.2).

396 Chapter 11 •

We use one-versus-rest SVMs (SVM) [17] as the baseline. The width of the RBF kernel for SVM is set using fivefold cross-validation. GRF [18] and LGC [19]. The affinity matrix is constructed by a Gaussian function whose width is set by fivefold cross-validation. LapSVM [20]: The base kernel is also selected to be Gaussian whose width is set by fivefold cross-validation, and all of the other hyperparameters are set by grid search. TSK [21] and OSK [3]. For TSK, the decay factor g is set to 2, and other parameters are set as in our ESK algorithm. For OSK, all parameters are set as in our ESK algorithm, and the bandwidth of the RBF kernel is learned using fivefold cross-validation.

• • •

(A)

(B)

2

2

2

1

1

1

0

0

0

-1

(C)

-1

Class1 Class2

-2 -2

-1

0

1

2

-2 -2

-1

Class1 Class2 -1

0

1

-2 -2

2

(D)

(E)

2

2

2

1

1

1

0

0

0

-1

-2 -2

-1

0

1

2

(G)

-2 -2

-1

0

1

2

(F)

-1

Class1 Class2

Class1 Class2

-1

Class1 Class2 -1

0

1

2

-2 -2

Class 1 Class 2 -1

0

1

2

2

1

0

-1

-2 -2

Class1 Class2 -1

0

1

2

Figure 11.2 Classification results on the two-intersecting-spiral data set: (A) GRF; (B) LGC; (C) LapSVM; (D) OSK; (E) TSK; (F) GTG; (G) ESK.

Semisupervised learning based on nuclear norm regularization 397 •



GTG [22]. For GTG, its kernel width is searched from the set linspace (0.1r, r, 5)W linspace (r, 10r, 5) with r being the average distance from each data point to its 20th nearest neighbor, and linspace (r, 10r, 5) denoting the set of t linearly equally spaced numbers between a and b Its other parameters are set to its default, respectively. The proposed ESK algorithm. In all of the experiments, we set the constant ¼ 1. And the number of nearest neighbors is set by grid search.

11.2.2 Synthetic data In this part, we use a toy example to illustrate the effectiveness of the proposed ESK algorithm (Fig. 11.3). The original data set is a two-intersecting-spiral dataset used in Ref. [23]. Two data points are labeled initially. In addition, there is one must-link constraint and one cannot-link constraint, denoted by the solid and dashed blue lines, respectively. We can see that all of the other methods wrongly predicted the labels of some data points, where four methods including GRF, LGC, LapSVM, and OSK cannot leverage the given pairwise constraints. Though TSK and GTG can take advantage of pairwise constraints, their classification performance also is relatively worse than that of the proposed ESK algorithm. The proposed ESK algorithm yields significantly better classification results than other existing methods on this toy dataset, where the size of neighborhood k is set to 11, and the number of smoothest eigenvectors of the graph Laplacian k is set to 5. Moreover, we demonstrate the efficiency of the proposed ESK algorithm, where we gradually increase the number of data points. For comparison, we also illustrate the

2

Time (seconds)

10

ESK GTG TSK OSK LapSVM LGC GRF

1

10

0

10 1000

1500

2000 2500 3000 Number of samples

3500

4000

Figure 11.3 Comparison of the time consumption for GRF, LGC, LapSVM, OSK, TSK, GTG, and the proposed ESK algorithm on the two-intersecting-spiral data set.

398 Chapter 11 average time consumption of GRF, LGC, LapSVM, OSK, TSK, and GTG. As the sample size increases, the proposed ESK algorithm, OSK, and TSK have similar time costs and are much faster than GRF, LGC, and LapSVM.

11.2.3 Real-world data sets In our experiment, three types of real-world data sets were selected to cover a wide range of attributes. Specifically, these data sets include: • •



UCI data. We perform experiments on five UCI data sets including Ionosphere, Balance, Sonar, Iris, and Glass, and an artificial data set, G50c. Image data. We perform experiments on five image data sets: MNIST [24], USPS, COIL20 [25], ORL, and YaleB3 [26]. For the MNIST0123 dataset, the test data subset of the well-known MNIST data on handwritten digit recognition was chosen, and select digits 0, 1, 2, and 3 as four classes where there are 980, 1,135, 1032, and 1010 instances, respectively, with a total of 4157. For the USPS0123 dataset, digits 0, 1, 2, and 3 were chosen as four classes in our experiments where there are 1194, 1005, 731, and 658 examples, respectively, with a total of 3588. For the USPS1479 dataset, we select digits 1, 4, 7, and 9 as four classes where there are 1005, 652, 645, and 644 examples, respectively, with a total of 2946. For the YaleB3 dataset, a subset of the Yale Face Database B was chosen [27], and use images of individuals 2, 5, and 10 and downsample each image to 3040 pixels. This gives us 1755 images with 1200 dimensions to work with. Text data. We also perform experiments on two text datasets: 20-Newsgroup and WebKB. For the 20-News dataset, the topic rec which contains autos, motorcycles, baseball, and hockey from the version 20-News-18,828 was chosen. For the Text1 dataset, we choose the topics mac and mswindows of the 20-Newsgroup dataset preprocessed as in Ref. [28]. For the WebKB dataset, a subset consisting of about 6000 Web pages from computer science departments of four schools was chosen (Cornell, Texas, Washington, and Wisconsin).

The basic information from those data sets together with additional randomly chosen pairwise constraints is summarized in Table 11.1, where we randomly generate two or five must-link constraints for each class and two or five cannot-link constraints for every two classes, respectively.

11.2.4 Transduction classification results Tables 11.2e11.4 show the performance of the existing technical SSL methods and the performance of the ESK algorithms proposed for these real-world datasets and G50c datasets, where the best performance of each dataset is shown in bold. Here, we only use

Semisupervised learning based on nuclear norm regularization 399 Table 11.1: Descriptions of three categories of data sets. Category UCI

Images

Text

Data G50c Ionosphere Sonar Balance Iris Glass MNIST0123 USPS0123 USPS1479 COIL20 ORL Yale3 20-News Text1 WK-CL WK-TX WK-WT WK-WC

Class

Feature

2 2 2 3 3 6 4 4 4 20 40 3 4 2 7 7 7 7

50 33 60 4 4 9 784 256 256 1024 1024 1200 8014 7511 4134 4029 4165 4189

Labeled 20 20 20 20 20 30 8 8 8 40 80 3 40 20 70 70 70 70

Size 550 351 208 625 150 214 4157 3588 2946 1440 400 1755 3970 1946 827 814 1166 1210

Num_M 10 10 10 15 15 12 8 8 8 40 80 6 20 20 14 14 14 14

Num_C 5 5 5 15 15 30 12 12 12 100 200 6 30 10 42 42 42 42

Note that num_M and num_C denote the numbers of randomly chosen must-link constraints and cannot-link constraints, respectively.

the given tag data (expressed as ESK_L, where pairwise constraints are derived from the given tag data) to compare the performance of the proposed ESK algorithm with the existing five existing SSL methods and SVM. In addition, we use sparse tag data and additional pairwise constraints (expressed as ESK_LC) to provide the classification results of the proposed ESK algorithm. By using SVM as the final classifier and following Ref. [21], we also use the given tag data of ESK to compare the enhanced spectrum kernels with two competitive spectrum kernels, including OSK and TSK. From these tables, we can observe the following: •





GTG and LapSVM usually outperform SVM, GRF, and LGC, especially on the image data sets, since there are clear nonlinear underlying manifolds behind those data sets, and the LapSVM algorithm can make use of both the labeled data and the geometrical structure information contained in the data. TSK and the proposed ESK algorithm are often better than SVM, GRF, LGC, GTG, and LapSVM since the flexible kernel from spectral transforms is more data-driven than the standard kernel, e.g., Gaussian kernel. And TSK and GTG have been shown to very effective for text data sets [22]. The proposed ESK algorithm usually performs as well as the best of the other algorithms. On the text data sets, ESK usually outperforms all other state-of-the-art SSL algorithms. Additional pairwise constraints have been shown to consistently improve the

400 Chapter 11 Table 11.2: Classification accuracies (mean and standard deviation, %) on UCI datasets. Data SVM GRF LGC LapSVM GTG TSK ESK_L ESK_LC OSKþSVM TSKþSVM ESKþSVM

G50c 85.36 ± 2.46 57.36 ± 9.36 86.78 ± 2.44 86.65 ± 3.22 83.70 ± 9.78 92.69 ± 2.13 94.57 ± 0.25 94.64 ± 0.27 91.79 ± 3.25 93.09 ± 4.40 95.06 ± 0.21

Ionosphere 74.54 ± 6.79 78.54 ± 6.84 83.10 ± 4.39 82.95 ± 1.84 78.89 ± 5.22 76.10 ± 7.19 84.72 ± 1.52 85.69 ± 1.33 82.43 ± 3.40 86.59 ± 3.15 87.43 ± 2.01

Sonar 66.24 ± 4.83 60.82 ± 6.35 62.18 ± 6.03 68.24 ± 1.28 63.84 ± 4.31 64.31 ± 5.02 66.86 ± 1.43 67.54 ± 1.75 64.57 ± 1.66 69.68 ± 2.97 69.22 ± 2.13

Balance 71.39 ± 6.14 67.46 ± 6.93 70.03 ± 8.19 63.86 ± 7.43 65.52 ± 5.04 68.41 ± 4.19 71.70 ± 4.06 72.55 ± 3.87 67.58 ± 8.64 72.82 ± 4.05 73.64 ± 5.61

Iris 94.35 ± 2.42 93.81 ± 2.46 93.11 ± 2.40 95.42 ± 1.83 95.39 ± 4.67 93.89 ± 3.88 96.10 ± 1.36 96.74 ± 1.40 93.83 ± 4.29 94.16 ± 1.35 96.58 ± 1.45

Glass 57.38 ± 3.64 57.81 ± 5.32 56.25 ± 3.92 60.11 ± 4.98 53.80 ± 8.36 57.80 ± 4.02 60.87 ± 3.63 62.21 ± 3.38 58.70 ± 5.94 61.96 ± 8.05 63.49 ± 4.62

Table 11.3: Classification accuracies (mean and standard deviation, %) on images datasets. Data SVM GRF LGC LapSVM GTG TSK ESK_L ESK_LC OSKþSVM TSKþSVM ESKþSVM

MNIST0123

USPS0123

USPS1479

74.06 ± 4.20 68.94 ± 6.03 80.13 ± 3.32 88.03 ± 1.69 91.55 ± 6.02 93.49 ± 0.73 95.56 ± 2.15 95.90 ± 1.86 81.35 ± 6.72 94.08 ± 4.23 96.17 ± 1.65

84.35 ± 4.27 77.09 ± 7.54 90.14 ± 4.14 87.88 ± 5.75 91.82 ± 5.24 95.25 ± 1.79 95.82 ± 0.93 96.45 ± 0.74 90.57 ± 4.11 95.79 ± 1.26 96.92 ± 3.08

75.54 ± 5.48 56.82 ± 8.28 77.15 ± 3.50 79.84 ± 4.15 82.25 ± 4.27 83.39 ± 2.61 84.73 ± 1.49 84.95 ± 1.36 78.72 ± 5.34 84.07 ± 1.93 85.35 ± 1.41

COIL20 74.96 ± 2.11 82.36 ± 2.76 80.38 ± 2.10 86.58 ± 1.53 78.54 ± 0.66 83.19 ± 1.29 87.75 ± 1.26 88.66 ± 2.45 86.52 ± 4.51 90.00 ± 3.93 90.39 ± 2.67

ORL 76.82 ± 2.71 76.92 ± 2.77 76.40 ± 2.39 77.34 ± 2.60 76.98 ± 4.07 76.13 ± 3.02 78.91 ± 2.31 83.44 ± 2.27 76.34 ± 4.22 82.22 ± 8.63 83.51 ± 4.82

YALE3 91.00 ± 7.22 95.18 ± 6.92 94.68 ± 5.47 96.95 ± 0.51 91.79 ± 2.46 93.35 ± 3.92 96.87 ± 1.13 97.35 ± 1.22 95.14 ± 2.69 95.37 ± 7.75 97.60 ± 1.64

Table 11.4: Classification accuracies (mean and standard deviation, %) on text datasets. Data SVM GRF LGC LapSVM GTG TSK ESK_L ESK_LC OSKþSVM TSKþSVM ESKþSVM

20-News 58.88 ± 8.17 71.63 ± 1.83 73.99 ± 2.48 74.36 ± 0.18 83.54 ± 4.04 86.07 ± 3.24 89.16 ± 0.74 89.32 ± 0.67 86.19 ± 4.55 89.30 ± 2.90 91.57 ± 0.76

Text1 76.05 ± 5.18 77.55 ± 9.79 74.89 ± 9.91 80.72 ± 1.51 79.35 ± 8.95 87.91 ± 2.93 89.65 ± 2.63 89.77 ± 2.25 82.74 ± 4.82 85.76 ± 3.63 89.93 ± 1.69

WK-CL 73.00 ± 0.46 73.26 ± 0.36 73.15 ± 0.41 74.62 ± 0.80 74.61 ± 5.44 75.28 ± 2.62 79.53 ± 3.18 79.74 ± 2.85 76.63 ± 2.32 77.25 ± 1.67 81.76 ± 2.96

WK-TX 71.92 ± 0.53 72.20 ± 0.41 71.86 ± 0.31 72.50 ± 0.52 70.75 ± 7.21 74.98 ± 3.99 78.60 ± 4.21 79.16 ± 3.42 75.50 ± 3.36 76.77 ± 2.82 79.61 ± 3.69

WK-WT 79.48 ± 0.25 79.57 ± 0.28 79.40 ± 0.26 80.18 ± 0.23 79.76 ± 6.74 80.39 ± 1.92 82.93 ± 1.57 83.46 ± 1.60 81.48 ± 0.76 82.04 ± 2.63 83.78 ± 1.34

WK-WC 75.36 ± 0.22 75.56 ± 0.29 75.40 ± 0.24 76.25 ± 0.28 74.43 ± 5.82 75.79 ± 1.70 80.49 ± 1.63 81.35 ± 1.49 77.86 ± 1.68 79.93 ± 0.45 82.26 ± 1.55

Semisupervised learning based on nuclear norm regularization 401 performance of the proposed ESK algorithm on all data sets since ESK can take advantage of both the given labeled data and pairwise constraints together. ESKþSVM often significantly outperforms the other two learned spectral kernel machines including OSKþSVM and TSKþSVM.



In the second part of these experiments, we illustrate classification accuracies of the proposed ESK algorithm on the G50c, USPS0123, and 20-News data sets with the number of randomly labeled points varying from 2 to 20, from 4 to 40, and from 4 to 40, respectively, and against a number of randomly chosen pairwise constraints with only one labeled data point in each class, respectively (Figs. 11.4e11.9). In the figures, the abscissa denotes the number of randomly labeled data or chosen pairwise constraints (we guarantee that there is at least one labeled point in each class), and the ordinate is the classification accuracy value averaged (A)

(B) 1

1 0.95

0.9

0.8 ESK TSK GTG LapSVM LGC GRF SVM

0.7

0.6

0.5

Accuracy

Accuracy

0.9

5 10 15 Number of randomly labeled examples

(C)

0.85 ESK TSK GTG LapSVM LGC GRF SVM

0.8 0.75 0.7

20

0.65

5

10 15 20 25 30 35 Number of randomly labeled examples

40

1

0.9

Accuracy

0.8 0.7

ESK TSK GTG LapSVM LGC GRF SVM

0.6 0.5 0.4

5

10 15 20 25 30 35 Number of randomly labeled examples

40

Figure 11.4 Classification results of different algorithms against a number of randomly labeled data points: (A) the G50c data set, (B) the USPS0123 data set, and (C) the 20-News data set.

402 Chapter 11 (A)

(B)

0.95

0.94 0.9

0.92

Accuracy

Accuracy

0.85 ESK LRK MRD

0.8

0.9

ESK LRK MRD

0.88

0.75

0.86 0.7

0.84 0.65

0

15 30 45 Number of randomly pairwise constraints

60

0

50 100 150 Number of randomly pairwise constraints

200

(C)

0.9

0.85

Accuracy

0.8 ESK LRK MRD

0.75 0.7 0.65 0.6 0.55

0

50 100 150 200 Number of randomly pairwise constraints

Figure 11.5 Classification results of the proposed ESK approach and LRK against a small number of randomly chosen pairwise constraints with only one labeled data point in each class: (A) the G50c data set, (B) the USPS0123 data set, and (C) the 20-News data set.

over 50 independent runs. For comparison, the classification results of five state-of-the-art SSL algorithms and SVM are also plotted in the corresponding figure. It can be clearly observed that the proposed ESK algorithm is very stable, that is, even when we only label a very small fraction of the data, it can still get high classification accuracies and consistently outperforms the other five algorithms with the same amount of labeled data. Moreover, as the number of sparse pairwise constraints grows, the classification accuracy of the proposed ESK algorithm and the graph Laplacian regularized kernel (LRK) method [5,29] can be considerably improved, and they both perform significantly better than the approach of manifold regularization with dissimilarity (MRD) [30]. In the third part of these experiments, we study the stability of the proposed ESK algorithm on three chosen datasets with respect to their parameters. In the proposed ESK

Semisupervised learning based on nuclear norm regularization 403 (A)

(B) 1

1

0.95

0.95

Accuracy

Accuracy

0.9 0.85

0.9

0.8

0.7

5

10

ESK 20

0.85

ESK 20 ESK 10 ESK 5

0.75

ESK 10 ESK 5 15

20

25

0.8 45

30

50

m

55 60 65 Neighborhood size k

70

75

Figure 11.6 Parameter stability testing results on the G50c data set. (A) Classification results versus  plot, the neighborhood size, m; (B) classification results versus k ¼ 45 plot, the parameter, (K) In these two figures, different lines represent the results of the proposed ESK algorithm with three different sizes of the randomly labeled instances, 20, 10, and 5.

(A)

(B)

1

0.95

Accuracy

Accuracy

0.95

0.9

ESK 40 ESK 20 ESK 10

0.85

0.8

1

5

7

0.9

ESK 40 ESK 20 ESK 10

0.85

9

11

m

13

15

0.8 10

15

20 25 30 Neighborhood size k

35

40

Figure 11.7 Parameter stability testing results on the USPS0123 data set. (A) Classification results versus m ¼ 5 plot, the neighborhood size, m; (B) classification results versus k ¼ 20 plot, the parameter, (K) In these two figures, different lines represent the results of the proposed ESK algorithm with three different sizes of the randomly labeled instances, 40, 20, and 10.

algorithm, there are mainly two parameters including the neighborhood size k ¼ n and the number of eigenvectors of the graph Laplacian in the matrix k, Q. We have also conducted a set of experiments to test the stability of our ESK algorithm. In the following figures, the ordinate represents the classification results averaged over 50 independent runs. From these figures, we can clearly see that: •

The proposed ESK algorithm is stable if there are adequately labeled instances, and the number of the chosen eigenvectors of the graph Laplacian, m is not too large. And it

404 Chapter 11 1 0.95

Accuracy

0.9 0.85 0.8 ESK 40 ESK 20 ESK 10

0.75 0.7

5

8

11

14

17

20

m

Figure 11.8 Parameter stability testing results on the 20-News data set. Classification results versus m ¼ 9 plot, the neighborhood size, m. In the figure, different lines represent the results of the proposed ESK algorithm with three different sizes of the randomly labeled instances, 40, 20, and 10.

1

(B) 1

0.95

0.95

Accuracy

Accuracy

(A)

0.9

0.85

0.9

0.85 ESK Transduction ESK Induction

ESK Transduction ESK Induction 0.8

4 8 12 16 Number of randomly labeled examples

20

0.8

8 16 24 32 Number of randomly labeled examples

40

(C) 0.95

Accuracy

0.9

0.85 ESK Transduction ESK Induction 0.8

8 16 24 32 Number of randomly labeled examples

40

Figure 11.9 The average transductive and inductive classification accuracies on three data sets: (A) the G50c data set, (B) the USPS0123 data set, and (C) the 20-News data set.

Semisupervised learning based on nuclear norm regularization 405



also reveals that the graph Laplacian kernel is essentially low rank, and the remaining eigenvectors can be seen as the noise component [29]. Doubtless, m is easy to tune since it is selected from positive integers. The proposed ESK algorithm is stable if the size of neighborhood m is not too large [31]. The graph Laplacian is more likely to fail to discover the underlying manifold structure as k increases. Similarly, k is easy to tune.

References [1] Li Z, Liu J, Tang X. Pairwise constraint propagation by semidefinite programming for semi-supervised classification. In: ICML; 2008. p. 576e83. [2] Hoi S, Jin R, Lyu M. Learning nonparametric kernel matrices from pairwise constraints. In: ICML; 2007. p. 361e8. [3] Zhu X, Kandola JS, Ghahramani Z, Lafferty JD. Nonparametric transforms of graph kernels for semisupervised learning. In: NIPS; 2005. p. 1641e8. [4] Liu W, Qian B, Cui J, Liu J. Spectral kernel learning for semi-supervised classification. In: IJCAI; 2009. p. 1150e5. [5] Wu X-M, So A, Li Z, Li S. Fast graph Laplacian regularized kernel learning via semidefinite-quadraticlinear programming. In: NIPS; 2009. p. 1964e72. [6] Weinberger KQ, Sha F, Zhu Q, Saul LK. Graph Laplacian regularization for large-scale semidefinite programming. In: NIPS; 2007. p. 1489e96. [7] Shang F, Liu Y, Wang F. Learning spectral embedding for semisupervised clustering. In: ICDM; 2011. p. 597e606. [8] Liu Z, Vandenberghe L. Interior-point method for nuclear norm approximation with application to system identification. SIAM Journal on Matrix Analysis and Applications 2010;31(3):1235e56. [9] Ma S, Goldfarb D, Chen L. Fixed point and Bregman iterative methods for matrix rank minimization. Mathematical Programming 2011;128(1):321e53. [10] Ni Y, Sun J, Yuan X, Yan S, Cheong L. Robust low-rank subspace segmentation with semidefinite guarantees. In: ICDM; 2010. p. 1179e88. [11] Lin Z, Chen M, Wu L. The augmented Lagrange multiplier method for exact recovery of corrupted lowrank matrices. submitted Mathematical Programming 2009. [12] Goldberg B, Zhu X, Recht B, Xu J, Nowak R. Transduction with matrix completion: three birds with one stone. In: NIPS; 2010. [13] Watson G. Characterization of the subdifferential of some matrix norms. Linear Algebra and its Applications 1992;170:33e45. [14] Ma Y, Zhi L. The minimum-rank gram matrix completion via modified fixed point continuation method. In: ISSAC; 2011. p. 241e8. [15] Barzilai J, Borwein J. Two-point step size gradient methods. IMA Journal of Numerical Analysis 1988;8:141e8. [16] Hu E, Chen S, Zhang D, Yin X. Semisupervised kernel matrix learning by kernel propagation. IEEE Transactions on Neural Networks 2010;21(11):1831e41. [17] Chang C, Lin C. LIBSVM: a library for support vector machines. 2001. Available at: http://www.csie.ntu. edu.tw/wcjlin/libsvm. [18] Zhu X, Ghahramani Z, Lafferty J. Semi-supervised learning using Gaussian fields and harmonic functions. In: Proc. 20th int’l conf. machine learning; 2003. p. 912e9. [19] Zhou D, Bousquet O, Lal T, Weston J, Scho¨lkopf B. Learning with local and global consistency. In: Advances in neural information processing systems; 2004. p. 321e8.

406 Chapter 11 [20] Melacci S, Belkin M. Laplacian support vector machines trained in the primal. Journal of Machine Learning Research 2011;12:1149e84. [21] Liu W, Qian B, Cui J, Liu J. Spectral kernel learning for semi-supervised classification. In: Proc. 21st int’l joint conf. artificial intelligence; 2009. p. 1150e5. [22] Erdem MP. Graph transduction as a non-cooperative game. Neural Computation 2012;24(3):700e23. [23] Souvenir R, Pless R. Manifold clustering. In: Proc. 10th int’l conf. computer vision; 2005. p. 648e54. [24] LeCun Y, Cortes C. The MNIST database of handwritten digits. 2009. Available, http://yann.lecun.com/ exdb/mnist/. [25] Nene SA, Nayar SK, Murase J. Columbia object image library (COIL-20),. Technical Report CUCS-00596. Columbia Univ.; 1996. [26] Vidal R, Ma Y, Piazzi J. A new GPCA algorithm for clustering subspaces by fitting, differentiating and dividing polynomials. In: Proc. IEEE conf. computer vision and pattern recognition; 2004. p. 510e7. [27] Georghiades AS, Belhumeur PN, Kriegman DJ. From few to many: illumination cone models for face recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis and Machine Intelligence 2001;23(6):643e60. [28] Szummer M, Jaakkola T. Partially labeled classification with Markov random walks. In: Advances in neural information processing systems; 2002. p. 945e52. [29] Li Z, Liu J, Tang X. Constrained clustering via spectral regularization. In: Proc. IEEE conf. computer vision and pattern recognition; 2009. p. 421e8. [30] Goldberg, Zhu X, Wright S. Dissimilarity in graph-based semi-supervised classification. In: Proc. 16th int’l conf. artificial intelligence and statistics; 2007. p. 155e62. [31] Shang F, Jiao LC, Liu Y, et al. Semi-supervised learning with nuclear norm regularization. Pattern Recognition 2013;46(8):2323e36.

C H A P T E R 12

Fast clustering methods based on learning spectral embedding Chapter Outline 12.1 Learning spectral embedding for semisupervised clustering

408

12.1.1 Graph construction and spectral embedding 408 12.1.1.1 Symmetry-favored graph 408 12.1.1.2 Spectral embedding of graph Laplacian 409 12.1.2 Problem formulation 410 12.1.2.1 The unit hypersphere 411 12.1.2.2 Squared loss model 411 12.1.2.3 Hinge loss model 413 12.1.2.4 Clustering 414 12.1.3 Algorithm 414 12.1.4 Experiments 415 12.1.4.1 Parameter selection 415 12.1.4.2 Vector-based clustering 415 12.1.4.3 Graph-based clustering 417

12.2 Fast semisupervised clustering with enhanced spectral embedding

421

12.2.1 Problem formulation 421 12.2.1.1 Objective function 421 12.2.1.2 Solving the objective function 422 12.2.1.3 Clustering 423 12.2.2 Algorithm 423 12.2.2.1 Experimental results 425 12.2.2.2 Parameter selection 425 12.2.2.3 Toy examples 426 12.2.2.4 Vector-based clustering 426 12.2.2.5 Graph-based clustering 432

References

434

Learning data representation is a fundamental problem in data mining and machine learning. In recent years, semisupervised clustering (SSC) has aroused considerable interest from the machine-learning and data-mining communities. Spectral embedding is a popular method for learning effective data representations. In this book, we introduce a learning spectral embedding via iterative eigenvalue thresholding, and apply it in semisupervised clustering [1,2]. Brain and Nature-Inspired Learning, Computation and Recognition. https://doi.org/10.1016/B978-0-12-819795-0.00012-8 Copyright © 2020 Tsinghua University Press. Published by Elsevier Inc. All rights reserved.

407

408 Chapter 12

12.1 Learning spectral embedding for semisupervised clustering In this chapter, a semisupervised clustering method based on enhanced spectral embedding (ESE) is proposed, which not only takes into account the structural information contained in the data set, but also makes use of prior information (such as pairwise constraints). In particular, we first construct a symmetric preference k-NN graph, which is robust to noisy objects and can reflect the underlying manifold structure of the data. Then, we learn that the enhancement spectrum is embedded into an ideal representation and conforms to the pairwise constraint as much as possible. Finally, by using the Laplacian regularization method, the learning spectrum is expressed as semidefinite quadratic linear programming (SQLP) under the square loss function, and the learning spectrum is expressed as small semidefinite programs (SDPs) under the hinge loss function. Both can be solved effectively.

12.1.1 Graph construction and spectral embedding Graph-based semisupervised methods presume that data points are represented in the forms of undirected or directed graphs. Although graph-based semi-supervised clustering has been extensively studied, it often lacks sufficient robustness in real-world clustering tasks [3]. Given a data set of n objects X ¼ {x1, ., xn}. In this chapter, we aim to implement semisupervised clustering taking as input an undirected weighted graph G ¼ (V, E, W), where a set of nodes V represents the n data points X, E is a set of edges connecting adjacent data points, and W is the weights to capture pairwise similarities between data points. 12.1.1.1 Symmetry-favored graph For convenience, we adopt the local scaling parameter trick proposed in Ref. [4] to construct a graph G such as a k-nearest neighbor (k-NN) graph. Let us define a selecting function of the local scale     (12.1) hðxÞ ¼ x  xðTÞ ; where x(T) is the T-th nearest neighbor of x in X. The asymmetric weight matrix A ˛ ℝnn associated with G is formed subsequently as 8 ! 2 >  x kx k > i j < exp  ; if xj ˛ Nðxi Þ hðxi Þhðxj Þ Aij ¼ (12.2) > > : 0; otherwise where N(xi) represents the set of k-nearest neighbor of xi.

Fast clustering methods based on learning spectral embedding 409 Based on the matrix A, we can define the weighted adjacency matrix of G as follows: 8 if xi ˛ Nðxj Þ and xj ˛ Nðxi Þ; > < 1; (12.3) Wij ¼ f Aji ; if xi ˛ Nðxj Þ and xj ;Nðxi Þ; > : Aij ; otherwise. Obviously, W is symmetric. Note that we set Wii ¼ 0 to avoid self-loops. We further P denote the diagonal degree matrix D ˛ ℝnn whose entries are given by Dii ¼ jWij. As done in Eq. (12.3), weights of those symmetric edges are explicitly enhanced, and those of asymmetric edges are relatively reduced, respectively, considering that two points connected by a symmetric edge are prone to be on the same submanifold, as shown in Fig. 12.1. Therefore, we also call the k-NN graph constructed using Eq. (12.3) the symmetry-favored k-NN graph. The proposed symmetry-favored k-NN graph is highly robust to the noise and outliers, and can reflect the underlying manifold structure of data. 12.1.1.2 Spectral embedding of graph Laplacian Suppose f :V / R is a label prediction function, and we measure the smooth of the function f on the graph G by    f ðx Þ f ðx Þ 2 n 1 X  i j  Sð f Þ ¼ Wij pffiffiffiffiffiffi  pffiffiffiffiffiffi 2 i;j¼1  Dii Djj 

(12.4)

The graph Laplacian L of G is defined as L ¼ D  W, and the normalized graph Laplacian L is defined as L ¼ D1=2 LD1=2 ¼ I  D1=2 WD1=2 ;

(12.5)

where I is an identity matrix. In graph-based semisupervised learning, the graph Laplacian L (or L) plays an essential role. Let us denote the eigenvalues of L by l0  l1  .  ln1, and the corresponding

(A)

(B)

Figure 12.1 Comparison of the k-NN and the symmetry-favored k-NN graphs (thicker edges represent larger edge weights): (A) k-NN graph; (B) symmetry-favored k-NN graph.

410 Chapter 12 orthonormal set of eigenvectors by f0 ; f1 ; .; fn1. Therefore the spectral decomposition of the Laplacian L is given as Xn1 L¼ l f fT : (12.6) i¼0 i i i For the discussion of the mathematical aspects of this decomposition, one can refer to Ref. [5]. We briefly summarize the following three relevant properties: Property 1. (Properties of graph Laplacian L [6]) Let L be the graph Laplacian of a connected graph, then we have 1. For every f ˛ ℝn we have

   f ðx Þ f ðx Þ 2 n X 1   j i f T Lf ¼ Wij pffiffiffiffiffiffi  pffiffiffiffiffiffi : 2 i;j¼1  Dii Djj 

(12.7)

2. L is symmetric and positive semidefinite. 3. L has one and only one eigenvalue equal to 0, and n  1 positive eigenvalues: 0 ¼ l0  l1  .  ln1 . According to above analysis, we can reformulate Eq. (12.4) as. Sðf Þ ¼ f T L f

(12.8)

T

where f ¼ ðf ðx1 Þ; .; f ðxn ÞÞ . The smaller is Sðf Þ, the smoother is f [7]. In particular, the smoothness of an eigenvector is Sðfi Þ ¼ fTi L fi ¼ li :

(12.9)

Thus, eigenvectors associated with smaller eigenvalues are smoother. Let F i ¼ ðf1 ; .; fi Þ; and call Fi the i-th order spectral embedding of data points, with row j of Fi being taken as the new representation of data point xj.

12.1.2 Problem formulation Given a data set X, and two sets of pairwise must-link and cannot-link constraints, denoted, respectively, by M ¼ {(xi, xj)} where xi and xj should be in the same class and C ¼ {(xi, xj)} where xi and xj should be in different classes. In this section, we will show how to learn the enhanced spectral embedding as consistent with the prior pairwise constraints as possible. We first convert learning the enhanced spectral embedding into a binary classification problem on a desired kernel matrix. Through taking advantage of Laplacian regularization, we then formulate it as semidefinite quadratic linear programs (SQLPs) under the squared loss function or small semidefinite

Fast clustering methods based on learning spectral embedding 411 programs (SDPs) under the hinge loss function. Finally, we efficiently solve the proposed SQLP and small SDP problems. 12.1.2.1 The unit hypersphere Motivated by mapping onto a unit hypersphere in implementing of spectral clustering, we have an ideal kernel matrix K for data points as follows [8]:  1; lðxi Þ ¼ lðxj Þ; Kij ¼ (12.10) 0; lðxi Þ 6¼ lðxj Þ; where l(xi) denotes the cluster label of xi. Thus, our goal is to learn the enhanced spectral embedding with respect to pairwise constraints, and through taking advantage of Laplacian regularization [9], we let the desired representation Y ¼ fy1 ; .; yn gT ¼ F m Q satisfy the prior pairwise constraints as yTi yj ¼ Fi;: QQT Fi;T: ¼ 1;

cðxi ; xj Þ ˛ M;

yTi yj ¼ Fi;: QQT Fi;T: ¼ 0;

cðxi ; xj Þ ˛ C;

(12.11)

where Q ˛ ℝmm is the desired matrix to enhance the spectral embedding Fm. Therefore, we convert learning the enhanced spectral embedding into a binary classification problem on a desired kernel matrix. Then through Laplacian regularization, the learning of K is reduced to the learning of a much smaller U ¼ QQT(m  n). In the following parts, we adopt two loss functions: the squared loss and hinge loss functions for the binary classification, respectively. 12.1.2.2 Squared loss model Under the squared loss function, we present the following objective function: n  X 2 LðYÞ ¼ yTi yi  1 þ

X 

yTi yj  1

ðxi ;xj Þ ˛ M

i¼1

2

þ

2 X  yTi yj  0 :

(12.12)

ðxi ;xj Þ ˛ C

Let S ¼ {(xi, xj, tij)} be the set of pairwise constraints, and tij be a binary variable that takes 1 or 0 to denote whether xi and xj belong to the same cluster or not. Then the optimization problem can be formulated by the following: 2 X  minLðYÞ ¼ (12.13) yTi yj  tij : Y

ðxi ;xj ;tij Þ ˛ S

The optimization problem can be formulated as follows: 2 X  minLðQÞ ¼ Fi;: QQT Fi;T:  tij : Q

ðxi ;xj ;tij Þ ˛ S

(12.14)

412 Chapter 12 This is not a convex biquadratic optimization problem with respect to Q in general. However, it can be relaxed to a convex one that can be optimally solved efficiently, as shown in the following. Let U ¼ QQT, and u ¼ vec(U) denote the vector obtained by concatenating all the columns of U. To solve the above unconstrained optimization problem, we first relax it to a convex quadratic semidefinite program (QSDP) form: min uT Au þ bT u; (12.15) u   P P where A ¼ ði;j;tij Þhij hTij  0; where hij ¼ vec FjT Fi , and b ¼ 2 ði;j;tij Þtij hij . Li et al. [10] proposed a Schur complement-based SDP optimization approach to solve the above QSDP problem. However, its time complexity could be as high as O(m9). Let r be the rank of A. With Cholesky factorization, we can obtain an r  m2 matrix B such that A ¼ BTB where A is a symmetric positive semidefinite matrix with rank (A) ¼ r [11]. Following Ref. [9], let z ¼ Bu, then the QSDP problem is equivalent to: min

u; z; l

l þ bT u

(12.16) s:t: z ¼ Bu; zT z  l; U  0:

Let Kl be the second-order cone with dimension l, i.e., Kl ¼ ðx0 ; xÞ ˛ ℝl : x0  kxk ; where k,k denotes the standard Euclidean norm. Let v ¼ ðð1 þ lÞ=2; ð1  lÞ=2; zT ÞT , ei(i ¼ 1, ., r þ 2) be the i-th basis vector, and C ¼ (0r2, Irr). Then, the following lemma holds. Lemma 1. zT z  l if and only if v ˛ Krþ2 : We have ðe1  e2 ÞT v ¼ l, ðe1 þ e2 ÞT v ¼ 1, and z ¼ Cv. Then by Lemma 1, the QSDP can be equivalent to min ðe1  e2 ÞT v þ bT u u;v

s:t:ðe1 þ e2 ÞT v ¼ 1; Bu  Cv ¼ 0; v ˛ Krþ2 ; U  0:

(12.17)

The optimization problem is a semidefinite quadratic linear program (SQLP), and therefore can be solved by using the standard software package, such as SDPT3 [12]. The main procedure of semisupervised clustering with enhanced spectral embedding under squared loss (ESES) is summarized in Algorithm 1.

Fast clustering methods based on learning spectral embedding 413 Algorithm 1 ESES and ESEH Algorithms Input: A data set of n instances X ¼ {x1, ., xn}, the set of must-link constraints M ¼ {(xi, xj)}, the set of cannot-link constraints C ¼ {(xi, xj)}. The number of the nearest neighbors k, the constant m. Output: Cluster labels of all the data points. 1 : Construct the k-NN graph and compute the weight matrix W by Eq. (12.3). 1=2 WD1=2 , where D ¼ diag(d ) is the 2 : Form the normalized graph Laplacian ii P L ¼I D diagonal degree matrix with di i ¼ jwij. 3 : Compute the m eigenvectors f1 ; .; fm of L corresponding to the first m smallest eigenvalues, and form the matrix F m ˛ ℝnm from Fm ¼ ½f1 ; .; fm  ˛ ℝnm by normalizing the row sums to have norm 1. 4 : Obtain the enhanced spectral embedding F m ðU Þ1=2 by solving the SQLPs problem (3-16) for ESES or the small SDPs problem (3-18) for ESEH. 5 : Apply K-means to the rows of F m ðU Þ1=2 to form C clusters.

12.1.2.3 Hinge loss model Under the hinge loss function, we present the following objective function: X LðYÞ ¼ d1  Tij Zij eþ

(12.18)

ðxi ;xj Þ ˛ ðMWCÞ

where QxSþ ¼ max(0, x) denotes the hinge loss as used in SVMs, Z ¼ 2YYT-eeT where e ¼ ½1; /; 1T ˛ ℝn1 , and the matrix T ˛ ℝnn represents all the pairwise constraints as follows: 8 > < þ1; ðxi ; xj Þ ˛ M; (12.19) Tij ¼ 1; ðxi ; xj Þ ˛ C; > : 0; otherwise: Let U ¼ QQT, then we reformulate (3.17) by the following X min εij U; ε

ðxi ;xj Þ ˛ ðMWCÞ

s:t: Tij Zij  1  εij ; εij  0; cðxi ; xj Þ ˛ ðMWCÞ; Z ¼ 2FUF T  eeT ; U  0:

(12.20)

The above optimization problem is the small semidefinite programs (SDPs) to learn the smaller matrix U replacing the entire kernel matrix K, and therefore can be efficiently solved by using the standard software package, such as SeDuMi/YALMIP [13]. The main

414 Chapter 12 procedure of semisupervised clustering with enhanced spectral embedding under hinge loss (ESEH) is also summarized in Algorithm 1. 12.1.2.4 Clustering Let U* be the matrix obtained by solving the SQLPs, or the small SDPs. Then we apply the K-means algorithm on the enhanced spectral embedding F m ðU  Þ1=2 to form C clusters. In addition, we can achieve the learned kernel matrix by K  ¼ F m U  ðF m ÞT .

12.1.3 Algorithm Based on the previous analysis, we propose the semisupervised clustering approach listed in Algorithm 1, which we called the semisupervised clustering with enhanced spectral embedding (ESE). At first glance, the toy data consists of three separate groups, and is composed of a mixture of Gaussian-like and curve-like groups (Fig. 12.2). In addition, we present the comparison between the similarity matrix in the input space and the enhanced kernel matrix achieved by our proposed approach, where the data are ordered such that all instances in the two Gaussian-like groups appear first, and all instances in the curve-like group appear second. We can see that the proposed ESE approach can address the

(A)

(B)

(C)

1.6

1.6

1.6

1.4

1.4

1.4

1.2

1.2

1.2

1

1

1

0.8

0.8

0.6

0.6

0.8

Data ML CL

0.6 0.4

0

0.5

1

1.5

2

(D)

0.4

0

0.5

1

1.5

2

0.4

0

0.5

1

1.5

2

(E)

Figure 12.2 Clustering results on a toy data set. (A) Toy data set with one ML and one CL constraints; (B) clustering results produced by spectral clustering; (C) clustering results produced by ESE; (D) affinity matrix for the toy data set in the input space; (E) kernel matrix achieved by the proposed ESE. For illustration purposes, the data are arranged such that points within a cluster appear consecutively. The brighter a pixel, the larger is the similarity the pixel represents.

Fast clustering methods based on learning spectral embedding 415 multimanifold problem, and the enhanced affinity matrix achieved by the proposed ESE approach exhibits two clear block structures, implying that two clusters are well-separated groups.

12.1.4 Experiments In this section, we conduct a set of semisupervised clustering experiments on a variety of data sets, including a synthetic data set, several real-world data sets, an image object recognition data set, and a scene category recognition data set. For comparison, we present the results of the four most related SSC algorithms including spectral learning (SL) [14], constrained clustering through affinity propagation (CCAP) [15], constrained clustering via spectral regularization (CCSR), and semisupervised kernel K-means algorithm (SSKK) [16]. The results of spectral clustering (SC) are shown for reference. The proposed ESE approach, SL, CCAP, CCSR, and SSKK directly address multiclass clustering problems and exploit both ML and CL constraints. In contrast, the recently proposed SSC algorithm [17] makes sense only for two-class problems, and Yu and Shi’s approach handles only ML constraints [18]. 12.1.4.1 Parameter selection In implementation of SL, we modified the similarities to 1’s and 0’s corresponding to the ML and CL constraints, respectively. In SL, SC, CCSR, and SSKK, as commonly done, we construct several candidate graphs and select the one that achieves the best result. In particular, the size of the neighborhood t is tuned from {10, 15, ., 50}. The similarities between pairwise points are computed using the standard Gaussian function,  .  Wðxi ; xj Þ ¼ exp kxi  xj k2 2s2 , where the width of Gaussian function s is searched from the set linspace (0.1r, r, 5)W linspace (r, 10r, 5) with r being the average distance from each data point to 20th nearest neighbor, and linspace (r1, r2, t) denoting the set of t linearly equally spaced numbers between r1 and r2. For the proposed ESE approach, we also tuned the size of the neighborhood t from {20, 25, ., 50}, and set m¼15 the same size as in CCSR. 12.1.4.2 Vector-based clustering We use two categories of real-world data sets in our experiments. These data sets include: • •

UCI data. We perform experiments on four UCI data sets including Wine, Balance, Iris, and WDBC, and an artificial data set, G50c. Image data. We perform experiments on four image data sets: MNIST digits [19], USPS digits, COIL20 [20], and YaleB3 [21].

The basic information of these real-world data sets is summarized in Table 12.1.

416 Chapter 12 Table 12.1: A summary of data sets. Data set G50c WDBC Balance Iris Wine USPS-test USPS-train MNIST0123 YaleB3 COIL20

Samples 550 569 625 150 178 2007 7291 24,754 1755 1440

Features 50 30 4 4 13 1616 ¼ 256 1616 ¼ 256 2828 ¼ 784 3040 ¼ 1200 3232 ¼ 1024

Clusters 2 2 3 3 3 10 10 4 3 20

In these experiments, we set the number of clusters equal to the true number of classes for each data set, and use the Rand index [15] to evaluate the accuracy of the resultant clustering for all the clustering algorithms (Figs. 12.3e12.5). In addition, we generate a varying number of pairwise constraints randomly for each data set. For a data set of C clusters, we randomly generate j ML constraints for each cluster and j CL constraints for every two clusters, with a total of j(C þ C(C1)/2) constraints for each j. We can observe the following: •







In most cases, these sophisticated SSC approaches outperform the spectral clustering (SC) method as the baseline method except CCAP on a few data sets. Especially, the proposed ESE, CCSR, and SSKK improve the performance of spectral clustering by taking advantage of some pairwise constraints. In most cases, the performances of SSC methods consistently increase as more constraints have been added, except CCAP on a few data sets. This suggests that pairwise constraints can be utilized to reduce the semantic gap between high-level semantic concepts and low-level data features. CCAP and SSKK usually perform poorer than CCSR and the proposed ESE approaches, and they cannot handle large-scale problems (e.g., the USPS-train and the MNIST0123 data sets). Clustering performances of both SL and SSKK approaches are hardly improved as the number of pairwise constraints grows. In addition, SSKK applies a kernel K-means-like algorithm directly to the modified kernel matrix, which is not guaranteed to be positive semidefinite. The proposed ESE and CCSR approaches often outperform the other three SSC methods of SL, CCAP, and SSKK. Furthermore, the proposed ESE approach (including two proposed algorithms: ESES and ESEH) consistently performs better than CCSR under almost all of the settings of pairwise constraints. This observation confirms that the proposed symmetry-favored k-NN graph can reflect the underlying manifold structures of data sets, and the proposed ESE approach can also handle large-scale SSC problems.

Fast clustering methods based on learning spectral embedding 417 (A)

1.5 Data Noise ML CL

1 0.5

0 -0.5

-1 -2

(B)

-1

0

1

2

3

(C)

(D)

1.5

1.5

1.5

1

1

1

0.5

0.5

0.5

0

0

0

-0.5

-0.5

-0.5

-1 -2

-1

0

1

2

3

(E)

-1 -2

-1

0

1

2

3

(F)

-1 -2

1.5

1.5

1

1

1

0.5

0.5

0.5

0

0

0

-0.5

-0.5

-0.5

-1

0

1

2

3

-1 -2

0

1

2

3

-1

0

1

2

3

(G)

1.5

-1 -2

-1

-1

0

1

2

3

-1 -2

Figure 12.3 Clustering results on a noisy two-moon data set: (A) the noisy two-moon data set with one ML and one CL constraints; (B) clustering results produced by SL; (C) CCAP; (D) CCSR; (E) SSKK; (F) the proposed ESES; (G) the proposed ESEH.

We also look at the computational costs of different algorithms. For example, the times taken by the proposed ESES and EESH algorithms on the Balance data set with 300 pairwise constraints are about 4 and 10 s, respectively, while SL, CCAP, CCSR, and SSKK take about 7, 15, 60, and 12 s, respectively. 12.1.4.3 Graph-based clustering In this section, we perform experiments on an image object recognition data set and a scene category recognition data set. We experiment with the Caltech-4 data set [22] which contains 1155 automobile, 1074 airplane, 450 face, and 800 motorcycle images, and the Scene-8 data set [23] which contains eight categories of natural scenes with a total of

418 Chapter 12 (A)

(B) 1

1

ESES ESEH SSKK CCSR CCAP SL SC

0.9

0.9

Rand index

Rand index

0.8

0.8 ESES ESEH SSKK CCSR CCAP SL SC

0.7

0.6

0.5 15

30

45

60 75 90 105 Number of Constraints

120

135

0.7 0.6 0.5 0.4 15

150

(C) 0.8

(D)

0.75

30

45

60 75 90 105 Number of Constraints

120

135

150

1

0.95 0.9

0.65 ESES ESEH SSKK CCSR CCAP SL SC

0.6 0.55 0.5 0.45 0.4 30

Rand index

Rand index

0.7

60

90

120 150 180 210 Number of Constraints

(E)

240

270

ESES ESEH SSKK CCSR CCAP SL SC

0.85 0.8 0.75

300

0.7

6

12

18

24 30 36 42 Number of Constraints

48

54

60

0.8 0.75

Rand index

0.7 ESES ESEH SSKK CCSR CCAP SL SC

0.65 0.6 0.55 0.5

6

12

18

24 30 36 42 Number of Constraints

48

54

60

Figure 12.4 Clustering results on UCI data sets: Rand index versus the number of constraints. (A) G50c; (B) WDBC; (C) Balance; (D) Iris; (E) Wine.

2688 images (Figs. 12.6 and 12.7). Here, we used uniform grid cells and SIFT descriptors extracted from a uniformly spaced grid on the image. Then, the pyramid match kernel [24], a kernel function between images, is used to generate a kernel matrix on the input sets of image features from each two images (the sets may have different cardinality).

Fast clustering methods based on learning spectral embedding 419 (A) 1

(B)

1

0.95 0.9

ESES ESEH CCSR SL SC

0.8

0.7

Rand index

Rand index

0.9

0.85

ESES ESEH CCSR SL SC

0.8 0.75

0.6 0.7 0.5 55

(C)

110

165

220 275 330 385 Number of Constraints

440

495

0.65 10

550

(D)

1

20

30

40 50 60 70 Number of Constraints

80

90

100

1

0.95 0.95

0.8

ESES ESEH SSKK CCSR SL SC

0.75 0.7 0.65 55

110

165

220 275 330 385 Number of Constraints

440

495

550

Rand index

Rand index

0.9 0.85

0.9

ESES ESEH SSKK CCSR SL SC

0.85

0.8

6

12

18

24 30 36 42 Number of Constraints

48

54

60

(E) 1

Rand index

0.95

0.9 ESES ESEH SSKK CCSR SL SC

0.85

0.8

0.75 210

420

630

840 1050 1260 1470 1680 1890 2100 Number of Constraints

Figure 12.5 Clustering results on image data sets: Rand index versus the number of constraints. (A) USPStrain; (B) MNIST0123; (C) USPS-test; (D) YaleB3; (E) COIL20.

420 Chapter 12 (A)

Car

Airplane

Face

Motorcycle

(B)

Coast

Forest

Inside city

Mountain Open country

Highway

Street Tall building

Figure 12.6 Example images from two image data sets: (A) Caltech-4 data set; (B) Scene-8 data set.

(A)

(B)

1

1

0.95

0.95 ESES ESEH SSKK CCSR SL SC

Rand index

Rand index

0.9 0.85 ESES ESEH SSKK CCSR SL SC

0.8 0.75 0.7 0.65 10

20

30

40 50 60 70 Number of Constraints

80

90

0.9

0.85

0.8

100

0.75 36

72

108

144 180 216 252 Number of Constraints

288

324

360

Figure 12.7 Clustering results on two image data sets: Rand index versus the number of constraints. (A) Caltech-4; (B) Scene-8.

There is no explicit vector representation of these data setsdthe kernel function defines a matrix of kernel function evaluations, which is then viewed as a dense graph. We compare the graph-based clustering performance of the proposed ESE approach against various existing state-of-the-art methods: SL, SSKK, and CCSR. From the results

Fast clustering methods based on learning spectral embedding 421 of graph-based clustering on Caltech-4 and Scene-8 data sets, we can see that all performances of four SSC methods consistently increase as the number of constraints grows. This also suggests that pairwise constraints can be utilized to reduce the semantic gap between high-level semantic concepts and low-level data features. Both the proposed ESE and CCSR methods take advantage of prior pairwise constraints and structure information contained in data sets for clustering tasks. Thus, they consistently outperform SL and SSKK in terms of clustering quality. Furthermore, the proposed ESE approach (including two proposed algorithms: ESES and ESEH) performs best on these two data sets under all of the settings of pairwise constraints.

12.2 Fast semisupervised clustering with enhanced spectral embedding In this section, we propose a new SSC method based on enhanced spectral embedding (ESE). It not only takes into account the structural information contained in the data set, but also makes use of prior information (such as pairwise constraints). In particular, we first construct a symmetric preference k-NN graph, which is robust to noisy objects and can reflect the underlying manifold structure of the data set. Then, we learn that the enhancement spectrum is embedded into an ideal representation and conforms to the pairwise constraint as much as possible. Finally, we formulate learning the new spectral representation as a semidefinite quadratic linear program (SQLP) problem, which can be efficiently solved. We adopt the local scaling parameter trick to construct a graph G such as a k-nearest neighbor (k-NN) graph, specific information can be found in Section 12.1.

12.2.1 Problem formulation Given a data set X, two sets of pairwise must-link and cannot-link constraints are also provided and denoted respectively by M¼{(xi, xj)} where xi and xj should be in the same cluster and C¼{(xi, xj)} where xi and xj should be in the different clusters. In this section, we introduce how to learn enhanced spectral embedding of priori pairwise constraints. We also present an objective function for the above problem, and formulate it as a semidefinite quadratic linear programming (SQLP) problem. Note that the unconstrained clustering problem degenerates to the ordinary spectral clustering. 12.2.1.1 Objective function Motivated by mapping onto a unit hypersphere in implementing of the spectral clustering, we have the ideal affinity matrix K for data points as follows [8]:  1; lðxi Þ ¼ lðxj Þ; Kij ¼ (12.21) 0; lðxi Þ 6¼ lðxj Þ;

422 Chapter 12 where l(xi) denotes the cluster label of xi. Thus, our goal is to learn the enhanced spectral embedding with pairwise constraints, and let the desired representation Y¼{y1, ., yn}¼ FmQ satisfy the prior pairwise constraints as, yTi yj ¼ FiT QQT Fi ¼ 1;

cðxi ; xj Þ ˛ M;

yTi yj ¼ FiT QQT Fi ¼ 0;

cðxi ; xj Þ ˛ C;

(12.22)

where Q ˛ ℝmm is the desired matrix to enhance the spectral embedding. Based on the above analysis, we present the following objective function: n  X 2 LðYÞ ¼ yTi yi  1 þ

X 

yTi yj

2 1 þ

ðxi ;xj Þ ˛ M

i¼1

X 

yTi yj

0

2

:

(12.23)

ðxi ;xj Þ ˛ C

Let S¼{(xi, xj, tij)} be the set of pairwise constraints, and tij be a binary variable that takes 1 or 0 to denote whether xi and xj belong to the same cluster or not. Then the optimization problem is reformulated by the following, minLðYÞ ¼ Y



X

yTi yj  tij

2

:

(12.24)

ðxi ;xj ;tij Þ ˛ S

The optimization problem can be formulated as follows: minLðAÞb A

X

 T 2 Fi QQT Fi  tij :

(12.25)

ðxi ;xj ;tij Þ ˛ S

This is not a convex biquadratic optimization problem with respect to Q in general [10]. However, it can be relaxed to a convex one that can be optimally solved efficiently, as shown in the next subsection. 12.2.1.2 Solving the objective function Let U¼QQT, and u¼vec(U) denote the vector obtained by concatenating all the columns of U. To solve the above unconstrained optimization problem (3.14), we first relax it to a convex quadratic semidefinite program (QSDP) form: min uT Au þ bT u; (12.26) u   P P where A ¼ ði;j;tij Þvij vTij  0; vij ¼ vec FjT Fi , and b ¼ 2 ði;j;tij Þtij vij . Li et al. proposed a Schur complement-based SDP optimization approach to solve the above QSDP problem. However, its time complexity could be as high as O(m9) [9].

Fast clustering methods based on learning spectral embedding 423 Let r be the rank of A. With Cholesky factorization, we can obtain an r  m2 matrix B such that A ¼ BTB where A is a symmetric positive semidefinite matrix with rank (A) ¼ r [11]. Let z¼Bu, then the QSDP problem is equivalent to: min

l þ bT u

s.t.;

z ¼ Bu; U  0:

u; z; l

zT z  l;

Let Kl be the second-order cone of dimension l, i.e.,

Kl ¼ ðx0 ; xÞ ˛ ℝl : x0  kxk ;

(12.27)

(12.28)

where k,k denotes the standard Euclidean norm. Let v ¼ ðð1 þ lÞ=2; ð1  lÞ=2; zT ÞT , ei(i¼1, ., r þ 2) be the i-th basis vector, and C¼(0r2, Irr). Then, the following lemma holds. Lemma 2. zT z  l if and only if v ˛ Krþ2 : We have ðe1  e2 ÞT v ¼ l, ðe1 þ e2 ÞT v ¼ 1, and z ¼ Cv. Then by Lemma 2, the QSDP in Eq. (3.15) can be equivalent to: min ðe1  e2 ÞT v þ bT u u;v

s.t.; ðe1 þ e2 ÞT v ¼ 1; Bu  Cv ¼ 0; v ˛ Krþ2 ; U  0:

(12.29)

The optimization problem is a semidefinite quadratic linear program (SQLP), and therefore can be solved by using the standard software package, such as SDPT3 [12]. 12.2.1.3 Clustering Let U* be the matrix obtained by solving the SQLP problem. Then we apply the K-means algorithm on the enhanced spectral embedding Fm ðU  Þ1=2 to form C clusters. In addition, we can achieve the enhanced affinity matrix by K  ¼ Fm U  FmT .

12.2.2 Algorithm On the basis of the previous analysis, we propose an efficient SSC method listed in Algorithm 1, which we call SSC, with enhanced spectral embedding (ESE). Fig. 12.8 shows an intuitive description of our ESE method. At first glance, toy data is made up of three separate groups, a Gaussian group, and a curve class. In addition, we give a comparison between the similarity matrix in the input space and the enhanced similarity matrix obtained by the proposed ESE method, in which the data ordering causes all the instances in the two Gaussian groups to appear first. All instances in the curve class group

424 Chapter 12 (A) 1.6 1.4 1.2 1 0.8

Data ML CL

0.6 0.4

0

0.5

(B)

1

1.5

(C)

1.6

1.6

1.4

1.4

1.2

1.2

1

1

0.8

0.8

0.6

0.6

0.4

2

0

0.5

(D)

1

1.5

2

0.4

0

0.5

1

1.5

2

(E)

Figure 12.8 Clustering results on a toy data set: (A) toy data set with one ML and one CL constraints; (B) clustering results produced by normalized cut [25]; (C) ESE; (D) affinity matrix for the toy data set in the input space; (E) affinity matrix achieved by ESE.

appear in the second place. Figs. 12.8BeE, show that the proposed ESE method can solve the multimanifolds problem, and the enhanced affinity matrix implemented by our ESE method shows two clear block structures (Table 12.2). This means that the two clusters are well separated groups. The main running time of the proposed ESE approach is consumed by constructing of a symmetry-favored k-NN graph and computing the enhanced spectral embedding. The time complexity of computing the enhanced matrix by solving the SQLP problem (12.16) is O(m6.5). The time complexity of ESE is O(n2 þ m6.5 þ nm2 þ tnc) in total, together with the memory requirements of O(kn) (k is the number of nearest neighbors, knn2), where t

Fast clustering methods based on learning spectral embedding 425 Table 12.2: ESE algorithm. Input: A data set of n instances X ¼ {x1, ., xn}, the set of must-link constraints M ¼ {(xi, xj)}, and the set of cannot-link constraints C ¼ {(xi, xj)}. The number of the nearest neighbors k, and the constant m. Output: Cluster labels of all the data points. Step 1. Construct the k-NN graph and compute the weight matrix W by Eq. (12.3). 1=2 WD1=2 , where D¼diag(d ) is the diagonal Step 2. Form the normalized ii P graph Laplacian L ¼ I  D degree matrix with di i ¼ j wij . Step 3. Compute the m eigenvectors f1 ; .; fm of L corresponding to the first m smallest eigenvalues, and form the matrix F m ˛ ℝnm from Fm ¼ ½f1 ; .; fm  ˛ ℝnm by normalizing the row sums to have norm 1. Step 4. Obtain the enhanced spectral embedding F m ðU Þ1=2 by solving the SQLP problem (13-16). Step 5. Apply K-means to the rows of F m ðU Þ1=2 to form C clusters.

is the number of iterations in the procedure of the K-means algorithm. In our algorithm, computing the first m eigenvectors of the sparse matrix L can be efficiently performed using the Lanczos algorithm [11]. 12.2.2.1 Experimental results In this section, we conduct a set of SSC experiments on many data sets, including two synthetic data sets, many real-world data sets, and an image object recognition and a scene category recognition data sets. For comparison, we present the results of the four most related SSC algorithms including spectral learning (SL) [14], constrained clustering through affinity propagation (CCAP) [15], constrained clustering via spectral regularization (CCSR) [10], and semisupervised kernel K-means algorithm (SSKK) [16]. The results of spectral clustering (SC) (here we adopt normalized cut as the representative) are shown for reference. Our ESE approach, SL, CCAP, CCSR, and SSKK directly address multiclass clustering problems and can exploit both ML and CL constraints. In contrast, the recently proposed SSC algorithm [17] makes sense only for two-class problems, and Yu and Shi’s approach handles only ML constraints. 12.2.2.2 Parameter selection In implementation of SL, we modify the similarities to 1’s and 0’s corresponding to the ML and CL constraints, respectively. In SL, SC, CCSR, and SSKK, as commonly done, we construct several candidate graphs. In particular, the size of the neighborhood t is tuned from {20, 25, ., 50}. The similarities between pairwise points are computed using the  .  standard Gaussian function, Wðxi ; xj Þ ¼ exp kxi  xj k2 2s2 , where the width of Gaussian function s is searched from the set linspace(0.1r, r, 5)W linspace(r, 10r, 5) with r being the average distance from each data point to the 20th nearest neighbor, and linspace(r1, r2, t) denoting the set of t linearly equally spaced numbers between r1 and r2.

426 Chapter 12 In the proposed ESE approach, we also tune the size of the neighborhood t from {20, 25, ., 50}, and set m¼15 as in CCSR. 12.2.2.3 Toy examples We conduct experiments on two synthetic data sets, as shown in Figs. 12.9 and 12.10A. The first one, the noisy two-moon data set, consists of 400 points and 100 uniformly distributed noisy points. The second one, noisy squared-circled-cluster composed of two squared-clusters (each has 1000 points), one circled-cluster (1000 points), and 300 noisy points, is more challenging. We illustrate the clustering results of four related SSC methods and the proposed ESE approach, as shown in Figs. 12.9 and 12.10, from which we can see that all of SL, CCAP, CCSR, and SSKK give some mistakes but our ESE approach gives better clustering results. Therefore, we claim that the existing SSC methods exhibit worse results as the noisy points essentially destroy the manifold structure, and the proposed ESE approach with a symmetry-favored k-NN graph is highly robust to noise. 12.2.2.4 Vector-based clustering We use two categories of real-world data sets in our experiments. These data sets include: • •

UCI data. We perform experiments on four UCI data sets including Wine, Balance, Iris, and WDBC, and an artificial data set, G50c. Image data. We perform experiments on four image data sets: MNIST digits [19], USPS digits, COIL20 [20], and YaleB3 [21]. (A)

(B)

1.5 Data Noise ML CL

1

(C)

1.5

1.5

1

1

0.5

0.5

0

0

0

-0.5

-0.5

-0.5

0.5

-1 -2

-1

0

1

2

-1 -2

3

-1

0

1

2

3

-1 -2

(D)

(E)

(F)

1.5

1.5

1.5

1

1

1

0.5

0.5

0.5

0

0

0

-0.5

-0.5

-0.5

-1 -2

-1

0

1

2

3

-1 -2

-1

0

1

2

3

-1 -2

-1

0

1

2

3

-1

0

1

2

3

Figure 12.9 Clustering results on noisy two-moon data set: (A) the original data set with one ML and one CL constraints; (BeF) clustering results produced by SL, CCAP, CCSR, SSKK, and the proposed ESE approach, respectively.

Fast clustering methods based on learning spectral embedding 427 (A)

(B)

6

6

4

4

2

2

0

0 -2

-2 Data Noise CL

-4 -6 -6

-4

-2

0

2

4

-4

6

(C)

-6 -6

6

4

2

0

-2

(D)

6

6

4

4

2

2

0

0

-2

-2

-4

-4

-6 -6

(E)

-4

-4

-2

0

2

4

6

-6 -6

(F)

6

4

2

2

0

0

-2

-2

-4

-4

-4

-2

0

2

4

6

-2

0

2

4

6

6

4

-6 -6

-4

-6 -6

-4

-2

0

2

4

6

Figure 12.10 Clustering results on noisy squared-circled-cluster data set: (A) the original data set with three CL constraints; (BeF) clustering results produced by SL, CCAP, CCSR, SSKK, and the proposed ESE approach, respectively.

428 Chapter 12 For the MNIST0123 data set, we choose the train data subset of the well-known MNIST data on handwritten digit recognition, and select digits 0, 1, 2, and 3 as four classes where there are 5923, 6742, 5958, and 6131 examples, respectively, with a total of 24,754. The USPS data set consists of a training set with 7291 images and a test set with 2007 images. The COIL20 data set consists of a set of grayscale images with 20 objects. For each object, there are 72 images of size 3232. For the YaleB3 data set, we choose a subset of the Yale Face Database B, and use images of individuals 2, 5, and 10 and down-sample each image to 3040 pixels. This gives us 1755 images with 1200 dimensions to work with. The basic information of these real-world data sets is summarized in Table 12.3. In all experiments, we set the number of clusters equal to the true number of classes for each data set, and use the Rand index [15] to evaluate the accuracy of the resultant clustering for all the clustering algorithms. In addition, we generate a varying number of pairwise constraints randomly for each data set. For a data set of C clusters, we randomly generate j ML constraints for each cluster and j CL constraints for every two clusters, with a total of j(C þ C(C1)/2) constraints for each j. To evaluate the SSC approaches under different settings of pairwise constraints, the clustering results of these SSC algorithms on real-world data sets are shown in Fig. 12.11, and the results are reported averaged over 50 independent runs. Moreover, we present the time consumption of our ESE approach for each data set, as listed in Tables 12.4 and 12.5. For comparison, we also report the time cost of other four SSC algorithms. From the results, we can observe the following: •



In most cases, with the exception of CCAP on a few data sets, these complex SSC methods are superior to spectral clustering as the base line method. In particular, the proposed ESE method, CCSR and SSKK, use the given local constraints to improve the performance of spectral clustering (Figs. 12.12 and 12.13). In most cases, the performance of the SSC method will continue to improve, because in addition to a few data sets to add CCAP, but also add more constraints. This shows that local constraints can be used to narrow the semantic gap between high-level semantic concepts and underlying data features. Table 12.3: A summary of data sets. Data set G50c WDBC Balance Iris Wine USPS-test USPS-train MNIST0123 YaleB3 COIL20

Samples 550 569 625 150 178 2007 7291 24,754 1755 1440

Features 50 30 4 4 13 1616 ¼ 256 1616 ¼ 256 2828 ¼ 784 3040 ¼ 1200 3232 ¼ 1024

Clusters 2 2 3 3 3 10 10 4 3 20

Fast clustering methods based on learning spectral embedding 429 (A)

(B) 1

1 ESE SSKK CCSR CCAP SL SC

0.9

0.9

Rand index

0.8 0.8

0.7 ESE SSKK CCSR CCAP SL SC

0.7

0.6

0.5 15

30

45

60 75 90 105 Number of Constraints

120

135

0.6 0.5 0.4 15

150

(C)

30

45

60 75 90 105 Number of Constraints

120

135

150

(D) 0.8

1

0.75

0.95

Rand index

0.7 0.9

0.65 0.6

0.85 ESE SSKK CCSR CCAP SL SC

0.55 0.5 0.45 0.4 30

60

90

120 150 180 210 Number of Constraints

240

270

ESE SSKK CCSR CCAP SL SC

0.8 0.75 0.7

300

(E)

(F)

0.8

6

12

18

24 30 36 42 Number of Constraints

48

54

60

1 0.95

0.75

0.9

0.65

ESE SSKK CCSR CCAP SL SC

0.6 0.55 0.5

Rand index

Rand index

0.7

6

12

18

24 30 36 42 Number of Constraints

48

54

0.85 0.8 0.75

ESE SSKK CCSR SL SC

0.7 0.65

60

55

110

165

220 275 330 385 Number of Constraints

440

495

550

Figure 12.11 Clustering results on UCI and image data sets: Rand index via the number of constraints. (A) G50c; (B) WDBC; (C) Balance; (D) Iris; (E) Wine; (F) USPS-test; (G) USPS-train; (H) MNIST0123; (I) YaleB3; (J) COIL20.

430 Chapter 12 (G)

(H)

1

1 ESE CCSR SL SC

0.95 0.9

Rand index

Rand index

0.9 0.8

0.7

0.85 0.8

ESE CCSR SL SC

0.6

0.5 55

110

165

220 275 330 385 Number of Constraints

440

495

0.75 0.7 0.65 10

550

(I)

20

30

40 50 60 70 Number of Constraints

80

90

100

(J) 1

1 0.95

0.95

Rand index

Rand index

0.9 0.9 ESE SSKK CCSR SL SC

0.85

0.8

6

12

18

24 30 36 42 Number of Constraints

48

54

0.85 ESE SSKK CCSR SL SC

0.8 0.75 0.7 210

60

420

630

840 1050 1260 1470 1680 1890 2100 Number of Constraints

Figure 12.11 Continued.

Table 12.4: Comparison of computational time (seconds) on the UCI data sets (Num: number of constraints). G50c Data SL CCAP CCSR SSKK ESE

Num 150 150 150 150 150

Time 2.79 4.46 69.48 4.37 2.90

Cancer Num 150 150 150 150 150

Time 2.49 12.89 70.63 6.68 2.72

Balance Num 300 300 300 300 300

Time 3.32 14.47 59.76 11.02 3.63

Iris Num 60 60 60 60 60

Wine Time

0.36 0.72 59.15 4.79 2.01

Num

Time

60 60 60 60 60

0.23 0.96 56.12 5.46 2.13

Fast clustering methods based on learning spectral embedding 431 Table 12.5: Comparison of computational time (seconds) on the image data sets (Num: number of constraints). USPS-test Data

Num

SL CCAP CCSR SSKK ESE

USPS-train

Time

550 550 550 550 550

11.75 e 79.85 58.35 13.98

Num 550 550 550 550 550

MNIST0123

Time

Num

139.67 e 212.59 e 145.73

YaleB3

Time

100 100 100 100 100

Num

1083.50 e 1428.79 e 1248.12

60 60 60 60 60

Time 10.63 e 79.55 33.20 12.41

COIL20 Num 2100 2100 2100 2100 2100

Time 11.43 e 70.11 51.93 12.66

No result (“e”) is reported for these algorithms as they do not work.

(A)

(B) 1

1 ESE CCSR

ESE CCSR 0.95 Rand index

Rand index

0.95

0.9

0.85

0.85

0.8

0.9

5

10

15 20 Number of dimensions

25

0.8

30

5

10

15 20 Number of dimensions

25

30

Figure 12.12 Clustering results of CCSR and the proposed ESE approach via the number of dimensions: (A) G50c; (B) USPS-test.

(A)

(B) 1

1 ESE CCSR

ESE CCSR 0.95 Rand index

Rand index

0.95

0.9

0.85

0.8 15

0.9

0.85

20

25

30 35 40 Neighborhood size

45

50

0.8 10

15

20

25 30 35 Neighborhood size

40

45

Figure 12.13 Clustering results of CCSR and the proposed ESE approach via the neighborhood size: (A) G50c; (B) USPS-test.

432 Chapter 12 •



The performances of CCAP and SSKK are usually worse than that of CCSR and the proposed ESE method, and they cannot deal with large-scale problems (such as USPS-TRAINS and MNIST 0123 data sets). In addition, SSKK applies the kernel K-means-like algorithm directly to the modified kernel matrix, which is not necessarily semidefinite [16]. The proposed ESE method and CCSR are usually superior to the other three SSC methods, including SL, CCAP, and SSKK. In terms of clustering quality and clustering speed, the proposed ESE method is always superior to CCSR. This proves that the proposed symmetric preference k-NN graph can reflect the basic manifold structure of the data set, and the proposed ESE method can also deal with large-scale SSC problems. In addition, ESE is usually much faster than CCAP, SSKK, and CCSR.

In the second part of the experiments, we study the stability of the proposed ESE approach with respect to its two parameters: the number of dimensions m and the neighborhood size k, and conduct some experiments on two chosen data sets: G50c and USPS-test with 150 and 550 pairwise constraints, respectively. We illustrate the clustering results of the proposed ESE approach. For comparison, the clustering results of CCSR on these two data sets are also plotted in the corresponding figure, respectively. We can see that CCSR and the proposed ESE approach are very stable if the number of dimensions m and the neighborhood size k are not too large or too small. Doubtless, two parameters are easier to tune since they are selected from only positive integers. Moreover, the proposed ESE approach consistently outperforms CCSR on these two data sets under all of the settings of two parameters. In the third part of the experiments, we evaluate the robustness of the proposed ESE approach against noisy pairwise constraints, and conduct some experiments on two chosen data sets: G50c and USPS-test with 150 and 550 pairwise constraints, respectively. We demonstrate the clustering results of the proposed ESE approach with noisy constraints at different noise levels, 2%e10%, as shown in Fig. 12.14. For comparison, the clustering results of SL, CCAP, SSKK, and CCSR on these two data sets are also plotted in the corresponding figure, respectively. In the figure, the abscissa represents the rate of noisy pairwise constraints, and the ordinate is the clustering accuracy. It is clear that the clustering accuracies of all these SSC approaches overall are degraded as more noisy pairwise constraints have been added, and the proposed ESE approach is very robust to the noisy pairwise constraints. 12.2.2.5 Graph-based clustering In this section, we perform some experiments on an image object recognition data set and a scene category recognition data set, which are the Caltech-4 data set [22] containing 1155 automobile, 1074 airplane, 450 face, and 800 motorcycle images, and the Scene-8 data set [23] containing eight categories of natural scenes with a total of 2688 images.

Fast clustering methods based on learning spectral embedding 433 (A)

1

(B)1

0.95

0.95 0.9

0.85

Rand index

Rand index

0.9

0.8 0.75

ESE SSKK CCSR CCAP SL

0.7 0.65 2

4 6 8 Noise (%) in pairwise constraints

0.85 0.8 ESE SSKK CCSR SL

0.75 0.7 0.65

10

2

4 6 8 Noise (%) in pairwise constraints

10

Figure 12.14 Clustering results of all these SSC approaches via the rate of noisy constraints: (A) G50c; (B) USPS-test.

Here, we use uniform grid cells and SIFT descriptors extracted from a uniformly spaced grid on the image. Then, the pyramid match kernel, a kernel function between images, is used to generate a kernel matrix on the input sets of image features from each of the two images (the sets may have different cardinality). For more details we refer the reader to Ref. [24]. There is no explicit vector representation of these data setsdthe kernel function defines a matrix of kernel function evaluations, which is then viewed as a dense graph. We compare the graph-based clustering results of the proposed ESE approach against various existing state-of-the-art methods: SL, SSKK, and CCSR. The results of graphbased clustering on Caltech-4 and Scene-8 data sets are shown in Fig. 12.15. Both ESE (A)

(B) 1

1

0.95

ESE SSKK CCSR SL SC

0.85

0.9 Rand index

0.9 Rand index

ESE SSKK CCSR SL SC

0.95

0.8

0.85 0.8 0.75 0.7

0.75 0.7 36

0.65 72

108

144 180 216 252 Number of Constraints

288

324

360

0.6 10

20

30

40 50 60 70 Number of Constraints

80

90

100

Figure 12.15 Clustering results on two image data sets: Rand index versus the number of constraints. (A) Caltech-4 data set; (B) Scene-8 data set.

434 Chapter 12

Figure 12.16 Distance matrices of the low-dimensional data representations obtained by (from left to right) SC, SL, CCSR, and ESE on two image data sets (first row: Caltech-4 data set; second row: Scene-8 data set).

and CCSR take advantage of the given pairwise constraints and structure information contained in data sets for the clustering tasks. Thus, they consistently outperform SL and SSKK in terms of clustering quality. Furthermore, the proposed ESE approach performs best on the two data sets under almost all of the settings of pairwise constraints. Moreover, we illustrate the affinity matrices of the low-dimensional data embeddings obtained by SC, SL, CCSR, and ESE, as shown in Fig. 12.16, from which we can find that the block structure of affinity matrices obtained by CCSR and the proposed ESE approach on these two image data sets is significantly more obvious than those obtained by SC and SL. This means that each cluster associated with new data embeddings obtained by CCSR and ESE becomes more compact and different clusters are well-separated groups.

References [1] Shang F, Liu Y, Wang F. Learning spectral embedding for semi-supervised clustering. In: 2011 IEEE 11th international conference on Data Mining. IEEE; 2011. p. 597e606. [2] Jiao LC, Shang F, Wang F, et al. Fast semi-supervised clustering with enhanced spectral embedding. Pattern Recognition 2012;45(12):4358e69. [3] Liu W, Chang S. Robust multi-class transductive learning with graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2009. p. 381e8. [4] Zelnik-Manor L, Perona P. Self-tuning spectral clustering. Advances in Neural Information Processing Systems 2004;16:1601e8. [5] Chung F. Spectral graph theory. American Mathematical Society; 1997.

Fast clustering methods based on learning spectral embedding 435 [6] von Luxburg U. A tutorial on spectral clustering. Statistics and Computing 2007;17(4):395e416. [7] Chapelle BS, Zien A, editors. Semi-supervised learning. Cambridge, MA: The MIT Press; 2006. [8] Kwok JT, Tsang IW. Learning with idealized kernels. In: Proceedings of the 20th international conference on machine learning; 2003. p. 400e7. [9] Wu X-M, So A, Li Z, Li S. Fast graph Laplacian regularized kernel learning via semidefinite-quadraticlinear programming. Advances in Neural Information Processing Systems 2009;21:1964e72. [10] Li Z, Liu J, Tang X. Constrained clustering via spectral regularization. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2009. p. 421e8. [11] Golub GH, Loan CFV. Matrix computations. 3rd ed. Baltimore, Maryland: The Johns Hopkins University Press; 1996. [12] Tu¨tu¨ncu¨ RH, Toh KC, Todd MJ. Solving semidefinite-quadratic-linear programs using SDPT3. Mathematical Programming 2003;95:189e217. [13] Lo¨fberg J. YALMIP: a toolbox for modeling and optimization in MATLAB. In: The CACSD conference; 2004. [14] Kamvar S, Klein D, Manning C. Spectral learning. In: Proceedings of the 18th international joint conference on artificial intelligence; 2003. p. 561e6. ´ . Constrained spectral clustering through affinity propagation. In: Proceedings [15] Lu Z, Carreira-Perpin˜a´n MA of the IEEE conference on computer vision and pattern recognition; 2008. p. 848e55. [16] Kulis B, Basu S, Dhillon I, Mooney R. Semi-supervised graph clustering: a kernel approach. Machine Learning 2009;74(1):1e22. [17] Coleman T, Saunderson J, Wirth A. Spectral clustering with inconsistent advice. In: Proceedings of the 25th international conference on machine learning; 2008. p. 152e9. [18] Yu SX, Shi J. Segmentation given partial grouping constraints. IEEE Transactions on Pattern Analysis and Machine Intelligence 2004;26(2):173e83. [19] LeCun Y, Cortes C. The MNIST database of handwritten digits. 2009. http://yann.lecun.com/exdb/mnist/. [20] Nene SA, Nayar SK, Murase J. Columbia Object Image Library (COIL-20). Columbia Univ.; February 1996. Technical Report CUCS-005-96. [21] Georghiades AS, Belhumeur PN, Kriegman DJ. From few to many: illumination cone models for face recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis and Machine Intelligence 2001;23(6):643e60. [22] Fei-Fei L, Fergus R, Perona P. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: Proceedings of the IEEE conference on computer vision and pattern recognition, workshop on generative model based vision; 2004. [23] Oliva, Torralba A. Modeling the shape the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision 2001;42(3):145e75. [24] Grauman K, Darrell T. The pyramid match kernel: discriminative classification with sets of image features. In: Proceedings of the IEEE international conference on computer vision; 2005. p. 1458e65. [25] Shi J, Malik J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 2000;22(8):888e905.

C H A P T E R 13

Fast clustering methods based on affinity propagation and density weighting Chapter Outline 13.1 The framework of fast clustering methods based on affinity propagation and density weighting 438 13.1.1 Related works 438 13.1.1.1 AP clustering 438 13.1.1.2 Spectral clustering 439 13.1.1.3 Nystro¨m method 439 13.1.1.4 Local length and global distance 440 13.1.2 Fast AP algorithm 441 13.1.2.1 Coarsening phase 441 13.1.2.2 Exemplar-clustering phase 444 13.1.2.3 Refinement phase 445 13.1.3 Fast two-stage spectral clustering framework 446 13.1.3.1 Fast two-stage AP algorithm 446 13.1.3.2 Determine the number of representative exemplars 447 13.1.3.3 Sampling phase 448 13.1.3.4 Fast-weighted approximation spectral clustering phase 448 13.1.3.5 Robustness 451 13.1.3.6 Fast nearest-neighbors research 454

13.2 Experiments and analysis

455

13.2.1 Experiments on the method based on affinity propagation 455 13.2.1.1 Synthetic data sets 455 13.2.1.2 Compared algorithms and parameter settings 458 13.2.1.3 Vector-based clustering 458 13.2.1.4 Evaluation metrics 459 13.2.1.5 Experimental results 460 13.2.1.6 Graph-based clustering 464 13.2.2 Experiments on the method based on density-weighting 465 13.2.2.1 Intertwined spirals data set 465 13.2.2.2 Real-world data sets 466 13.2.2.3 Compared algorithms 467 13.2.2.4 Algorithm performances 468 13.2.2.5 Spectral embedding 473

References

474

Brain and Nature-Inspired Learning, Computation and Recognition. https://doi.org/10.1016/B978-0-12-819795-0.00013-X Copyright © 2020 Tsinghua University Press. Published by Elsevier Inc. All rights reserved

437

438 Chapter 13 In this section, we introduce a novel fast affinity propagation (FAP) clustering approach. FAP simultaneously considers both local and global structure information contained in data sets, and is a high-quality multilevel graph-partitioning method that can implement both vector-based and graph-based clustering. Although spectral clustering can produce high-quality clustering on small data sets, the amount of computation makes clustering of large data sets unfeasible. The limitation of affinity propagation (AP) is that it is difficult to determine the “preference” value of the parameters, which leads to the optimal clustering solution. These problems limit the scope of application of these two methods. Therefore, we also develop a fast two-stage spectral clustering framework with local consistency and global consistency. In this framework, we introduce a fast densityweighted low-rank approximate spectral clustering (FWASC) algorithm to solve the above problems. The algorithm is a high-quality graph partition method, and the local and global structure information contained in the data set are considered at the same time. Specifically, we first propose a new fast two-stage AP (FTSAP) algorithm, which coarsens the input sparse graph and generates a small number of representative samples, which is a simple and effective sampling scheme.

13.1 The framework of fast clustering methods based on affinity propagation and density weighting 13.1.1 Related works Before we go into the details of our FAP approach, we first briefly review some works that are closely related to this chapter. 13.1.1.1 AP clustering AP takes as input a collection of real-valued similarities among all data points, where the similarity s(i,k) indicates how well the data point xk is suited to be the cluster center for data point xi. The similarity of each pairwise data point is set to a negative squared Euclidean distance: For points xi and xk, sði; kÞ ¼  jjxi  xk jj

2

(13.1)

AP can be viewed by searching over valid configurations of the exemplars Z ¼ {z(x1),/,z(xn)} to maximize the sum of similarities between each data point and its exemplar as follows: arg max Z

n X i¼1

sfxi ; zðxi Þg.

(13.2)

Fast clustering methods based on affinity propagation and density weighting 439 The preferences P are important parameters in AP, which influence the final number of clusters. When P are larger, the number of identified exemplars is increased, otherwise, it is decreased. The values of input preferences are usually set to the median of the pairwise similarities. However, these values cannot lead to a suboptimal clustering solution in many cases, since the underlying manifold structure of the data set is not considered. 13.1.1.2 Spectral clustering Spectral clustering is a class of methods based on eigendecompositions of graph affinity matrices [1], and can stably detect nonconvex patterns and linearly nonseparable problems. Let a matrix W˛ℝnn denote the affinity matrix for the graph G¼(V, E), with nodes V representing the n data points and edges E whose weights capture pairwise similarities between data points. Let D be a diagonal matrix, and the i-th element on its diagonal line, di denotes the degree of a vertex vi ˛ V, di ¼

n X

wi;j

(13.3)

j¼1

The goal of spectral clustering is to partition the data points into k disjoint clusters such that each point xi belongs to one and only one cluster. Ng et al. [2] provided a k-way partitioning method. Those eigenvectors induce an embedding of the data points in a lowdimensional subspace. Finally, the K-means is used to assign the labels of all data points. 13.1.1.3 Nystro¨m method In this chapter, we focus on a class of sampling-based approximation techniques that have been widely used in machine learning and data mining. Among them, a well-known approach is the Nystro¨m method, which was presented to speed up the performance of kernel machines [3], and has been used in applications ranging from manifold learning to spectral clustering. The Nystro¨m method originated from numerical treatment of the integral equation: Z 1 pðyÞkðx; yÞfðyÞdy ¼ lfðxÞ; (13.4) 0

Here, k(,,,) is the kernel function, which is usually positive semidefinite; p(,) is the underlying probability density function; and l and f(,) are the eigenvalue and eigenfunction of the kernel matrix K, respectively. Let landmark points Z ¼ fzi gm i¼1 , and the integral can be approximated by m 1 X Wðzi ; zj Þfðzj Þ ¼ lfðxi Þ; m j¼1

i ¼ 1; 2; /m;

(13.5)

440 Chapter 13 where W(zi,zj) is the kernel matrix of landmark points, and f(zi)˛ℝm is the corresponding eigenvector. These may then be used to form an approximation f(x) to the eigenfunctions of K as follows: m 1 X kðx; xj Þfðxj Þ. fðxÞz ml j¼1

(13.6)

P P Let K ¼ UK K UKT , where K contains the eigenvalues of K and UK the associated eigenvectors. Suppose we randomly sample m  n columns of K uniformly. Let W be the mm matrix consisting of the intersection of these m columns with the corresponding m rows of K. Let K be partitioned as     W W AT K¼ ; and C ¼ : (13.7) A A B The Nystro¨m method uses W and C to approximate K as follows: K z Ke ¼ CWlþ CT ;

(13.8)

where Wl is the best l-rank approximation of W, that is Wl ¼ argminrank(S)¼lk WeSkF, and P T , the approximate eigenvalues Wlþ denotes the pseudo-inverse of Wl. Let W ¼ UW W UW and eigenvectors of K are [3]: rffiffiffiffi  n X X X1 m z ; and U z : (13.9) CU K W K W W m n If l  m eigenvalues and eigenvectors are needed, the time complexity of this approach is O(m3 þ nml). 13.1.1.4 Local length and global distance A meaningful measure of distance between pairs of data points plays an important role in clustering approaches. The idea of incorporating both local and global information into label prediction is inspired by the recent work on semisupervised learning [4], which means: (1) nearby points are likely to have the same label; and (2) points on the same structure (usually referred to as a cluster) are likely to have the same label. Based on lowdensity separation in semisupervised classification [5], we present a local length and a new global distance. Definition 1. A pairwise local length between each two data points of X is defined as DL ðxi ; xj Þ b erdistðxi ;xj Þ  1;

(13.10)

where dist(xi,xj) is the Euclidean distance between xi, xj, and r > 0 is the flexing factor. In addition, the local length between two points can be elongated or shortened by adjusting the flexing factor r. According to the local length, we also define a new distance metric,

Fast clustering methods based on affinity propagation and density weighting 441 called the global distance, which measures the distance between a pair of points by searching for the shortest path in the sparse graph. Definition 2. The pairwise global distances among the sampling exemplars Y are defined as follows: jpj1 X   DL ðpk ; pkþ1 Þ þ 1; i; j ¼ 1; 2; /; m1 ; DG yi ; yj ¼ min p ˛ Pi;j

(13.11)

k¼1

where DL(pk,pkþ1) is the local length between the node pk and pkþ1, Pi,j is the set of all paths connecting nodes yi and yj in the sparse graph of all data points, m1 is the number of the sampling exemplars Y. The global distance satisfies the properties for a distance metric, i.e., D(xi,xj) ¼ D(xj,xi); D(xi,xj)0; D(xi,xj)D(xi,xk) þ D(xk,xj) for all xi,xj,xk; and D(xi,xj) ¼ 0 if and only if xi ¼ xj. As a result, the global distance metric can measure the geodesic distance along the manifold, and achieve the aim of relatively elongating the distances among data points lying on different manifolds and simultaneously relatively shortening the distances among data points lying on the same manifold. And the global distance is robust against the noise and outliers, and can reflect the underlying manifold structures of data sets.

13.1.2 Fast AP algorithm In this section, we propose a novel fast affinity propagation (FAP) algorithm, which is a high-quality multilevel graph-partitioning method, and can consider both local and global information contained in data sets. The framework of our FAP approach is similar to the multilevel algorithms of Karypis and Kumar [6], Dhillon et al. [7], and Wang and Zhang [8]. Fig. 13.1. shows a graphical overview of our multilevel framework. Below, we present our FAP algorithm in terms of its three phases: coarsening, exemplar-clustering, and refinement. 13.1.2.1 Coarsening phase We propose a fast sampling algorithm as the coarsening phase of our FAP algorithm. The original AP algorithm takes the full similarity matrix to perform the information propagation. At each iteration step, there are generally n2 data pairs whose responsibility and availability values need to be calculated, resulting in a computation complexity of O(Tn2), where T is the number of iterations. This greatly affects the computational cost of the algorithm, especially when the number of data points is large. In our work, we first construct a sparse graph G¼(V,E), where the vertices V denote data points and the edges E contain parts of the pairwise edges between any two of the data points. It has been pointed

442 Chapter 13

Figure 13.1 Three phases of our FAP algorithm (for k ¼ 3).

out in Ref. [9] that the sparsity of the constructed graph will lead to faster calculation since the information propagation only needs to be performed on the existing edges. 13.1.2.1.1 Fast sampling algorithm

In this part, a fast sampling algorithm based on two factors is proposed. First of all, we presuppose that adding or not adding an edge between two data points will not change the final result. Therefore, when the algorithm runs on sparse graphs, our algorithm may be improved; secondly, the data points that act as good samples locally may be candidates for global samples [10]. Therefore, we consider a two-stage strategy to boost the original AP algorithm: In Stage I, we adopt a coarsening procedure to get local exemplars in the sparse graph using the sparse AP algorithm quickly, and empirically set T1 ¼ 20, where T1 is the number of iterations; in Stage II, we only consider the exemplars from Stage I as the candidates for final representative exemplars. Here we should highlight that the pairwise distances among data points in Stage I are the local lengths, and only the local structure information contained in the data set is considered. However, in Stage II the pairwise distances among the first-stage exemplars Y are the global distances, which can reflect the underlying geometric structure of the data manifold. This complete fast sampling algorithm (FS) is listed in Algorithm 1as follows. We apply the proposed FS algorithm to a simple toy data set to illustrate its efficiency. This toy data set consists of 3000 data points as shown in Fig.13.2A, and we aim to find 35 final representative exemplars among them. The classical AP algorithm using full similarity matrix took 261.42 seconds to achieve the final result, as shown in Fig.13.2B.

Fast clustering methods based on affinity propagation and density weighting 443 Algorithm 1 Fast sampling algorithm (FS) Input: Data points X ¼ {x1,/,xn}, the preferences P2 ˛ ℝm1 1 for Stage II, the size of the neighborhood t, and the flexing factor r. Output: Final representative exemplars, Z ¼ fz1 ; /; zm2 g. 1. Construct a t-nearest neighbor sparse graph; 2. In Stage I, the sparse graph is coarsened using the sparse AP algorithm, and m1 exemplars Y are identified, where the preferences P10 are set to the median of the sparse pairwise similarities; 3. In Stage II, first compute the global distances among the m1 exemplars Y by Eq. (13.5); 4. Refine the candidate exemplars to achieve the final m2 representative exemplars using the classical AP algorithm in Stage II, where the parameters P2 are initialized with the median of the similarities among the candidate exemplars, P20 ; If m2 is too big, let P2 )P2 þ P20 , then run step 4, until m2 becomes a relatively moderate value.

(A)

(B)

(C)

(D)

Figure 13.2 A toy data set. “*” indicates identified exemplars and colors indicate clusters: (A) original data; (B) AP, 261.42 s; (C) Stage I of FS, 13.09 s; (D) Stage II of FS, 8.12 s.

444 Chapter 13 Our FS algorithm with initial neighborhood t ¼ 50 finds 351 candidates in Stage I in 13.09 seconds, and 35 final representative exemplars in Stage II in 8.12 seconds, as shown, respectively, in Fig. 13.2C and D. We can see that the proposed FS algorithm is nearly 10 times faster than the classical AP algorithm. However, both the Stage II of FS and AP algorithm have a limitation that it is hard to determine the value of parameter “preference,” which can lead to a suboptimal clustering solution. 13.1.2.1.2 Determine the number of representative exemplars

The number of identified exemplars Y in Stage I, m1, may be relatively large. How to determine the final representative exemplars from those candidate exemplars Y is very important, and the number of the final representative exemplars m2 can affect the ultimate performance of clustering. If the value m2 is too small, the resulting FS algorithm may not be accurate. When m2 is too large, the local structures of data sets would be hidden and the results will also be bad. Therefore, a relatively moderate value m2 would give better results. Here, we apply a grid search style, which is to scan the search space of the parameters P2 to find the suboptimal number. The method only scans the search space of the parameters P2 corresponding with a small range of the numbers, generally

9 ,P02 ; 8 ,P02 ; /; P02 , where P02 is the median of the pairwise similarities among the final representative exemplars. 13.1.2.2 Exemplar-clustering phase During the coarsening phase, we implement the proposed FS algorithm to achieve a small number of final representative exemplars which can construct a smaller graph G2 for the exemplar-clustering phase at the third level of FAP. In this phase, we propose a new density-weighted spectral clustering method to perform the final representative exemplars clustering [2]. Since the number of samples in each coarsened group represented by the corresponding exemplar is different, original spectral clustering methods are no longer appropriate. Definition 3. A density-weighted affinity matrix is defined as  2 ðz ; z ! d i j G jSi jjSj j ; i; j ¼ 1; /m2 ; Wði; jÞ ¼ exp  n2 2si sj

(13.12)

where dG ðzi ; zj Þ is the global distance between the representative exemplar zi and zj in the original sparse graph of all data points; the cluster sizes jSij’s, i ¼ 1,/,m2 for every group corresponding with the representative exemplar; si is the local scale, si ¼ dG ðzi ; sK Þ; i ¼ 1; /m2 ;

(13.13)

Fast clustering methods based on affinity propagation and density weighting 445 where sK is the K-th neighbor of exemplar zi. In this work, we use a single value of h i K ¼ m2k2 , where k is the number of clusters [11]. This complete density-weighted spectral clustering (DWSC) method is shown in the exemplar-clustering phase of Algorithm 2. We can see that the proposed FAP algorithm is very robust against the noise.

Algorithm 2 Fast AP clustering algorithm (FAP) Input: Data points X ¼ {x1,/,xn}, the preferences P2 ˛ ℝm1 1 for Stage II of FS algorithm, the size of the neighborhood t, and the flexing factor r. Output: Cluster set C ¼ {C1,/,Ck}. 1. Coarsening phase - Apply Algorithm 1 to identify the final representative exemplars Z ¼ fz1 ; /; zm2 g, and count the cluster sizes jSij’s, i ¼ 1,/,m2 for each group corresponding the representative exemplar zi; 2. Exemplar-clustering phase (DWSC) - Compute the density-weight affinity matrix W ˛ ℝm2 m2 for m2 representative exemplars Z ¼ fz1 ; /; zm2 g, given in Eq. (13.6), and the degree matrix D, given in Eq. (13.3); - Conduct the eigenvalue decomposition D1/2WD1/2fZ ¼ lZfZ to find the eigenvectors fZ ˛ ℝm2 k corresponding with the k largest eigenvalues lZ, and form the matrix U ˛ ℝm2 k by normalizing each row vector of fZ; - Execute K-means algorithm for m2 row vectors of U, and assign zi to the cluster Cl iff the i-th row vector of U in the l-th cluster; 3. Refinement phase Achieve the assignments of all data points through the labels of their corresponding exemplars.

13.1.2.3 Refinement phase During the final phase of the proposed FAP approach, all samples are assigned through the labels of their corresponding representative exemplars. The clustering in Gi induces a clustering in Gi-1 as follows: If an exemplar in Gi is in cluster Cj, then all samples in Gi-1 formed from that exemplar are in cluster Cj. A key aspect of our FAP approach is how to select the subset of a data set, and FAP would transfer the main computational burden from one kernel eigen-analysis to a combinatorial task of data sampling [12]. Our complete fast AP algorithm (FAP) is listed in Algorithm 2. Fig.13.3 gives an intuitive illustration of our FAP approach.

446 Chapter 13 (A)

1.5

(B) 1.5

1

1

0.5

0.5

0

0

-0.5

-0.5

-1 -2

(C)

-1

0

1

2

3

-1 -2

1.5

(D) 1.5

1

1

0.5

0.5

0

0

-0.5

-0.5

-1 -2

-1

0

1

2

3

-1 -2

-1

0

-1

0

1

1

2

3

2

3

Figure 13.3 Clustering results on a two-moon data set with two bridge points using the proposed FAP. (A) A toy data set with two bridge points, which are in red rectangles; (B) the 7-nearest-neighbor graph of the data set; (C) the final representative exemplars are identified using a fast sampling algorithm where “*” indicates identified exemplars and colors indicate clusters; (D) clustering results produced by the proposed FAP approach with the flexing factor r ¼ 8.

13.1.3 Fast two-stage spectral clustering framework In this section, we present a novel fast two-stage spectral clustering framework with local and global consistency, which means: (1) nearby points are likely to have the same label; and (2) points on the same structure (usually referred to as a cluster) are likely to have the same label. The idea of incorporating both local and global information into label prediction is inspired by recent work on semisupervised learning. Below, we describe our fast spectral clustering algorithm in terms of its two phases: sampling and fast density-weighted Nystro¨m approximation spectral clustering (DWASC). 13.1.3.1 Fast two-stage AP algorithm In this part, we propose a fast two-stage AP sampling algorithm (FTSAP) based on two aspects: First, we preassume that neither of two data points being far apart chooses the other as an exemplar, whether to add an edge between the two data points or not does not change the final result. Thus our algorithm may be boosted when it runs on the sparse graph; second, data points that serve as good exemplars locally may be candidates for exemplars globally. So we consider a two-stage strategy to boost the original AP

Fast clustering methods based on affinity propagation and density weighting 447 algorithm: in Stage I, we adopt a coarsening procedure to get local exemplars on the sparse graph using the sparse AP algorithm quickly; in Stage II, we only consider the exemplars from Stage I as the candidates for final representative exemplars. Here we should highlight that the pairwise distances between all data points in Stage I are the Euclidean distances. However, in Stage II, the pairwise distances between the first-stage exemplars Y are the global distances, which better reflect the underlying geometric structure of the data manifold embedded in high dimensional space. Definition 1. The global pairwise distances between the candidate exemplars Y identified in the first stage and all data points X are defined as dG ðyi ; xj Þ ¼ min

p ˛ Pi;j

jpj1 X

dðpk ; pkþ1 Þ; i ¼ 1; /; m1 ; j ¼ 1; /; n;

(13.14)

k¼1

where d(pk,pkþ1) is the Euclidean distance between the node pk and pkþ1; Pi,j is the set of all paths connecting nodes yi and xj in the sparse graph of all data points. Global distances sometimes fail because of the existence of some points connecting different data clusters (which are usually called bridge points) [13]. In the next section, we apply a method which can preprocess the data set by eliminating these bridge points. Below, we present a nonuniform sampling algorithm listed in Algorithm 1, which we called the fast two-stage AP algorithm (FTSAP). Here, we apply a simple toy data set to illustrate the efficiency of the proposed FTSAP algorithm. A total of 3000 data points are randomly sampled from a 2-D rectangle as shown in Fig. 13.3A. We aim at finding 18 final representative exemplars among them. The classical AP algorithm using full similarity matrix took 363.93 seconds to achieve the final result as shown in Fig. 13.3B. The proposed FTSAP algorithm with initial neighborhood t ¼ 50 produced 285 candidates in Stage I in 22.35 seconds, and 18 final representative exemplars in Stage II in 9.50 seconds, as shown, respectively, in Figs. 13.3C and D. We can see that the proposed FTSAP algorithm is much faster than a classical AP algorithm. 13.1.3.2 Determine the number of representative exemplars The number of identified candidate exemplars in the first stage, m1 may be relatively large, and those candidate exemplars can reflect the intrinsic local structure information. How to determine the number of final representative exemplars from those candidate exemplars is very important, and the value can affect the ultimate performance of clustering. If the value m2 is too small, the resulting FTSAP algorithm may not be accurate. When m2 is too large, the local structure of the data set would be hidden and the results will also be bad. Therefore, a relatively medium-value m2 would give better results. Here we apply a grid search style, which is to scan the search space of the parameters P2 to find the optimal

448 Chapter 13 (A)

(B)

(C)

(D)

Figure 13.4 A toy data set. “*” indicate identified exemplars and colors indicate clusters: (A) original data; (B) AP, 263.93 s; (C) FTSAP Stage I, 22.35 s; (D) FTSAP Stage II, 9.50 s.

number. The method only scans the search space of the parameters P2 corresponding with

a small range of the number, generally 9 ,P02 ; 8 ,P02 ; /; P02 , where P02 is the median of the pairwise similarities between the final representative exemplars (Fig. 13.4). 13.1.3.3 Sampling phase During the sampling phase, the FTSAP is proposed to produce a small number of representative exemplars which can form a much smaller subgraph. Since the local information is mainly considered in Stage I of the FTSAP algorithm, we apply the most common Euclidean distance. However, in Stage II, the pairwise distances between the candidate exemplars identified in the first stage are global distances, which can reflect the underlying geometric structure of the data set. 13.1.3.4 Fast-weighted approximation spectral clustering phase The Nystro¨m method is an effective method to generate low-rank matrix approximation and has been applied in many large-scale learning applications [14]. A key aspect of the Nystro¨m method is how to select the subset of a data set, and the Nystro¨m extension

Fast clustering methods based on affinity propagation and density weighting 449 would transfer the main computational burden from one of kernel eigen-analysis to a combinatorial task of data sampling [12]. In this part, the Nystro¨m method is extended to a more general case based on the integral equation. The original Nystro¨m method assigns equal importance to all the chosen exemplars. Here, we explicitly introduce this density function p(,) evaluated at the 2 representative exemplars Z ¼ fzi gm i¼1 , which are chosen by the proposed FTSAP algorithm (Algorithm 3). jSi j (13.15) ; i ¼ 1; /; m2 ; n where the cluster sizes jSij’s, i ¼ 1,/,m2 for every group corresponding the representative exemplar zi. pðzi Þ ¼

Then the integral equation can be approximated as Z 1 m2 X pðyÞkðx; yÞfðyÞdyx pðzi Þkðx; zi Þfðzi Þ: lfðxÞ ¼ 0

(13.16)

i¼1

Here the symbol , denotes entities corresponding to this weighted version. By choosing x at the representative exemplars, we have m2 1 X Wðzi ; zj Þfðzj Þ ¼ l fðxi Þ; m2 j¼1

i ¼ 1; /; m2 ;

(13.17)

Algorithm 3 Fast two-stage AP algorithm (FTSAP) Input: data points X ¼ {x1,/,xn}, the preferences P10 ˛ ℝn1 for Stage I, the number of iterations T1 ¼ 20, the preferences P20 ˛ ℝm1 1 for Stage II; Output: final representative exemplars, Z ¼ fz1 ; /; zm2 g; 1. Construct a sparse graph: t-nearest neighbor graph or ε-neighborhood graph; 2. In Stage I, the sparse graph is coarsened using sparse AP algorithm quickly, and m1 candidate exemplars Y are identified, where parameters P10 are computed from the median of the sparse pairwise similarities; 3. In Stage II, compute the global distances among the m1 candidate exemplars Y by Eq. (3.17); 4. Refine the candidate exemplars to achieve the final m2 representative exemplars using classical AP algorithm in Stage II, where the parameters P20 are initialized with the median of the similarities among the candidate exemplars Y; 5. If m2 is too big, let P2 )P2 þ P20 , then run step 4, until m2 becomes a relatively medium value.

450 Chapter 13 where W ˛ ℝm2 m2 is the density-weighted affinity matrix computed at the representative exemplars. Definition 2. The density-weighted affinity matrix is defined as  2 ðz ; z ! d G i j jSi jjSj j ; i; j ¼ 1; /; m2 ; Wði; jÞ ¼ exp  (13.18) 2 n 2si sj 2 ðz ; z Þ is the global distance between the representative exemplar z and z on the where dG i j i j sparse graph of all data points; si is the local scale,

si ¼ dG ðzi ; sT Þ; i ¼ 1; /; m2 ;

(13.19)

where sT is the T-th neighbor of exemplar zi. In this work, we use a single value of h i T ¼ m2k2 , where k is the number of clusters [12]. And the extrapolation matrix A is given by dG ðxi ; zj A i:j ¼ pðzj Þexp  2si sj 2

! ; i ¼ m2 þ 1; /n; j ¼ 1; /; m2 ;

To obtain the diagonal degree matrix D, we apply the following procedure [18]: # # " " T w þ a1  W1m2 þ A 1nm2 1 ; ¼ K1 ¼ 1 T a2 þ A W a1 A1m2 þ A W A 1nm2

(13.20)

(13.21)

T

where w, a1, and a2 represent the row sums of W, A , and A, respectively, and 1 is a column vector of ones. #! " w þ a1  1 (13.22) D ¼ diag a2 þ A W a1 Perform the eigenvalue decomposition, 1=2

1=2

D1:m2 ;1:m2 W D1:m2 ;1:m2 fZ ¼ lZ fZ ;

(13.23)

where fZ ˛ ℝm2 and lZ are the eigenvector and corresponding eigenvalue, respectively.       1=2 1=2 Find fZ 1; fZ 2; /; fZ k, the k largest eigenvectors of D1:m2 ;1:m2 W D1:m2 ;1:m2 , and form       the matrix U Z ¼ fZ 1 fZ 2/ fZ k ˛ ℝm2 k by stacking the eigenvectors in columns,   and L ¼ diag ðlz Þ1; /; ðlz Þk .

Fast clustering methods based on affinity propagation and density weighting 451 The eigenvector of the full data set X can thus be approximated as 3 2 " # 1=2 1=2 D1:m2 ;1:m2 W D1:m2 ;1:m2 1 5U Z L ¼ E U Z L1 ; UK ¼ 4 1=2 1=2 F Dm2 þ1:n;m2 þ1:n A D1:m2 ;1:m2 1=2

1=2

1=2

(13.24)

1=2

where E ¼ D1:m2 ;1:m2 W D1:m2 ;1:m2 , and F ¼ Dm2 þ1:n;m2 þ1:n A D1:m2 ;1:m2 . Although the Nystro¨m eigenvectors U K can be approximated by extrapolating the eigenvectors of E. They are not orthonormal. Let R¼EþE

1=2 T

F FE

1=2

;

(13.25)

T

and its eigen decomposition R ¼ U R LR U R . It is proved that if E is positive definite, then T

it has orthonormal columns (i.e., V V ¼ I). " #   1=2 E 1=2   V¼ E U R :;1:k LR F 1:k;1:k

(13.26)

Under the fast two-stage spectral clustering framework with local and global consistency, we design a two-stage spectral clustering approach listed in Algorithm 2, which we called the fast density-weighted low-rank approximation spectral clustering (FWASC). Figs. 13.5 and 13.6 give us an intuitive illustration of the proposed FWASC algorithm (Algorithm 4). 13.1.3.5 Robustness In Stage I of the proposed FTSAP algorithm, the computing of the global distance could be topologically unstable, depending on the neighborhood size in constructing a

n input points

FTSAP sampling m2 exemplars

DWASC algorithm k clusters of original points

Figure 13.5 Overview of two-stage spectral clustering.

452 Chapter 13 (A)

(B)

(C)

(D)

Figure 13.6 Clustering results on the two-moon data set with two bridge points using the proposed FWASC. (A) Toy data set with two bridge points, which are in red rectangles; (B) the 7-nearest-neighbor graph of the data set; (C) the final representative exemplars are identified using the FTSAP algorithm; (D) clustering result with the proposed FWASC algorithm. We can see that the bridge points will cause the bad clustering result.

neighborhood sparse graph. A relatively large neighborhood size might result in shortcircuiting that destroys the manifold structure of data points (for example, see Fig. 13.5B). Therefore, the existence of bridge points will bias the final clustering results. It will be better if we can preprocess data sets by discarding these bridge points. Since the bridge points usually reside on the low-density places, the local shape and size of their neighborhoods will be different from the neighborhoods of the other samples, and they are unlikely to be final identified exemplars. Here, we apply the following confusion rate of all the data points to discriminate the bridge points from the others. The confusion rate is defined as gi ¼

minðdimðx X i Þ;tÞ j¼1

li j ; li j

i ¼ 1; /; n;

(13.27)

where li j is the j-th eigenvalue of the local covariance matrix Ci constructed by t-nearest neighbors of the data point xi,

Fast clustering methods based on affinity propagation and density weighting 453 Algorithm 4 : Fast density-weighted low-rank approximation spectral clustering (FWASC) Input: data points X ¼ {x1,/xn}, the number of clusters, k < m2, and the number of representative exemplars, m2n; Output: cluster set C ¼ {C1,/,Ck}; A Sampling phase: - Compute the confusion rate of all data points to detect whether there are bridge nodes. If there are some bridge nodes, then discard them from the sparse graph; - Apply the FTSAP algorithm to produce the final representative exemplars Z ¼ fz1 ; /; zm2 g, and counter the cluster sizes jSij’s, i ¼ 1,/,m2 for every group corresponding the representative exemplar zi; A Density-weighted approximation spectral clustering (DWASC) phase: 1. Compute the density-weight affinity matrix W ˛ ℝm2 m2 for m2 representative exemplars Z ¼ fz1 ; /; zm2 g, given in Eq. (13.21), and the corresponding extrapolation matrix A, given in Eq. (13.23), and the degree matrix D, given in Eqs. (13.24) and (13.25); 1=2

1=2

1=2

1=2

2. Calculate E ¼ D1:m2 ;1:m2 W D1:m2 ;1:m2 , and F ¼ Dm2 þ1:n;m2 þ1:n AD1:m2 ;1:m2 ; 1=2 T

1=2

3. Construct R ¼ E þ E F FE , given in Eq. (13.28); T 4. Perform the eigenvalue decomposition, R ¼ UR LR UR . Ensure that the eigenvalues in LR are in decreasing order; 5. Compute V using Eq. (13.29), and form the matrix U from V by renormalizing each of V ’s rows to achieve the spectral embedding; 6. Execute K-means clustering for n row vectors of U, assign xi to the cluster Cl iff the i-th row vector of U is in the l-th cluster.

Ci ¼

X

ðxj  xi Þðxj  xi ÞT ;

xj ˛ Nðxi Þ

where N(xi) represents the t-nearest neighbors of xi; and li j ¼

(13.28) P

xk ˛ Nðxi Þlk j

. t, dim(xi)

is the dimensionality of xi. In real-world applications, a threshold d would be determined, and the points satisfying gi>d can be discarded and the remaining points are applied to cluster. Here, we use three-quarters of the largest confusion rate as a threshold. But we must recompute the neighborhood graph that has eliminated a few nodes that have an extremely large confusion rate. For example, Fig. 13.7A shows the confusion rate of nodes of the above two-moon data set in seven-nearest neighbors.

454 Chapter 13 (A)

(B)

Figure 13.7 Clustering results on the two-moon data set using the proposed FWASC after preprocessing. (A) The confusion rate of the data points, where the abscissa represent the data indices, and the black line denotes the threshold d ¼ 8.995; (B) clustering result with our FWASC algorithm after eliminating the bridge points.

13.1.3.6 Fast nearest-neighbors research The cost of naive nearest-neighbor computation is O(n2). Thus this computation for largescale high-dimensional data sets is prohibitively expensive. For those sets, we adopt a combination of random projections and spill trees to get approximation neighbors (Liu et al. 2005). Random projection is readily available to deal with the high-dimensional data sets. In particular, the JohnsoneLindenstrauss lemma (Johnson and Lindenstrauss 1984) as described below states that one can embed a data set of n points in a subspace of dimension O(logn) with little distortion on the pair-wise distances. Random projection is a linear transform from xi˛ℝD to xi ˛ ℝd by a random matrix Q˛ℝdD as xi ¼ Qxi ; i ¼ 1; /; n;

(13.29)

where Q is a random matrix, and its entries, qi,j are independent random variable values obeying a specific distribution such as the normal distribution. Theorem 1. (JohnsoneLindenstrauss lemma). Given 0

:

0:85

0:2 < d  0:5

0:5

d < 0:2

(14.6) (14.7)

(14.8)

where lðH; gÞ is the normalized inertia [13], V(H) is the size of the region H, xb is the mean of x, Li is the i-th order normalized inertia of spheres. In SIMPLIcity, the color and texture features are emphasized rather than shape features. Hence the function g(d) is defined to note that shape distances in SIMPLIcity serves as a “bonus.” Note that, due to the specialization of SAR images, the color feature is the average pixel   4  2 P value in our method. Consequently, it should be changed into dt ri ; rj0 ¼ uk fk  fk0 , k¼1

where f1 is the mean pixel components of the region, f2, f3, and f4 are the mean energy components of the region. 14.1.1.3.3 Improved IRM scheme

The successful IRM distance performs not as well as expected in measuring the similarity of SAR images because of the special characteristics of SAR images, such as highly noisy, etc. The reasons can be concluded as follows. First, it is clear for us to find that the consistency of block-wise segmentation of SAR images is not very good, as shown in Fig. 14.1. In other words, the segmentation results of SAR images are too decentralized to compute the region-based similarity though IRM measure is robust to segmentation. In a nature image, the background and foreground can be divided easily. For an SAR image, however, there is no clear distinction between background and foreground. Moreover, the objects of SAR images are always diverse in type and huge in quantity. The target of block-wise segmentation in SIMPLIcity is separating the objects from the background, which is more suitable to nature images than to SAR images. Second, the shape features in SIMPLIcity are calculated based on blockwise segmentation. If the segmentation is not good enough, the shape features may play a negative role in computing similarities. Moreover, rather than shape features, the color and texture features are considered more important in SIMPLIcity, which may lead to expression of a SAR image that is incomplete.

SAR image processing based on similarity measures and discriminant feature learning

485

To overcome the limitations of IRM measure in SAR images mentioned above, we propose a new similarity distance based on IRM measure in this section, named improved integrated region matching (IIRM). The edge features are adopted into IIRM to conquer the negative effect of the poor block-wise segmentation, and those features can also be the shape descriptors. These are now discussed in detail. 14.1.1.3.4 Edge regions calculation

An image I can be represented by two kinds of region, i.e., texture region R ¼ {r1, r2, /rm} and edge region RE ¼ {re1, re2, /ren}. The texture region R can be obtained by the adaptive k-means using the color and texture features mentioned above. The edge region RE can be obtained by binary segmentation using edge features, which are acquired by the Prewitt method [14]. The Prewitt operator used is defined. 1 0 1 / 1 C B C B B « 1 « C C B 1 0 C B 1 / 1 0 1 / 1 B 1 / 1 C C B C B C B C B y x B « « 1 « C g ¼B « 1 « ; g ¼B 0 / 0 C C A @ C B B 1 / 1 C 1 / 1 0 1 / 1 lð2lþ1Þ C B C B B « 1 « C C B A @ 1 / 1 ð2lþ1Þl (14.9) y

x

where g and g are the vertical and horizontal edge detection operators. To maintain fairness we choose the same scale as extracting texture features for designing the Prewitt operator, i.e., l ¼ 4. After the convolution between image I and gy or gx the margin images of vertical or horizontal direction will be acquired. To eliminate the effect of direction factor, we combine the vertical and horizontal values for the final edge features. The formulas of what we discussed are shown, where fiE is the edge feature of pixel i. GEy  I5gy ; Gx  I5gx qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi E fi ¼ ðGy Þ2 þ ðGx Þ2

(14.10)

The binary segmentation method [15] is used here for segmenting images with edge features FE. The examples are shown in Fig. 14.2. The first row shows the original images, and the second row shows the binary segmentation results. The number of regions after binary segmentation is two. The edge region RE can be acquired so far. Note that we use

486 Chapter 14

Figure 14.2 Binary segmentation using edge features. First row: original images; second row: segmentation result.

the mean and variance of the edge features of region rei to compute the edge IRM distance. 14.1.1.3.5 IIRM computation

Suppose that images I1 and I2 have already been represented by texture region sets R ¼   {r1, r2, /rm} and R2 ¼ r10 ; r20 ; /rn0 after adaptive segmentation, and by edge region sets   RE1 ¼ {re1, re2} and RE2 ¼ re01 ; re02 after binary segmentation, respectively. The texture IRM distance and edge IRM distance can be computed as follows, dTðR1 ; R2 Þ ¼

X si;j di;j ;

4   X  2 d ri ; rj0 ¼ utk fk  fk0

i;j

dEðRE1 ; RE2 Þ ¼

X sei;j dei;j ;

(14.11)

k¼1 2  X   2 de rei ; re0j ¼ uek fek  fe0k

i;j

(14.12)

k¼1

where f1 is the mean pixel value of the texture region, f2, f3, and f4 are the mean energy components of the texture region, si,j is the significance score of texture region, fe1 and fe2 are the mean and variance value of edge region, sei,j is the significance score of edge region. The significance score here is calculated according to the most similar highest priority (MSHP) principle. Considering the contribution of both texture and edge the IRM distance can be computed as, dIIRM ¼ u1  dT þ u2  dE;

u1 þ u2 ¼ 1

(14.13)

IIRM is summarized in Table 14.4. The experimental results show that the IIRM distance clearly performs better than IRM distance in SAR images.

SAR image processing based on similarity measures and discriminant feature learning

487

Table 14.4: IIRM algorithm. Procedure: Input: Image I1 and I2. Step 1. Partition the images into block with 4  4 pixel, and extract the color and wavelet energy features, respectively. Step 2. Use the adaptive k-means method to segment  the images  for texture region sets R1 ¼ {r1, r2, /rm} and R2 ¼ r10 ; r20 ; /rn0 . n o Step 3. Compute the texture region feature sets ffi ; i ¼ 1; /4g; fj 0 ; j ¼ 1; /4 and the region area percentage pi ; p0j . Step 4. Calculate texture significance credit si, j between a pair of segments according to MSHP, and then determine the texture IRM distance defined. Step 5. Extract the edge features. Step 6. Use the binary segmentation method to segment the images for edge region sets RE1 ¼ {re1, re2}   and RE2 ¼ re01 ; re02 . n o Step 7. Compute the edge region feature sets ffei ; i ¼ 1; 2g; fe0j ; j ¼ 1; 2 , and the region area percentage pei ; pe0j . Step 8. Calculate edge significance credit sei, j between a pair of segments according to MSHP, and then determine the edge IRM distance defined. Step 9. Compute the IIRM distance between two images. Output: IIRM distance between images I1 and I2.

14.1.1.4 Methodology summary We now summarize the SAR retrieval method proposed in this chapter. Suppose a set of raw SAR images is given. Steps 1e5 accomplish the offline process, which aims at building a labeled SAR image patches database. The remainder of the steps belong to the online process, which focuses on retrieving the most similar patches to the query from the labeled database. By adding the improvements into traditional CBIR techniques, i.e., semantic categorization by SSL, classification error recovery scheme, and IIRM measure oriented toward SAR image, our SAR retrieval method obtains encouraging results. Based on these encouraging results, we believe our SAR retrieval method could be helpful in EO mining missions. 14.1.1.4.1 Off-line process

Input: a set of raw SAR images Ii, i ¼ 1, /M. 1) Divide the raw images Ii, i ¼ 1, /M into equal-sized nonoverlapping rectangular patches pi, i ¼ 1, /, N of size X  Y to build the SAR database. 2) Define the semantic categories and select the representative patches. The number of patches NT could be small. In our experiments, the percentage of representative patches is 5%. 3) Extract the classification feature of all patches.

488 Chapter 14 4) Classify all patches in SAR database by semisupervised learning. 5) Compute the empirical confusion matrix using the method described in Section 14.1.1.2. Output: labeled SAR database, train samples for SSL, and empirical confusion matrix. 14.1.1.4.2 On-line process

Input: query patch q, labeled SAR database, train samples for SSL, and the empirical confusion matrix. 1) Extract the classification feature of query patch q. 2) Check the query patch q exists in the database or not. If it does not exist in the database, its semantic label should be obtained by the same SSL method as the off-line process. Otherwise the label of query could be acquired from the database directly. 3) Expand the query’s label into label set by the classification recovery scheme. 4) Calculate the IIRM distances between the query patch q and relevant image patches in database. Output: the sorted IIRM distances between query patch q and relevant image patches. 14.1.1.5 Experiment Table 14.5 summarizes the details of the chosen SAR products, including the place, covered area, resolution, and semantic labels. The covered area is approximately calculated by the longitude and latitude contained in the product metadata, the resolution is a parameter recorded in the metadata, and the semantic labels are given by several dedicated examiners. Those SAR scenes are nonoverlapped and divided into equal-sized image patches first for the building database. The superiorities of the patch-based method have been presented in many works, for example, to extract the features which could capture the local properties in a patch, the selected patch size is 256  256 pixels in Table 14.5: The details of the test data. Satellite RadarSat-2

TerraSAR

Place North of Hong Kong, Shenzhen, China 04,457 West of Hong Kong, China 10,652 East of Hong Kong, China 10,874 Tokyo, Japan 10,758 Beijing, China 13,668 Shanghai, China 13,892 Washington, US 13669

Area (km2)

Resolution

Labels

1245.9

3.00

7

503.0

3.00

4

478.5

3.00

6

298.8 1378.8 2033.3 883.4

3.00 3.00 2.99 3.00

5 5 7 7

SAR image processing based on similarity measures and discriminant feature learning

489

Ref. [16]. In Refs. [17,18], the sizes of patches are 200  200 and 16  16 or 128  128 for better performance in their own applications. In the literature [19], to get better retrieval behaviors, the size of patches is 160  160. The patches for retrieval should be large enough to capture the visual features of a semantic category, and small enough to remain one semantic category at the same time [20]. As is Ref. [16], the size of image patches is 256  256 in our experiments. The total number of image patches is 69,647. Note that the specific formats and resolutions used in this chapter are not the limitations of the mining approach proposed. Since the database should be labeled, 10 land cover semantic categories are of interest, namely Mountain, Ocean, Port, High-Density Residential, Medium-Density Residential, Low-Density Residential, Farm, Plant, Mixed-Forest, and Water-Bodies. The examples of 10 semantics are shown in Fig. 14.3. The definition of semantic category is based on the area coverage, i.e., patch pi belongs to a semantic category ck if and only if the area coverage of ck in pi is over 50% roughly. We admit that the semantics defined in this chapter are simple, and the semantic categories could be more diverse in practice. However, the semantic annotation is not the main work in this chapter, and this open, tough research topic could be studied in depth in the future. Because the texture features are widespread in SAR image processing [21,22], the energy of frequency bands of the Daubechies wavelet decomposition is extracted in this chapter to accomplish the semantic categorization. To accomplish the systematic evaluation, we choose 15,728 patches which are easy to identify from the database as the established data set. The selection of the data

Figure 14.3 Examples of semantic categories: (A) Mountain, (B) Ocean, (C) Port, (D) High-Density Residential, (E) Medium-Density Residential, (F) Low-Density Residential, (G) Farm, (H) Plant, (I) Mixed-Forest, (J) Water-Bodies.

490 Chapter 14 Table 14.6: Sematic categories distribution in the data set and database. ID 1 2 3 4 5 6 7 8 9 10

Category name Mountain Ocean Port High-Density Residential Medium-Density Residential Low-Density Residential Farm Plant Mixed-Forest Water-Bodies Total

Data set number

Database number

1886 2117 853 3657 1105 1003 1535 1130 1654 788 15,728

8188 8368 3134 10,041 8734 7584 5582 8149 5431 4418 69,647

set and the task of semantic definition were completed by several dedicated examiners, and the number of image patches corresponding to each category in the data set is shown in Table 14.6. The number of each semantic category in the database is also shown in Table 14.4, which is obtained by the LGC method and the trainset is 5% of the total number in the database. In the categorization error recovery scheme, the threshold of filtering the probabilities should be predefined, and this empirical value is defined as 0.008. In the IIRM calculation, the weights of the texture IRM distance and of the edge IRM distance are 0.4 and 0.6, respectively. Although the color-texture-edge signatures are chosen to represent an SAR image in this chapter, other types of features can also be added. We adopt this kind of signature since it is simple to compute and robust to perform. Furthermore, some popular high-level features, such as bags of SIFT features [23], are more appealing for recognizing objects, and they do not perform well when the background is noisy [24]. Deriving image signatures which are robust and effective for retrieval is itself an open and deep research task that we could be studied in the future. 14.1.1.5.1 Performance of improved integrated region matching (IIRM) measure

This section studies the improved integrated region matching (IIRM) measure introduced in Section 14.1.3.1.1. We compare this with some other similarity measures, including IRM, UFM [25], and D2-Distance. •

IRM: The Integrated Region Matching measure was proposed based on region segmentation. It considered all segment regions between two images, and its robustness was proved by applying it to over one million images.

SAR image processing based on similarity measures and discriminant feature learning •



491

UFM: The Unified Feature Matching measure characterized each segment region by a fuzzy feature to deal with the problems of blurry boundaries. The similarity of two images was then defined as the overall resemblance between two sets of fuzzy features. D2-Distance: Discrete Distribution distance for image annotation. Images were first described by the discrete distributions, and the distance between two images was defined as the sum of squared Mallows distances [26,27] between individual distributions.

The experiments are designed as follows. Pick up an image patch from the data set randomly, and then compute the different similarity measures between the selected patch and the rest of patches in the data set. The results are placed in ordered lists. Patches having the same semantics as the selected patch are deemed as correct and vice versa. The precision and recall are chosen to evaluate the performance of different similarity measures. Suppose the number of result patches is ns, the number of patches in the same semantic with selected patch is nt, and the number of overlapped patches between two sets is nc (that is, the number of correct patches). Precision is defined as nc=n and recall is s nc=n .The comparison is provided in Fig. 14.4A. As a whole, IIRM outperforms the other t

three measures for both precision and recall. The precision and recall percentages achieved by IIRM are 1.27, 1.58, and 1.48 times as high as by D2-Distance, IRM and UFM, respectively. Note that, D2-Distance performs best when ns is small. However, with ns increasing its behavior drops dramatically. Nevertheless, IIRM performs more stable than D2-Distance where the recalleprecision curve of IIRM is flatter than that of D2-Distance. This shows that the proposed similarity measure is more robust than D2-Distance. Furthermore, because the linear programming is involved in computation of the Mallows distance, the calculation of D2-Distance is more costly than that of other three matchingbased measures. X 1 rði; jÞ rðiÞ ¼ CðiÞ 1jT;IDðjÞ¼IDðiÞ 11=2 X 1 sðiÞ ¼ @ ½rði; jÞ  rðiÞ2 A CðiÞ 1jT;IDðjÞ¼IDðiÞ 0

rt ¼ pðiÞ ¼

X X 1 1 rðiÞ;st ¼ sðiÞ CðiÞ 1iT;IDðiÞ¼t CðiÞ 1iT;IDðiÞ¼t

X X 1 1 1; pt ¼ pðiÞ CðiÞ ijT;rði;jÞ;IDðjÞ¼IDðiÞ CðiÞ 1iT;IDðiÞ¼t

(14.14)

(14.15) (14.16)

492 Chapter 14

Figure 14.4 Comparing the performance of IIRM, D2-Distance, IRM, and UFM measures using the image patches in the data set. For average precision, the larger numbers denote better results. For average rank and deviation, the lower numbers denote better results. (A) Recalleprecision curve; (B) average precision; (C) average mean rank; (D) average standard deviation.

Besides the precision and recall, we also adopt the average precision, the average mean rank, and the average standard deviation to assess the behavior of IIRM. The Category ID of patch qi is denoted as ID(i). For a query patch qi, r(i, j) is the rank of patch qj, i.e., the position of patch qj in retrieval results for patch qi, which is an integer between 1 and C(i) [the number of patches in ID(i)]. The comparisons are displayed in Fig. 14.4BeD. It is clear that the IIRM measure outperforms the other three approaches in most categories. The highest improvement reaches as high as 6.73, 5.70, and 3.56 in average precision, average mean rank, and average standard deviation. From these assessment criteria, we can see that the behavior of IIRM is better than that of D2-Distance.

SAR image processing based on similarity measures and discriminant feature learning

493

Figure 14.5 Query examples of the proposed method. Query patch (left element in each part) and retrieved patches (right elements in each part) are presented. The patches in the red frame are incorrect retrieved results decided by several dedicated examiners. (A) Farm examples; (B) High-Density Residential examples; (C) Ocean examples; (D) Medium-Density Residential examples; (E) LowDensity Residential examples; (F) Water-Bodies examples.

14.1.1.5.2 Query example (proposed method, IRM, one of the latest retrieval methods)

Here, we show the retrieval results by query examples which are shown in Fig. 14.5. We choose some image patches from the database randomly containing a semantic label (Farm, High-Density Residential, Ocean, Medium-Density Residential, Low-Density Residential, and Water-Bodies). Due to space limitations, only the top 10 matches corresponding to each query are shown, and based on these retrieval results we give the correct and incorrect marks, which are completed by several dedicated examiners, depending on the relevance of image semantics. The incorrect results are tabbed by the red frames. Note that, because the relevance of image semantics depends on the standpoint of the user, the relevance criteria used in this chapter, especially in Fig. 14.5, may be different from those used by a user of this method. From the figures, it is clear that the results of our method are acceptable in Farm, High-Density Residential, and Ocean (see Figs. 14.5AeC), whose numbers of correct results in top 10 patches are 10, 9, and 8, respectively. For the other three semantic categories, however, the retrieval results are not good enough, especially the Water-Bodies (see Figs. 14.5DeF). How to enhance the retrieval precision of these semantic categories could be our future work.

494 Chapter 14

Figure 14.6 Semantic content of the RadarSAT-2 over the north of Hong Kong.

14.1.1.5.3 Land cover statistical analysis

Besides the fast retrieval, our method also could be used to analyze the land cover statistically. For the deep study image content, we choose the patches from one SAR scene to be queried, and then the retrieval process was implemented by these patches in the labeled database. One patch belongs to one semantic category if and only if the average IIRM distance between it and all the patches within this semantic category is the smallest among the relevant categories. Also, the relevant categories are decided by the LGC and empirical confusion matrix. Fig. 14.6 shows the distribution of the semantic categories corresponding to RadarSAT-2 scene over the north of Hong Kong. Here, we observe seven different semantic classes existed in this scene, i.e., Mountain, Ocean, Port, MediumDensity Residential, Low-Density Residential, Farm, and Water-Bodies, and we could find the contributions of Mountain, Ocean, Port, Medium-Density Residential, Low-Density Residential, Farm, and Water-Bodies in this scene are 49.23%, 26.57%, 9.66%,2.90%, 11.06%, 0.14%, and 0.43%, respectively.

14.1.2 Fusion similarity-based reranking for SAR image retrieval 14.1.2.1 Fusion similarity-based reranking The framework of the proposed image reranking method FSR is shown in Fig. 14.7. When the user inputs a query SAR image q, the initial retrieval results d can be obtained by any RS/SAR image retrieval method. Then the top-ranked SAR images I ¼ fI1 ; I2 ; /; In g in d can be picked for reranking. To represent the SAR image from different aspects and suppress the speckle noise, we extract several SAR-oriented visual features from the SAR images within I to describe them simultaneously. In addition, their relevance scores [28]

SAR image processing based on similarity measures and discriminant feature learning

495

Figure 14.7 Framework of fusion similarity-based reranking.

can be estimated under different visual modalities. After that, a modal-image matrix is constructed by the estimated scores. To combine the effects of different modalities, a new resemblance measure named fusion similarity is defined using the modal-image matrix to weigh the relationships between SAR images. Finally, the reranking results Y are acquired by an existing reranking function using the estimated relevance scores and the obtained fusion similarities. 14.1.2.1.1 Preprocessing

There are two steps in the preprocessing part, including multiple SAR-oriented visual features extraction and initial relevance scores estimation. For the feature extraction, we construct two bag-of-visual-words (BOVW) features for the SAR images first. Generally speaking, the BOVW features can be extracted as follows: (1) find the interest points of images using scale-invariant feature transform (SIFT), (2) generate the codebook using those interest points, and (3) obtain the BOVW features by the histogram of the code words. Since the original SIFT algorithm [23] does not consider the speckle noise within the SAR images, we adopt SAR-SIFT [29] and ROEWA-based SIFT (R-SIFT) [30] in this work to reduce its negative influence. We name these two BOVW features SBOVW and RBOVW, respectively. The length of those two BOVW features is 1500 in this chapter. In addition, another SAR-oriented feature, local gradient ratio pattern histogram (LGRPH) [31], is also selected to represent the SAR images. There are three parameters in the formula for the calculation of the LGRPH feature for an SAR image, including the height and the width of the SAR image N and M, and the maximum value of LGRPK. Here, we set N ¼ M ¼ 256 since the size of selected SAR images for testing our method is 256  256, and we set K ¼ 255 in accordance with the original literature. Consequently, the dimension of this feature is 256 in this chapter.

496 Chapter 14 For the initial relevance scores estimation, we select the algorithm proposed in Ref. [32] to s i . In this algorithm, a map an SAR image’s ranking position si into the relevance score b large number of queries is used to investigate the relationship between si and b s i . The estimation function is formulated as b s i ¼ Eq ˛ Q ½sðq; si Þ, where Q is the set of queries, Eq ˛ Q denotes the expectation over Q, and sðq; si Þ indicates the relevance score of the i-th retrieved image to q. The value of the score is 1 or 0 in this chapter which means the i-th retrieved image is similar or dissimilar compared to the query. The mean squared loss criterion method is used to smooth the estimation. In general, whether a retrieved SAR image is similar to the query or not should be judged by users. However, since the test data used to verify our method are a ground truth SAR image archive, we specify that the retrieved SAR image is similar to the query if and only if they belong to the same category. The details of test data are discussed in Section 14.1.2.2. Note that, the two preprocessing steps discussed above can be accomplished offline for a fixed SAR image archive. 14.1.2.1.2 Reranking 14.1.2.1.2.1 Modal-image matrix construction and fusion similarity calculation When the set of b is estimated under different visual modalities, the modal-image initial relevance scores Q

matrix can be constructed as follows. Assume that there are m modalities available, and the number of images for visual reranking is n. Thus the m  n modal-image matrix R can be constructed, where ri, j denotes the relevance score of image j under the visual modality i. Now, an SAR image can be described by an m-dimensional vector in which the elements are the estimated relevance scores. In other words, an image j can be represented by the j-th column elements within R. Then, the similarities between images can be calculated by these vectors. In this chapter, the cosine-based similarity is adopted to measure the resemblance between two vectors u and v, in which the definition is simðu; vÞ ¼ cosðu; vÞ ¼

u$v : kuk2 kvk2

(14.17)

So far, the fusion similarities between SAR images have been calculated, and the value of the similarity ranges from 0 to 1. It is apparent that the resemblance of two SAR images is high if the scores of these two SAR images in various visual modalities are close. Since the relevance scores are estimated under different visual feature spaces, the relationship between two score vectors can also reflect the similarity between two SAR images visually. 14.1.2.1.2.2 Reranking function and solution After the fusion similarities between SAR images are obtained, the next step is to rerank the images according to these similarities by a reranking function. In this chapter, we introduce a graph-based reranking function into our method. The definition of this function is

SAR image processing based on similarity measures and discriminant feature learning e þ lky  yk2 ; minQðy; y; I Þ ¼ yT Ly 2 y

497

(14.18)

where y indicates the reranked relevance scores, ymeans the initial relevance scores, L e is the normalized graph Laplacian, and l is used to denotes the image set for reranking, L control the proportion of two terms. The first term is a graph regularizer, which ensures the visually similar images are close to each other, while the second term is a loss function which assures the reranked results do not change too much compared to the initial list. Assume that the initial relevance scores of the SAR images within I have been estimated under various modalities, and the n  n fusion similarity matrix W has also been calculated by the method displayed in the previous section. We do the following operations to properly use the reranking function displayed in Eq. (14-2). First, the different sets of estimated initial relevance scores are linearly combined by the CombSUM e can algorithm [33] to obtain the initial score y. Then, the normalized graph Laplacian L 1=2 1=2 e ¼I D WD , where D is a diagonal matrix and the value of di,i be acquired by L is the sum of the i-th row of W. The reranked score y can be solved directly through an easy derivation, that is,

1 e 1 y: (14.19) y¼ I þ L l However, the computational complexity of the inversion is O(n3), which is timeconsuming with n increases. In this study, we adopt the gradient descent method to update the reranked scores to decrease the time cost. Note that the derivative of Q with respect to y can be easily derived as vQ e þ 2lðy  yÞ: ¼ 2Ly vy

(14.20)

when we obtain the reranked scores y, the ranks of the SAR images within I are adjusted by the descending order of y. 14.1.2.2 Experiments and discussion 14.1.2.2.1 Experiment settings The experiments are completed using MATLAB 2012, installed on a Windows PC with an Intel Core i7 processor, 2.90 GHz CPU, and 8 GB of DDR3 memory. To test our method, an established ground truth SAR image archive proposed in our previous work [1] is adopted. The archive was constructed by seven raw SAR scenes (HH polarization) with a spatial resolution of 3 m. These SAR scenes cover six cities, including Beijing (China), Hong Kong (China), Shanghai (China), Shenzhen (China), Tokyo (Japan), and Washington (USA). The area covered reaches 6821 km2. In addition, seven SAR scenes are produced by RadarSat-2 and Terra SAR-X. For the retrieval task, the image should be large enough

498 Chapter 14 Table 14.7: Categories distribution within the archive. ID 1 2 3 4 5

Category Mountain Ocean Port High-Density Residential Medium-Density Residential

Number 1886 2117 853 3657 1105

ID 6 7 8 9 10

Category Low-Density Residential Farm Plant Mixed Forest Water Bodies

Number 1003 1535 1130 1654 788

for extracting its visual feature, and should be small enough for mainly containing one land-cover category. As a result, seven raw SAR scenes are divided into 15,728 nonoverlapped SAR images with the size of 256  256. These SAR images are classified into 10 land-cover categories manually. The categories and numbers of SAR images within each category are summarized in Table 14.7. We have to admit that the semantics defined in this archive might not be the optimal choice, and the semantic categories could be more diverse in practice. How to define more accurate semantic categories for each SAR image within the archive is an open, tough research topic, which is also planned by the authors. There are two parameters that should be set in advance, i.e., the number of SAR images for reranking n, and the positive parameter l that is used in reranking function. In the following experiments, n and l are set to be 500 and 1, respectively, unless stated otherwise, and their influence is discussed in Section 14.1.2.3. The retrieval precision and recall are selected to evaluate the reranking performance. For a query SAR image q, we assume that the number of retrieval results is nr, the number of SAR images within the archive has the same category with query q is nt, and the number of overlapped SAR images between two sets is nc. The retrieval precision is defined as nc/nr, while the retrieval recall is defined as nc/nt. 14.1.2.2.2 Numerical assessment 14.1.2.2.2.1 Based on different retrieval methods To validate the effectiveness of FSR, we add it after three different existing RS/SAR retrieval methods, which were proposed in the literature [34,35]. The first two were introduced for the SAR image retrieval, and we abbreviated them as FCD14 and SR15. The last one was presented for RS image retrieval, and we recorded it as KSH16 for convenience. The results (counted by the top 100 retrieval/reranking images) are summarized in Table 14.8. The parameters of different retrieval methods are the same as the value from the original literature. Note that the supervised hashing method [33] is used here to accomplish KSH16. In addition, the hash bit is set to 48, and the SBOVW feature is selected to represent the SAR images. It is apparent that the performance of three RS/SAR retrieval methods is enhanced at different degrees by FSR. The encouraging results validate that our reranking method is effective in improving the performance of SAR image retrieval.

SAR image processing based on similarity measures and discriminant feature learning

499

Table 14.8: Average retrieval performance of three different RS/SAR retrieval methods, and average reranking performance of FSR based on three retrieval methods. Before reranking (%) FCD14 SR15 KSH16

Precision Recall Precision Recall Precision Recall

47.07 2.27 60.66 3.30 66.04 3.45

After reranking 61.31 3.23 80.15 4.56 76.72 4.19

14.1.2.2.2.2 Compared with different reranking algorithms Two reranking methods proposed in the RS community (i.e., Ref. [36,37]) are selected to evaluate the behavior of FSR. Both are active learning-based RF methods. We record them as RF07 and RF15, respectively, here. Due to the features we used in this work (SBOVW, RBOVW, and LGRPH), we name the comparisons RF07S, RF07R, RF07L, RF15S, RF15R, and RF15L. We selected LibSVM [38] to train the support vector machine (SVM) classifier with the radial basis function (RBF) kernel, and the parameters of SVM are selected by five-fold crossvalidation. Note that, to maintain fairness with FSR, only the top 500 retrieval results are used to accomplish different comparisons, and the number of RF iterations is set to 10. In addition, we chose SR15 to be the initial retrieval method, and the obtained retrieval results are regarded as the baseline. The results are displayed in Fig. 14.8. From the observation of the precisionerecall curve (Fig. 14.8A), it is apparent that all reranking methods can improve the initial retrieval. For the RF15 methods, the RF15S achieves the best performance compared with RF15R and

Figure 14.8 Evaluation criteria value obtained by the different reranking methods: (A) precisionerecall curve; (B) precision across different categories.

500 Chapter 14 RF15L. The precision of RF15S is even higher than that of our method when the recall is small. For RF15R and RF15L, their performance drops dramatically at the beginning, and then becomes steady when the recall increases. The same situation can be found in the RF07 methods. The performance of RF07S is stronger than that of RF07R and RF07L, and the weakest performance is obtained by RF07L.Although the comparisons have already achieved positive results, our FSR still outperforms them. For precision, the highest enhancements obtained by FSR over other comparisons are 22.80% (baseline), 18.36% (RF07L), 15.18% (RF07R), 11.17% (RF07S), 15.16% (RF15L), 8.85% (RF15R), and 5.92% (RF15S). For recall, the largest improvements of FSR are 1.26% (baseline), 0.63% (RF07L), 0.86% (RF07R), 0.81% (RF07S), 0.51% (RF15L), 0.56% (RF15R), and 0.49% (RF15S). Note that, since only the top 100 retrieval/reranking results are counted and the number of images within each category is far more than 100 (Table 14.7), the value of recall is small here. The average retrieval precision obtained by different reranking methods across different categories is displayed in Fig. 14.8B. Our method outperforms others in most categories, which further proves the usefulness of our method for SAR image reranking. 14.1.2.3 Influence of different parameters Now, we discuss the influence of two parameters n and l. First, we fix l ¼ 1, and then vary n from 200 to 1000 to study its impact on our reranking method. The results are exhibited in Fig. 14.9A. It is clear that the performance of FSR is enhanced with n

Figure 14.9 Reranking precision of our method using different parameters: (A) the number of SAR images for reranking n; (B) the positive parameter l.

SAR image processing based on similarity measures and discriminant feature learning

501

Table 14.9: Reranking speed of different methods (unit: second). RF07L 500 600 700 800 900 1000

0.3829 0.5049 0.6333 0.7825 0.9448 1.1243

RF07R

RF07S

1.490 2.080 2.726 3.475 4.308 5.232

1.340 1.881 2.512 3.227 4.033 4.919

RF15L 0.401 0.510 0.644 0.792 0.961 1.138

RF15R

RF15S

1.525 2.099 2.761 3.517 4.360 5.296

1.362 1.911 2.537 3.254 4.063 4.978

FSR 0.898 1.139 1.424 1.756 2.082 2.543

increases. This demonstrates that if more images are used, a better performance will be achieved. However, the bigger n, the more computation time is required. Considering the efficiency of our method, we choose n ¼ 500 here. Second, l is tuned from 0.001 to 1 with fixed n, and the results are displayed in Fig. 14.9B. The reranking behavior is enhanced with l increases, and the performance almost remains the same when l > 0:1. This result demonstrates that the loss term in reranking function [Eq. (14.2)] plays an important role. Note that all the results are counted by the top 100 retrieval/reranking images, and SR15 is selected to acquire the initial retrieval results here. 14.1.2.4 Reranking efficiency We increase the number of SAR images n for reranking to study the reranking speed of FSR. In addition, other comparison time costs are also counted for reference. We vary n from 500 to 1000, and the time costs of various reranking methods are displayed in Table 14.9. It is apparent that all reranking methods’ time costs are acceptable. For the RF07 and RF15 methods, the reranking speed is proportional to the feature length. For our FSR (which is a multiple modalities-based method), the time cost is longer than RF07L and RF15L, but shorter than others. This demonstrates that our method is efficient. 14.1.2.5 Reranking examples The reranking examples are exhibited in Fig. 14.10. Due to space limitations, we select four SAR images from the archive randomly to be the queries (which are located in the top-left corner of each block), and only the top nine retrieval/reranking results are displayed. The incorrect results are tagged in red, and the number of correct images within the top 35 results is provided as well. It is clear that FSR improves the performance of the initial retrieval significantly. The number of correct SAR images in the top 35 is increased by five (“Farm”), seven (“Mountain”), 14 (“Water-Bodies”), and 17 (“Plant”), respectively, after the reranking. Note that, SR15 is adopted in this section to accomplish the initial retrieval.

502 Chapter 14

Figure 14.10 Reranking examples. The top-left corner images in each block are queries. The incorrect results are tagged in red. (A) Query image belongs to “Mountain”; (B) “Farm”; (C) “Water Bodies”; (D) “Plant’”

14.1.3 SAR image content retrieval based on fuzzy similarity and relevance feedback We now discuss our proposed content-based SAR image retrieval method as depicted in Fig. 14.11. We use I ¼ fI1 ; I2 ; /; IN g to represent the SAR image patch data set, and assume that the query image patch q has been provided by the user. First, similarities between the query patch q and the image patches in the data set are calculated using the RFM criterion. The initial retrieval results Y are obtained by the order of the RFM similarities. Second, to improve the retrieval performance, we carry out the MRF scheme. With the help of various AL algorithms’ efforts and users’ opinions, the initial ranked list

SAR image processing based on similarity measures and discriminant feature learning

503

Figure 14.11 Framework of proposed content-based SAR image retrieval method.

can be refined to achieve a better performance. We now discuss the details of the RFM algorithm and the MRF scheme. 14.1.3.1 Region-based fuzzy matching 14.1.3.1.1 Introduction to the improved integrated region matching algorithm The prototype of the RFM is the improved integrated region matching (IIRM) algorithm [1], which we briefly introduce. To measure the similarity between two SAR image patches I1 and I2, IIRM segments them into different feature spaces, i.e., brightness-texture and edge feature spaces. For brightness-texture feature space, IIRM first partitions an image patch into a number of nonoverlapping blocks to speed up the segmentation. Here, a block means a rectangle area with a fixed size, which is 4  4 in the IIRM algorithm. Then, each block is represented by the four-dimensional feature vector [HH,HL,LH,GV]T, where HH, HL, and LH indicate the energy of the high-frequency bands of one-level Daubechies-4 or Haar wavelet transforms on the block, and GV denotes the mean gray value within the block. All feature vectors are then divided into several groups by the adaptive k-means algorithm [12], where each group corresponds to a brightness-texture region. Here, the region means the area within the image patch that belongs to one segmented class. The region signature is the mean value of the feature vectors within this region. For edge feature space, the Prewitt operator [39] is adopted to extract the edge features at the pixel level, and a binary segmentation method [15] is selected to segment the SAR image patch into two edge regions. Each edge region is described by the mean and variance of edge features within the region. Suppose the SAR image patch I1 has already been segmented into n brightness-texture regions R ¼ fR1 ; R2 ; /Rn g and two edge regions E ¼ fE1 ; E2 g, while the SAR image   patch I2 has been divided into n’ brightness-texture regions R0 ¼ R01 ; R02 ; /R0n0 and two h i  0 0 0 e edge regions E ¼ E1 ; E2 . Then, the significance scores S ¼ si;j nn0 and S ¼ sei;j

22

504 Chapter 14 can be calculated according to the most similar highest priority (MSHP) principle [12]. Finally, the IIRM distance between I1 and I2 can be computed as dIIRM ¼ u  dTðR; R0 Þ þ ð1  uÞ  dEðE; E 0 Þ X si;j di;j ; dTðR; R0 Þ ¼ (14.21)

i;j

dEðE; E 0 Þ ¼

X

e sei;j di;j

i;j e is the Euclidean distance where di, j denotes the Euclidean distance between Ri and R0j , di;j 0 between Ei and Ej , and u ˛ ½0; 1 controls the proportion of the two kinds of region.

14.1.3.1.2 RFM measure

Although the IIRM algorithm has been successful in measuring the similarities between SAR image patches, its performance can be improved as follows. First, we consider the characteristics of the SAR image when calculating the resemblance between SAR image patches. It is well known that speckle noise in the SAR image patches will decrease the performance of most image classification methods. Thus, a key problem is how to reduce the negative influence of speckle in the similarity calculation. Also, since the multiscale property is useful in classifying SAR images, another challenge is how to effectively incorporate SAR image patches’ multiscale properties into our similarity measure. Third, since the objects within an SAR image patch are large in number and diverse in type, the block level segmented result (in the IIRM algorithm) is always suboptimal. In other words, the segmented regions are not continuous, which impacts the behavior of region-based similarity. How to overcome this limitation is also a key issue. To deal with these problems, we developed an RFM measure. First, superpixels are adopted to overcome the speckle noise during the generation of brightness-texture regions. Then, a multiscale edge detector is used to extract the edge features of SAR image patches, which takes advantage of the multiscale property of the SAR image patches. Third, fuzzy theory is introduced into our RFM to reduce the influence of the inaccurate segmentations. 14.1.3.1.2.1 Superpixel-based segmentation for brightness-texture regions A superpixel is an image oversegmented region, which is obtained with some constraints (e.g., intensity, location, etc.) [40]. The pixels within a superpixel have similar properties. We adopt superpixels rather than the basic pixels for our SAR image segmentation since (1) the superpixel can effectively suppress speckle noise, and (2) the computational expense based on superpixels is low [41]. Superpixels can be obtained by any segmentation method, and we select the simple linear iterative clustering (SLIC) algorithm [42]. SLIC produces superpixels according to color similarities and coordinate positions of pixels. The fivedimensional feature vector [labxy]T is extracted from each pixel, where [lab]T denotes the pixel color value in CIELab space, and xy indicates the position of the pixel in the image

SAR image processing based on similarity measures and discriminant feature learning

505

plane. To accurately describe the resemblance between two feature vectors, SLIC defines a distance measure

dlab

r Ds ¼ dlab þ dxy s qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ ðli  lj Þ2 þ ðai  aj Þ2 þ ðbi  bj Þ2 ; qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  2ffi 2 dxy ¼ ðxi  xj Þ þ yi  yj

(14.22)

pffiffiffiffiffiffiffiffiffiffiffiffiffi where s ¼ Np =Ns indicates the grid interval (Np is the total number of pixels within an image, Ns is the expected number of superpixels), andris a positive parameter that adjusts the compactness of the superpixels. The clustering begins with Ns cluster centers, and then they are moved to the positions which have the lowest gradient value in a 3  3 neighborhood. The neighborhood here means a square area which is centered on a single cluster center. The pixels within an image are assigned to the nearest centers whose search area overlaps them. When all of the pixels are arranged, the centers are updated to the mean vectors of the pixels within the centers. The process does not stop until a convergence status is met. The SLIC algorithm is appropriate for not only color images but also grayscale images. For a color image, SLIC converts the pixel value from RGB space into CIELab space to get [lab]T. For a grayscale image, such as an SAR image patch within our data set, SLIC obtains one pixel’s [lab]T according to its gray value directly. Two examples are displayed here to explain how to get [lab]T for a pixel. Assume that there is a color image, and the value of a pixel within this color image is [2,6,77]T (RGB space). After the transformation from RGB to CIELab, we can get the appropriate [lab]T, that is, [6.1332,28.0275,42.2502]T. Suppose the gray value of one pixel within an SAR image is 243, its [lab]T in the SLIC algorithm will be [243,243,243]T. Examples of superpixels obtained from four real SAR image patches using the SLIC algorithm are shown in Fig. 14.12. Once we have superpixels of an SAR image patch I, the next step is segmenting them by the proper clustering method. In our study, the signature for each superpixel is the arithmetic mean of the pixels within it. For the pixels within I, we adopt a group of Gabor filters [43] to extract the texture features, while the normalized gray value is chosen to be pffiffiffi pffiffiffi the brightness feature. For Gabor filters, there are five scale factors ( 2 20 , 2 21 , pffiffiffi pffiffiffi 2 pffiffiffi 3 2 2 , 2 2 , and 2 24 ) and six directions (0+, 30+, 60+, 90+, 120+, and 150+) so that the dimension of the texture feature vector for a pixel is 30. To get one-dimensional xB minðXB Þ

i brightness features, the pixels’ gray value can be normalized by xNB ¼ maxðX , B i ÞminðXB Þ

where xBi is the gray value of a pixel before normalization, xNB i is the normalized gray B value of a pixel, and X denotes the set of the gray values of all of the pixels. Thus, the

506 Chapter 14

Figure 14.12 Examples of superpixels and superpixel-based segmentation results. The images in the first row (top) are the original SAR image patches. Superpixels obtained by SLIC algorithm for each patch are shown in the second row, and the number of superpixels is 1000 in these examples. The third row displays superpixel-based segmentation results obtained by the adaptive k-means method, and the “# of categories” indicates the number of cluster categories determined by the adaptive k-means method.

! dimension of a brightness-texture feature vector f for a pixel is 31. To avoid providing the number of cluster centers, the adaptive k-means method is chose, which automatically determines the number of clusters by setting an empirical threshold for the distortion. After the labels of the superpixels are determined, the pixels’ labels are also defined consistent with the superpixels they belong to. Due to the high efficiency of SLIC and the adaptive k-means method, the average computational time for segmenting an SAR image patch (256  256) is as little as 1 s (CORE i7 processor, 2.90 GHz CPU PC under Windows). Examples of superpixel-based segmentation results are exhibited in Fig. 14.12.

SAR image processing based on similarity measures and discriminant feature learning

507

So far, we have obtained the brightness-texture regions R of an SAR image patch. There is one point we need to remark on, and that is the SAR image patches are filtered with a Gaussian-smoothing filter before the operation mentioned above to further suppress speckle noises. 14.1.3.1.2.2 Multiscale edge detector-based segmentation for edge regions To effectively utilize the multiscale information of SAR image patches, a multiscale edge detector is adopted to extract edge features. The detector is constructed using a group of Prewitt operators, and each Prewitt operator is responsible for the edge information at a particular scale. For a fixed scale li , the edge feature of an SAR image patch I can be extracted as follows qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Fe ðli Þ ¼ ðI5gy Þ2 þ ðI5gx Þ2 ; (14.23) where gy and gx indicate the vertical and horizontal orientation detectors, respectively, and their definitions are 1 0 1 / 1 C B B « 1 « C C B C B 0 1 C B 1 / 1 1 / 1 0 1 / 1 C B C B B C y x B B C g ¼@ « 1 « « « 1 «A ; g ¼B 0 / 0 C : C C B B 1 / 1 C 1 / 1 0 1 / 1 li ð2li þ1Þ C B C B B « 1 « C A @ 1 / 1 ð2li þ1Þli (14.24) From the observation of Eq. (14.4), we can easily find that the Prewitt operators’ size is controlled by the scale parameter li. With an increase in li , the size of the Prewitt operators increases, so that the scale of the extracted edges gets larger, which helps to depress speckle noise. Suppose there is a set of scales l ¼ ½l1 ; /; lsc T , and we can obtain a set of edge features fFe ðl1 Þ; /; Fe ðlsc Þg for an SAR image patch I according to !e Eq. (14.3). Thus, for a pixel within I, the dimension of the edge feature vector f is sc, and each element within the feature vector is obtained by a Prewitt operator with fixed scale li . Thus, we name this feature vector multiscale edge feature vector. Note that the gray value of an SAR image patch is normalized before we extract its multiscale edge feature, and the normalization method is the same approach mentioned in the brightness feature extraction. In addition, the selection of the Prewitt operator is not a limitation of our RFM measure. Other basic edge detection operators, such as Roberts and Sobel operators, could also be used.

508 Chapter 14

Figure 14.13 Examples of multiscale edge detector-based segmentation. The top row displays the original SAR image patches. Segmentation results are shown in the second row, the “# of categories” indicates the number of clusters determined by the adaptive k-means method.

Similar to the superpixel-based segmentation method, we also select the adaptive k-means method to divide the multiscale edge features into several groups so that the edge regions E of an SAR image patch can be determined. To be fair with the brightness-texture regions, there are still five scales in the multiscale edge feature extraction. The scales ½l1 ; /; l5 T are equal to the Gabor filter scales mentioned in the previous section. Examples of multiscale edge detector-based segmentation results are shown in Fig. 14.13. The computational time of this segmentation method for an SAR image patch (256  256) is around 1 s. 14.1.3.1.2.3 Fuzzy region representation To measure the similarity between two SAR image patches, we have to find the proper signatures to describe regions R and E. In the IIRM measure, the brightness-texture region signature is the center feature vector, i.e., the arithmetic mean of all the pixels within the region. In addition, the edge region is represented by the mean and variance of the features in the pixels within a region. Although these signatures perform positively, they also have distinct drawbacks. For a particular brightnesstexture region, the center feature can avoid the influence of inaccurate segmentation, but considerable useful information is lost as well. For an edge region, the mean and variance reflect the distribution of the features within the region, but other factors (e.g., blurry segmented boundaries, etc.) are not considered. To handle these limitations, we introduce fuzzy theory [25] into the region description. On the one hand, fuzzy signatures define the

SAR image processing based on similarity measures and discriminant feature learning

509

character of each feature within the region to take the segmented uncertainties into account. On the other hand, the fuzzy signatures represent the gradual transition between regions, so that the issue of the blurry boundaries can be handled. Suppose the brightness-texture regions R and the edge regions E of an SAR image patch I have been determined. Then, the brightness-texture features of I are partitioned into F ¼   fF1 ; F2 ; /Fn g, while the edge features are divided into F e ¼ Fe1 ; Fe2 ; /Fem . The values of m and n are automatically selected by the adaptive k-means algorithm. Now, we need to find a proper membership function to map those features into fuzzy space. This membership function assigns a value (between 0 and 1) to each pixel (corresponding to each feature vector) within the region. This value indicates the degree of membership of a single pixel to a region. The value 0 indicates that the pixel is not a member of the region, while the value 1 means the pixel is a member of the region. The transition from 0 to 1 models the segmented uncertainties between different regions. In accordance with the literature, we select the Cauchy function [44] to be the prototype membership function, and then the brightness-texture region Ri and edge region Ej can be described by the fuzzy e e , whose membership functions are ei and F features F j

! me f ¼ Fi

1 0 1 ! ! c  a f  f i B C 1þ@ A df

;

!e  mee f ¼ Fj

1 1a ; 0 e !e  c !  f  f j C B 1þ@ A e df

(14.25)

! ! c where f indicates one of the feature vectors within Ri, f i means the center feature of Ri, !e !e c f denotes one of the feature vectors within Ej, f j represents the center feature of Ej, a is a positive parameter that controls the grade of fuzziness of the obtained fuzzy feature, df and df e indicate the average distances between region centers with the definitions df ¼

df e ¼

n1 X n  X 2 ! c !  c   f p  f q nðn  1Þ p¼1 q¼pþ1

2 mðm  1Þ

m 1 X

m  e X ! !e  c c   f p  f q

:

(14.26)

p¼1 q¼pþ1

  e;F ee . So far, regions R and E of an SAR image I are represented by the fuzzy features F The range of the value of the fuzzy features is [0, 1], where a larger value indicates the better fit for the feature vector is to its region.

510 Chapter 14 14.1.3.1.2.4 RFM similarity calculation Now, we discuss how to measure the similarity between two SAR image patches within fuzzy feature space. To take advantage of the contributions of all the regions, the UFM measure is adopted here as the prototype of our RFM measure. Let’s take a brightness-texture region as our first example. Assume again that there are two SAR image patches I1 and I2, and their brightness-texture regions are represented by the n o   e 0 ¼ F0 : 1  j  n0 . First of all, we have e ¼ F ei : 1  i  n and F fuzzy features F j

ei and to formulate the similarity measures for F

e0 . F j

This measure can be defined as  a   ! dfi þ dfj 0 e e : S Fi ; Fj ¼ sup me e0 f ¼  a  Fi XFj ! ! c a c ! f ˛ ℝ31 dfi þ dfj þ  f i  f j 

(14.27)

! Here, the 31-dimensional feature vector f consists of 30-dimensional Gabor coefficients and one-dimensional normalized gray value. Then, considering all of the regions’ effects, the UFM measure formulates the similarity between two SAR image patches under the brightness-texture space as follows,   !ðFe ;Fe Þ e;F e 0 ¼ ½ð1  lÞ! ufm F u b T L u a þ l! ’

(14.28)

where ! u a and ! u b are the weights obtained by the area percentage and the border favored ’ !ðFe ;Fe Þ ! ! is the schemes. The l ˛ ½0; 1 adjusts the importance between u a and u b , L similarity vector consisting of fuzzy similarities between all of the brightness-texture regions corresponding to two SAR image patches, and its definition is 0   e0 !F !Fe T !ðFe ;Fe Þ L ¼ l ; l (14.29) 0

e

: n0 F 0 0 n ! !Fe e ; l ¼S F ei e; W F e ;WF l ¼S F j¼1

j

i¼1

The area percentage and border-favored schemes are both proposed in the literature [25]. The area percentage scheme emphasizes that the important objects within an image always occupy larger areas, so that the weight for one region equals its area percentage within an image. The border-favored scheme requires that similar semantics have similar backgrounds, so that the weights of the regions near the image boundary will be higher. Similar to the brightness-texture UFM similarity, the UFM similarity between two SAR image patches using edge features can also be determined by

SAR image processing based on similarity measures and discriminant feature learning

511

e0

e e  e e0  e e T !ðF ;F Þ e ;F e ¼ ð1  lÞ! ufm F ub L u a þ l! ; (14.30) n o n o 0 0 ee ¼ F ee ¼ F e e : 1  i  m and F e e : 1  i  m0 are the fuzzy signatures where F i i e e u and ! u are the weight sets in line for edge regions of SAR image patches I and I , ! e

1

2

a

b

e0

!ðFe ;Fe Þ with the area percentage scheme and the border favored scheme, L is the similarity vector consisting of fuzzy similarities between all of the edge regions, defined by e

e e e0  e e0 !F !Fe T !ðFe ;Fe Þ L ¼ l ; l e0

!Fe l



0 e m

0



!Fe

e ;WF ee ; l ¼S F j j¼1

e



0

m

ee ; W F ee ¼S F i

:

(14.31)

i¼1

To integrate the contributions of the two types of regions for measuring similarities   e e0   e ;F e;F e 0 and ufm F e in a linear between SAR image patches, RFM combines ufm F manner, with the definition    e e0  e;F e 0 þ ð1  hÞ$ufm F e ; e ;F rfmðI1 ; I2 Þ ¼ h$ufm F

(14.32)

where h ˛ ½0; 1 adjusts the importance of the two different kinds of regions. Since the   e e0   e is between 0 and 1, respectively, then the value of e ;F e;F e 0 and ufm F value of ufm F the RFM similarity between two SAR image patches ranges from 0 to 1. 14.1.3.1.2.5 RFM summarization and computational complexity On the basis of the previous discussion, the calculation method of our proposed RFM measure is summarized in Algorithm 1. It is clear that the computational aspect of this RFM contains four parts: (1) superpixel-based segmentation, (2) multiscale edge detector-based segmentation, (3) region description with fuzzy signatures, and (4) an RFM similarity calculation. Since the pixellevel feature and region fuzzy signature extraction can be completed offline, the computational components of the RFM can be broken down into three parts, the first obtaining superpixels using the SLIC algorithm, the second clustering superpixels with the adaptive k-means algorithm, and the third carrying out the RFM similarity calculation. Assume that the SAR data set has N image patches I ¼ fI1 ; I2 ; /; IN g (each patch has Np pixels), the expected number of the superpixels obtained with the SLIC algorithm is Ns, and the average number of their brightness-texture and edge regions are Ct and Ce, respectively. For the superpixel-based segmentation, the computational expense for an SAR image patch scales as O(Np þ NsCtt). For the multiscale edge detector-based segmentation, the computation to an SAR image patch scales as O(NpCet). For the RFM similarity calcula  tion, the computation to two SAR image patches scales as O Ct2 þCe2 . Note that t

512 Chapter 14 Algorithm 1 Calculation Method of the RFM Measure Procedure: Input: Two SAR image patches I1 and I2. Step 1. Superpixel-based segmentation for brightness-texture regions. 1.1. Extract the pixel-level brightness and texture features for two SAR image patches, i.e., the normalized gray values and the responses of a bank of Gabor filters. 1.2. Obtain the superpixels of two patches using the SLIC algorithm, and represent them by the arithmetic mean vector of the pixels within them. 1.3. Divide the superpixels into different groups using the adaptive k-means method. 1.4. Expand the superpixels’ segmented labels to all of the pixels within them. The superpixel-based brightness-texture regions R and R0 can then be computed. Step 2. Multiscale edge detector-based segmentation for edge regions. 2.1. Extract the multiscale edge features for I1 and I2. 2.2. Segment I1 and I2 according to the adaptive k-means method using the edge features, then the edge regions E and E 0 can be determined. 0 e, F e 0, F e e , and F e e , where F e and F ee Step 3. Describe two types of regions with fuzzy features F 0 0 e 0 0 e and F e are the fuzzy signatures of R and E . are the fuzzy signatures of R and E, and F Step 4. RFM similarity calculation. 4.1. Compute the brightness-texture region based UFM similarity between I1 and I2. 4.2 Compute the edge region-based UFM similarity between I1 and I2. 4.3 Calculate the RFM similarity between two SAR image patches. Output: The RFM similarity between I1 and I2.

indicates the number of iterations of k-means here. Consequently, the computation of the   RFM measure scales as O 2ðNp þðNs Ct þNp Ce ÞtÞ þCt2 þCe2 . 14.1.3.2 Multiple relevance feedback (MRF) From the RFM similarities between the query patch q and target image patches within the data set, we can obtain initial retrieval results Y. These results are acquired in a blind manner, and they are often not satisfactory. To refine the initial results, we develop an RF method we call multiple relevance feedback (MRF). Before we introduce our MRF, we briefly review the RF calculation. The RF method is a popular interactive reranking method. It adapts to users’ feedback (i.e., current retrieval results are either relevant or irrelevant) to improve performance. The common RF in CBIR can be summarized as follows: (1) obtain the initial retrieval result using any search engine, (2) select a small number of positive and negative samples by a specific algorithm or some other user-defined criterion, and (3) rerank the samples within

SAR image processing based on similarity measures and discriminant feature learning

513

the initial result using an appropriate machine learning method. The last two steps are iterative, and the iteration process stops when some convergence conditions are met. Here, the proposed MRF is based on the general RF method. First, to simplify the RF process, we introduce AL into the MRF to select SAR images from the data set automatically. Second, since the performance of the RF driven by a single AL algorithm is restricted to certain kinds of images, we expand it to using multiple AL algorithms. In other words, different AL algorithms are utilized in MRF to refine the original ranked list, and then these reranked results are integrated by fusion to get the final retrieval results d. We first discuss MRF using a single AL algorithm scenario. The corresponding flowchart is displayed in Fig. 14.14. When the initial retrieval results Y are acquired, a small number of image patches (relevant and irrelevant) are selected by users or by specific algorithms to construct an image patch set T . The image patches within T are then used to train a binary SVM classifier to rerank the image patches within the data set. Their ranks can be obtained according to the distances between themselves and the hyperplane. In accordance with margin sampling theory [36,45], the smaller this distance the more uncertain the samples are. If the reranked results converge (i.e., they satisfy the users), the final results are output. Otherwise, the AL algorithm is used to select a certain number of image patches from the data set, and their binary labels (relevant and irrelevant) are given by the users. The main target of AL is to achieve greater precision with fewer labeled data and fewer samples selected from the unlabeled data [46]. Thus, the image patches selected by the AL algorithm can be regarded as the optimal samples to train the classifier. The selected image patches and their binary labels are then added to T , and the binary SVM classifier is retrained to rerank all of the SAR images. These operations are carried out iteratively, and they will not stop until the reranked results have converged.

Figure 14.14 Flowchart of RF driven by a single AL algorithm.

514 Chapter 14 There are several points we wish to explain further. First, when something is referred to as relevant it means that the retrieval image patch is similar to the query patch. Second, we have to provide a small number of relevant and irrelevant image patches before the first iteration. One is the query patch q, which is regarded as the relevant image. The irrelevant patch is selected from the bottom of the initial retrieval results. Third, we use LibSVM [38] to train a binary SVM classifier. In our study, we design a kernel function based on our proposed RFM similarity for SVM training, named the RFM Gaussian kernel. Its definition is ! drfm ðIi ; Ij Þ2 KðIi ; Ij Þ ¼ exp  ; (14.33) z2 where drfm(Ii, Ij) denotes the RFM distance between two SAR image patches Ii and Ij, and z is the standard deviation of all the pairwise RFM distances. As discussed in Section 14.1.3.1.2.4, the RFM represents the similarity between two SAR image patches. Its value ranges from 0 to 1, and a larger value means greater similarity between patches. To get the RFM distance between Ii and Ij, we convert the RFM similarity rfm(Ii, Ij) to the RFM distance drfm(Ii, Ij) using a simple transformation drfm(Ii, Ij) ¼ 1rfm(Ii, Ij). Now, we extend the single AL algorithm approach to the multiple AL algorithm scenario. This flowchart is displayed in Fig. 14.15. Suppose there are M AL algorithms available, and the RF results have been determined, respectively, under each different AL algorithm. The next step is to fuse these results to get the final retrieval result. Due to differences in AL algorithms, the positions s ¼ ½s1 ; /; sM T of an SAR image patch in different RF results can be very diverse. It is difficult to fuse their position directly. Consequently, we transform the positions corresponding to an SAR image patch in different RF results into the relevant scores [28] y ¼ ½y1 ; /; yM T , and then fuse these scores to arrive at the final result. Since the relationship between a position and its relevant score can be estimated in

Figure 14.15 Flowchart of proposed MRF.

SAR image processing based on similarity measures and discriminant feature learning

515

an offline process, we assume that the scores of an SAR image patch under different single RF schemes have been acquired first. Then, these scores will be fused using the CombSUM algorithm [47] to get the final score yi for an SAR image patch. The CombSUM algorithm first normalizes the elements within y into [0, 1], and then combines T them in a weighted summation. When the final scores y ¼ y1 ; /; yN of all SAR image patches within the initial retrieval result are obtained, the positions of these patches are adjusted according to the descending order of the scores. To complete the transition from position to relevant score, a large number of query estimation schemes is adopted. First, a set of SAR image patches are selected randomly from the data set. Then, we use those patches as the queries and obtain the retrieval results according to the RFM measure. For our application, the retrieval results are relevant or irrelevant based on the patch content. For our study, if a retrieval result belongs to the same class as the query, this patch is regarded as relevant and its score is set to 1. Otherwise, the retrieval result is irrelevant and the score is set to 0. Finally, the estimated relevant score list can be determined by averaging all queries’ score lists. Generally speaking, the estimated relevant score list for average operations is nonsmooth. To get a smooth estimated relevant score list, the mean squared loss criterion can be used to fit the nonsmooth list. More details of this estimation scheme can be found in Ref. [32]. 14.1.3.3 Experiments and discussion 14.1.3.3.1 Setting parameters For the RFM measure, there are four parameters that we have to set in advance: the expected number of superpixels Ns, the positive parameter a, which controls the extracted fuzzy features’ fuzziness, the variable l ˛ ½0; 1 that adjusts the weight sets’ proportion in the UFM similarity calculation, and the parameter h ˛ ½0; 1 that controls the significance of different regions in the RFM similarity computation. In our experiments, we set Np ¼ 5000, a ¼ 1, l ¼ 0:1, and h ¼ 0:5 for all SAR image patches unless stated otherwise. The influence of different parameters for RFM is examined in Section 14.1.3.1.2.5. Here, we have to admit that it might not be the best idea to set the fuzziness grade parameter a to be the same for two kinds of regions (brightness-texture and edge regions). However, it is a difficult and time-consuming task to separately assign values to this empirical parameter in accordance with the different nonlinearities of the brightness-texture/edge regions. In the future we plan to explore how we can rapidly and accurately determine the values of a for different SAR image patches, and for different kinds of regions. For MRF, we have to select the AL algorithms in advance. Based on SVM theory, three AL algorithms are proposed, which are: the Simple Margin, the MaxMin Margin, and the Ratio Margin. We adopt them for our MRF process. In addition, the number of iterations T is set to 10 for each single RF procedure, and the number of selected image patches in

516 Chapter 14 each iteration is set to five. The impacts of different iteration times on our retrieval method are discussed in Section 14.1.3.4.5. The regularization parameter of SVM for MRF is tuned for all images using an interactive k-fold cross validation method. 14.1.3.3.2 Evaluation criteria

There are two types of experiments, classification and retrieval. For classification, precision is adopted. Assume that the number of test SAR image patches is ct, and the number of correctly classified samples within the test patches is cr. The classification precision is defined as cr/ct. For image retrieval, the average retrieval precision (ARP) and average retrieval recall (ARR) are used. Suppose the number of retrieved SAR image patches for a query q is nr(q), the number of SAR image patches within the data set whose labels are same as the query q is nt(q), and the number of overlapping SAR image patches between the two sets is nc(q) (i.e., the correctly retrieved patches). The retrieval precision and recall are defined as nc(q)/nr(q) and nc(q)/nt(q). Note that the desired output for a CBIR system is that the more relevant results are in the top-ranked positions. Thus, only the top 100 ranked retrieval patches are considered in determining the ARP and ARR in the following retrieval experiments. 14.1.3.3.3 Retrieval examples

The retrieval examples are displayed in this section. There are four groups of retrieval results, which are exhibited in Fig. 14.16. These groups include: “Mixed Forest,” “Farmland,” “Pond,” and “Water Bodies.” Queries are selected from each category randomly. Due to space limitations, only the top 20 retrieved SAR image patches are provided visually, and the number of correct patches among the top 35 retrieved results is also provided. The incorrectly retrieved patches are tagged in red. Two sets of experimental results are found to be satisfactory (Figs. 14.16AeB), in that the number of incorrect images within the top 35 retrieval results are only zero and one, respectively. In contrast, the other two sets of examples are unsatisfactory (contain some mislabeled SAR image patches marked in red; Figs. 14.16CeD), and the number of incorrect samples within the top 35 results are 10 and 12, respectively. The main reasons for unsatisfactory retrievals are: For the query belonging to “Pond,” we can find that most of the incorrectly retrieved patches should belong to “Farmland.” These patches contain farmland surrounding ponds. In addition, the objects within these incorrectly labeled patches are always arranged into a similar shape in the query, i.e., a regular rectangle. This leads to their obtained regions (brightness-texture or edge) always being similar to that of the query. Thus, our RFM always dictates that the similarities between these incorrectly retrieved patches and the query are high. Moreover, the MRF scheme is carried out using the RFM Gaussian kernel, so that some incorrectly retrieved patches (belonging to “Farmland”) are not excluded from the top-ranked results. Consequently, the retrieval results of this query are not good enough. For the query belonging to “Water Bodies”, the

SAR image processing based on similarity measures and discriminant feature learning

517

Figure 14.16 Retrieval examples. The far left larger image patches are queries. The right patches within each block are the retrieval results. The incorrect retrievals are tagged in red.

incorrectly retrieved patches are concentrated under “Plantation.” This is mainly because of the MRF scheme. As mentioned in Section 14.1.3.2, the MRF is conducted using three single relevance feedbacks (RF). Thus, the fusion results are influenced by all of the individual ones. In this example, one of the single RF methods performs poorly, so that the final retrieval results are not unsatisfactory. 14.1.3.4 Numerical evaluation 14.1.3.4.1 Performance of the RFM The performance of the proposed RFM is assessed in this section. The following matching-based similarity measures are selected to compare with the RFM measure: •

Normalized compression distance (NCD) [48]. A noise-resilient similarity measure, which uses compressor distances to calculate the degree of resemblance between two files.

518 Chapter 14 •

Integrated region matching (IRM). The IRM measure was used for image retrievals. All of the segmented regions are regarded as the elements for calculating the similarities between images, and the weights of those elements are obtained using the MSHP principle. Unified feature matching (UFM). The UFM measure was introduced for generalpurpose image retrievals. Images are first segmented into different regions, and then the fuzzy features are chosen to describe the regions and to reduce inaccurate segmentation. Similarities between images are converted into the resemblance between fuzzy feature vectors. Improved integrated region matching (IIRM) [1]. The IIRM measure was proposed for SAR images. The SAR image patches are described by two types of region-based signatures, and then the similarities between patches are measured by the weighted sum of two kinds of IRM distances. Discrete distribution distance (D2) [24]. D2 distance was proposed for image annotation. This distance is actually the sum of squared Mallows distances [27] between different distributions.







Note that all of the comparison measures used in this study have been performed by the authors following the methods outlined in the original literature. In addition, the LempeleZiveWelch (LZW) algorithm [49] is used to compute NCD between two SAR image patches. First, we use classification to validate our RFM’s behavior, and the k-nearest neighbor (KNN) algorithm is selected as the classifier. There are two reasons for selecting KNN. One is that KNN is the most common way to classify unlabeled image data. The other is that KNN is based on the distances between the data. These characteristics of KNN result in the behavior of the similarity measure being reflected directly. Here, the number of nearest neighbors knn is set as 1, 3, 5, 7, 9, and 11, respectively, to observe different measures’ performances. Moreover, the proportion of training samples to test samples is 1:1. The classification precision obtained by different measures mentioned above is summarized in Table 14.10, and all of the results are obtained by the k-fold crossTable 14.10: KNN classification precision obtained by different similarity measure using the constructed SAR image patch data set. knn 1 3 5 7 9 11

NCD

IRM

UFM

IIRM

0.6087 0.6139 0.6242 0.6365 0.6410 0.6485

0.6593 0.6816 0.6918 0.6927 0.6921 0.6910

0.6690 0.6861 0.6944 0.6948 0.6950 0.6920

0.6783 0.6992 0.7070 0.7086 0.7098 0.7099

D2 0.7111 0.7331 0.7387 0.7400 0.7377 0.7354

RFM 0.8121 0.8282 0.8304 0.8295 0.8262 0.8235

SAR image processing based on similarity measures and discriminant feature learning

519

validation. From the results, it is apparent that RFM outperforms all others, and that the best classification result of RFM occurs for knn ¼ 5. Overall, the order of six measures is RFM, D2, IIRM, UFM, IRM, and NCD in accordance with their classification performance. Although D2 achieves good classification results, its computational cost is high since there is a linear step in its calculation. UFM and IRM perform less well in SAR image patches than for optical images since most SAR image patches are much more complicated than optical images. Although the performance of NCD is weaker than that of other methods, its classification results are acceptable. Unlike other comparisons, IIRM is proposed for SAR images, but its classification performance is mediocre. The reason for this is that the number of selectable samples is small. Generally speaking, IIRM performs better when knn is large [1]. Second, we used all of the measures to retrieve SAR image patches from the data set. Given a query SAR image patch q, the retrieval results are acquired directly according to the order of the different similarities. The results are displayed in Fig. 14.17. It is clear that our proposed measure achieves the best performance. For ARP, the largest improvements of RFM compared with other measures are 25.1% (NCD), 14.9% (IRM), 14.2% (UFM), 13.79% (IIRM), and 8.71% (D2). For ARR, the greatest enhancements of RFM compared with other measures are 1.62% (NCD), 1.27% (IRM), 1.08% (UFM), 0.89% (IIRM), and 0.51% (D2). Note that the value of the ARR is small since that only the top 100 ranked retrieval image patches are counted, and the general number of image patches within each category is far more than 100 (Table 14.9). Fig. 14.18 details the performance of various measures across different categories using ARP. The “ARP@100” in the figure means that the top 100 ranked retrieval image patches are considered to calculate the retrieval precision. A notable observation is that the performance of RFM is

Figure 14.17 Comparing retrieval results of our RFM and comparison measures using the constructed SAR image patch data set. (A) Average retrieval precision; (B) average retrieval recall.

520 Chapter 14

Figure 14.18 Average retrieval precision of different measures across 14 categories within the constructed SAR image patch data set. For Category IDs, see Table 14.9.

not as good as expected in some categories, such as “Mountain” and “Ocean.” The reason is that the contents of the SAR image patches within these categories are uniform, which contradicts the superpixel and fuzzy signature extraction. However, RFM generally outperforms other comparisons in most categories. The encouraging experimental results validate the effectiveness of RFM for measuring the similarities between SAR image patches. 14.1.3.4.2 Performance of the proposed retrieval method

To demonstrate that the proposed method is effective, we compared it with various existing RS or SAR image retrieval methods, including: • •

• •



Baseline. The retrieval results obtained by the RFM measure are regarded as the baseline. The SAR image retrieval method based on semisupervised learning (SSL) and IIRM (SSL þ IIRM). The query patch is classified into a certain semantic category, and then a recovery scheme is applied to reduce the influence of the inevitable classification error. The SAR image patches within the same and relevant categories are ranked by the IIRM measure. SSL þ RFM. According to the retrieval method, we change IIRM into our proposed similarity measure (RFM) to retrieve the SAR image patches. Double-criteria relevance feedback (DCRF) [36]. This interactive image retrieval method was proposed for RS images. After the initial retrieval results are obtained, the samples are selected by the developed double-criteria AL algorithm in each iteration, and then the SVM is introduced to rerank the original results. Triple-criteria relevance feedback (TCRF). Based on DCRF, a triple-criteria AL algorithm in RF for content-based RS image retrieval was presented in Ref. [37]. In an

SAR image processing based on similarity measures and discriminant feature learning

521

iterative process, uncertainty, diversity, and density are taken into account simultaneously to select optimal RS images within the initial retrieval results. Then, similar to DCRF, SVM is adopted to rerank the images for final retrieval results. Note that, among the comparisons, DCRF and TCRF are the RF-based retrieval methods, which are similar to our method. In addition, AL algorithms are also used by themselves to select images in RF iterations. To be fair, we make them select five SAR image patches in each iteration, and use our RFM Gaussian kernel to train the binary SVM classifier. The regularization parameters of SVM for DCRF and TCRF are tuned by all image patches using k-fold cross-validation in iterations. In addition, the “recovery scheme” mentioned above uses the confusion matrix [11] to reduce the losses resulting from the inevitable classification error to retrieval. We exhibit the ARP and ARR obtained by the different retrieval methods in Fig. 14.19. As a whole, the order of different retrieval methods in accordance with their performance is our method, TCRF, DCRF, SSL þ RFM, and SSL þ IIRM. For SSL þ IIRM, since the IIRM measure is weak when the number of retrieved images is small, its performance is not even as good as baseline. However, when the IIRM measure is changed into RFM (i.e., SSL þ RFM), its behavior improved significantly. This demonstrates that (1) the retrieval framework proposed for SAR images is effective, and (2) the similarity measure proposed in this chapter (RFM) outperforms IIRM for the top few ranked results. Since SSL þ IIRM and SSL þ RFM retrieve SAR image patches in a transductive manner [50] and no refining operation is added after the retrieval results are obtained, there is a clear performance gap between them and the other three approaches (DCRF, TCRF, and our method). Combining users’ opinions and proper patch selection schemes (i.e., AL algorithms), DCRF and TCRF succeed in improving the retrieval results to a large degree

Figure 14.19 Comparing retrieval results of our method and comparisons using constructed the SAR image patch data set. (A) Average retrieval precision; (B) Average retrieval recall.

522 Chapter 14

Figure 14.20 Average retrieval precision of different retrieval methods across 14 categories within the constructed SAR image patch data set. For Category IDs, see Table 14.9.

over baseline. Although their performance is good, our method still outperforms them. The main reason for this is that our method uses multiple AL algorithms to select different sets of image, and integrates their contribution together to enhance the retrieval results. Both precision and diversity are considered in our method. For ARP, the largest improvements resulting from our method over other methods are 16.45% (baseline), 24.56% (SSL þ IIRM), 14.09% (SSL þ RFM), 3.85% (DCRF), and 2.28% (TCRF). For ARR, the highest enhancements produced by our method are 1.62% (baseline), 2.26% (SSL þ IIRM), 1.30% (SSL þ RFM), 0.47% (DCRF), and 0.36% (TCRF). Fig. 14.20 details different retrieval method performances across various categories. From the observations, it is clear that our method performs the best in most of the categories. Moreover, our method is stable between categories. Even if its performance is not the best in all categories, it is always within the top two positions. The largest enhancements achieved by our method over other methods are found in “Port” (baseline), “Water Bodies” (SSL þ IIRM), “Mountain” (SSL þ RFM), and “Road” (DCRF and TCRF). These encouraging results validate that our retrieval method is effective for SAR images. Apart from these numerical assessments, we also provide the top retrieved image patches obtained by different methods for a query which belongs to “Mountain” in Fig. 14.21. Due to space limitations, only the top 10 ranked results are displayed here, and the incorrect results are tagged in red. Meanwhile, the number of correctly retrieved image patches within the top 35 results are also presented. 14.1.3.4.3 Importance of the multiple RF schemes’ integration

The proposed MRF scheme fuses the results of multiple RF methods to improve the initial retrieval results. Therefore, we compare MRF with the approaches which only utilize a single RF scheme (i.e., only one AL algorithm is utilized to accomplish RF process) to

SAR image processing based on similarity measures and discriminant feature learning

523

Figure 14.21 Examples of retrieval results of “Mountain” by different methods on the constructed SAR image patch data set. The first image patches in each block are the queries, and the rest of the image patches in each block are the retrieval results. The incorrect results are tagged in red. The number of correct patches within the top 35 results is provided below each block.

demonstrate the importance of integration. As mentioned in Section 14.1.3.2, three AL algorithms are utilized to accomplish MRF, i.e., Simple Margin, MaxMin Margin, and Ratio Margin. Thus, we name the approaches with only one AL algorithm RFSM, RFMM, and RFRM, respectively. The results of our SAR image patch data set are exhibited in Fig. 14.22. All of the results are obtained after 10 RF iterations. From the observation of histograms, it is clear that MRF outperforms all single-AL algorithm-based methods, which substantiates the significance of our integration scheme. Among three comparisons, RFMM performs poorer than the two other methods. When the number of retrieved results is less than 30, the performance of RFRM is better than that of RFSM. However, RFSM’s performance enhances with the number of retrieved images increases. As a whole, the difference of the performance of three single-AL algorithm-based RF schemes is minimal. After the integration, MRF leads to a larger improvement over baseline compared to the three other comparisons. For ARP, the greatest enhancements generated by our method are

524 Chapter 14

Figure 14.22 Comparing relevance feedback results of MRF, RFSM, RFMM, and RFRM using the constructed SAR image patch data set. (A) Average retrieval precision; (B) Average retrieval recall.

16.45% (baseline), 2.98% (RFSM), 4.39% (RFMM), and 3.48% (RFRM). For ARR, the largest improvements of our method are 1.62% (baseline), 0.42% (RFSM), 0.49% (RFMM), and 0.42% (RFRM). To provide a more objective and believable assessment for our MRF scheme, we apply it to a public ground-truth high-resolution aerial image data set. In this data set, there are 2100 equal-sized (256  256) RS images with a pixel resolution of 30 cm, and they are classified into 21 land-cover categories. The number of images within each category is 100. More details of this high-resolution aerial image data set can be found in Refs. [35,51]. Similar to the previous experiment, the RFM measure is used to obtain the initial retrieval results (which are regarded as the baseline), then the RF schemes MRF, RFSM, RFMM, and RFRM are applied to refine the initial results for the final retrieval results. All of the results are obtained within 10 RF iterations, and five RS images are selected by different AL algorithms in each iteration. The results of this RS image data set are shown in Fig. 14.23. A similar case can be found, that is, MRF outperforms all single AL algorithm-based methods. This again proves the effectiveness of our MRF scheme. 14.1.3.4.4 Significance of the RFM Gaussian kernel

We discuss the superiority of our RFM Gaussian kernel in this section. As mentioned in Section 14.1.3.2, the new kernel function is proposed based on the RFM distance between SAR image patches. To verify its usefulness, we apply it and other two radial basis functions (RBFs) to our MRF scheme to study their performance, respectively. Two RBF kernels are formed from two common original features in the RS image community. One is the 60-dimensional homogeneous texture feature [52] extracted by 30 Gabor filters (five scales and six orientations). The feature vector is constructed by the mean and standard

SAR image processing based on similarity measures and discriminant feature learning

525

Figure 14.23 Comparing relevance feedback results of MRF, RFSM, RFMM, and RFRM using the ground-truth high-resolution aerial image data set. (A) Average retrieval precision; (B) Average retrieval recall.

deviation of the filters. The other one is the color histogram features [51]. Since the SAR image patches are grayscale images, we rename them brightness histogram features. The length of a brightness histogram feature is 256 (the gray value within each image patch ranges from 0 to 255), and it is normalized to sum to one. The MRF schemes based on two RBF kernels and our RFM Gaussian kernel are recorded as MRF-HT, MRF-BH, and MRF for convenience. The results are shown in Fig. 14.24. From the observation of the bars, we can easily find that the MRF scheme based on our RFM Gaussian kernel outperforms all other comparisons. It demonstrates that the Gaussian kernel formed from our RFM measure is more proper to our SAR image patches.

Figure 14.24 Comparing multiple relevance feedback results of our SAR image patch data set using different kernel functions. (A) Average retrieval precision; (B) Average retrieval recall.

526 Chapter 14 14.1.3.4.5 Influences of different parameters

The influences of different parameters on our proposed similarity measure and retrieval method are discussed in this section. The following parameters impact the RFM measure directly, including the expected number of superpixels Ns, the fuzzy features’ fuzziness grade parametera, the weight sets’ proportion variable l, and the two types of region significance parameter h. Thus, we change their values to study their influences. Initially, we vary Ns from 500 to 5000. Then, ais varied from 0.5 to 1.5. Third, l is tuned from 0.1 to 0.9. Finally, we change h from 0.1 to 0.9. When one parameter is changed, others are fixed as described in Section 14.1.3.2. The results are shown in Fig. 14.25. For Ns, the performance of the RFM measure is enhanced when Ns increases. When Ns > 2500, RFM’s behavior changes only slightly. For a, the peak value of RFM’s performance appears around a ¼ 1. When a < 1, the

Figure 14.25 Influences of different parameters on RFM measure. The results are counted by the constructed SAR image patch data set: (A) Ns; (B) a; (C) l; (D) h.

SAR image processing based on similarity measures and discriminant feature learning

527

measure’s behavior decreases dramatically. When a > 1, the performance of RFM reduces slowly. For l, the trend of RFM’s behavior is decreasing, which demonstrates that the weights obtained using the area percentage scheme are more important for RFM. For h, the best performance of RFM appears around h ¼ 0:5, which means both of types of regions (i.e., brightness-texture and edge regions) are significant to RFM. For our retrieval method, the number of iterations T will influence its performance. As mentioned in Section 14.1.3.2, we set T ¼ 10 in this study. Now, we display the different retrieval performances in each RF iteration to study the influence of T. Since two comparisons (i.e., DCRF and TCRF) are RF-based retrieval methods, we also provide their behaviors as a reference. The results are shown in Fig. 14.26. From these observations, we can easily find that the performance of three retrieval methods has decreased in the first few iterations, and then is enhanced with subsequent iterative loops. The reason for this is as follows. The total number of SAR image patches within the data set is 15,728, but the number of selected SAR image patches in each iteration is five. At the beginning of the RF iteration, the number of patches for SVM training is small, so that the trained binary SVM classifier is not able to perform well. When the RF iterations increase, the SVM classifier becomes stronger as more patches are selected by the AL algorithms. Although all three methods perform less well than expected at the beginning, our method still achieves the best. After 10 iterations, all of the methods’ performances improved, and our retrieval method reached the best performance.

Figure 14.26 Influence of the number of iterations T. The results are counted by the constructed SAR image patch data set.

528 Chapter 14

14.2 SAR image change detection based on spatial coding and similarity 14.2.1 Saliency-guided change detection for SAR imagery using a semisupervised Laplacian SVM Consider two coregistered intensity SAR images, which are acquired in the same geographical area at two different times t1 and t2. They are defined as X1 ¼ {X1(d, e),1  d  D,1  e  E} and X2 ¼ {X2(d, e),1  d  D,1  e  E}, where D and E are the width and height (which are measured in pixels) of the images, respectively. We apply the log ratio operator [53] to generate the difference image, which is robust to calibration and the speckle noise. The log ratio operator is used as follows:    X1   DI ¼ ln10  ¼ jln10 X1  ln10 X2 j (14.34) X2 14.2.1.1 Learning a pseudotraining set via saliency detection We apply the contest-aware saliency algorithm to obtain a pseudotraining set. According to the saliency detection theory, the salient regions should contain not only the prominent objects but also the background parts that convey the context [54]. Therefore, the pseudotraining set could contain both the changed and unchanged areas. A pixel in the difference image is salient if its appearance is distinct. However, a pixel which looks salient should be based on its surrounding neighbors. We consider an h  h block centered on each pixel to represent the pixel. Let pi represent the image block, which consists of pixel i and its surrounding pixels. The image block pi is supposed to be salient if its gray level is different from all other image blocks. Let dgray(pi, pj) be the Euclidean distance between the vectorized blocks pi and pj in gray level space, normalized to the range [0,1]. When dgray(pi, pj) is high, pixel i is considered salient. Another important factor is the positional distance between blocks. Background blocks are likely to have many similar blocks both near and far from each other in the image. In comparison, salient blocks tend to be grouped together. This means that a block pi is common when the blocks similar to it are distant, and is salient when the similar blocks are nearby. Suppose dposition(pi, pj) be the Euclidean distance between the position of blocks pi and pj. Based on the analysis above a dissimilarity measure d(pi, pj) between a pair of blocks is defined as:     dgray pi ; pj d pi ; pj ¼ (14.35) 1 þ c  dposition ðpi ; pj Þ

SAR image processing based on similarity measures and discriminant feature learning

529

where c is a constant and set to 3 to emphasize the impact of the position between blocks. The difference in appearance is proportional to the dissimilarity measure, and the positional distance is inversely proportional to the dissimilarity measure. When d(pi, pj) is high, pixel i is seen to be salient and it is highly dissimilar to all other image blocks. In order to evaluate a block’s saliency, it is unnecessary to consider its dissimilarity to all other image blocks. It is enough to find its surrounding neighbors (if most neighbor blocks are highly different from pi, then clearly all image blocks are highly different from pi). Thus, for this purpose, we use an L  L (L > h) compared block centered on pi. Then, we compute the dissimilarity between the center block pi and its neighbors fqk gKk¼1 (K is the number of neighbors), according to Eq. (14.2). A pixel i is salient when d(pi, qk) is high ck˛[1, ., K]. The saliency value of pixel i is defined as: ( ) K 1 X Si ¼ 1  exp  dðpi ; qk Þ (14.36) K k¼1 The larger Si is, the more salient pixel i is. Therefore, we select the pixels which obtain the largest Si as the labeled changed samples, and select the pixels which obtain the smallest Si as the labeled unchanged samples. Fig. 14.27 shows the selected training set in the reference images of the Ottawa and Bern data sets. Their original SAR images are 290  350 and 301  301 pixels, respectively. The chosen changed pixels are illustrated in the red squares and the unchanged pixels are illustrated in the green squares. From this figure, we can see that the saliency similarity could obtain the correct changed and unchanged areas.

Figure 14.27 The selected samples in the reference images according to the saliency similarity algorithm (the selected changed areas are marked in the red squares, and the selected unchanged areas are marked in the green squares): (A) Ottawa data set; (B) Bern data set.

530 Chapter 14 14.2.1.2 Obtaining change result via Laplacian support vector machine In order to guarantee the validity of the selected samples, we only pick out the most salient pixels which could represent the changed and unchanged parts precisely. The information of the labeled samples is too little to separate the changed areas from the difference image. Therefore, we take the advantages of the semisupervised method to combine the unlabeled samples to obtain a more exhaustive description of the changes. Let {(gi, yi), i ¼ 1, ., l} denote the l labeled vectors obtained according to the saliency detection algorithm, and {gi, i ¼ l þ 1, ., l þ u} denote the randomly selected u unlabeled vectors, where gi is the selected sample vector and yi˛{1, þ1} is the labels. The regularization minimization function in LapSVM [55] is defined as follows: l 1X Vðgi ; yi ; f Þ þ gA kf k2K þ gM kf k2M f ˛ HK l i¼1

min

(14.37)

where V is a generic cost function, kf k2K is the norm in the associated reproducing kernel Hilbert space (RKHS) HK, and kf k2M reflects the intrinsic structure of the data distribution. gA and gM are the corresponding regularization parameter. In this chapter, we focus on the LapSVM formulation, which basically uses the hinge-loss function as the SVM and the graph Laplacian for manifold regularization. In the following, we review the formulation. LapSVM uses the same hinge-loss function as the SVM Vðgi ; yi ; f Þ ¼ maxf0; 1  yi f ðgi Þg

(14.38)

where f is the decision function implemented by the selected classifier, and the predicted label y* (* makes a difference with the known label) is obtained by the sign function: y* ¼ sgn(f(gi)). The second regularized penalty term can be written as: kf k2K ¼ aT Ka

(14.39)

where a is the expansion coefficients vector, T denotes operator, and K is the     transpose kernel matrix formed by kernel functions K gi ; gj ¼ fðgi Þ; f gj [56], where fð ,Þ is a nonlinear mapping to a higher dimensional Hilbert space HK. The geometry of the data is modeled with a graph in which nodes are composed by both labeled and unlabeled samples connected by weights Wij [57]. Regularizing the graph follows from the smoothness assumption and is defined as: kf k2M ¼

lþu X

 2  f T Lf W Þ  f g ¼ f ðg ij i j ðl þ uÞ2 i;j¼1 ðl þ uÞ2 1

(14.40)

SAR image processing based on similarity measures and discriminant feature learning

531

where L ¼ DW is the graph Laplacian matrix, W is the edge weights matrix of the adjacency graph, D is the diagonal degree matrix of W whose elements are given by lþu P Dij ¼ Wij , and Dij ¼ 0 for i s j; the normalizing coefficient 1/(l þ u)2 is natural scale j¼1

factor for the empirical estimate of the Laplace operator; and f ¼ ½f ðgi Þ; .; f ðglþu ÞT ¼ Ka þ b, where b is the bias term. The LapSVM is formulated for binary classification problems and is suitable for the change detection problem which divides the difference image into the changed part and the unchanged part. In this chapter, we apply the Preconditioned Conjugate Gradient (PCG) algorithm [58] to solve the LapSVM. 14.2.1.3 Experimental results 14.2.1.3.1 Description of data sets We select three real SAR data sets to test the proposed change detection approach. The first data set is two SAR images (290  350 pixels) in the region of the city of Bern in Switzerland. They were acquired by the European Remote Sensing 2 (ERS-2) satellite SAR sensor, in April and May 1999, as shown in Figs. 14.28A and B, respectively. The reference image is shown in Fig. 14.28C. The second data set is obtained by a Radarsat SAR sensor over the city of Ottawa in May and August 1997, as shown in Figs. 14.29A and B, respectively. They are a section (290  350 pixels) of two SAR images with 10 m resolution. The reference image shown in Fig. 14.29C is manually defined. The last data sets are a part of the Yellow River data sets. They are acquired by Radarsat-2 in the region of the Yellow River estuary in China in June 2008 and June 2009, respectively. As shown in Fig. 14.30, the size is 257  289 pixels. The effect of speckle noise on the image acquired in 2009 is much greater than that on the one acquired in 2008 because the two

Figure 14.28 Bern data set with 30 m resolution, C-band, and VV polarization acquired by ERS-2 SAR sensor. The image size is 301  301 pixels. (A) Image acquired in April 1999; (B) image acquired in May 1999; (C) the reference image (the black areas indicate the changed parts and the white areas indicate the unchanged parts).

532 Chapter 14

Figure 14.29 Ottawa data set with 10 m resolution, C-band, and HH polarization acquired by Radarsat SAR sensor. The image size is 290350 pixels. (A) Image acquired in July 1997; (B) image acquired in August 1997; (C) the reference image (the black areas indicate the changed parts and the white areas indicate the unchanged parts).

Figure 14.30 Yellow River data set with 3 m resolution, C bands, and HH polarization acquired by Radarsat-2 sensor. The image size is 257289 pixels. (A) Image acquired in June 2008; (B) image acquired in June 2009; (C) the reference image (the black areas indicate the changed parts and the white areas indicate the unchanged parts).

original images are a single-look image and a four-look image, respectively. Fig. 14.30C shows the reference image in which the black areas indicate the changed parts and the white areas indicate the unchanged parts. 14.2.1.3.2 Quantitative analysis

To evaluate the performance of different methods, the quantitative analysis of change detection results is set as follows. Suppose that N is the number of pixels in the difference image, and Nc and Nu represent the total number of changed pixels and unchanged pixels

SAR image processing based on similarity measures and discriminant feature learning

533

in the reference image, respectively. The criteria are set as follows: the false alarms (FA), which is the number of unchanged pixels wrongly detected as changed areas and the FA rate in percentage is PFA ¼ (FA)/Nu  100; the missed alarms (MA), which represents the number of changed pixels undetected as unchanged areas and the MA rate in percentage is PMA ¼ (MA)/Nc  100; the overall error (OE) which is the sum of FA and MA, and the OE rate in percentage is POE ¼ ((FA) þ (MA))/(Nc þ Nu)  100; the kappa index which is a statistical measurement of accuracy or agreement [59]. 14.2.1.3.3 Parameter selection

We choose the labeled pseudotraining set by applying the saliency detection method, and randomly select the unlabeled set in the difference image. The number of labeled samples is set to 20, and the number of the unlabeled samples is the twice that of the labeled ones. The block for each pixel in the difference image is generated by its h  h neighbor data. In this section, the different sizes of the extracted blocks for each pixel are analyzed. The size h  h of the extracted block is set as 3  3, 5  5, 7  7, and 9  9 pixels. Fig. 14.31 shows FA, MA, and the kappa index for the proposed method on different SAR image data sets. Fig. 14.31 indicates that the kappa index gradually decreases along with the increase of the block size to the two data sets. The reason could be that the larger image block contains much more noise which ignores the centered pixel. The size of the block is set to 3  3 in the experiments. In addition, the regularization parameters gA and gM are tuned in the range of [104, 104]. 14.2.1.3.4 Experiment results and analysis on three data sets

We select five existing algorithms to compare the performance of our approach: generalized minimum-error thresholding (GKI-LN) method [60], principal component analysis, and k-means clustering (PCA-K) method [53], compressed sampling sparse representation (CS-KSVD) method [61], neighborhood-based ratio (NR) approach [62], and deep learning (DL) method [63]. Fig. 14.32 shows the experiment results carried out on the Bern data set. There are many white points which are false detected in Fig. 14.32A. This is the negative effect of noise to the assumption model. From Figs. 14.32B and C, we can see that the PCA-K and CSKSVD generate almost the same amount of MA. NR gives the best result in FA. However, it leads to the highest value in MA. It is important for the thresholding algorithm to select the correct modeling for the changed and unchanged pixels in the difference image. There are the same OEs in DL and our proposed method, as shown in Table 14.11. However, the proposed method has a maximum kappa coefficient of 0.8677. The visual results of different methods for the Ottawa data set are shown in Fig. 14.33. The proposed method produces an OE of 1990 pixels, which is much smaller than others five methods. In Fig. 14.33D, there are so many isolated points that the method of NR

534 Chapter 14

Figure 14.31 Influences of the block size (from 3  3 pixels to 9  9 pixels) on SAR data sets. (A) FA, MA, and kappa coefficient for the Bern data set; (B) FA, MA, and kappa coefficient for the Ottawa data set.

cannot distinguish the changed areas from the difference image. Compared with the other four methods, the proposed method has a better performance in discriminating the changed information in the upper left corner. The kappa index of the proposed method is a maximum of 0.9264, as shown in Table 14.11, which indicates that our proposed method outperforms the other methods clearly in the ability to suppress noise. The experimental results of different methods for the Yellow River data set are shown in Fig. 14.34. From Fig. 14.34, we can see that the visual results of GKI-LN and DL seem good, but their MAs are too large to detect the changed area. Compared with the other three methods, the proposed method produces much less FA, even though the result of the proposed approach cannot detect the precise location of the changed area. As can be seen from Table 14.11, the minimum OE of the proposed method is 4194 and the maximum kappa coefficient is 0.8015. From the above analysis, the method that we proposed can

SAR image processing based on similarity measures and discriminant feature learning

535

Figure 14.32 Change detection results obtained using different methods for the Bern data set: (A) GKI-LN; (B) PCA-K; (C) CS-KSVD; (D) NR; (E) DL; (F) proposed.

effectively utilize the pseudotraining set to improve the performance of the change detection.

14.2.2 SAR images change detection based on spatial coding and nonlocal similarity pooling Let us consider two coregistered intensity SAR images, X1 ¼ {X1(i, j),1  i  I,1  j  J} and X2 ¼ {X2(i, j),1  i  I,1  j  J}, which are acquired in the same geographical area at two different times, t1 and t2. Under our scheme, a change detection map CM ¼ {cm(i, j),1  i  I,1  j  J}, which can represent changes between the two images X1 and X2 is produced. The entire algorithm is mainly comprised of four parts: (1) producing the difference image via log ratio operator; (2) learning a dictionary via using hierarchical AP clustering and PCA; (3) generating feature vectors via sparse coding and nonlocal similarity pooling; and (4)

536 Chapter 14 Table 14.11: Change detection results on SAR data sets obtained by different methods. Data set Bern

Ottawa

Yellow River

Methods GKI-LN PCA-K CS-KSVD NR DL Proposed GKI-LN PCA-K CS-KSVD NR DL Proposed GKI-LN PCA-K CS-KSVD NR DL Proposed

FA

PFA

MA

PMA

OE

291 158 161 110 154 139 68 955 558 1366 601 1003 172 2137 2215 2344 815 1406

0.33 0.18 0.18 0.12 0.17 0.16 0.08 1.12 0.65 1.60 0.70 1.17 0.28 3.51 3.64 3.85 1.34 2.31

86 146 147 199 145 160 4183 1515 1929 760 1894 987 6902 2663 2697 2802 4471 2788

7.44 12.64 12.73 17.23 12.55 13.85 26.06 9.44 12.02 4.73 11.80 6.15 51.38 19.83 20.08 20.86 33.29 20.76

377 304 308 309 299 299 4251 2470 2487 2126 2495 1990 7074 4800 4912 5146 5286 4194

POE 0.42 0.34 0.34 0.34 0.33 0.33 4.19 2.43 2.45 2.09 2.46 1.96 9.52 6.46 6.61 6.93 7.12 5.65

Kappa coefficient 0.8480 0.8674 0.8657 0.8596 0.8662 0.8677 0.8244 0.9073 0.9047 0.9224 0.9035 0.9264 0.6006 0.7785 0.7736 0.7630 0.7313 0.8015

Figure 14.33 Change detection results obtained using different methods for the Ottawa data set: (A) GKI-LN; (B) PCA-K; (C) CS-KSVD; (D) NR; (E) DL; (F) proposed.

SAR image processing based on similarity measures and discriminant feature learning

537

Figure 14.34 Change detection results obtained using different methods for the Yellow River data set: (A) GKILN; (B) PCA-K; (C) CS-KSVD; (D) NR; (E) DL; (F) proposed.

obtaining the change map via K-means clustering algorithm to partition the feature vectors into two clusters. 14.2.2.1 Producing the difference image First, we need to generate the difference image of two SAR images. The subtraction operator and the ratio operator [64] are well-known techniques used for producing the difference image. Furthermore, the logarithm can transfer the multiplicative noise to the additive noise. For this reason, we apply the log ratio operator [65] to generate the difference image DI which is robust to calibration and the speckle noise. The log ratio operator is used as follows:    X1 þ ε  ¼ jlogðX1 þ εÞ  logðX2 þ εÞj (14.41) DI ¼ log X2 þ ε where ε is a tiny constant. The use of Xi þ ε (i ¼ 1,2) instead of Xi (i ¼ 1,2) is to avoid the case that the pixel values in Xi (i ¼ 1,2) be zeros, which make the operator nonsense [66].

538 Chapter 14 14.2.2.2 Learning dictionary via affinity propagation In order to learn a compact and discriminative dictionary to extract sparse feature vectors for each pixel, we need to first construct a sample data set for training. For this purpose, we divide the difference image into nonoverlapping image blocks of h  h in size. They could contain three types: (1) unchanged blocks; (2) changed blocks; and (3) a mixture of changed and unchanged blocks. Because the gray levels of the changed or unchanged blocks are relatively homogeneous, the intensity variances of the mixture of changed and unchanged blocks could be mostly larger than that of the changed or unchanged blocks [67]. Most of the mixture of changed and unchanged blocks is excluded when their intensity variances are larger than a threshold TH. We experientially choose the median of the obtained nonoverlapping image blocks intensity variances as the threshold TH. Suppose that there are G selected image blocks Q ¼ [q1,q2, .,qG], which are obtained after threshold TH. Aiming at making the dictionary compactly, in this chapter, we apply affinity propagation (AP) [68] to cluster the obtained image blocks set Q ¼ [q1,q2, .,qG], where the similarity used is simply set to the negative squared error. The reason for using the affinity propagation method in this chapter is that, unlike other unsupervised clustering algorithms, there is no need to set the initial number of clusters in AP. First we review the standard AP model. It takes the similarity matrix S ¼ (s(i,j)) between data points as input, where s(i,j) is the similarity (Euclidean distance) [69] of point j to point i. There are two kinds of messages exchanged between data points: responsibility and availability in the AP algorithm. The “responsibility” r(i,j), sent from data point i to candidate exemplar point j, reflects the accumulated evidence for how well-suited point j is to serve as the exemplar for point i, taking into account other potential exemplars for point i. The “availability” a(i,j), sent from candidate exemplar point j to point i, reflects the accumulated evidence for how appropriate it would be for point i to choose point j as its exemplar, taking into account the support from other points that point j should be an exemplar. To data point i, the value of j that maximizes the sum of the availability a(i,j) and the responsibility r(i,j) is the exemplar for point i, i.e., the exemplar of the point i is k ¼ argmaxj(a(i,j) þ s(i,j)). The core of the AP algorithm is the information communicated between data points: rði; jÞ ) sði; jÞ  maxfaði; kÞ þ sði; kÞg k6¼j

(14.42)

To begin with, the availabilities are initialized to zero: a(i,j) ¼ 0. In the first iteration, because the availabilities are zero, r(i,j) is set to the input similarity between point i and point j as its exemplar, minus the largest of the similarities between point i and other candidate exemplars. In later iterations, when some points are effectively assigned to other exemplars, their availabilities will drop below zero as prescribed by the update rule below.

SAR image processing based on similarity measures and discriminant feature learning

aði; jÞ ¼

1 0 8 > X > > > maxð0; rðk; jÞÞA; < min@0; rði; jÞ þ > > > > :

if i 6¼ j

k6¼fi;jg

X maxð0; rðk; jÞÞ;

539

(14.43) if i ¼ j

k6¼j

The availability a(i,j) is set to the “self-responsibility” r(j,j) plus the sum of the positive responsibilities candidate exemplar j received from other points. Only the positive portions of incoming responsibilities are added, because it is only necessary for a good exemplar to explain some data points well (positive responsibilities), regardless of how poorly it explains other data points (negative responsibilities). The “self-availability” a(j,j) is updated differently. This message reflects accumulated evidence that point j is an exemplar, based on the positive responsibilities sent to candidate exemplar j from other points. When updating the messages, it is important that they must be damped to avoid numerical oscillations that arise in some circumstances. Each message is set to g times its value from the previous iteration plus 1  g times its prescribed updated value, where the damping factor g is between 0 and 1. However, this algorithm is time-consuming and occupies large storage space for a largescale data set. According to Ref. [70], when dealing with the data set in a certain size, the AP algorithm can efficiently and accurately execute, but when the data scale is beyond a certain range, the efficiency of the AP algorithm will be reduced. In this chapter, we propose an improved AP clustering method called hierarchical affinity propagation (HAP) to obtain the exemplars for the training image blocks data set. First, the data set is divided into several subsets which are no more than 1000 samples in size to reduce the time complexity of the AP algorithm. After the AP cluster in the subsets, we can obtain the cluster centers for every subset. Second, we carry out an AP algorithm again to the set which was composed of all of the cluster centers obtained in the subsets. Finally, the final cluster centers fm1 ; m2 ; .; mU g are viewed as the initial classes, and then we partition the elements of the whole data set into the class that is the most similar to the cluster center. In addition, the similarity between pixel j and pixel i is computed by 

 sði; jÞ ¼ exp  jjqi qj jj 2 d2 , where d is the width parameter. For learning a discriminative dictionary, we use PCA [71] to compute the principal components for each cluster, because PCA could focus on the edges and structures of the image blocks. In addition, PCA is a classical signal decorrelation and dimensionality reduction technique that is widely used in pattern recognition and statistical signal processing [72]. PCA has been successfully applied in spatially adaptive image denoising in Ref. [73]. Therefore, we apply PCA to each cluster to generate the eigenvectors as a

540 Chapter 14

Figure 14.35 Part of the constructed dictionary.

subdictionary Bu(u ¼ 1,2, .,U). The subdictionaries and the centroids of the clusters are treated as the dictionary B ¼ ½m1 ; B1 ; m2 ; B2 ; .; mU ; BU . Fig. 14.35 shows a part of the constructed dictionary learned on the basis of hierarchical AP clustering and PCA. The left column shows the centroids obtained from hierarchical AP clustering, and the right five columns are the first five most important eigenvectors corresponding to each cluster. 14.2.2.3 Creating feature vectors via sparse coding and nonlocal similarity pooling Sparse representation (SR) has become a hot issue and been used in a wide variety of application areas. The sparse representation technique is capable of restraining the negative effect of noise on images to some extent. This is very important because multiplicative speckle noise is the obstacle to applying SAR images. Therefore, in this section, we use sparse representation during initial feature vectors encoding [74]. The initial feature vectors for every pixel are constructed by overlapping h  h data blocks centered on per-pixel in the whole difference image. In order to avoid exceptions at the image margin the image was enlarged by repeating boundary pixels accordingly. Therefore we can obtain N(N]I  J) initial feature vectors: D ¼ {d1,d2, .,dN} instead of single pixels. And we then apply sparse coding to encode for each pixel. The equation of the sparse coding is as follows: argminjjdi  Bci jj2 þ ljjci jj1 c

(14.44)

where B is the constructed dictionary B ¼ ½m1 ; B1 ; m2 ; B2 ; .; mU ; BU , di(i ¼ 1,2, .,N) is the initial feature vector of a pixel, and ci is the sparse coding feature vector. We can get

SAR image processing based on similarity measures and discriminant feature learning

541

sparse coding vectors by the proposed feature-sign search algorithm [75] to solve Eq. (14.4) here. Due to the spatial context helping us to understand the semantic meaning of images [1], one of the most successful ways is the pooling method which could capture the discriminative information. Therefore, we use one kind of pooling method which considers the spatial information of similar pixels. In Ref. [76], the authors concluded that the pooling could capture the feature context and extract discriminative information. This will contribute to obtaining the prominent change information. To begin with, we search the similar primitive feature vectors which are composed of each pixel and its neighbors in the difference image according to the nonlocal method on the whole image plane. In this chapter, the similarity is measured by the Euclidean distance alone. Defining a positive integer L(L < < N), we search for L most similar ones for an initial feature vector called primary, and then construct a group of initial feature vectors for the primary noted as di, i ¼ 1,2, .,N:   (14.45) Fi ¼ di;0 ; di;1 ; di;2 ; .; di;L where Fi is the group of the initial feature vector di. The central one of Fi is defined as the 0-th initial feature vector di;0, which is simple di itself. Other k initial feature vectors are called contributories. L is the number of the most similar ones. However, a larger number of similar features could also contain more irrelevant elements, which is adverse to obtaining significant change information. In this chapter, we use L ¼ 5 for all experiments empirically. The pooling is implemented on the coding feature vectors corresponding to the generated initial feature vectors group Fi as follows:     (14.46) vi ¼ max ci;0 þ ci;l ¼ ci;0 þ max ci;l 1lL

1lL

where the notation maxl denotes element-wise maximization of L vector. From Eq. (14.6), we can see that the pooled features for an initial feature vector are composed of the coding features of the initial feature vectors group. Therefore, in this chapter, for a group consisting of one primary and L contributory vectors, the relevance between them will be high when this group is located on a homogeneous area, whether changed or unchanged. Therefore, the pooling enhances the similarity of homogeneous areas and reduces the correlation of the heterogeneity in a latent way. 14.2.2.3.1 Obtaining a change map by k-means clustering

In this chapter, the k-means clustering algorithm is used to cluster the pooling feature vectors into two categories. As discussed in Ref. [53], let MEc and MEu be the cluster

542 Chapter 14 mean feature vectors for changed class and unchanged class, respectively. The change map CM ¼ {cm(i, j), 1  i  I, 1  j  J} is obtained according to the Euclidean distance. The process is described as follows:  1; jjvði; jÞ  MEc jj  jjvði; jÞ  MEu jj2 cmði; jÞ ¼ (14.47) 0; others where “1” represents the changed pixel and “0” represents the unchanged one. 14.2.2.4 Experimental results 14.2.2.4.1 Quantitative analysis To evaluate the performance of different methods, the quantitative analysis of change detection results is set as follows. They are the false alarms (FA, the number of pixels that are unchanged pixels wrongly detected as changed), missed alarms (MA, the number of pixels that are changed pixels that are undetected), overall error (OE, the number of the pixels that are the sum of the false alarms and the missed alarms), and the kappa index, which is a statistical measurement of accuracy or agreement [59]. To evaluate the result further, we define their percentages. Suppose that N is the number of pixels in DI, and Nc and Nu are the total number of changed and unchanged pixels in the ground truth map, respectively. The indicators are described as follows [77]: (1) the false alarm rate percentage is described as PFA ¼ FA/Nu  100; (2) the missed alarm rate percentage is described as PMA ¼ MA/Nc  100; (3) the total error rate percentage is described as POE ¼ (FA þ MA)/(Nc þ Nu)  100. In addition, the time T taken for each algorithm is also an important criterion. The timeefficient comparison is listed for different approaches. Here, the unit of time is the second. 14.2.2.4.2 Parameter selection

In this experiment, we show the influence of the parameter l on the performances of the proposed method. This parameter l in the sparse coding Eq. (14.2) enforces the sparsity to the solution. The solution will be much sparser if the parameter is larger [74]. In order to choose a proper parameter l, we analyze quantitatively the selections from 0.1 to 1 for the three different data sets as shown in Fig. 14.36. Figs. 14.36AeC show the performance of the false alarms (FA), the missed alarms (MA), and the overall errors (OE) for the three data sets, respectively. With an increase in l, the overall errors increase, while the kappa values decrease at the same time. We can see that the parameter l should be set between [0.1, 0.2]. In order to preserve the properties of robustness and sparsity simultaneously, selection of the value l in the interval from 0.1 to 0.2 can give relatively accurate results on SAR image data sets. Next, we will show the experiment of different sizes of the extracted blocks for each pixel. In this chapter, a feature vector for each pixel of the difference image is generated by its

SAR image processing based on similarity measures and discriminant feature learning

543

Figure 14.36 Performances of the proposed method on different SAR images against the parameter l: (A) the result of the Bern data set; (B) the result of the Ottawa data set; (C) the result of the Yellow River data set; (D) kappa value.

h  h neighbor data. The size h  h of the extracted block is set as 3  3, 5  5, 7  7, and 9  9. Fig. 14.37 shows FA, MA, and the kappa index for the proposed method on different SAR image data sets. Fig. 14.37 indicates that the changes to FA and MA are opposites. The increase in one is much more than the decrease in the other one. In addition, with the block size increasing, the kappa index gradually decreases, along with the increase in the block size to the first and second data sets, while the kappa index of the third data set is the highest when the block size is 55. The reason for this could be that its difference image contains much more noise, because the two original images are a single-look image and a four-look image. A comparatively large size of block will reduce the effects of noise. 14.2.2.4.3 Experiment results and analysis of the first three data sets

In order to demonstrate the effectiveness of the proposed change detection algorithm, we select five related approaches for comparison: the generalized minimum-error thresholding

544 Chapter 14

Figure 14.37 Influences of the block size on different SAR images: (A) FA, MA, and kappa for the Bern data set; (B) FA, MA, and kappa for the Ottawa data set; (C) FA, MA, and kappa for the Yellow River data set.

(GKI-LN) method [60], principal component analysis and K-means clustering (PCA-K) [53], the compressed sampling sparse representation (CS-KSVD) method [78], the cumulant-based Kullback-Leibler divergence (CKLD) method [64], and the neighborhoodbased ratio approach (NR) [62]. In the experiment, the training data set for constructing

SAR image processing based on similarity measures and discriminant feature learning

545

the dictionary is obtained by nonoverlapping the difference image. The selection step of the training sample is the same as the size of the blocks. The parameter l is set to be 0.1 in all of the experiments. Fig. 14.38 shows the qualitative results of different methods for the Bern data set. As shown in Fig. 14.38A, many noise points are falsely detected. This is the negative effect of noise to yield a very large overlap between the changed and unchanged hypotheses in the feature space. From Figs. 14.38B and C, we can see that the PCA-K and CS-KSVD obtain almost the same amount of missed alarms. However, fewer noise points are generated by the proposed method compared with the other methods. The proposed method presents the minimum false alarms, which indicates that it can make good use of the dependency of spatial information to remove single points. CKLD gives the poorest results. The generated noise is slightly more, as shown in Fig. 14.38D. CKLD could be appropriate to detect the large-scale changed areas. However, the changed parts are scattered and small in the data set. Therefore, the result is not very good using this method. Compared with the NR method, the proposed method decreases the rate of missed detection from 14.23% to 15.93%, as listed in Table 14.12. The overall errors yielded by GKI-LN, PCA-K, CSKSVD, CKLD, and NR are 377, 304, 308, 513 and 309 pixels, respectively. The proposed method clearly outperforms them.

Figure 14.38 Change detection results on the Bern data set using (A) GKI-LN, (B) PCA-K, (C) CS-KSVD, (D) CKLD, (E) NR, (F) proposed.

546 Chapter 14 Table 14.12: Change detection results on SAR data sets obtained by different methods. Data set Bern

Ottawa

Yellow River

Methods GKI-LN PCA-K CS-KSVD CKLD NR Proposed GKI-LN PCA-K CS-KSVD CKLD NR Proposed GKI-LN PCA-K CS-KSVD CKLD NR Proposed

FA

PFA

MA

PMA

OE

POE

291 158 161 111 110 100 68 955 558 2148 1366 489 172 2137 2215 2407 2344 1300

0.33 0.18 0.18 0.12 0.12 0.11 0.08 1.12 0.65 2.51 1.60 0.57 0.28 3.51 3.64 3.96 3.85 2.14

86 146 147 402 199 184 4183 1515 1929 2191 760 1302 6902 2663 2697 3780 2802 2523

7.44 12.64 12.73 34.81 17.23 15.93 26.06 9.44 12.02 13.65 4.73 8.11 51.38 19.83 20.08 28.14 20.86 18.78

377 304 308 513 309 284 4251 2470 2487 4339 2126 1791 7074 4800 4912 6187 5146 3823

0.42 0.34 0.34 0.57 0.34 0.31 4.19 2.43 2.45 4.27 2.09 1.76 9.52 6.46 6.61 8.33 6.93 5.15

Kappa 0.8480 0.8674 0.8657 0.7431 0.8596 0.8708 0.8244 0.9073 0.9047 0.8393 0.9224 0.9323 0.6006 0.7785 0.7736 0.7072 0.7630 0.8199

T(s) 16.70 3.59 153.91 64.33 20.25 94.67 18.22 4.51 185.63 67.10 20.35 166.49 14.18 2.76 168.53 48.99 23.42 91.69

The qualitative results of different methods for the Ottawa data set are shown in Fig. 14.39. As can be seen from Fig. 14.39F, the proposed method has a better performance in the preservation of edge information than PCA-K in the upper right corner. The proposed method obtains an overall error of 1791 pixels, which is much smaller than the other five methods which obtain 4251, 2470, 2487, 4339, and 2126, respectively. We can see that GKI-LN gives the best results in FA. However, it leads to the highest value in MA. It is important for the thresholding algorithm to select the correct modeling for the change and there to be no change in the difference image. Compared with the other methods, the kappa index of the proposed method is a maximum of 0.9323, as shown in Table 14.12. Our proposed method outperforms the other methods clearly in the ability to suppress noise. Fig. 14.40 shows the experimental results of different methods for the Yellow River data set. The changed regions in this data set are small and scattered. In addition, it is difficult to obtain the accurate changed map because there is a great deal of speckle noise compared with the data sets above. The visual result of GKI-LN seems good, but its MA is too large to detect the change area. There are many isolated points in the results obtained, as displayed in Figs. 14.40D and E. Hence, the CKLD and NR methods lead to higher PFA values, as indicated in Table 14.12. Though the result of the proposed approach cannot detect the intact location of the changed area, it can better preserve the structure of the terrain. From Table 14.12, we can see that the proposed method obtains the smallest overall error rate of 5.15% and the maximum kappa index of 0.8199. From the above

SAR image processing based on similarity measures and discriminant feature learning

547

Figure 14.39 Change detection results by using different methods on the Ottawa data set: (A) GKI-LN, (B) PCA-K, (C) CS-KSVD, (D) CKLD, (E) NR, (F) proposed.

Figure 14.40 Change detection results by using different methods on the Yellow River data set: (A) GKI-LN, (B) PCA-K, (C) CS-KSVD, (D) CKLD, (E) NR, (F) proposed.

548 Chapter 14 analysis, the method that we proposed can effectively reduce the errors and improve the change detection performance. As for the time-efficiency, PCA-K works best in different data sets because it is only based on PCA and K-means clustering. Second is the GKI-LN method. Its operation is simply based on the histogram and a statistical model. However, the performance of the statistical method is weak in handling the strong speckle noise, which affects the accuracy when we create the model for the changed or unchanged classes. Our method and CSKSVD spent a relatively long time in constructing the dictionary for obtaining features. In spite of their higher time consumption, the results indicate that our proposed method is robust to speckle noise and can obtain discriminative features for changed regions. 14.2.2.4.4 Experiment results and analysis on the last two image pairs

For the Wuhan area data set, there is no ground truth image provided to quantitatively measure, which is unlike the first three data sets. Hence, we only provide a visual analysis to compare the change detection results obtained by the six different methods. Figs. 14.41 and 14.42 show the log-ratio difference images and the final results on the two selected data sets, respectively. In this experiment, the regularization parameter l is set to 0.1. Because the images are large in size and have strong noise, the size of the extracted block is chosen to be h  h ¼ 5  5 to reduce the interference of noise. From Figs. 14.41A and B, we can see that the changed regions are the filled lakes and some building constructions. As shown in Figs. 14.41D and H, we can see that GKI-LN and NR generate many more isolated white points, which indicates that they are not sufficiently robust to the strong speckle noise. As shown in Fig. 14.41I, the proposed method has less isolated points and looks more reasonable compared with the results of the other approaches. The main change between Figs. 14.42A and B is the Erqi Changjiang River Bridge, which is being built across the Changjiang River. In addition, there are some small changes, such as the ships and buildings that have appeared. Referring to the difference image, the result of the proposed method, as shown in Fig. 14.42I, could be able to accurately detect the changed places and the isolated points are much fewer. This proves that the proposed approach could obtain satisfactory results and effectively reduce the effect of the speckle noise. 14.2.2.4.5 Results and analysis on simulated images

In the experiment on simulated image pairs, we quantitatively analyze the robustness of the proposed method. The first simulated image pair is shown in Fig. 14.43. Its size is 251  282. The size of the second data set is 350  250 as shown in Fig. 14.44. The effect of speckle noise on the images is large because both of the simulated images are one-look images. The ground truth images are given in Figs. 14.43C and 14.44C, respectively. The proposed change detection algorithm compares the five related

SAR image processing based on similarity measures and discriminant feature learning

549

Figure 14.41 Experimental results on the East Lake data set: (A) image acquired in 2006, (B) image acquired in 2009, (C) the difference image, (D) GKI-LN, (E) PCA-K, (F) CS-KSVD, (G) CKLD, (H) NR, (I) proposed.

approaches: the generalized minimum-error thresholding (GKI-LN) method [60], principal component analysis and K-means clustering (PCA-K) [53], the compressed sampling sparse representation (CS-KSVD) method [78], the cumulant-based Kullback-Leibler divergence (CKLD) method, and the neighborhood-based ratio approach (NR). The experiment setups are as follows. The regularization parameter l is set to 0.1. The size of the extracted block is chosen to be h  h ¼ 5  5 because a relatively large block will reduce the effects of noise. Fig. 14.45 shows the change detection results of the first simulated image pair. Table 14.13 reports the quantization values. From the results, we can see that the proposed method could reduce the speckle noise effectively. GKI-LN yields a very high FA because it is sensitive to noise. As for PCA-K and CS-KSVD, they give lower FA and can decrease the

550 Chapter 14

Figure 14.42 Experimental results on the Erqi Changjiang River Bridge data set: (A) image acquired in 2006, (B) image acquired in 2009, (C) the difference image, (D) GKI-LN, (E) PCA-K, (F) CS-KSVD, (G) CKLD, (H) NR, (I) proposed.

influence of speckle noise to some extent. However, they lead to a higher value in MA. The overall errors are high. The changed area in the upper part is accurately detected in Fig. 14.45D. Compared with CKLD, our proposed method improves 4 the kappa index by approximately 2%. The kappa accuracy produced by NR is much higher than with other algorithms. However, the noise is clear, as shown in Fig. 14.45E. As can be seen from Fig. 14.45F, the result obtained by our method has a good visual effect compared with the other methods. The changed areas of the second image pair are dispersive and irregular. For this reason, it is much more difficult to detect the exact location of the changed areas. Fig. 14.46 shows the change detection results of the simulated image pair using the different methods. The

SAR image processing based on similarity measures and discriminant feature learning

551

Figure 14.43 The first simulated images used in the experiment: (A) image changed before, (B) image changed after, (C) ground truth.

Figure 14.44 The second simulated images used in the experiment: (A) image changed before, (B) image changed after, (C) ground truth.

Figure 14.45 Change detection results using different methods on the first simulated data set: (A) GKI-LN, (B) PCA-K, (C) CS-KSVD, (D) CKLD, (E) NR, (F) proposed.

552 Chapter 14 Table 14.13: Change detection results on simulated data sets obtained by different methods. Data set The first simulated images

The second simulated images

Methods GKI-LN PCA-K CS-KSVD CKLD NR Proposed GKI-LN PCA-K CS-KSVD CKLD NR Proposed

FA

PFA

MA

PMA

OE

POE

2088 335 338 1419 1680 585 102 374 600 775 236 256

3.91 0.63 0.63 2.65 3.14 1.09 0.12 0.44 0.70 0.91 0.28 0.31

163 3588 3597 1448 359 1175 600 349 335 144 469 360

0.94 20.71 20.76 8.36 2.07 6.78 28.45 16.55 15.88 6.83 22.24 17.07

2251 3923 3935 2867 2040 1760 702 723 935 919 705 625

3.18 5.54 5.56 4.05 2.88 2.49 0.80 0.83 1.07 1.05 0.81 0.71

Kappa 0.9171 0.8400 0.8395 0.8904 0.9240 0.9320 0.8073 0.8254 0.7860 0.8052 0.8190 0.8448

T(s) 16.41 4.85 149.71 46.79 12.57 62.75 22.79 2.56 181.83 58.56 25.43 68.72

Figure 14.46 Change detection results using different methods on the second simulated data set: (A) GKI-LN, (B) PCA-K, (C) CS-KSVD, (D) CKLD, (E) NR, (F) proposed.

SAR image processing based on similarity measures and discriminant feature learning

553

numerical results are listed in Table 14.13. GKI_LN and NR give a better result of OE than the other classical methods. However, their MAs are relatively high. Compared with PCA_K, the improvement of our proposed method is approximately 100 pixels in the OE. As can be seen from Fig. 14.46C, the CS_KSVD method generated relatively more noise points. We can see that CKLD gets the best result in MA, but the highest FA value cannot obtain a satisfactory result. According to the visual and quantitative results obtained on this simulated data set, the proposed method is verified to be more effective than the other methods. 14.2.2.4.6 Experiment for sparse representation

In this experiment, sparse representation is adopted to reduce the effect of speckle noise in SAR images. In order to demonstrate its effectiveness, we show the quantitative results with and without sparse representation in the real three data sets and simulated data sets. Experimental setups are the same as above. Fig. 14.47 shows the change detection results of the data sets. The first row shows the results of different data sets without sparse representation in the coding step. The second row shows the results with sparse representation. From the visual point, we can see that the visual effect of using the sparse representation technique is much better than without it. It is evident that the noise is

Figure 14.47 Change detection results with and without sparse representation on different data sets. (A and F) Without and with sparse representation on Bern data set, respectively, (B and G) without and with sparse representation on Ottawa data set, respectively, (C and H) without and with sparse representation on the Yellow River data set, respectively, (D and I) without and with sparse representation on the first simulated data set, respectively, (E and J) without and with sparse representation on the second simulated data set, respectively.

554 Chapter 14 Table 14.14: Change detection results on different data sets. Data set Bern Ottawa Yellow River The first simulated images The second simulated images

Methods

FA

PFA

MA

PMA

OE

POE

Kappa

Without SP With SP Without SP With SP Without SP With SP Without SP With SP Without SP With SP

105 100 692 489 2751 1300 563 585 599 256

0.12 0.11 0.81 0.57 4.52 2.14 1.05 1.09 0.70 0.31

198 184 1241 1302 1811 2523 1509 1175 336 360

17.14 15.93 7.73 8.11 13.48 18.78 8.71 6.78 15.93 17.07

303 284 1933 1791 4562 3823 2072 1760 935 625

0.33 0.31 1.90 1.76 6.14 5.15 2.93 2.49 1.07 0.71

0.8616 0.8708 0.9275 0.9323 0.7982 0.8199 0.9193 0.9320 0.7859 0.8448

effectively suppressed for the Yellow River data set and the second simulated data set. As for the Bern and Ottawa data sets, the visual sense is not clear. The reason for this is that their noise is relatively less than the Yellow River data set. As shown in Figs. 14.47D and I, the result with sparse representation for the first simulated data set could reduce the isolated points on the margin. Table 14.14 gives the FA, MA, and OE in the numerical results and the percentages for the data sets. From Table 14.14, we can see that more pixels are detected falsely in the results obtained by the method without the sparse representation process than that with it. Table 14.14 shows that the sparse representation technique could reduce the values of FA to different degrees for the data sets. The quantitative results also prove that the use of sparse representation can reduce the speckle noise interference.

References [1] Jiao L, Tang X, Hou B, et al. SAR images retrieval based on semantic classification and region-based similarity measure for earth observation. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 2015;8(8):3876e91. [2] Tang X, Jiao L. Fusion similarity-based reranking for SAR image retrieval. IEEE Geoscience and Remote Sensing Letters 2017;14(2):242e6. [3] Tang X, Jiao L, Emery WJ. SAR image content retrieval based on fuzzy similarity and relevance feedback. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 2017;10(5):1824e42. [4] Wang S, Yang S, Jiao L. Saliency-guided change detection for SAR imagery using a semi-supervised Laplacian SVM. Remote Sensing Letters 2016;7(11):1043e52. [5] Wang S, Jiao L, Yang S. SAR images change detection based on spatial coding and nonlocal similarity pooling. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 2016;9(8):3452e66. [6] Chapelle, et al. Semi-supervised learning, vol. 2. Cambridge, MA, USA: MIT Press; 2006. [7] Jebara T, Wang J, Chang S-F. Graph construction and b-matching for semi-supervised learning. In: Proc. 26th annu. int. conf. mach. learn.; 2009. p. 441e8.

SAR image processing based on similarity measures and discriminant feature learning

555

[8] Zhou D, Bousquet O, Lal TN, Weston J, Scho¨lkopf B. Learning with local and global consistency. In: Proc. neural inf. process. syst. (NIPS), vol. 16; 2003. p. 321e8. [9] Zhu X, et al. Semi-supervised learning using Gaussian fields and harmonic functions. In: Proc. int. conf. mach. learn. (ICML), vol. 3; 2003. p. 912e9. [10] Wang, Song T. Remote sensing image retrieval by scene semantic matching. IEEE Transactions on Geoscience and Remote Sensing May 2013;51(5):2874e86. [11] Congalton RG. A review of assessing the accuracy of classifications of remotely sensed data. Remote Sensing of Environment 1991;37(1):35e46. [12] Wang Z, Li J, Wiederhold G. Simplicity: semantics-sensitive integrated matching for picture libraries. IEEE Transactions on Pattern Analysis and Machine Intelligence Sep. 2001;23(9):947e63. [13] Gersho. Asymptotically optimal block quantization. IEEE Trans Inf Theory Jul. 1979;25(4):373e80. [14] Gonzalez RC, Woods RE. Digital image processing. Upper Saddle River, NJ, USA: Prentice Hall; 2002. [15] Otsu. A threshold selection method from gray-level histograms. Automatica 1975;11(285e296):23e7. [16] Shyu C-R, et al. Geoiris: geospatial information retrieval and indexing system-content mining, semantics modeling, and complex queries. IEEE Transactions on Geoscience and Remote Sensing Apr. 2007;45(4):839e52. [17] Popescu I, Gavat, Datcu M. Contextual descriptors for scene classes in very high resolution SAR images. IEEE Geoscience and Remote Sensing Letter Jan. 2012;9(1):80e4. [18] Birjandi, Datcu M. Multiscale and dimensionality behavior of ICA components for satellite image indexing. IEEE Geoscience and Remote Sensing Letter Jan. 2010;7(1):103e7. [19] Espinoza-Molina D, Datcu M. Earth-observation image retrieval based on content, semantics, and metadata. IEEE Transactions on Geoscience and Remote Sensing Nov. 2013;51(11):5145e59. [20] Datta R, Li J, Parulekar A, Wang JZ. Scalable remotely sensed image mining using supervised learning and content-based retrieval. State College, PA, USA: Pennsylvania State Univ.; 2006. Tech. Rep. CSE 06e019. [21] Kandaswamy U, Adjeroh DA, Lee M-C. Efficient texture analysis of SAR imagery. IEEE Transactions on Geoscience and Remote Sensing Sep. 2005;43(9):2075e83. [22] Yi-Bo, Chang Z, Ning W. A survey on feature extraction of SAR images. In: Proc. int. conf. comput. appl. syst. model. (ICCASM), vol. 1; 2010. V1e312eVV1e317. [23] Lowe DG. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 2004;60(2):91e110. [24] Li, Wang JZ. Real-time computerized annotation of pictures. IEEE Transactions on Pattern Analysis and Machine Intelligence Jun. 2008;30(6):985e1002. [25] Chen Y, Wang JZ. A region-based fuzzy feature matching approach to content-based image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence Sep. 2002;24(9):1252e67. [26] Mallows C. A note on asymptotic joint normality. Annals of Mathematical Statistics 1972;43:508e15. [27] Levina E, Bickel P. The earth movers distance is the mallows distance: some insights from statistics. In: Proc. 8th IEEE int. conf. comput. vis. (ICCV’01), vol. 2; 2001. p. 251e6. [28] Wilkins, Ferguson P, Smeaton AF. Using score distributions for query-time fusion in multimedia retrieval. In: Proceedings of the 8th ACM international workshop on multimedia information retrieval. ACM; 2006. p. 51e60. [29] Dellinger F, Delon J, Gousseau Y, Michel J, Tupin F. SARSIFT: a sift-like algorithm for SAR images. Geoscience and Remote Sensing, IEEE Transactions on 2015;53(1):453e66. [30] Zhu H, Ma W, Hou B, Jiao L. SAR image registration based on multifeature detection and arborescence network matching. IEEE Geoscience and Remote Sensing Letters 2016;13(5):706e10. [31] Yuan X, Tang T, Xiang D, Li Y, Su Y. Target recognition in SAR imagery based on local gradient ratio pattern. International Journal of Remote Sensing 2014;35(3):857e70. [32] Wang, Li H, Tao D, Lu K, Wu X. Multimodal graph-based reranking for web image search. Image Processing, IEEE Transactions on 2012;21(11):4649e61.

556 Chapter 14 [33] Liu W, Wang J, Ji R, Jiang Y-G, Chang S-F. Supervised hashing with kernels. In: Computer vision and pattern recognition (CVPR), 2012 IEEE conference on. IEEE; 2012. p. 2074e81. [34] Espinoza-Molina D, Chadalawada J, Datcu M. SAR image content retrieval by speckle robust compression based methods. In: EUSAR 2014; 10th European conference on synthetic aperture radar; proceedings of. VDE; 2014. p. 1e4. [35] Demir B, Bruzzone L. Hashing-based scalable remote sensing image search and retrieval in large archives. Geoscience and Remote Sensing, IEEE Transactions on 2016;54(2):892e904. [36] Ferecatu, Boujemaa N. Interactive remote-sensing image retrieval using active relevance feedback. Geoscience and Remote Sensing, IEEE Transactions on 2007;45(4):818e26. [37] Demir B, Bruzzone L. A novel active learning method in relevance feedback for content-based remote sensing image retrieval. Geoscience and Remote Sensing, IEEE Transactions on 2015;53(5):2323e34. [38] Chang C-C, Lin C-J. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2011;2(3):27. [39] Gonzalez C, Woods RE. Digital image processing. Upper Saddle River, NJ, USA: Prentice Hall; 2002. [40] Veksler, Boykov Y, Mehrani P. Superpixels and supervoxels in an energy optimization framework. In: Computer visioneECCV 2010. Springer; 2010. p. 211e24. [41] Yu H, Zhang X, Wang S, Hou B. Context-based hierarchical unequal merging for SAR image segmentation. Geoscience and Remote Sensing, IEEE Transactions on 2013;51(2):995e1009. [42] Achanta, Shaji A, Smith K, Lucchi A, Fua P, Susstrunk S. SLIC superpixels compared to state-of-the-art superpixel methods. Pattern Analysis and Machine Intelligence, IEEE Transactions on 2012;34(11):2274e82. [43] Jain K, Farrokhnia F. Unsupervised texture segmentation using Gabor filters. In: Systems, man and cybernetics, 1990. Conference proceedings., IEEE international conference on. IEEE; 1990. p. 14e9. [44] Hӧppner F. Fuzzy cluster analysis: methods for classification, data analysis and image recognition. John Wiley & Sons; 1999. [45] Tong, Koller D. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research 2002;2:45e66. [46] Settles B. Active learning literature survey, vol. 52, no. 55e66. Madison: University of Wisconsin; 2010. p. 11. [47] Wilkins, Smeaton AF, Ferguson P. Properties of optimally weighted data fusion in CBMIR. In: Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval. ACM; 2010. p. 643e50. [48] Cebri0 an, Alfonseca M, Ortega A. The normalized compression distance is resistant to noise. IEEE Transactions on Information Theory 2007;53(5):1895e900. [49] Welch. A technique for high-performance data compression. Computer 1984;17(6):8e19. [50] Feng, Xu D. Transductive multi-instance multi-label learning algorithm with application to automatic image annotation. Expert Systems with Applications 2010;37(1):661e70. [51] Yang Y, Newsam S. Geographic image retrieval using local invariant features. Geoscience and Remote Sensing, IEEE Transactions on 2013;51(2):818e32. [52] Manjunath BS, Salembier P, Sikora T. Introduction to MPEG-7: multimedia content description interface, vol. 1. John Wiley & Sons; 2002. [53] Celik T. Unsupervised change detection in satellite images using principal component analysis and kmeans clustering. IEEE Geoscience and Remote Sensing Letters 2009;6(4):772e6. https://doi.org/10.1109/ LGRS.2009.2025059. [54] Goferman S, Zelnik-Manor L, Ayellet T. Context-aware saliency detection. In: IEEE Conference on computer vision and pattern recognition 13e18; 2010. p. 2376e83. https://doi.org/10.1109/ CVPR.2010.5539929. [55] Belkin M, Niyogi P, Sindhwani V. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research 2006;7:2399e434.

SAR image processing based on similarity measures and discriminant feature learning

557

[56] Go´mez-Chova L, Camps-Valls G, Mun˜oz-Marı´ J, Calpe J. Semisupervised image classification with Laplacian support vector machines. IEEE Geoscience and Remote Sensing Letter 2008;5(3):336e40. https://doi.org/10.1109/LGRS.2008.916070. [57] Olivier C, Schulkopf B, Alexander Z. Semi-supervised learning. Cambridge, MA: MIT Press; 2006. [58] Melacci S, Belkin M. Laplacian support vector machines trained in the primal. Journal of Machine Learning Research 2011;12:1149e84. [59] Rosenfield GH, Fitzpatrick-Lins A. A coefficient of agreement as a measure of thematic classification accuracy. Photogrammetric Engineering & Remote Sensing 1986;52(2):223e7. [60] Moser G, Serpico SB. Generalized minimum-error thresholding for unsupervised change detection from SAR amplitude imagery. IEEE Transaction on Geoscience Remote Sensing 2006;44(10):2972e82. https:// doi.org/10.1109/TGRS.2006.876288. [61] Fang LY, Li ST, Hu JW. Multitemporal image change detection with compressed sparse representation. In: 18th IEEE international conference on image processing 11e14; 2011. p. 2673e6. https://doi.org/10.1109/ ICIP.2011.6116218. [62] Gong M, Cao Y, Wu Q. A neighborhood-based ratio approach for change detection in SAR images. IEEE Geoscience and Remote Sensing Letter 2012;9(2):307e11. https://doi.org/10.1109/LGRS.2012.2167211. [63] Liu J, Gong MG, Zhao JJ, Li H, Jiao LC. Difference representation learning using stacked restricted Boltzmann machine for change detection in SAR images. Soft Computing 2014;21:1e13. https://doi.org/ 10.1007/s00500-014-1460-0. [64] Inglada J, Mercier G. A new statistical similarity measure for change detection in multitemporal SAR images and its extension to multiscale change analysis. IEEE Transactions on Geoscience and Remote Sensing May 2007;45(5):1432e45. [65] Celik. Change detection in satellite images using a genetic algorithm approach. IEEE Geoscience and Remote Sensing Letter Apr. 2010;7(2):386e90. [66] Zheng Y, Zhang X, Hou B, Liu G. Using combined difference image and k-means clustering for SAR image change detection. IEEE Geoscience and Remote Sensing Letter Mar. 2014;11(3):691e5. [67] Yang S, Wang M, Chen Y, Sun Y. Single-image super-resolution reconstruction via learned geometric dictionaries and clustered sparse coding. IEEE Transactions on Image Processing Sep. 2012;21(9):4016e28. [68] Frey BJ, Dueck D. Clustering by passing messages between data points. Science Feb. 2007;315:972e6. [69] Jiao L, Feng J, Liu F, Sun T. Semisupervised affinity propagation based on normalized trivariable mutual information for hyperspectral band selection. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Jun. 2015;8(6):2760e73. [70] Frey BJ. Affinity propagation FAQ [EB/OL]. http://www.psi.toronto.edu/affinitypropagation/faq.html, 2012-01-05/2012-12-01. [71] Fukunaga K. Introduction to statistical pattern recognition. 2nd ed. New York: Academic; 1991. [72] Dong, Zhang L, Shi G, Wu X. Image deblurring and super-resolution by adaptive sparse domain selection and adaptive regularization. IEEE Transactions on Image Processing July 2011;20(7):1838e57. [73] Zhang L, Dong W, Zhang D, Shi G. Two-stage image denoising by principal component analysis with local pixel grouping. Pattern Recognition Apr. 2010;43:1531e49. [74] Yang J, Yu K, Gong Y, Huang T. Linear spatial pyramid matching using sparse coding for image classification. In: Proc. IEEE conf. comput. vis. pattern recog.; 2009. p. 1794e801. [75] Lee H, Battle A, Raina R, Ng A. Efficient sparse coding algorithms. In: Advances in neural information processing systems. MIT Press; 2007. p. 801e8. [76] Xie L, Tian Q, Wang M, Zhang B. Spatial pooling of heterogeneous features for image classification. IEEE Transactions on Image Processing May 2014;23(5). [77] Su L, Gong M, Sun B, Jiao L. Unsupervised change detection in SAR images based on locally fitting model and semi-EM algorithm. International Journal of Remote Sensing 2014;35(2):621e50. [78] Fang LY, Li ST, Hu JW. Multitemporal image change detection with compressed sparse representation. In: Proc. 18th IEEE int. conf. image process.; 2011. p. 2673e6.

C H A P T E R 15

Hyperspectral image processing based on sparse learning and sparse graph Chapter Outline 15.1 Hyperspectral image denoising based on hierarchical sparse learning 15.1.1 Spatial-spectral data extraction 561 15.1.2 Hierarchical sparse learning for denoising each band-subset 15.1.3 Experimental results and discussion 566 15.1.3.1 Experiment on simulated data 567 15.1.3.2 Experiment on real data 575

560

563

15.2 Hyperspectral image restoration based on hierarchical sparse Bayesian learning 15.2.1 Beta process 581 15.2.1.1 Full hierarchical sparse Bayesian model 15.2.2 Experimental results 584 15.2.2.1 Denoising 587 15.2.2.2 Predicting the missing data 592 15.2.2.3 Discussion 593

582

15.3 Hyperspectral image dimensionality reduction using a sparse graph 15.3.1 15.3.2 15.3.3 15.3.4 15.3.5

References

581

595

Sparse representation 595 Sparse graph-based dimensionality reduction 596 Sparse graph learning 597 Spatial-spectral clustering 602 Experimental results 602 15.3.5.1 Introduction of hyperspectral datasets 603 15.3.5.2 Classification results 604 15.3.5.3 Influence of spatial-spectral clustering 610 15.3.5.4 Convergence analysis 611

613

High-spectral image (HSI) is three-dimensional data integrating spectral information and spatial information, which can provide more reliable and more accurate information of the observed object [1e3]. It has wide application in the fields of remote sensing, diagnosis of medicine, and mineralogy, and has made good results.

Brain and Nature-Inspired Learning, Computation and Recognition. https://doi.org/10.1016/B978-0-12-819795-0.00015-3 Copyright © 2020 Tsinghua University Press. Published by Elsevier Inc. All rights reserved

559

560 Chapter 15

15.1 Hyperspectral image denoising based on hierarchical sparse learning In order to restore HSI better, we extend the denoising framework of a two-dimensional image to a hierarchical dictionary learning framework using the Bayesian method. Let Y be one HSI with a size of lm  ln  ll , where lm  ln represents the spatial pixels number, and ll defines the size of the spectral domain. In general, the spectral correlation of HSI is more important than the spatial characteristics. In recent years, it has been reported that HSI has high performance in tracking, target detection, and classification. The method is divided into two main stages: spatial spectral data extraction and noise-free estimation based on hierarchical standby learning. First, according to the structural similarity index (Ssim), the band subset partition is introduced, and the HSI is divided into multiple/one band subsets. Each band subset is composed of highly correlated continuous bands, and then the local uniform region is extracted for each spectral pixel. Second, a hierarchical sparse learning model composed of clean image items, Gao Si noise terms, and sparse noise terms is established to suppress all kinds of noise. In order to effectively capture the potential spatial information of HSI, the Gao Si process with gamma constraints is applied to the dictionary of clean image items. The second and third terms are used to infer the statistical characteristics of the existing noise, such as Poisson noise, Gaussian noise, dead pixel lines, stripes, or their mixing. The recovery results of each frequency band subset can be obtained by solving the framework by the Bayesian method. The proposed methodological framework is shown in Fig. 15.1.

Spectral Correlation Calculation between Adjacent Bands Local Segmentation Point Detection

Hierarchical Model Construction for Each Band-subset Bayesian Inference for Each Band-subset

Non-overlapped Band-subset Generation Denoised Hyperspectral Image Overlapped Cubic Patch Partition for Each Band-subset

Noisy Hyperspectral Image Spatial-spectral Data Extraction

Hierarchical Sparse Learning for Denoising Each Band-subset

Figure 15.1 Framework of the proposed approach.

Hyperspectral image processing based on sparse learning and sparse graph

561

15.1.1 Spatial-spectral data extraction In HSI, adjacent bands are obtained under relatively similar sensor conditions, and they have a strong correlation in the spectral domain. On this basis, the spectral correlation between adjacent bands is measured by SSIM [4]. Suppose Bj and Bjþ1 represent a 2D image lying in the j-th and j þ 1-st band, respectively. Structure similarity between the j-th and j þ 1-st band can be defined by Eq. (15.1).    2mBj mBjþ1 þ c1 2sBj sBjþ1 þ c2   SSIMðBj ; Bjþ1 Þ ¼  (15.1) m2Bj þ m2Bjþ1 þ c1 s2Bj þ s2Bjþ1 þ c2 In Eq. (15.1), mBj and sBj are the mean and variance of band Bj, respectively; mBjþ1 and s2Bjþ1 are the mean and variance of band Bjþ1; the predefined constants c1 and c2 are applied to stabilize the division with a weak denominator. Normally, the closer SSIM(Bj, Bjþ1) is to one, the stronger the structural correlations are between the j-th and j þ 1-st spectral bands. Supposing Sc(j) ¼ SSIM(Bj, Bjþ1), the structural correlation curve Sc can be generated. Fig. 15.2 shows the structural correlation curves of different hyperspectral images. Based on the curve in Fig. 15.2, it can be found that the correlation coefficient in the adjacent frequency band varies greatly for different hyperspectral images. The curve in Fig. 15.2A shows a relatively stable trend. There are obvious droplets in Figs. 15.2B,C, which means that some adjacent bands in urban data are much lower than those in Indian Pines data. At the same time, the continuous spectral band between two adjacent droplets shows a relatively stable trend. However, most previous studies have ignored this property of the spectral band. These methods are applied to restore the whole frequency band directly, or to denoise the subset of frequency band constructed by dividing all spectrum bands into fixed values sequentially. In order to make full use of the correlation between adjacent bands, the optimal segmentation in the spectral domain is discussed by estimating

Figure 15.2 Structural correlation curves of different hyperspectral images: (A) Pavia data; (B) Urban data; (C) Indian Pines data.

562 Chapter 15 the droplets in the curve Sc. The detailed procedure of spatial-spectral data extraction is described here. First, form the structural correlation curve Sc. Second, detect the local segmentation points Sc(j) in the curve Sc. And Sc(j) meets the condition denoted by Eq. (15.2).  Sc ðj  1Þ  Sc ðjÞ > h Sc ðj þ 1Þ  Sc ðjÞ > h

(15.2)

In Eq. (15.2), h is a predefined threshold to avoid local interference caused by noise in curve Sc. The starting and ending points in Sc can be regarded as internal local boundary points. In order to make better use of useful spectral features and suppress noise, the fusion image is generated by using the average spectral band between adjacent segmentation points. According to Eqs. (15.1) and (15.2), whether to merge adjacent fusion images is determined. The dividing points in curve Sc can be identified by this method. As shown in Fig. 15.2A, the correlation between all adjacent bands tends to be relatively stable, in which case there is no need to segment the spectral band. As shown in Figs. 15.2B and C, HSI can be divided into nonoverlapping band subsets according to the partition points. In this case, HSI itself can be regarded as a special band subset. Let C represent the number of subsets. First, noise data can be reconstructed as X ¼ {X1, /, Xc, /, XC}, c ¼ 1, /, C, Xc is a band subset of c. Finally, in order to effectively preserve the local details of HSI in the spatial domain [5], we use cubic blocks instead of two-dimensional blocks in the process of denoising. On this basis, each band set is divided into multiple overlapping cubic patches. The size of each cube patch is lx  ly  lc, where lx  ly is a spatial dimension. lc defines the number of spectral bands in the c-th band set. It is noted that there are different material categories in the same cube plaque. With the increase of the spatial size of the cube patch, the spectral characteristic pollution (mixing/blurring) increases. Therefore, greater lx and ly may lead to the instability of cation accuracy. If the neighborhood is too small, you cannot mine the spatial information well. In this chapter, we strike a balance between the stability of spatial information and the development of spatial information, and set up lx ¼ ly ¼ 4 in the experiment. After the cubic patch is transformed into a vector, the c-th band subset   X c ¼ xc1 ; /; xci ; /; xcM , can be obtained, where M ¼ (lm  lx þ 1) (lnly þ 1) represents the number of cubic patches, and xci ˛ RP ; P ¼ lx  ly  lc is a vector generated by the i-th cubic patch of the c-th band subset.

Hyperspectral image processing based on sparse learning and sparse graph

563

15.1.2 Hierarchical sparse learning for denoising each band-subset In this section, a framework of hierarchical sparse learning is constructed by using prior distribution and super prior distribution to restore noisy HSI [6e8]. Gaussian process and Gamma distribution are integrated into dictionary atoms, and the spatial consistency of HSI is discussed. A new priori composed of transcendental distribution and super priori distribution is often called hierarchical priori. It is worth noting that sparse learning frameworks with multiple priorities can be regarded as special cases of deep learning. Considering the dataset X ¼ {X1, /, Xc, /, XC}, c ¼ 1, /, C. This method describes the features of each subset of HSI recovered by independent rendering. For the c-th band  subset X c ¼ xc1 ; /; xci ; /; xcM , the hierarchical denoising model can be denoted as Eq. (15.3). X c ¼ Dc Ac þ N c þ Qc +Sc

(15.3)

where the symbol + represents multiplication in the direction of the element. The first term on the right side of Eq. (15.3) represents the ideal noise estimate of Xc, which is expressed as a linear combination of the dictionary atoms. Dc ¼  c  d1 ; /; dci ; /; dcK ˛ RPK has columns of dictionary atoms to learn, where K is the number of dictionary atoms. In the context of HSI, the characteristics of xci are highly consistent with the samples of adjacent regions. In order to better understand HSI [9], the prior knowledge is explicitly applied to dictionary atom dck by using the Gaussian process (GP), constrained by gamma distribution. In addition, spectral pixels with similar characteristics in local regions can share the same dictionary atoms with a very high probability, which is consistent with the cognition of vision and structure. Ac ¼  c  a1 ; /; aci ; /; acM ˛ RKM places sparse restriction on the vectors to remove noise. It can   be written as Ac ¼ Wc+Zc. The matrix W c ¼ wc1 ; /; wci ; /; wcM , with the size of K  M, represents the weight of matrix Ac, where wci is drawn from Gaussian distribution. And the   sparseness of matrix Ac is adjusted by Zc ¼ zc1 ; /; zci ; /; zcM ˛ f0; 1gKM, which is drawn from the beta-Bernoulli process. Let symbol w define an i.i.d. draw from a distribution, N represents the Normal distribution, IK be the K  K identity matrix, Bern and Beta refer to Bernoulli and Beta distributions, respectively. And then the noise-free data DcAc may be represented in the following manner:

2



c c c Sjj0 ¼ x1 exp  xj  xj0 =x2 x1 wGðb; rÞ dk wNð0; SÞ   wci wN 0; I K =gcw

  zcki wBern pck

2

pck wBetaða0 =K; b0 ðK  1Þ=KÞ (15.4)

where x1 comes from gamma distribution and x2 is a predefined constant. They are all

2



reapplied to represent the smoothness between xcj and xcj0 . If the xcj  xcj0 is small, it is 2

564 Chapter 15 considered that the corresponding components between xcj and xcj0 have great spatial consistency. At the same time, the spatial consistency and spectral correlation of c-th band subset Xc are used. zcki indicates whether dck is used to represent xci with probability pck. At   K/N, the expected Eðpk Þ ¼ K 1 a0 K 1 a0 þK 1 ðK 1Þb0 is approximately 0.

 According to this, most of the elements in the set zcki k¼1;/;K are equal to zero, and sparsity is reasonably applied to vector aci. Obviously, aci is composed of a small number of nonzero values and a large number of zero values. The constraint constructed by the b-Bernoulli process can be regarded as a special l0 norm. Note that the noiseless

 estimation of each xcj uses the specific subset of the dictionary atoms dck k¼1;/;K , which

 is specified by the sparse vector zcki k¼1;/;K . The representation coefficient zcki ¼ 1 representing the atom dck is employed for representing coefficient acki . By analyzing the

 locations nonzero elements and the number in the set zcki k¼1;/;K , dictionary size can be learned adaptively, including atomic selection and prediction.   The matrix N c ¼ nc1 ; /; nci ; /; ncM , with the size of P  M, defines the zero-mean Gaussian noise. nci is extracted from zero-mean Gaussian noise component, and gn , with position variance can be displayed by the expression in Eq. (15.5). In addition, Poisson noise is a kind of dependent signal noise, which means that high brightness pixels will be strongly interfered with. In order to eliminate the dependence on noise variance, before implementing the denoising method [10,11], the variance stable transformation (VST) is introduced to convert Poisson noise into Gaussian noise. After recovery, the final recovery result of HSI is obtained by using the corresponding inverse transformation.   nci wN 0; IP = gcn (15.5) The third term represents sparse noise, such as dead pixel lines, which uses b-Bernoulli   process to represent sparsity. The matrix Qc ¼ qc1 ; /; qci ; /; qcM , with the size of P  M,   represents the intensity of the sparse noise, and Sc ¼ sc1 ; /; sci ; /; scM ˛ RPM depicts the location information where the sparse noise may exist.     qci wN 0; I P =gcv scpi wBernoulli qcpi (15.6) qcpi wBetaðaq ; bq Þ When scpi ¼ 1, the m-th element of xci is polluted by sparse noise with the amplitude qcpi . By adjusting the shape parameters of the beta distribution (a0 and b0), expect Eðqi Þ ¼ a0 =ða0 þb0 Þ to be close to zero. Each element if sci is i.i.d. The position arbitrariness in sparse noise is well described by extracting from beta distribution. gcw , gcn , and gcv are interpretable as noninformative hyperparameters, which can regulate the precision of wci , nci , and qci . To flexibly solve the model with the posterior PDF, gamma distribution is developed for these hyperparameters.

Hyperspectral image processing based on sparse learning and sparse graph gcw wGðc; dÞ

gcn wGðe; f Þ

gcv wGðg; hÞ

565 (15.7)

The negative logarithmic posterior density function of the above model can be represented

 (used in conjunction with all data X c ¼ xci i¼1;/;M ). n o 

     n o log p Dc ; wci ; zci ; qci ; sci ; pck ; qcp ; xc1 ; gcw ; gcn ; gcv  xci ¼ X  T

  gn X

xc  Dc wc +zc  qc +sc 2 þ 1 dc S1 dck i i i i i 2 i 2 k k 2   X  c c gw X  c 2 X c þ Beta pk ja0 =K; b0 ðK  1Þ=K þ logBernoulli zik jpk wik þ i;k k i;k 2 (15.8)       X X 2 gX c Beta q ; b logBernoulli scpi jqcpi þ v p;i qcpi þ þ ja q q p p p;i 2     þlog Gamma xc1 jb; r þ log Gamma gcw jc; d     þlog Gamma gcn je; f þ log Gamma gcv jg; h þ const The complete posterior of all parameters can be obtained instead of the point approximation, which is usually represented by the maximum posterior (MAP). All the parameters needed to solve the model can be obtained by prior distribution and super prior distribution. Therefore, the proposed hierarchical sparse learning framework has stronger robustness and accuracy. The graphical representation of the complete model is shown in Fig. 15.3. Gibbs sampling is then implemented to solve the hierarchical model by employing the conjugacy of the model posterior distributions [12,13]. The update equations for each

Figure 15.3 Graphical representation of the hierarchical sparse learning model.

566 Chapter 15 random variable are drawn from the conditional distributions on the most recent values of all other ones in the model. The details of the update equations can be found in the Appendix. After performing the Gibbs sampling, Dc and Ac can be obtained based on

c c

 dk , wi , and zci . Then restored images for the c-th band-subset X c ¼  c  x1 ; /; xci ; /; xcM can be inferred by calculating DcAc. Additionally, as there are several solutions for the same spectral pixel due to the usage of overlapping cubic patches, the final restored result is constructed by averaging all overlapping cubic patches. For all band-subsets X ¼ {X1, /, Xc, /, XC}, c ¼ 1, /, C, the whole HSI after denoising can be obtained by sequentially performing this operation.

15.1.3 Experimental results and discussion To evaluate the performance of the proposed approach, six state-of-the-art denoising methods are selected to make a comparison based on both the simulated and real data. These compared methods include K-SVD [14], BM3D [15], ANLM3D [16], BM4D [17], LRMR [18], and DDL3þFT [19]. The necessary parameters in the K-SVD method, BM3D method, ANLM3D method, and DDL3þFT method are fine-tuned or automatically selected, and the best experimental results are obtained. The noise variance is selected from the collection {0.01, 0.03, 0.04, 0.05, 0.07, 0.09, 1.1} in BM4D. The rank of the noise-free matrix is selected from {4, 5, 6, 7} in LRMR, and the base number of sparse items is selected from the set {0, 500, 1000, 1500, 2000, 3000, 4000, 5000}. K is set to a relatively small value to reduce the calculation time for the proposed method, here we select K ¼ 128. The parameters of super priori in Gaussian process are set to b ¼ r ¼ 106 and x2 ¼ 200. The number of iterations in Gibbs sampling is set to 100. When setting the other parameters before the first stage, two cases should be considered. (1) HSI is polluted by Gaussian noise, mixed noise, or dead pixel line. The parameters are set as follows: a0 ¼ b0 ¼ c ¼ d ¼ 106; e ¼ f ¼ 106; aq ¼ bq ¼ g ¼ h ¼ 106 . (2) HSI is polluted by the mixture of Gaussian noise and Poisson noise or Gaussian noise, dead pixel line, and Poisson noise. The parameters are set as: a0 ¼ b0 ¼ 104, c ¼ d ¼ 105; e ¼ f ¼ 106; aq ¼ bq ¼ 104 , g ¼ h ¼ 105. Once the type of noise is selected, the parameters of the advanced level are set as described above, and there is no need to tune. By using these superparameters, the potential information of input HSI is effectively learned by sampling infinite prior space, and all kinds of noise of input HSI are suppressed. For a Gibbs sampling iteration, the computational complexity of the method is close to that of OðKðP þMÞ þPMÞ. It should be noted that the proposed framework takes much longer than the six comparison algorithms.

Hyperspectral image processing based on sparse learning and sparse graph

567

15.1.3.1 Experiment on simulated data Using the Pavia university image obtained by ROSIS, a reflective optical system imaging spectrometer of Pavia in northern Italy, the simulated data are studied experimentally. It consists of 103 spectral bands, and the size of each band is 610  340, which contains the whole spectral resolution reflection characteristics of 10 nm steps from 430 to 860 nm. Before the simulation experiment, the gray value of Pavia university data is graded to [0, 1]. In order to facilitate the comparison, two methods of evaluation are proposed. (1) The experimental results before and after the denoising are qualitatively evaluated. This includes the spatial image of the selected band and the spectral characteristics of several pixels. (2) Two common indexes of peak signal-to-noise ratio (PSNR) and feature similarity (FSIM) are given. The denoising effect was quantitatively evaluated. The PSNR measures the gray similarity between the restored image and the reference image according to the MSE. FSIM is from the point of view of human perception. The gradient amplitude feature and phase consistency feature are combined to estimate the simulation results [20,21]. Higher PSNR and FSIM values show better denoising performance. To estimate the effectiveness of the proposed approach, three types of noise are considered for the Pavia data: (1) zero-mean Gaussian noise; (2) Poisson noise. This is parameterized by the expression Xpoisson ¼ X*peak, where Xpoisson refers to the noisy HSI corrupted by Poisson noise; X is the reference image; peak denotes the intensity of Poisson noise; and (3) dead pixel lines. These are added to the same position of the selected bands in HSI, and their width varies from one line to three lines. In the simulated experiments, the noises are added for the Pavia data as in the following three cases. Case 1: The noise standard variance s for each band of the HSI is randomly selected from 0.02 to 0.15. Fig. 15.4 shows the PSNR and FSIM values of each band before and after recovery as a quantitative evaluation of Pavia data. In Fig. 15.4, the HSI curve of noise fluctuates greatly to the spectrum, which is caused by the change of s with the spectral band. Therefore, (B)

(A)

1

FSIM

PSNR

Noisy HSI KSVD BM3D SpectralANLM3D Bands BM4D LRMR DDL3+FT Ours

PSNR

40

25

10

0

55 Spectral Bands

110

0.7

0.4

0

55 Spectral Bands

110

Figure 15.4 Quantitative evaluation results for Pavia data: (A) PSNR; (B) FSIM.

568 Chapter 15 noise information is very important to greatly improve the performance of noise reduction. By adaptively predicting noise and fully mining spectral spatial information, this method obtains higher PSNR and FSIM values than competitors in most bands. KSVD and BM3D do not learn s in the simulation process by restoring HSI, with predefined fixed noise variance. At the same time, they are implemented in the noisy HSI band, ignoring the strong spectral correlation. KSVD and BM3D show lower values in Figs. 15.4A and B. ANLM3D and BM4D use spatial and spectral information to suppress noise. Compared with KSVD and BM3D, better simulation results are obtained. ANLM3D shows very unstable performance, as shown in Fig. 15.3. LRMR takes advantage of the low-rank property in HSI and represents a better FSIM value by well retaining the potential characteristics of HSI. By exploring hierarchical depth learning and fine tuning, DDL3þFT shows a similar PSNR value, as shown in Fig. 15.4A. In the process of recovery, LRMR and DDL3þFT convert HSI into a two-dimensional matrix, neither of which can effectively make use of the spatial consistency of HSI. Obviously, the proposed method has a more stable trend than the six comparison methods. In this chapter, the effectiveness and robustness of this method for suppressing zero-mean Gao Si noise under the condition of s cross-band variation are proved. In terms of visual comparison, Fig. 15.5 shows the denoising results of band 101 calculated by different methods. The results show that this method has achieved good (A)

(B)

(C)

(D)

1

1

1

1

0.8 0 .8

0 0.8 .8

0.8 0 .8

0.8 0 .8

0.6 0 .6

0 0.6 .6

0.6 0 .6

0.6 0 .6

0.4 0 .4

0 0.4 .4

0.4 0 .4

0.4 0 .4

0.2 0 .2

0 0.2 .2

0.2 0 .2

0.2 0 .2

0

0

0.5 0 .5

0

1

(E)

0

0.5 0 .5

(F)

0

1

0

0.5 0 .5

(G)

0

1

0

(H)

0 0.5 .5

1

(I)

1

1

1

1

1

0.8 0 .8

0.8 0 .8

0.8 0 .8

0.8 0 .8

0.8 0 .8

0.6 0 .6

0.6 0 .6

0.6 0 .6

0.6 0 .6

0.6 0 .6

0.4 0 .4

0.4 0 .4

0.4 0 .4

0.4 0 .4

0.4 0 .4

0.2 0 .2

0.2 0 .2

0.2 0 .2

0.2 0 .2

0.2 0 .2

0

0

0.5 0 .5

1

0

0

0.5 0 .5

1

0

0

0.5 0 .5

1

0

0

0.5 0 .5

1

0

0

0.5 0 .5

1

Figure 15.5 Restored images of band 101corrupted with Gaussian noise: (A) clean HSI; (B) noisy HSI; (C) KSVD; (D) BM3D; (E) ANLM3D; (F) BM4D; (G) LRMR; (H) DDL3þFT; (I) ours.

Hyperspectral image processing based on sparse learning and sparse graph

569

results in suppressing noise and maintaining local detail structure. This is further described by the magnification area in the restored image of all competitive methods. KSVD has poor performance and loses useful structural information. BM3D smoothed some important objects. This also does worse recovery work. ANLM3D can effectively take advantage of high nonlocal self-similarity and strike a balance between smoothing and structural maintenance. However, it is still unable to restore the outline of the local target. The denoising results of BM4D and DDL3þFT lose some fine targets. LRMR can get the same results as the proposed method. However, Gao Si noise has not been well reduced, as shown in Fig. 15.5G. Obviously, these visual evaluation results are completely consistent with the above numerical evaluation results. Case 2: the mixed noise consists of zero-mean Gaussian noise, standard variance s ¼ 0:2 and dead pixel lines. For case 2, the 95 band recovery results calculated by different methods, including KSVD, ANLM3D, BM3D, BM4D, DDL3þFT, LRMR, and the proposed method, are shown in Fig. 15.6. In all methods of restoration images, an area of interest is magnified. The results (A)

(B)

(C)

1

1

1

1

0.8

0.8

0.8

0.8

0.6

0.6

0.6

0.6

0.4

0.4

0.4

0.4

0.2

0.2

0.2

0.2

0

0

0.5

1

0

0

0.5

1

0

0

(D)

0.5

1

0

0

0.5

1

(E)

(F)

(G)

(H)

1 0.8

1 0.8

1 0.8

1 0.8

1 0.8

0.6

0.6

0.6

0.6

0.6

0.4

0.4

0.4

0.4

0.4

0.2

0.2

0.2

0.2

0.2

0

0

0.5

1

0

0

0.5

1

0

0

0

0.5

1

(I)

0

0.5

1

0

0

0.5

1

Figure 15.6 Restored results of band 95 corrupted with a mixture of Gaussian noise and dead pixel lines: (A) clean HSI; (B) noisy HSI; (C) KSVD; (D) BM3D; (E) ANLM3D; (F) BM4D; (G) LRMR; (H) DDL3þFT; (I) ours.

570 Chapter 15 in Figs. 15.6AeH show that the visual performance of this method is closer to that of simple HSI. The dead pixel lines in Figs. 15.6CeH are still obvious, which means that the six comparison methods cannot suppress the dead pixel line. In addition, according to the magnified region, the results show that the method can effectively restore the uniform region while retaining the edge. KSVD can partly improve the quality of noisy images. However, it destroys the structure of sparse coefficients in dictionary learning, resulting in edge loss and other structural details, such as in Fig. 15.6C. BM3D reduces noise by using statistics similar to adjacent patches, and gets a better visual impression than KSVD. However, it can smooth the structural information, and the result is fuzzy. ANLM3D can effectively make use of high nonlocal self-similarity to better balance smoothing and structural retention. It cannot retain edges and details. By using the three-dimensional nonlocal self-similarity of the data cube, BM4D can achieve visual improvement. However, it removes some good objects and oversmooths the HSI. As shown in Fig. 15.6G, the fuzzy black line means that LRMR only removes some dead-zone pixel lines, and the fuzzy white dots show that LRMR cannot effectively remove heavy Gaussian noise. Obviously, the DDL3þFT method cannot restore the mixed noise image, which can be seen from the blurred edge and the obvious dead area pixel line, as shown in Fig. 15.6H. Figs. 15.7 and 15.8 show the vertical and horizontal profiles of 95 band (559, 150) pixels in case 2 simulation experiment, respectively. Obviously, the results are visually different in shape and amplitude, in which rapid fluctuations are caused by the existence of dead pixel lines. It can be seen from Figs. 15.7 and 15.8 that the profiles produced by this (A)

(B)

(C)

(D)

0.9

0.9

0.9

0.9

0.45

0.45

0.45

0.45

0

0

(E)

350 700 Row number

0

0

(F)

350 700 Row number

0

0

(G)

350 700 Row number

0

0.9

0.9

0.9

0.45

0.45

0.45

0.45

0

350 700 Row number

0

0

350 700 Row number

0

0

350 700 Row number

350 700 Row number

0

350 700 Row number

(H)

0.9

0

0

0

Figure 15.7 Vertical profiles of band 95 at pixel (559, 150) before and after denoising: (A) clean HSI; (B) KSVD; (C) BM3D; (D) ANLM3D; (E) BM4D; (F) LRMR; (G) DDL3þFT; (H) ours.

Hyperspectral image processing based on sparse learning and sparse graph (A)

(B)

(C)

(D)

0.8

0.8

0.8

0.8

0.4

0.4

0.4

0.4

0

0

175 350 Column number

(E)

175 350 Column number

(F)

0

175 350 Column number

(G)

0

0.8

0.8

0.8

0.4

0.4

0.4

0.4

175 350 Column number

0

175 350 Column number

0

175 350 Column number

(H)

0.8

0

571

0

175 350 Column number

175 350 Column number

Figure 15.8 Horizontal profiles of band 95 at pixel (559, 150) before and after denoising: (A) clean HSI; (B) KSVD; (C) BM3D; (D) ANLM3D; (E) BM4D; (F) LRMR; (G) DDL3þFT; (H) ours.

method are the closest to the original HSI, for the removal of Gaussian noise and dead pixel lines. The curves in Figs. 15.7BeG and 15.8BeG are not ideal consistent with those in Figs. 15.7 and 15.8A, which leads to the limited denoising ability of HSI, which greatly supports the above analysis. The spectral characteristics of a clear image and restored pixel image (559,150) are shown in Fig. 15.9, which is very important for the classification and recognition of HSI. Bandby-band denoising of HSI band is carried out by KSVD and BM3D, and the spectralspatial correlation is destroyed. As shown in Figs. 15.9B and C, there are identifiable artifacts in the signatures restored by KSVD and BM3D. Using ANLM3D and BM4D, the recovered characteristics are closer to the initial spectral reflectivity curve because of the use of spectral space information. However, compared with clean spectral characteristics, they seem to have strong volatility. The restored signature LRMR and DDL3þFT have similar trends and shapes, but the details cannot be saved well. Obviously, the spectral characteristics calculated by this method are optimal, which shows the advantages of this method in suppressing dead pixel lines and Gaussian noise. Case 3: Mixed noise consisting of zero-mean Poisson noise, Gaussian noise, and dead pixel lines, with s ¼ 0:15 and peak ¼ 30; For case 3, the band 90 of the initial HSI and denoising results is shown in Fig. 15.10. One area of all the listed images is magnified for clear comparison. KSVD has poor display performance, as shown in Fig. 15.10C, BM3D is too smooth and loses some useful

572 Chapter 15 (A)

(B)

(C)

(D)

0.4

0.4

0.4

0.4

0.2

0.2

0.2

0.2

0

0

0

(E)

60 120 Spectral Bands

0

(F)

60 120 Spectral Bands

0

0

(G)

60 120 Spectral Bands

0

0

(H)

0.4

0.4

0.4

0.4

0.2

0.2

0.2

0.2

0

0

0

60 120 Spectral Bands

0

0

60 120 Spectral Bands

0

0

60 120 Spectral Bands

0

60 120 Spectral Bands

60 120 Spectral Bands

Figure 15.9 Spectral reflectance curves at pixel (559, 150) before and after denoising: (A) clean HSI; (B) KSVD; (C) BM3D; (D) ANLM3D; (E) BM4D; (F) LRMR; (G) DDL3þFT; (H) ours.

(A)

(E)

(B)

(F)

(C)

(G)

(D)

(H)

(I)

Figure 15.10 Restored results of band 90 corrupted with a mixture of Gaussian noise, Poisson noise, and dead pixel lines: (A) clean HSI; (B) noisy HSI; (C) KSVD; (D) BM3D; (E) ANLM3D; (F) BM4D; (G) LRMR; (H) DDL3þFT; (I) ours.

Hyperspectral image processing based on sparse learning and sparse graph

573

objects. In addition, there are obvious dead pixel lines in Figs. 15.10C and D, which show that the KSVD and BM3D methods cannot remove dead area pixel lines. ANLM3D and BM4D algorithms can only reduce some dead pixel lines. As shown in Figs. 15.10D and F, neither of them can preserve these exquisite objects very well. There are still dead pixel lines and some Gaussian noise in the restored image of LRMR. From Fig. 15.10H, it can be seen that DDL3þFT failed to suppress dead zone pixel lines. As shown in Fig. 15.10I, our method can effectively remove Gaussian noise, Poisson noise, and dead pixel lines, while retaining local details such as edge and texture. Obviously, its performance is better than the six comparison methods. Figs. 15.11 and 15.12 display vertical and horizontal Pro with 90 LES at pixels (399, 290) in the simulation experiment of case 3, respectively. Fig. 15.13 shows the spectral reflectivity curves of competing methods for all positions (399, 290) before and after denoising. The rapid fluctuation of the results in Fig. 15.10 is caused by dead pixel lines. According to the difference of shape and amplitude of Figs. 15.11e15.13, visual comparison is carried out. KSVD is at a disadvantage in suppressing mixed noise and retaining spectral information, as shown in Figs. 15.11e15.13B. According to Figs. 15.11e15.13C, BM3D, the structure protection is too smooth, and some recovery artifacts are introduced. It can be found from Figs. 15.11e15.13D that ANLM3D can partly reduce the mixed noise, as shown in Fig. 15.12D, which can be easily seen by reducing the rapid fluctuation. As shown in Figs. 15.11e15.13E, compared with KSVD, BM3D, ANLM3D, and DDL3FT, the curves obtained by BM4D and LRMR are very close to the initial curves, but the details are not satisfactory. From Figs. 15.11e15.13F, the

(A)

(B)

(C)

(D)

0.8

0.8

0.8

0.8

0.4

0.4

0.4

0.4

0

0

(E)

350 700 Row number

0

0

(F)

350 700 Row number

0

0

(G)

350 700 Row number

0

0.8

0.8

0.8

0.4

0.4

0.4

0.4

0

350 700 Row number

0

0

350 700 Row number

0

0

350 700 Row number

350 700 Row number

0

350 700 Row number

(H)

0.8

0

0

0

Figure 15.11 Vertical profiles of band 90 at pixel (399, 290) before and after denoising: (A) clean HSI; (B) KSVD; (C) BM3D; (D) ANLM3D; (E) BM4D; (F) LRMR; (G) DDL3þFT; (H) ours.

574 Chapter 15 (A)

(B)

(C)

(D)

0.8

0.8

0.8

0.8

0.4

0.4

0.4

0.4

0

0

0

200 400 Column number

(E)

0

0

200 400 Column number

(F)

0

0

200 400 Column number

0

(G)

(H)

0.8

0.8

0.8

0.8

0.4

0.4

0.4

0.4

0

0

200 400 Column number

0

0

0

200 400 Column number

0

0

200 400 Column number

200 400 Column number

0

200 400 Column number

Figure 15.12 Horizontal profiles of band 90 at pixel (399, 290) before and after denoising: (A) clean HSI; (B) KSVD; (C) BM3D; (D) ANLM3D; (E) BM4D; (F) LRMR; (G) DDL3þFT; (H) ours.

(A)

(B)

(C)

0.3

0.3

0.3

0.3

0.2

0.2

0.2

0.2

0.1

0.1

0.1

0.1

0

0

0

0

40 80 120 Spectral Bands

(E)

0

40 80 120 Spectral Bands

(F)

(D)

0

40 80 120 Spectral Bands

(G)

0

0

(H)

0.3

0.3

0.3

0.3

0.2

0.2

0.2

0.2

0.1

0.1

0.1

0.1

0

0

40 80 120 Spectral Bands

0

0

40 80 120 Spectral Bands

0

40 80 120 Spectral Bands

0

40 80 120 Spectral Bands

0

0

40 80 120 Spectral Bands

Figure 15.13 Spectral reflectance curves before and after denoising at pixel (399, 290): (A) clean HSI; (B) KSVD; (C) BM3D; (D) ANLM3D; (E) BM4D; (F) LRMR; (G) DDL3þFT; (H) ours.

curve obtained by BM4D is very close to the original curve, but it is not satisfactory in detail preservation. As shown in Figs. 15.11e15.13E, the curve obtained by BM4D is very close to the original curve, but it is not satisfactory in detail preservation. By comparing the areas marked with red rectangles, our method is easy to get the method closest to the internal pattern of clean HSI, which is completely consistent with the above analysis.

Hyperspectral image processing based on sparse learning and sparse graph

575

15.1.3.2 Experiment on real data This method uses two famous real data sets, including urban data and Indian pine data. The experimental results before and after restoration were evaluated qualitatively by visual impression. There are no reference images for the actual HSI, for numerical calculations, such as PSNR and FSIM. Therefore, the classification accuracy of Indian pine data is used to quantitatively estimate the denoising performance. 15.1.3.2.1 Denoising for urban data

Urban data are obtained by an HYDICE sensor and its original size is 307  307  210. Because of the difference between the detectors, it has different intensity bands and mixed noise relative to the band. A subset of 150  150  188 size was used in the following experiments after removing the urban data 104  108,139 and 207,210 bands of atmospheric and water absorption pollution. Fig. 15.14 shows the restored images of 186 band obtained by different methods. For visual purposes, the yellow arrows in Fig. 15.14 are used to mark obvious stripes. At the same time, Fig. 15.15 gives the magnification details in Fig. 15.14 with red rectangles. KSVD is not good at reducing stripes and saving structure, as shown in Figs. 15.14e15.15B. Obviously, BM3D and BM4D smooth important texture and fine targets, neither of which removes stripes. According to Figs. 15.14e15.15D and 15.14e15.15G, it can be found that ANLM3D and DDL3FT can partly suppress stripes. DDL3FT has better ability to retain texture and edge information

(A)

(B)

(C)

(D)

(E)

(F)

(G)

(H)

Figure 15.14 Restored results in Urban image: (A) original band 186; (B) KSVD; (C) BM3D; (D) ANLM3D; (E) BM4D; (F) LRMR; (G) DDL3þFT; (H) ours.

576 Chapter 15

Figure 15.15 4 magnified results of the various approaches in the red rectangle of Fig. 15.14: (A) original band 186; (B) KSVD; (C) BM3D; (D) ANLM3D; (E) BM4D; (F) LRMR; (G) DDL3þFT; (H) ours.

than ANLM3D. In terms of noise reduction and structure maintenance, LRMR has better recovery performance than the other five methods. It can be clearly observed from Figs. 15.14F and H that LRMR is not as good as the suggested method in removing stripes. In general, the proposed restoration method can effectively recover urban data, and has better effect than the six comparative denoising methods. In addition, Fig. 15.16 shows the results of 104 band before and after denoising. As shown in Fig. 15.16A, there are many multiple stripes in the initial band 104. Fig. 15.17 shows the pseudo-color images of urban data before and after restoration, including 1104 and 135 bands. KSVD updates dictionary atoms one by one and destroys the structure of sparse coefficient. As shown in Figs. 15.16e15.17B, it blurs the image structure and shows poor performance. Through block matching and three-dimensional collaborative filtering, BM3D greatly improves the quality of the original image, but there are still obvious stripes in Figs. 15.16e15.17C. According to the red elliptic region, it is found that the ANLM3D, BM4D, LRMR, and DDL3þFT can partially remove the stripes, while the LRMR has poor image recovery effect on the severe noise pollution. The results show that the method has better edge-retaining effect, and the denoising effect on the flat area is good. In particular, by comparing the blue rectangular area in Fig. 15.16, it is found that the method can recover the target and keep the edge very well while effectively reducing the fringe and mixed noise. The results show that compared with KSVD, BM3D, ANLM3D, BM4D, LRMR, and DDLþ3FT, the restored images show good visual effect and better detail storage.

Hyperspectral image processing based on sparse learning and sparse graph (A)

(B)

(C)

(D)

(E)

(F)

(G)

(H)

577

Figure 15.16 Restored results in Urban image: (A) original band 104; (B) KSVD; (C) BM3D; (D) ANLM3D; (E) BM4D; (F) LRMR; (G) DDL3þFT; (H) ours.

Figure 15.17 Restored results of Urban image: (A) original false-color image (R: 1, G: 104, and B: 135); (B) KSVD; (C) BM3D; (D) ANLM3D; (E) BM4D; (F) LRMR; (G) DDL3þFT; (H) ours.

578 Chapter 15 15.1.3.2.2 Experimental results on Indian Pines data

The second data set is named Indian Pines, which is recorded by the NASA AVIRIS sensor over the Indian Pines region in 1992, and this data set contains much random noise in some bands during the acquiring process. It comprises 220 spectral bands and the spatial dimension of each spectral band is 145  145 pixels. For Indian Pines data, the ground truth has 16 land cover classes and a total of 10,366 labeled pixels. Figs. 15.18e15.20 display the visual comparisons of different bands polluted by different noises. Obviously, the proposed method achieves better visual quality than the compared ones. From Figs. 15.18e15.20B, KSVD shows poorer denoising capability than other methods. BM3D significantly blurs the images and loses the texture information and edges. As shown in Figs. 15.18e15.19D, the improvements of the image quality obtained by ANLM3D appear very small and can be neglected. BM4D is oversmoothing and loses the texture information. By observing the regions marked by the blue rectangle in Figs. 15.18e15.20, our denoising method has better ability on the edge and structure preservation than LRMR and DDL3þFT; meanwhile, LRMR and DDL3þFT do worse work in efficiently removing the random noises as shown in the regions marked by the red ellipse in Figs. 15.18e15.20. Our algorithm is greatly superior to the six compared methods for seriously corrupted images, which is consistent with the above analysis. Therefore, our algorithm can do best in the removal of random noises for Indian Pines data, while effectively improving the quality of the noisy HSI and restoring the texture and structure details. (A)

(B)

(C)

(D)

(E)

(F)

(G)

(H)

Figure 15.18 Restored results in Indian Pines image: (A) original band 164; (B) KSVD; (C) BM3D; (D) ANLM3D; (E) BM4D; (F) LRMR; (G) DDL3þFT; (H) ours.

Hyperspectral image processing based on sparse learning and sparse graph (A)

(B)

(C)

(D)

(E)

(F)

(G)

(H)

579

Figure 15.19 Restored results in Indian Pines image: (A) original band 220; (B) KSVD; (C) BM3D; (D) ANLM3D; (E) BM4D; (F) LRMR; (G) DDL3þFT; (H) ours.

(A)

(B)

(C)

(D)

(E)

(F)

(G)

(H)

Figure 15.20 Restored results in Indian Pines image: (A) original band 1; (B) KSVD; (C) BM3D; (D) ANLM3D; (E) BM4D; (F) LRMR; (G) DDL3þFT; (H) ours.

580 Chapter 15 Table 15.1: SVM classification accuracies on heavily corrupted bands (104e108, 150e163, 220) of the Indian Pines data. Initial HSI OA k

15.74% 0.0912

KSVD

BM3D

57.96% 0.5368

81.2% 0.7803

ANLM3D 25.17% 0.2135

BM4D 69.27% 0.679

LRMR 48.38% 0.3664

DDL3þFT 49.12% 0.4284

Ours 83.76% 0.8109

For the classification-based assessment, two cases were considered on the basis of the test data: (1) 20 severely corrupt Indian Pines data were classified, including 104e108, 150e163, and 220 bands; (2) the Indian Pines data are classified to remove 20 seriously corrupt bands. Similar to the situation in Ref. [22], the training samples of “alfalfa,” “grass/pasture mowed,” and “oats” in small classes were set at 15 samples per class, and the number of training samples in other classes was set at 50. Support vector machine (SVM) is a classical cationic method. As usual, the commonly used overall accuracy (OA) and kappa coefficients are selected as the evaluation matrix, and the mapping of the results is used as the visual estimation. Table 15.1 lists the overall accuracy (OA) and kappa coefficient (k) of the results for the heavily corrupted bands. Table 15.2 shows the OA and kappa coefficient of the results for the remaining bands. After achieving the recovery for testing data, the values of OA and kappa coefficient are obviously enhanced, as shown in Tables 15.1 and 15.2, which demonstrate the necessary of HSI denoising before implementing the classification. Compared with other algorithms, the proposed method obtains the best OA and kappa coefficient in both Tables 15.1 and 15.2, which means that our denoising method can greatly restore the structure information (which is essential for classification) in the seriously polluted bands or the remaining 200 bands. Bearing that in mind, ANLM3D method obtains an OA value of 25.17% and k of 0.2135 in Table 15.1, which are just 9.43% and 0.1223 higher than the initial HSI. This lower classification accuracy and kappa coefficient are totally in line with the poor denoising performance as displayed in Fig. 15.19D. The classification maps of different algorithms are displayed in Fig. 15.21, where the first row is the results of the 20 heavily corrupted bands and the second row presents the results of the remaining 200 bands before and after restoration. According to Fig. 15.21, it

Table 15.2: SVM classification accuracies on the Indian Pines data (bands 104e108, 150e163, 220 removed). Initial HSI OA k

74.39% 0.7183

KSVD

BM3D

87.96% 0.8531

0.9215% 0.8673

ANLM3D 81.06% 0.775

BM4D

LRMR

87.59% 0.8568

87.18% 0.8548

DDL3þFT 85.92% 0.8442

Ours 90.35% 0.8726

Hyperspectral image processing based on sparse learning and sparse graph

581

Figure 15.21 Classification results for Indian Pines data before and after denoising: the first row is the results of the 20 heavily corrupted bands; the second row presents the results of the remaining 200 bands: (A) original HSI; (B) KSVD; (C) BM3D; (D) ANLM3D; (E) BM4D; (F) LRMR; (G) DDL3þFT; (H) ours.

can be easily observed that the result of the suggested method presents the better visual effect than the six compared algorithms.

15.2 Hyperspectral image restoration based on hierarchical sparse Bayesian learning Sparse Bayesian learning (SBL) has been widely applied for the HSI analysis [23,24], with successful results. But multidimensional integral needs to be calculated for solving sparse Bayesian problems, which is usually analytically intractable. Until now, the full conjugacy between the beta and Bernoulli distributions has been demonstrated [25]. Therefore, in the SBL model combining the beta and Bernoulli distributions, the posterior computation can be performed analytically. We start with a brief review of the beta process in Section 15.2.1 and then provide a detailed description of the proposed model in Section 15.2.2.

15.2.1 Beta process The beta process is an infinite jump process, which is suitable for dictionary learning due to its high flexibility [26]. The two-parameter beta process is denoted by the draw H w BP(a, b, H0) with parameters a, b > 0, which is proposed in Ref. [27]. Let U be a measurable space and B its s-algebra, the disjoint and infinitesimal partitions of U are denoted as B˛{B1, /, BK}. The base measure H0 is a fixed probability measure over (U, B) with H0 ðBk Þ ¼ K1 for k ¼ 1, /, K The set function form of the two-parameter beta process is shown as follows. HðBÞ ¼

K X k¼1

pk dBk ðBÞ

a bðK  1Þ pk wBeta ; K K

(15.9)

582 Chapter 15 where H is composed by an infinite number of Bk sampled i.i.d. from H0 with K probabilities. pk represents the jump, which is commonly utilized to parameterize a finite Bernoulli process. Beta defines the beta distribution. Supposing zi ˛ RK is drawn from a Bernoulli process with the parameter pk, zi is a binary vector and can be written as zik wBernoulliðpk Þ. In the dictionary learning, fBk gk¼1;/;K refers to the dictionary atoms, and K represents the number of atoms. By reasonably choosing K, a and b, pk will be near to zero and zik is equal to zero with a high probability, which implies the sparse constraints on the dictionary learning model. 15.2.1.1 Full hierarchical sparse Bayesian model Considering the HSI Y ˛ Rlx ly l , where lx and ly define the size of the two spatial dimensions, l refers to the number of bands. To fully exploit highly correlated spectral information and strongly similar spatial information, Y is divided into overlapping 3D blocks instead of 2D blocks when performing the restoration. The size of each 3D block is P ¼ nx  ny  l, where nx  ny defines the spatial size of the 3D blocks. In vector form, each block is transformed into xi ˛ RP for i ¼ 1, /, M, and the total number of 3D blocks is M ¼ (lx  nx þ 1) (ly  ny þ 1). For HSI recovery, some existing noises, including impulse noises, stripes, and dead pixel lines, only appear in small part of pixels within a band or a few bands, and the intensity and positions of these noises are often more subtle and various. Therefore, these noises can be considered to be sparse in the hyperspectral images. To fully depict the noise characteristics of HSI, we decompose the noise term into a Gaussian noise term and sparse noise term. xi ¼ Dai þ ni þ si +vi   dk wN 0; P1 I P ai ¼ zi +wi

ap bp ðK  1Þ ; zik wBernoulliðpk Þ; pk wBeta K K   wi wN 0; g1 w I K ; gw wGðc; dÞ   ni wN 0; g1 n I P ; gn wGðe; f Þ sip wBernoulliðqip Þ; qip wBetaðaq ; bq Þ   vpi wN 0; g1 v ; gv wGðg; hÞ

(15.10)

Hyperspectral image processing based on sparse learning and sparse graph

583

With these, the proposed model is considered as consisting of three terms. The first term represents the “clean and entire” HSI, which can be well learned by elements of dictionary. The success is attributed to the fact that valid data in corrupted images are intrinsically sparse under the dictionary framework and the noises are uniformly spread and cannot be represented by the dictionary. The second term is the Gaussian noise, and the third term defines the sparse noise. A beta process coupled with a Bernoulli process is utilized to depict the sparseness of the valid data and the arbitrariness of the intensity and positions in the sparse noise. A Gaussian process is exploited to learn the Gaussian noise. According to these, the “clean and entire” image can be effectively restored from the degraded HSI, while the noises can be greatly reduced by well learning their statistics characteristics. The symbol + represents the element-wise multiplication, IP(IK) is the P  P (K  K) identity matrix, and K is the number of the dictionary atoms. In this model, D ¼ [d1, /, dk, /, dK]˛RPK represents the dictionary learned from the test data, with the dictionary atoms drawn from a Gaussian distribution. The vector ai represents the sparse coefficient and A ¼ ½a1 ; /; ai ; /; aM  is the sparse coefficient matrix for fxi gi¼1;/;M. The binary vector zi ¼ ½zi1 ; zi2 ; /; ziK T , drawn from a Bernoulli process, is coupled with pk drawn from the beta process. And it defines the columns of D exploited to represent ai with probability pk. The vector wi ¼ ½wi1 ; wi2 ; /; wiK T is the weight of the coefficient ai , which is learned by Gaussian process. When K/N, the expectation of zi is drawn from Poissonðap =bp Þ at random. Therefore, explicit sparseness can be enforced on the coefficients fai gi¼1;/;M through adjusting noninformative hyperparameters ap and bp in beta process. When zik ¼ 0, the coefficient aik is equal to zero instead of near zero in many sparse approaches, which means that the k-th atom of D is not used for coefficient ai . By calculating the number of unused atoms, the size of the dictionary can be inferred adaptively. N(,) and Gð ,Þrepresent the normal distribution and gamma distribution, respectively, and these two distributions tender much more flexibility to solve the model with the posterior PDF. gw , gn , and gv are taken for precision of parameters or noise precision separately, with a noninformative gamma prior. In the sparse noise term, beta-Bernoulli process and Gaussian distribution are explored to fully depict the arbitrariness of position and amplitude in the sparse noise separately. The intensity of sparse noise is defined by the matrix vi ¼ ½vi1 ; /; vip ; /; viP T , with each element constrained by the noninformative hyperparameter priors. The matrix si ¼ ½si1 ; /; sip ; /; siP T represents the location information of sparse noise in xi, which can be illustrated with nonzero elements of si.

584 Chapter 15 The negative logarithm posterior density function of the proposed method is represented as Eq. (15.1). According to Eq. (15.1), all observed and unknown variables can be considered as stochastic variables with the joint probability distribution specified. Therefore, the proposed method has greater robustness and accuracy. In the proposed method, the distributions of all random variables are in the conjugate exponential family, and Gibbs sampler can be utilized to infer each variable by repeatedly sampling the conditional distributions. The detail inferences of the Gibbs sampler are displayed in Algorithm 1. log pðfD; W; Z; Q; S; fpk g; fqip g; gw ; gn ; gv gjXÞ ¼ 0:5gn

X

kX  DðW+ZÞ  Q+Sk22 þ 0:5P i

þ0:5gw þ

X ik

X ip

kwi k22 i

þ

X

kq k2 i i 2

þ

dT d k k k

 ap bp ðK  1Þ  Beta pk  ; k K K

X

logBernoulliðzik jpk Þ

þ0.5gv þ

X

X

X ip

(15.11)

Betaðqip jaq ; bq Þ

logBernoulliðsip jqip Þ

þlog Gðgw jc; dÞGðgn je; f ÞGðgv jg; hÞ þconst Moreover, we observe yi ¼ Si +xi instead of xi to make a prediction for missing data by using the remaining data, where Si ¼ f0; 1gP is the sampling matrix with Si STi ¼ I kSi k0. For f ¼ 1, /, P, Sfi ¼ 0 represents that the f-th pixel of the vector xi is lost. Additionally, there are several solutions to the same spectral pixel due to the usage of overlapping 3D blocks, so the restored HSI is constructed by averaging all overlapping 3D blocks. Performing this operation for Xrestore, we can get the final HSI after restoration.

15.2.2 Experimental results The performance of the proposed restoration model is demonstrated on two hyperspectral data sets visually and quantitatively. One is Indian Pines data, with the spatial size of 145  145, which was acquired by the Airborne Visible/Infrared Imaging Spectrometer in June 1992. It has 200 spectral bands of 10 nm widths from 0.4mm to 2.45mm and a spatial resolution of 20 m. The second data set, Botswana, consists of 145 spectral wavelengths with 1476  256 pixels. It was acquired by the NASA EO-1 satellite with the Hyperion

Hyperspectral image processing based on sparse learning and sparse graph Algorithm 1 Input: Noisy data X, hyperparameters Output: Restored data Xrestore Initialization: Num ¼ 100, K ¼ 256 for

iter ¼ 1: Num

for

k ¼ 1:K







Sampling dk : pðdk j ÞwN mdk ; Udk Udk ¼

PI P þ gn

P i

wik2 z2ik

1

mdk ¼ gn Udk

P i

wik zik xði;kÞ

xði;kÞ ¼ xi  Dðwi 1zi Þ  qi 1si þ ðwik 1zik Þdk    p1 Sampling zik : pðzik j ÞwBernoulli p1 þp0 p1 ¼ pk exp  0:5gn wik2 dTk dk 2wik dTk xði;kÞ p 0 ¼ 1  pk

 1 Sampling wi : pðwik j Þ wNðwik jmwik ; Uwik Þ Uwik ¼ gw þ gn zTik dTk dk mwik ¼ gn Uwik dTk xði;kÞ Sampling pk : pðpk j ÞwBeta

ap K

þ

P i

zik ; bp ðK1Þ K

þM 

P

! zik

i

end Sampling gw : P T pðgw j ÞwG c þ0:5MK; d þ 0:5wi wi . i

Sampling gn :   P pðgn j ÞwG e þ0:5PM; f þ i kxi  Dðwi +zi Þ  qi +si k22 Sampling sip : pðsip j ÞwBernoulli

v1 v1 þv0

   v1 ¼ qip exp  0:5gv q2ip 2qip xði;sÞ ,

v0 ¼ 1  qip xði;sÞ ¼ xi  Dðwi +zi Þ  1     , mqip ¼ gn Uqip sip xði;sÞ Sampling qi : p qip j  wN mqip ; Uqip Uqip ¼ gv þ gn s2ip Sampling qp :

P P pðqp j ÞwBeta aq þ sip ; bq þM  sip i

i

Sampling gv : P T pðgv j ÞwG g þ0:5PM; h þ 0:5qi qi i

end Calculating Xrestore ¼ DA

585

586 Chapter 15 sensor on May 31, 2001. It is in 10 nm windows with 30 m spatial resolution over a 7.7 km strip. A subset of size 150  200  145 is used here. Note that noisy and water absorption bands were removed from both data sets in the experiment. To investigate the performance of the proposed method, we choose four different methods for comparison, including K-SVD [28], BM3D [29], ANLM3D [16], and BM4D [17]. The necessary parameters in the four compared methods are finely tuned or automatically selected to generate the optimal simulated results. For the proposed method, the spatial size of blocks is 4  4, and spatial information can be employed by this to a certain degree. The size of the dictionary and the iteration number of Gibbs sampling are set as K ¼ 128 and 100 separately. The hyperparameters of the Gaussian noise are set as e ¼ f ¼ 106. For setting the remaining hyperparameters, two cases are taken into consideration. (1) The HSI is contaminated by Gaussian noise, dead pixel lines, or a mixture of both. The remaining hyperparameters are set as: ap ¼ bp ¼ aq ¼ bq ¼ 106 ; c ¼ d ¼ g ¼ h ¼ 106. (2) The HSI is contaminated by a mixture of Gaussian and impulse noise or a mixture of Gaussian noise, impulse noise, and dead pixel lines. The remaining hyperparameters are set as: ap ¼ bp ¼ aq ¼ bq ¼ 104 ; c ¼ d ¼ g ¼ h ¼ 105. Once the types of noises are selected, the hyperparameters will be also determined and will not need to be tuned. The experimental results are evaluated in two ways. First, visual comparisons are shown in the restored images and spectral signatures. Due to the huge amounts of data pixels and spectral bands, a few of them are presented in this chapter. Second, peak signal-to-noise ratio (PSNR) is used to quantitatively measure the similarity between the restored and reference images based upon the mean square error. Structure similarity (SSIM) and feature similarity (FSIM) are utilized to measure structural consistency and perceptual consistency between each initial band and restored band, respectively [4,21,30]. Normally, the higher the measure value is, the better quality the image has. The mean spectral angle (MSA) between different spectral pixels is employed to numerically evaluate spectral fidelity of the restored results [31,32]. The MSA is calculated by Eq. (15.12). 1 0 ðcÞT ðoÞ u v 1 X X 1 B xij $xij C



(15.12) MSA ¼ cos @

ðcÞ ðoÞ A uv i¼1 j¼1

xij $ xij

ðcÞ

ðoÞ

where xij and xij represent the restored spectral pixels and original spectral pixels located at (i, j), respectively. u and v are the number of pixels in the two spatial dimensions. T represents the transpose of matrix. Generally, the smaller the MSA values are, the better the spectral fidelity is. The following experiments consist of three subsections. Section 15.2.2.1 presents the restored results of the HSI polluted by various noises; Section 15.2.2 reports the inferring results of the HSI with some data missing,

Hyperspectral image processing based on sparse learning and sparse graph

587

which also gives the results of the HSI degraded by noise contamination and missing data simultaneously; Section 15.2.2.3 explains the necessities of the sparse noise term. 15.2.2.1 Denoising In the first simulated experiment, four kinds of noises are added to the data sets. 1) Zero-mean Gaussian noise is added to all bands, with the noise variance s fixed or varying across bands randomly. Tables 15.3 and 15.4 show the PSNR values of restored results with five different approaches and Tables 15.5 and 15.6 display the MSA values before and after denoising, in which the two simulated data sets are corrupted by Gaussian noise with the noise standard deviation s ¼ ½5; 15; 25; 35; 50. The best measure values are in bold to improve the comparisons. Clearly, the proposed method has the higher PSNR values and lower MSA values than the other four compared methods. This means that our method can better promote the image quality and preserve the spectral information on each located pixel while removing Gaussian noise. According to Tables 15.3e15.6, KSVD and BM3D generate worse values than other methods, which is attributed to the fact that KSVD and BM3D denoise the HSI band by band and destroy the spectral correlations. This inferiority is very obvious in estimating spectral fidelity, as shown in Tables 15.5 and 15.6. ANLM3D has higher PSNR values than KSVD and BM3D, but it Table 15.3: PSNR comparison with different methods for Indian Pines data. PSNR 5 Corrupted HSI KSVD BM3D ANLM3D BM4D Ours

34.1463 35.8408 38.6757 38.8343 41.18 42.8863

15 24.6095 31.24 33.1185 33.9009 35.5003 39.3473

25 20.174 27.8295 30.8745 31.5687 33.8228 37.3712

35 17.7596 25.7287 29.6837 29.962 32.1876 34.5684

50 14.1553 22.3953 28.498 28.7341 30.5854 32.4548

Table 15.4: PSNR comparison with different methods for Botswana data. PSNR 5 Corrupted HSI KSVD BM3D ANLM3D BM4D Ours

34.1473 35.9896 37.6772 39.6791 42.1063 43.3222

15 24.6107 30.8125 32.3446 33.9057 36.2804 39.2766

25 20.1724 27.6338 30.1449 31.6826 33.6897 37.0629

35 17.252 25.1636 29.0141 30.0964 32.0322 35.5206

50 14.1511 25.6029 27.0309 28.902 30.7741 33.6706

588 Chapter 15 Table 15.5: MSA comparison with different methods for Indian Pines data. MSA 5 Corrupted HSI KSVD BM3D ANLM3D BM4D Ours

0.0277 0.0279 0.0243 0.0193 0.0138 0.0140

15 0.0829 0.0341 0.0303 0.0297 0.0259 0.0211

25 0.1378 0.0434 0.0362 0.038 0.0299 0.0249

35 0.1917 0.0541 0.0428 0.0454 0.0352 0.0293

50 0.2704 0.0718 0.0579 0.0558 0.0416 0.0357

Table 15.6: MSA comparison with different methods for Botswana data. MSA 5 Corrupted HSI KSVD BM3D ANLM3D BM4D Ours

0.0755 0.0445 0.0447 0.0365 0.0299 0.0227

15 0.2208 0.0914 0.0813 0.0605 0.0549 0.0368

25 0.3526 0.1435 0.0886 0.0797 0.0626 0.0441

35 0.46907 0.1957 0.1073 0.0981 0.0705 0.0504

50 0.6161 0.1336 0.147 0.1267 0.0836 0.0599

does worse work than BM3D in some MSA values. BM4D has better values than KSVD, BM3D, and ANLM3D, but it is obviously inferior to the proposed method. Figs. 15.22 and 15.23 display the performance curves of PSNR, SSIM, and FSIM values for India Pines data and Botswana data separately, in which noise standard deviations change across bands within the interval [15, 30]. In Figs. 15.22 and 15.23, the curves have obvious fluctuations due to the varying s with the bands. It is easily found that the performances of different algorithms are almost the same at some bands, which arises from the fact that much smaller Gaussian noise has been added to these bands than their adjacent bands. Obviously, the PSNR, SSIM, and FSIM values with the proposed approach are higher than the competitors at most bands, which have a more stable trend at the same time. This is because the proposed method can well learn noise characteristics and adaptively infer the noise standard deviation. KSVD and BM3D show lower values in both Figs. 15.22 and 15.23. And they denoise all bands with fixed noise levels, which fail to dispose the structural and feature information on degraded images. Exploring both spatial and spectral information, ANLM3D and BM4D yield higher measure results than KSVD and BM3D. Both of them present unstable performance as shown in Figs. 15.22 and 15.23. By sampling the infinite parameter space, the proposed method can obtain the optimum solution to the HSI recovery whether the noise standard

SSIM

Hyperspectral image processing based on sparse learning and sparse graph

589

KSVD

BM3D Bands ANLM3D BM4D Ours

SSIM

Figure 15.22 PSNR, SSIM, and FSIM values of each band with different methods for Indian Pines data.

KSVD

BM3D Bands ANLM3D BM4D Ours

Figure 15.23 PSNR, SSIM, and FSIM values of each band with different methods for Botswana data.

deviation within each band is equal or not. For one Gibbs sampling iteration, the computational complexity of the proposed method is near to {{\tf="PSSym"O}}ðKðP þMÞ þPMÞ. It should be pointed out that the suggested method consumes more time than the four compared ones. 2) Impulse noise with the deviation from 0.01 to 0.02 is added to 10 bands selected randomly. 3) Dead pixel lines are simulated for the randomly selected bands. The width of the dead pixel lines is from one line to three lines. In the following experiment, we add dead

590 Chapter 15

Figure 15.24 Restoration of band 45 with a mixture of impulse noise and dead pixel lines: (A) clean image; (B) corrupted image; (C) KSVD; (D) BM3D; (E) ANLM3D; (F) BM4D; (G) ours.

pixel lines to eight bands in the same position separately, which are from band 43 to band 46 and from band 129 to band 132. 4) Stripes are added to the randomly selected bands. The width of the stripes is from one line to three lines. Due to the similarity between dead pixel lines and stripes, we leave out the presentation of results for stripes in this work. Fig. 15.24 presents the images of band 45 for Indian Pines data after the removal of the mixed impulse noise and dead pixel lines. Also, we consider that hyperspectral images are polluted by the mixed Gaussian noise, impulse noise, and dead pixel lines. The restoration results of band 130 for Indian Pines data are shown in Fig. 15.25. It can be easily observed that the proposed method achieves outstanding performance in the visual results. KSVD employs the iterative method to learn the dictionary adaptively and improves the image quality greatly compared with the corrupted ones. But KSVD learns the dictionary atoms one by one and destroys the structures of sparse coefficients, which results in the loss of the edges and other structural details as shown in Figs. 15.24C and 15.25C. Meanwhile, it needs careful parameter tuning, e.g., noise level, the size of dictionary, and blocks. Due to the highly spatial consistencies (see in Figs. 15.24A and 15.25A), BM3D has a large amount of similar blocks to achieve the restoration. By computing the average of the similar noisy blocks, it can effectively smooth the noises and makes a much better visual impression than KSVD. However, BM3D smoothes out some image structures and details while performing the recovery,

Hyperspectral image processing based on sparse learning and sparse graph

591

Figure 15.25 Restoration of band 130 with the mixed Gaussian noise, impulse noise, and dead pixel lines: (A) clean image; (B) corrupted image; (C) KSVD; (D) BM3D; (E) ANLM3D; (F) BM4D; (G) ours.

and it is highly sensitive to noise level. Both K-SVD and BM3D are the bandwise approaches, which neglect the spectral continuity and correlations in HSI. ANL3D can effectively utilize the high nonlocal self-similarity to better balance smoothing and details preservation. It fails to preserve the edges and some of the fine details for recovering seriously degraded images, in which local image structure is corrupted heavily. BM4D can achieve visual improvements by adopting the threedimensional nonlocal self-similarity data cube. Using the BM4D method, the high spectral correlations between the continuous bands are not fully exploited; instead, only local correlations between some neighboring bands are explored. And its results smooth out some fine details. There are obvious dead pixel lines as displayed in Figs. 15.24CeF and 15.25CeF, which means the four compared approaches fail in restoring the seriously degraded HSI. According to Figs. 15.24G and 15.25G, the obvious superiority of the proposed method can be easily found in detail preservation and the mixed noise reduction, which imply that the proposed approach is possibly able to predict the missing pixels. Generally, the proposed approach achieves more promising denoising performance, which is in line with the quantitative results in Tables 15.3e15.6 and Figs. 15.22 and 15.23.

592 Chapter 15 15.2.2.2 Predicting the missing data In this subsection, 2% of the test data are randomly observed to estimate the performance of the proposed method and, in other words, 98% of the test data are missing. Then the full HSI is recovered by employing the 2% data. Fig. 15.26 shows the true spectral signatures, the corrupted spectral signatures, and inferred spectral signatures at different pixels, based on randomly observing 2% of Indian Pines data. In Fig. 15.27, the same is done for Botswana data. Noticing that, in the curves of the corrupted values, the values of missing data are equal to zero, while the values of observed data are larger than zero. It is visually clear that the inferred spectra are very near to the true values. This means that the proposed method can efficiently restore the spectrum of HSI with very little data observed, and offers the capability of identifying and terrain classification for HSI analysis.

(A)

(B)

Figure 15.26 Spectrum of different pixels for Indian Pines data: (A) pixel (77,36); (B) pixel (30,116).

(A)

(B)

Figure 15.27 Spectrum of different pixels for Botswana data: (A) pixel (16,77); (B) pixel (60,155).

Hyperspectral image processing based on sparse learning and sparse graph

593

Figure 15.28 Recovery image of Botswana data with Gaussian noise standard deviation of 25% and 98% data missing: (A) initial HSI (B) corrupted HSI; (C) restored HSI.

Furthermore, we consider the more realistic case that the HSI is degraded by noise contamination and missing data simultaneously. And Botswana data are utilized for simulating in this case, which is degraded by a mixture of Gaussian noise with standard deviation of 25 and the missing of 98% data. From the aspect of visual impressions, Fig. 15.28 displays the restored results, where the false-color images are represented by band combination of 85(red), 36(green), and 70(blue). According to Fig. 15.28, the suggested method shows convincible results while greatly preserving the structure and detail information. Above all, the proposed method has great superiority in both the spectral signatures preservation and the visual appearance for predicting the missing pixels. 15.2.2.3 Discussion To explain the necessary sparse noise terms, we consider the Indian Pines data in this subsection, which are polluted by the impulse noise with the deviation 0.02 and dead pixel lines. The method generated by disabling sparse noise term of the proposed method is named as Algorithm-dis. As presented below, Fig. 15.29 shows the visual impression of band 45 obtained by the proposed method and Algorithm-dis. Fig. 15.30 shows the horizontal profiles of band 45 at location (115, 30), and Fig. 15.31 displays the spectral signatures at location (115, 30). According to Fig. 15.29, it can be easily found that both of the algorithms improve the image quality greatly compared with the corrupted ones in Fig. 15.29B. The Algorithm-dis can only reduce part of the impulse noise, as presented in Fig. 15.29C. And it fails in preserving the fine objects. As presented in Fig. 15.29D, the proposed method can effectively remove the impulse noise and dead pixel lines, while preserving the local details such as edges and textures. Obviously, the proposed method achieves better performance than Algorithm-dis.

594 Chapter 15 (A)

(B)

(C)

(D)

Figure 15.29 Restoration of band 45 with a mixture of impulse noise and dead pixel lines: (A) initial HIS; (B) corrupted HIS; (C) Algorithm-dis; (D) ours.

(A)

(B)

(C)

(D)

Figure 15.30 The horizontal profiles of band 30 at location (115, 30): (A) initial HSI; (B) corrupted HIS; (C) Algorithm-dis; (D) ours.

(A)

(B)

(C)

(D)

Figure 15.31 The spectral signatures at location (115, 30): (A) initial HSI; (B) corrupted HIS; (C) Algorithmdis; (D) ours.

After the denoising processing, the fluctuations are reduced to some degree. As shown in Figs. 15.30 and 15.31, the results obtained by the proposed method are closest to the curves of true value, which means the proposed method can greatly remove the mixed noise while well preserving the edge and texture information. From Figs. 15.30B and 15.31B, it can be observed that the curves have some obvious fluctuations, due to the

Hyperspectral image processing based on sparse learning and sparse graph

595

existence of impulse noise and dead pixel lines. The results calculated by Algorithm-dis fail in effectively restoring the shape and amplitude of the clean HSI and lose some details, which can be seen from Figs. 15.30C and 15.31C

15.3 Hyperspectral image dimensionality reduction using a sparse graph Classification is one of the basic research topics of hyperspectral images, and has attracted much attention in recent years. Compared with multispectral imagery, hyperspectral imagery can provide much abundant information of land-covers with an increasing number of contiguous, narrow bands. However, the enlargement of the spectral range and spectral resolution also brings redundant information and noisy information, which makes a higher computational cost of processing. Meanwhile, it also needs more labeled training samples for an effective classifier with a higher dimension of spectral features. Therefore, dimensionality reduction (DR) becomes an indispensable part of hyperspectral data processing.

15.3.1 Sparse representation Here, some notations used in this chapter are introduced first. Let X denote the data set or a subset of data set (X ˛ Rmn), where m is the number of features (or bands) and n is the number of pixels. xi ˛ Rmn is the i-th column of X, which indicates one pixel and index (xi) denotes the spatial coordinate of pixel xi. The sparse graph is denoted as G(V, E, Q), where V is the set of vertexes in the graph, E is the set of connected edges between vertexes and Q ˛ Rnn denotes the weight matrix of the graph. For simplicity, vertex is replaced by pixel in the following description. The element of Q is represented as Qij, Qij ¼ 0 means pixel xi is not connected with pixel xj in the sparse graph. In the sparse representation, D is defined as the dictionary and Qi is the sparse representation coefficient of xi. In the projection learning, the projection matrix is denoted as P ˛ Rmp. Sparse representation is commonly used as an encoder that transforms data points to a new space spanned by the overcomplete dictionary’s atoms. Suppose there exists an overcomplete dictionary D ˛ Rmn, n >> m, x ˛ Rm1 is a data point, q ˛ Rn1 is the unknown representation coefficient of x, satisfying x ¼ Dq. Because this equation is underdetermined, we could not obtain an exclusive solution about q. By adding a sparsity constraint to q, a sparse solution will be obtained, which contains a small number of significant elements while the rest are close or equal to zero. The optimization problem can be written as follows: minkqk0 ; s:t: x ¼ Dq q

(15.13)

596 Chapter 15 where jj*jj0 is the L0-norm, which records the number of nonzero atoms in a vector. Because of the nonconvexity of L0-norm, the optimization problem in Eq. (15.13) is NPhard, and can be solved by some greedy algorithms such as matching pursuit and orthogonal matching pursuit. In most cases, L0-norm can be replaced with L1-norm like. minkqk1 ; q

s:t:

x ¼ Dq

(15.14)

It is a convex optimization problem, which can be solved by some convex optimization methods such as basis pursuit. The formulation can be translated to an unconstraint form. minkx  Dqk22 þ l  kqk1 q

(15.15)

where l is a regularization parameter for adjusting the sparsity of q. The optimization of this problem could be solved by least absolute shrinkage and selection operator (LASSO).

15.3.2 Sparse graph-based dimensionality reduction The sparse graph consists of sparse representation coefficients of all data points. The weight matrix of the sparse graph is denoted as Q ˛ Rnn . The whole data set is denoted as the matrix X ¼ [x1, x2, ., xn] ˛Rmn. For one data point xi ˛ Rm1, other data points constitute the dictionary of it, denoted as Di ¼ [x1, ., xi-1, xiþ1, xn] ˛Rmn. The sparse representation coefficient of xi can be computed, and is denoted as qi. After all data points get their coefficients, weight matrix Q can be achieved by integrating those coefficients together. Similar to the KNN-graph, the weights of the sparse graph can be viewed as the similarity metric between each data point. The graph similarity matrix usually is symmetric, so the weight matrix of the sparse graph is symmetrized by Q ¼ (Q þ QT)/2. The aim of sparse graph-based dimensionality reduction is to find a low-dimensional space where the similarity information contained in the sparse graph can be preserved. Suppose there is a linear projection matrix mapping from the high-dimensional space to the low-dimensional space, which is denoted as P ˛ ℝmp ðp  mÞ. p is the dimension number of the lowdimensional space. The low-dimensional representation of X is denoted as Y ˛ ℝpn , satisfying Y ¼ PTX. The objective function could be formulated as follows: min P

n X n

X

PT xi  PT xj 2 Qij 2

i¼1 j¼1

(15.16)

Hyperspectral image processing based on sparse learning and sparse graph

597

If xi and xj are close to each other in original space, i.e., with large Qi,j, then their projected vectors PTxi and PTxj should be close too in the low-dimensional space. The objective function is as follows. n X n

X

PT xi  PT xj 2 Qij 2

i¼1 j¼1

  ¼ 2Tr PT XðD  QÞX T P   ¼ 2Tr PT XLX T P

(15.17)

where D is a diagonal matrix with its entries being the row (or column) sum of Q, i.e., P dii j Qij , and L ¼ DQ is the Laplacian matrix. By adding a constraint PTXDXTP ¼ I, the final objective function is formulated as Eq. (15.18). min P

Tr½PT XLX T P Tr½PT XDX T P

(15.18)

The above problem can be solved by the generalized eigenvalue problem. After the generalized eigenvalue decomposition of (XLXT, XDXT), these p eigenvectors corresponding to p smallest eigenvalues are selected to construct the projection matrix P ¼ [v1, v2, ., vi, ., vp], where vi is the eigenvector with the eigenvalue si, s1 < s2, ., < sp.

15.3.3 Sparse graph learning The proposed method mainly includes two parts. In the first part, the spatial-spectral clustering is employed to divide all pixels into K clusters. In the second part, sparse graph learning is done, in which sparse graph and projection matrix are learnt alternately. Because of clustering, sparse graph construction is divided into K subgraphs construction, by which the computational cost is maintained to be acceptable and the performance does not decrease. A simple schematic diagram of the proposed method is given in Fig. 15.32. 1) The basic idea and formulation: In the sparse graph construction, one pixel is represented by just a few pixels by minimizing the sum of the construction error and the sparsity regularization. Although the sparse graph has some advantages brought by sparse representation, it might be degraded by the spectral similarity between different land covers. In our work, some additional and useful information is inserted into the sparse representation process to obtain the effective sparse graph. Hence, a new framework of sparse graph learning (SGL) is proposed, of which the objective function is given as follows.

598 Chapter 15

Figure 15.32 A schematic diagram of the sparse graph learning-based dimensionality reduction method.

min P;Q

s:t: wpro ij

n X n  X

 ð1  bÞwijspaþspe þ bwpro Qij ij

i¼1 j¼1

kX  XQk2F  ε



¼ PT xi  PT xj

2

(15.19)

wijspaþspe ¼ akindexðxi Þ  indexðxj Þk2 þ ð1 þ aÞkxi  xj k2 Qii ¼ 0

and Qij  0

0a1

and 0  b  1

wpro ij is

the projection weight that represents the distance of xi and xj in the where projection space and wijspaþspe is the spatial-spectral weight defined by the distance combination of the pixel pair in the spectral space and the spatial space. a and b are two regularization parameters which control the influence of these three kinds of information on sparse graph learning. In common sparse graph-based dimensionality reduction methods, the graph is constructed by exploring the local structure of data in original space without considering the information in the projection space and the projection matrix is computed from the obtained sparse graph directly. In addition, the spatial information is not considered during construction of the sparse graph, which is very important for hyperspectral images. However, in SGL, all of the information which exists in the spectral space, spatial space,

Hyperspectral image processing based on sparse learning and sparse graph

599

and projection space is considered to explore the imprecise discriminant information of data. The imprecise discriminant information is computed by using the Euclidean distance þ bwpro in different spaces. The weight wij ¼ ð1 bÞwspaþspe ij ij is defined according to the imprecise discriminant information, by which the learned sparse graph can have a high discriminant power. This weight is inversely proportional to the possibility of pixels belonging to the same class. Small weight wij means pixels xi and xj are possible in the same class and the corresponding coefficient Qij should have a high possibility to be nonzero. Besides the connection in the graph, the weight wij also affects the weight of the connected edge, which is induced to be proportional to the similarity of pixels. Because of the projection weight involved in SGL, projection information of the data learned in the last iteration can be utilized in sparse graph learning in the next iteration. This feedback process makes the sparse graph much more precise than the common sparse graph. Although the extracted discriminant information is not very precise, the sparse representation process can reduce the influence of this imprecision on the final results due to its robustness to noise data. As a result, the sparse graph obtained by SGL contains more discriminant information than the common sparse graph and the final projection matrix is more beneficial for classification. SGL also can be transformed to other sparse graph construction methods by using different definitions of weight wij. In the unsupervised case, if weight wij equals 1 for each pixel pair, SGL is equivalent to common sparse graph construction, in which each pixel is represented on whole data without any guidance information. In the supervised case, if weight wij equals 1 only when labels of xi and xj are the same and equals 0 otherwise, SGL-DR can be transformed to BSGDA, which just uses pixels in the same class to represent the center pixel. 2) Optimization processing: The optimization can be done by using the alternate iteration method. Two variables Q and P can be optimized alternately when the other variable is given. In the following, the details of the optimization processing are given. In the first step, Q can be achieved by solving the following problem with the fixed P: min Q

n X n  X

 pro þ bw ð1  bÞwspaþspe Qij ij ij

i¼1 j¼1

s:t: kX  XQk2F  ε Qii ¼ 0

(15.20)

and Qij  0

spaþspe T T where wpro ¼ akindexðxi Þ  indexðxj Þk2. For convenience, ij ¼ kP xi  P xj k2, wij distances in different spaces are all normalized into the range [0; 1]. Let

600 Chapter 15 wij ¼ ð1 bÞwijspaþspe þ bwpro ij . The problem is transformed to another formulation which is similar to the lasso problem. minkX Q

 XQk2F

þl

n X n X

wij Qij

(15.21)

i¼1 j¼1

Qii ¼ 0

s:t:

and Qij  0

where l is a sparsity regularization parameter. min P

n X n

X

PT xi  PT xj Qij 2

(15.22)

i¼1 j¼1

It can be transformed to another simple formulation by the following transformations. n X n

X

PT xi  PT xj 2 Qij 2

i¼1 j¼1

" ¼ Tr PT

n X n X ðxi  xj Þðxi  xj ÞT Qij P

!# (15.23)

i¼1 i¼1

1 3 n X n X ðei  ej Þðei  ej ÞT Qij AX T P5 ¼ Tr 4PT X @ 0

2

i¼1 j¼1

where ei is a column vector with the i-th element equal to 1 and 0 otherwise. n X n X

ðei  ej Þðei  ej ÞT Qij

i¼1 j¼1

¼

n X n  X

ei eTi Sij þ ej eTj Qij  ei eTj Qij  ej eTi Sij



(15.24)

i¼1 j¼1

where Drow

¼ Drow  Dcolumn  Q  QT ! n n P P  diag Q:j and Dcolumn ¼ diag Qi: . This definition is also j¼1

i¼1

mentioned in Ref. [33]. Let Lap ¼ Drow þ Dcolumn  Q  QT and D ¼ Drow þ Dcolumn, Eq. (15.32) can be written by adding the constraint PTXDXTP ¼ I. min P

Tr½PT XLapX T p   Tr PT XDX T P

(15.25)

Hyperspectral image processing based on sparse learning and sparse graph

601

By using the generalized eigenvalue decomposition of (XLapXT, XDXT), projection matrix P consists of p eigenvectors corresponding to p smallest eigenvalues, P ¼ [v1, v2, ., vp], vi is the eigenvector with eigenvalue si, s1 < s2, ., < sp. After multiple iterations, the weight matrix Q of the sparse graph is tending to be stable. In our method, the maximum iteration number is used as the stop condition of the algorithm. When the maximum iteration number is reached, the algorithm is stopped and the obtained projection matrix P is outputted. Because of the imprecise discriminant information introduced into sparse representation, the sparse graph could contain much more discriminant information than usual. To show the advantage of the obtained sparse graph compared with the common sparse graph, a simple testing result is given in Fig. 15.33. The weight matrix obtained by the proposed method is distinctly diagonal, which means many pixels are connected with a few pixels in the same class.

Figure 15.33 An example showing the difference between the common sparse graph and sparse graph learning. For two data sets, 50 pixels per class are sampled from four classes to learn graph. (A)e(C) Results on Data set (1); (D)e(F) results on Data set (2). (B) and (E) The weight matrix of common sparse graph where light points represent nonzero weights. (C) and (F) The weight matrix obtained by sparse graph learning.

602 Chapter 15

15.3.4 Spatial-spectral clustering In common sparse graph construction, the sparse representation coefficient of each pixel is obtained on the dictionary consisting of a whole data set except itself. The computational time of sparse graph construction is very high, especially when there are a large number of pixels. One way to solve this problem is to shrink the solution space of sparse representation by adding some constraints. In the proposed method, a simple clustering processing is considered to reduce the computational cost of sparse graph construction. By clustering, sparse graph construction is divided into K subgraphs construction. For simplicity and effectiveness, the K-means clustering is utilized as the basic clustering algorithm. Random initialization is employed to determine initial cluster centers. Parameter K can be determined according to the scale of the data set. The value of K should not be too large, which can guarantee that clustering will not destroy the local structure of the sparse graph. Aiming to reduce the influence of clustering on performance, both spatial information and spectral information are used in clustering, and is termed spatial-spectral clustering. The spatial coordinate of each pixel is utilized as spatial features and combined with spectral features to generate the spatial-spectral features. Additionally, aiming to reduce the computational cost of clustering, the number of spectral features usually can be reduced by PCA. After the clustering on the spatial-spectral features, all pixels are divided into K clusters. During sparse graph construction, pixels only are represented on these pixels in the same cluster. Due to this, K sparse subgraphs can be obtained in a short time and integrated together into an unbroken graph consisting of all pixels. This aims to exhibit the proposed method clearly.

15.3.5 Experimental results In this section, two hyperspectral data sets are used to verify the proposed method’s performance. The proposed method is compared with some state-of-the-art approaches, including unsupervised methods (PCA [34], LPP [35], NPE [36], IsoP [37], L1-graph [38], and SPP [39]) and supervised methods (SGDA and BSGDA [40]). Additionally, aiming to show the advantage of SGL, three different graph construction methods which also utilize spatial information are compared with the proposed method. The projected data obtained by these DR methods are applied to classification. The final classification result is used to evaluate the performance of these DR methods. Two common classifiers are used in this section, which are nearest neighbor classifier (NN classifier) and support vector machine classifier (SVM classifier). Three assessment indicators are used to evaluate each method’s performance, including overall accuracy rate (OA), average accuracy rate (AA), kappa coefficient (l), and computational time. Because of the computational intensiveness of graph-based methods [41], we use a sample subset to replace all data in experiments,

Hyperspectral image processing based on sparse learning and sparse graph

603

which are randomly selected from the original data set. The SPAMS toolbox is employed to solve the sparse representation problem involved in experiments. The maximum iteration number is set to be 20 by experience. 15.3.5.1 Introduction of hyperspectral datasets Data set (1): This data set was gathered by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over the Indian Pines test site in northwestern Indiana. A total of 220 spectral bands were generated in the wavelength range of 0.4e2.5 mm by the AVIRIS sensor. Because of water absorption and noises, 20 bands were removed and. only 200 bands are used in the experiments. The size of this image is 145  145. This data set contains 16 categories, in which the number of pixels in each category is shown in Table 15.7. The spatial distribution of each category is given in Fig. 15.34A, where different colors represent different categories. In Fig. 15.34B, the spectral signature/ reflectance of each category is shown. Data set (2): Data set (2) is the Pavia University image which was produced by the Reflective Optics System Imaging Spectrometer (ROSIS) optical sensor depicting Pavia University in Italy. About 115 bands in the spectral range of 0.43e0.86 mm were gathered by the ROSIS sensor. A total of 103 bands are retained in experiments by removing 12 noisy bands. The image consists of 610  340 pixels with the spatial resolution of 1.3 m per pixel. Nine categories are included in this data set and the pixel number of each

Table 15.7: Categories of data set (1). No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Class Alfalfa Corn-notill Corn-min Corn Grass/Pasture Grass/Trees Grass-pasture-mowed Hay-windrowed Oats Soybeans-notill Soybeans-min Soybeans-clean Wheat Woods Building-grass-trees-drives Stone-steel Total number: 10,366

Number 54 1434 834 234 497 747 26 489 20 968 2468 614 212 1294 380 Towers

604 Chapter 15

Figure 15.34 Distribution of data set (1), different colors represent different categories. (A) Spatial distribution of each category, (B) spectral response of each category. The index of each band is given on the X-axis and the spectral response of different categories on each band is given on the Y-axis. Table 15.8: Categories of data set (2). No. 1 2 3 4 5 6 7 8 9

Class

Number

Asphalt 6631 Meadows 18,649 Gravel 2099 Trees 3064 Painted Metal Bare Soil Bitumen 1330 Self-blocking Bricks Shadows 947 Total number: 42,776

category is shown in Table 15.8. In Fig. 15.35, the spatial distribution and the spectral signature/reflectance of each category are given. 15.3.5.2 Classification results 1) Experiment 1: The projection number p is an important parameter of dimensionality reduction methods. In this experimental part, the influence of this parameter on performance will be investigated directly. Here, K is equal to 100 with l ¼ 1, a is set to be 0.8. NN classifier is used with 10% samples of data set (1) and 50 samples per class as training data. For comparison, those methods mentioned above also are tested under the same conditions. The projection number varies from 1 to 100 for two data sets. The experimental results are given in Fig. 15.36. All methods could obtain stable results when the projection number reaches a certain value. Although results of other methods are

Hyperspectral image processing based on sparse learning and sparse graph

605

Figure 15.35 Distribution of data set (2), different colors represent different categories. (A) Spatial distribution of each category, (B) spectral response of each category. The index of each band is given on the X-axis and the spectral response of different categories on each band is given on the Y-axis.

Figure 15.36 Results of experiment 1. (AeC) The results of data set (1), (DeF) the results of data set (2). NN classifier is used for classification with 10% training samples for data set (1) and 50 training samples per class for data set (2).

606 Chapter 15 tending to become stable when the projection number exceeds 10, the proposed method reaches steady state after the projection number is equal to 10 for data set (1) and five for data set (2), which is comparable or surpassing other methods. In addition, the results of the proposed method are much better than other methods with the same projection number. 2) Experiment 2: In this part, all methods are tested with different numbers of training samples for classification. To prove the universality, both NN classifier and SVM classifier are used. For SVM classifier, Gaussian kernel is chosen and the parameters of c and g are selected from 105 to 105 by multiple cross-validation tests. The projection number is set to be 30 for data set (1) and 20 for data set (2). The experimental results of data set (1) are shown in Fig. 15.37. Along with the increasing training samples, OAs of all methods are improved. Because SGDA and BSGDA are supervised methods, their results are not very good when the number of training samples is not large enough. Because there is just one labeled sample in Class 9, BSGDA did not get a result when the size equals to 5%. Nevertheless, when there are many more training samples, BSGDA still cannot transcend the proposed method. For both classifiers, the proposed method outperforms other methods with an apparent advantage. Classification maps of all methods are shown in Fig. 15.38. The region consistency of the classification map obtained by the proposed method is much better than other methods. The experimental results on data set (2) are shown in Fig. 15.39. The experimental results of all methods rise with the number of training samples and sparse graph-based methods,

Figure 15.37 Classification performance as the number of training samples varies for dataset (1), the number of projections is 30. (A) The results of NN classifier. (B) The results of SVM classifier. An average of 20 times is given. K ¼ 100, l ¼ 1, a ¼ 0:8.

Hyperspectral image processing based on sparse learning and sparse graph

607

Figure 15.38 Classification maps of all methods on data set (1). 10% training samples are used for training NN classifier.

such as L1-graph, SPP, and BSGDA, obtain better results than other methods. Because training samples are not sufficient, SGDA could not reach a good result. BSGDA constructs a sparse graph based on the labels, so it still has some advantages. The proposed method still obtains better results than other methods on data set (2). To show the classification results more directly, classification maps of all methods are given in

608 Chapter 15

Figure 15.39 Classification performance as the number of training samples varies for data set (2), the number of projections is 20. (A) The result of NN classifier. (B) The result of SVM classifier. An average of 20 times is given. K ¼ 100, l ¼ 1, a ¼ 0:8.

Fig. 15.40. The regional consistency of the classification map of the proposed method is much better than other methods. In summary, features obtained by the proposed method can achieve better results for classification than the compared methods. Because of the imprecise discriminant information inserted into sparse representation and the robustness of sparse representation, each pixel in the graph is connected with some similar pixels which is possible in the same class and the weight of the connected edge is close to the similarity between pixels. The discriminant information contained in the sparse graph leads to a projection matrix that is beneficial for classification by preserving the intrinsic structure of data in the projection space. 2) Experiment 3: In the literature about hyperspectral images, many new distances or similarity measurements are proposed, which consider the spatial-spectral character of hyperspectral images. Here, three different similarity measurements are used to construct the local neighborhood graph. Under the framework of LPP, these three different graphs are compared with the sparse graph learnt by our method. First, two-dimensional spatial coordinates index(xi) are utilized as the spatial features [42]. The similarity between two points will be increased if they are spatially adjacent. The formulation of this similarity measurement is defined as follows: 2

sðxi ; xj Þ ¼ ekxi xj k2 =s1 þ ekindexðxi Þindexðxj Þk2 =s2 2

The graph based on the above similarity measurement is denoted by an SC-graph.

(15.26)

Hyperspectral image processing based on sparse learning and sparse graph

609

Figure 15.40 Classification maps of all methods on data set (2). Fifty training samples per class are used for training NN classifier.

610 Chapter 15 Second, as in Ref. [43], the similarity between two points is measured by the spatially coherent distance of them in a spatial window. Suppose an r  r spatial window is utilized. This distance is computed by the following equation. sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi m X d 2e ðXi ðsÞ; Xj ðsÞÞ (15.27) dE ðxi ; xj Þ ¼ s¼1

where Xi represents an m  r matrix consisting of pixels in r  r window centered at xi and Xi(s) is the s-th row of Xi.de($) which is a distance measurement which can be Euclidean distance or spectral vector angle. In experiments, Euclidean distance is selected. 2 The weight of the connected edge in the graph is computed by edE ðxi ;xj Þ =s. The graph constructed by using this measurement is denoted by an SCD-graph. 2

Third, in Ref. [44], a squared sliding window is utilized to compute the average spectra of pixels in this window as the center pixel’s spatial features. The similarity measurement is formulated as

2

2 spa

=s2  xspe =s1  xspe xspa i xj i j 2 2 Sðxi ; xj Þ ¼ me þ ð1  mÞe (15.28) The graph computed by this similarity measurement is denoted by an AWSK-graph. An adaptive spatial window is designed to improve the performance. In experiments, windows with different sizes are also employed for comparison. In experiments, two window sizes of 33 and 55 are adopted for SCD-graph and AWSK-graph, s is selected from [110, 110] for best results and m in AWSK-graph is equal to 0.5. In addition, the range of window size in the adaptive spatial window is equal to that is Refs. [45,46]. The number of neighbors is equal to six for these three kinds of graphs. Both the Indian Pines image and the Pavia University image are tested here. For classification, 10% samples are selected as training samples for the first data set and 50 samples per class are selected as training samples for the second data set. The classification results are shown in Figs. 15.41 and 15.42. For the SCD-graph and AWSKgraph, the window size may affect the performance. The AWSK-graph with an adaptive spatial window performs better than a fixed window. However, the additional computational cost will be brought by the adaptive selection of the appropriate window size. Although the SC-graph utilizes spatial information without involving the spatial window, its performance is not satisfied. In both data sets, the proposed method achieves the best results and it does not need to consider the effect of window size. 15.3.5.3 Influence of spatial-spectral clustering In the proposed method, spatial-spectral clustering is used as a preprocess to reduce the computational time of sparse graph learning. If the partition of the data set is not consistent with the following process, the classification will be affected badly. In this

Hyperspectral image processing based on sparse learning and sparse graph

611

Figure 15.41 The classification obtained by different graphs on the Indian Pines image.

Figure 15.42 The classification obtained by different graphs on the Pavia University image.

experiment, different cluster numbers K are chosen with other fixed parameters for two data sets. Classification results given in Fig. 15.43A demonstrate that classification is influenced by the spatial-spectral clustering slightly for two data sets. The high computation due to the iterations and sparse representation is reduced obviously and becomes stable when K is large enough, which is shown in Fig. 15.43B. As a conclusion, spatial-spectral clustering meets our expectations, in which the computational cost is reduced without a decrease in performance. 15.3.5.4 Convergence analysis In the proposed method, two variables are optimized alternately during multiple iterations. The residual error of obtained Q between two iterations and the classification result in each iteration were tested in experiments to verify the convergence of the proposed

612 Chapter 15

Figure 15.43 Experimental results with different cluster numbers (K) (A) Classification results with varying K, (B) the change of computational time with (K) a ¼ 0.8, b ¼ 0.1, and l ¼ 1.

method. Fig. 15.44 gives the experimental results of data set (2). Different results are given when different b are set. In Fig. 15.44, the residual error of Q descends with iterations and the classification result reaches a metastable result after certain iterations. Because b controls the effect of projection weight, which is changed in each iteration on sparse graph learning, the convergence rate is much quicker and the final status is much more stable when b is small than when b is large. Therefore, the value of b commonly is chosen from a small range, such as [0.1 0.5], in experiments.

Figure 15.44 Experimental analysis of the proposed method’s convergence: (A) the residual error of Q between two iterations, Err(iter) ¼ jjQiterQiter-1 jjF; (B) the change of classification results with the iterations. a ¼ 0.8 and l ¼ 1.

Hyperspectral image processing based on sparse learning and sparse graph

613

References [1] Liu S, Jiao L, Yang S. Hierarchical sparse learning with spectral-spatial information for hyperspectral imagery denoising. Sensors 2016;16(10):1718. [2] Liu S, Jiao L, Yang S, et al. Hierarchical sparse Bayesian learning with beta process priors for hyperspectral imagery restoration. IEICE Transactions on Information and Systems 2017;100(2):350e8. [3] Chen P, Jiao L, Liu F, et al. Dimensionality reduction of hyperspectral imagery using sparse graph learning. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 2017;10(3):1165e81. [4] Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 2004;13(4):600e12. [5] Papyan V, Elad M. Multi-scale patch-based image restoration. IEEE Transactions on Image Processing 2016;25:249e61. [6] Ye MC, Qian YT, Zhou J. Multitask sparse nonnegative matrix factorization for joint spectralespatial hyperspectral imagery denoising. IEEE Transactions on Geoscience and Remote Sensing 2015;53:2621e39. [7] Shah A, David K, Ghahramani ZB. An empirical study of stochastic variational inference algorithms for the beta Bernoulli process. In: Proceedings of the 32nd International Conference on Machine Learning (ICML‘15), Lille, France; July 2015. [8] Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian data analysis. 3rd ed. Boca Raton, FL, USA: Chapman & Hall/CRC; 2014. [9] Rasmussen C, Williams C. Gaussian processes for Machine Learning. MIT Press; 2006. [10] Qian YT, Ye MC. Hyperspectral imagery restoration using nonlocal spectral-spatial structured sparse representation with noise estimation. IEEE Journal of Selected Topic in Applied Earth Observations and Remote Sensing 2013;6:499e515. [11] Makitalo M, Foi A. Noise parameter mismatch in variance stabilization, with an application to PoissonGaussian noise estimation. IEEE Transactions on Image Processing 2014;23:5349e59. [12] Casella G, George EI. Explaining the Gibbs sampler. The American Statistician 1992;46:167e74. [13] Rodriguez YG, Davis R, Scharf L. Efficient Gibbs sampling of truncated multivariate normal with application to constrained linear regression. New York: Columbia Univ.; 2004. [14] Aharon M, Elad M, Bruckstein A. K-SVD: an algorithm for designing over complete dictionaries for Sparse Representation. IEEE Trans Signal Process 2006;54:4311e22. [15] Dabov K, Foi A, Karkovnik V. Image denoising by sparse 3D transform-domain collaborative filtering. IEEE Transactions on Image Processing 2007;16:2080e94. [16] Manjo´n JV, Pierrick C, Luis MB, Collins DL, Robles M. Adaptive non-local means denoising of MR images with spatially varying noise levels. Journal of Magnetic Resonance Imaging 2010;31(1):192e203. [17] Maggioni M, Katkovnik V, Egiazarian K, Foi A. Nonlocal transform-domain filter for volumetric data denoising and reconstruction. IEEE Transactions on Image Processing 2013;22(1):119e33. [18] Zhang HY, He W, Zhang LP, Shen HF, Yuan QQ. Hyperspectral image restoration using low-rank matrix recovery. IEEE Transactions on Geoscience and Remote Sensing 2014;52:4729e43. [19] Huo L, Feng X, Huo C, Pan C. Learning deep dictionary for hyperspectral image denoising. IEICE Trans. Inf. & Syst. 2015;7:1401e4. [20] Li J, Yuan QQ, Shen HF, Zhang LP. Hyperspectral image recovery employing a multidimensional nonlocal total variation model. Signal Processing 2015;111:230e446. [21] Zhang L, Zhang L, Mou XQ, Zhang D. FSIM: a feature similarity index for image quality assessment. IEEE Transactions on Image Processing 2011;20(8):2378e86. [22] Xu Y, Wu Z, Wei Z. Spectralespatial classification of hyperspectral image based on low-rank decomposition. IEEE Journal of Selected Topic in Applied Earth Observations and Remote Sensing 2015;8:2370e80.

614 Chapter 15 [23] Sun L, Wu Z, Liu J, Xiao L, Wei Z. Supervised spectralespatial hyperspectral image classification with weighted Markov random fields. IEEE Transactions on Geoscience and Remote Sensing Mar. 2015;53(3):1490e503. [24] Xu L, Li F, Wong A, et al. Hyperspectral image denoising using a spatialespectral Monte Carlo sampling approach. IEEE Journal of Selected Topic in Applied Earth Observations and Remote Sensing Jun. 2015;8(6):3025e38. [25] Thibaux R, Jordan MI. Hierarchical beta processes and the Indian buffet process. In: Proc. International Conf. on AISTATS, San Juan, Puerto Rico, vol. 2; Mar. 2007. p. 564e71. [26] He L, Qi H, Zaretzki R. Beta process joint dictionary learning for coupled feature spaces with application to single image super-resolution. In: Proc. International Conf. on CVPR, Portland, United States of America; Jun. 2013. p. 345e52. [27] Paisley J, Carin L. Nonparametric factor analysis with beta process priors. In: Proc. 26th International Conf. on ICML, Montreal, Canada; Jun. 2009. [28] Rasti B, Sveinsson JR, Ulfarsson MO, Benediktsson J. A hyperspectral image denoising using first order spectral roughness penalty in wavelet domain. IEEE Journal of Selected Topic in Applied Earth Observations and Remote Sensing Jun. 2014;7(6):2458e67. [29] Liu S, Bourennane, Fossati C. Reduction of signal-dependent noise from hyperspectral images for target detection. IEEE Transactions on Geoscience and Remote Sensing Sept. 2014;52(9):5396e411. [30] Deger F, Mansouri A, Pedersen M, et al. A sensor-data-based denoising framework for hyperspectral images. Optics Letters Mar. 2015;23(3):1938e50. [31] Yuan Q, Zhang L, Shen H. Hyperspectral image denoising with a spatialespectral view fusion strategy. IEEE Transactions on Geoscience and Remote Sensing May 2014;52(5):2314e25. [32] Chen J, Jia X, Yang W, Matsushita B. Generalization of subpixel analysis for hyperspectral data with flexibility in spectral similarity measures. IEEE Transactions on Geoscience and Remote Sensing Jul. 2009;47(7):2165e71. [33] Zhang L, Qiao L, Chen S. Graph-optimized locality preserving projections. Pattern Recognition June 2010;43(6):1993e2002. [34] Wold S, Esbensen K, Geladi P. Principal component analysis. Chemometrics and Intelligent Laboratory Systems Aug. 1987;2(1):37e52. [35] He, Niyogi P. Locality preserving projections. Neural Information Processing Systems 2004;45(1):186e97. [36] He D, Cai S, Yan, Zhang H-J. Neighborhood preserving embedding. In: Tenth IEEE International Conference on Computer Vision (ICCV’ 2005); 2005. [37] Cai D, He X, Han J. Isometric projection. In: Twenty-second Conference on Artificial Intelligence (AAAI-07); 2007. [38] Belkin M, Niyogi P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation Jun. 2003;15(6):1373e96. [39] Qiao L, Chen S, Tan X. Sparsity preserving projections with applications to face recognition. Pattern Recognition Jan. 2010;43:331e41. [40] Ly H, Du Q, J.E. F. Sparse graph-based discriminant analysis for hyperspectral imagery. IEEE Transactions on Image Processing 2014;52(7):3872e84. [41] Xue, Du P, Li J, Su H. Simultaneous sparse graph embedding for hyperspectral image classification. IEEE Transactions on Image Processing Nov. 2015;53(11):6114e33. [42] Hou B, Zhang X, Ye Q, Zheng Y. A novel method for hyperspectral image classification based on Laplacian eigenmap pixels distribution-flow. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 2013;6(3):1602e18. [43] Mohan, Sapiro G, Bosch E. Spatially coherent nonlinear dimensionality reduction and segmentation of hyperspectral images. IEEE Geoscience and Remote Sensing Letters 2007;4(2):206e10.

Hyperspectral image processing based on sparse learning and sparse graph

615

[44] Sun W, Halevy A, Benedetto JJ, et al. Nonlinear dimensionality reduction via the ENH-LTSA method for hyperspectral image classification. IEEE Journal of Selected Topic in Applied Earth Observations and Remote Sensing 2014;7(2):375e88. [45] Zhang L, Zhang L, Tao D, Huang X, Du B. Hyperspectral remote sensing image subpixel target detection based on supervised metric learning. IEEE Transactions on Image Processing 2014;52(8):4955e65. [46] Bandos TV, Bruzzone L, Camps-Valls G. Classification of hyperspectral images with regularized linear discriminant analysis. IEEE Transactions on Image Processing 2009;47(5):862e73.

C H A P T E R 16

Nonconvex compressed sensing framework based on block strategy and overcomplete dictionary Chapter Outline 16.1 Introduction 617 16.2 The block compressed sensing framework based on the overcomplete dictionary 16.2.1 Block compressed sensing 618 16.2.2 Overcomplete dictionary 619 16.2.3 Structured compressed sensing model

618

620

16.3 Image sparse representation based on the ridgelet overcomplete dictionary 16.4 Structured reconstruction model 624

620

16.4.1 Structural sparse prior based on image self-similarity 624 16.4.2 Reconstruction model based on an estimation of the direction structure of image blocks 625

16.5 Nonconvex reconstruction strategy References 626

626

16.1 Introduction A natural image is a common and important signal for humans to perceive the natural environment. This book regards natural images as research objects, and studies the application theory and method of compressed sensing from the two aspects of sparse representation and reconstruction estimation of compressed sensing, in order to promote the application of compressed sensing in natural signals and related fields [1,2]. In order to obtain the sparse representation of natural images, this book builds upon the ridgelet overcomplete dictionary, which has been proved by theory and experimentsdthis dictionary can provide flexible and adaptive sparse representations for images under the block strategy. However, because of the redundancy and multipeak characteristics of image block sparse representation based on the dictionary, when executing compressed sensing based on this dictionary, the uncertainty and instability of the reconstruction problem is illustrated. In order to improve the reconstruction accuracy, another prior structure of the image besides sparsity must be exploited and used, i.e., structured sparse prior, and the Brain and Nature-Inspired Learning, Computation and Recognition. https://doi.org/10.1016/B978-0-12-819795-0.00016-5 Copyright © 2020 Tsinghua University Press. Published by Elsevier Inc. All rights reserved

617

618 Chapter 16 structured sparse reconstruction model and the corresponding solving strategies and methods are also needed. In many reconstruction strategies and methods, we choose to study nonconvex reconstruction strategy and a method based on zero norm sparse constraint. This is due to the reconstruction model of zero norm constraint being the origin problem of compressed sensing; on the other hand, considering the existing l1 norm relaxation method may cause a loss of precision in solving the structured reconstruction model. There are two kinds of nonconvex reconstruction strategies in this book: one is a reconstruction strategy based on greedy search, this kind of method uses the suboptimal search strategy, which is simple and fast, suitable for verifying a structured reconstruction model although there is a loss of precision; the other is our proposed new reconstruction strategy based on evolutionary search, in which we combined compressed sensing with the evolutionary method. Our method not only makes use of the superior performance of evolutionary methods in solving nonconvex nonlinear and complex problems, but also realizes global search when solving nonconvex reconstruction problems in the solution space. Based on the research motivations above, this book presents an image nonconvex compressed sensing framework based on the overcomplete dictionary and block strategy. Observation of the image uses block compressed observations, which divide an image into equal-sized and nonoverlapping image blocks, and then executes the observation for all image blocks in the same random Gaussian way. A ridgelet overcomplete dictionary is used to obtain the sparse representation of each image block; on this basis, it mines and exploits the structured sparse prior of the image block in the dictionary, which mainly includes the similarity relation among the image blocks and the matching relation between image blocks and the direction structure of the dictionary, and establishes the corresponding structured reconstruction model and the nonconvex reconstruction method.

16.2 The block compressed sensing framework based on the overcomplete dictionary 16.2.1 Block compressed sensing This section describes the image block compressed sensing framework [3,4]. The observation method is a random observation method based on block strategy, which divides an image into some nonoverlapping image blocks of the same size, and then uses the same method, such as Gaussian observation, to randomly observe all the image blocks to obtain the observation value. In the block strategy, the structure of the image block is much simpler relative to the whole image, and therefore it is easier to construct and obtain a smaller and fully

Nonconvex compressed sensing framework 619 redundant sparse dictionary relative to the image blocks in the structure. The advantage of using block processing in image compressed sensing is that the image has self-similar characteristics, which means there are only a limited number of different types of image block structures in one image. The observation vectors of similar blocks must be similar, so that the similarity measurement between the observation vector can be used to find image blocks with similar structures. The image blocks in each class can be represented by a group of the same atoms, and together with sparse prior knowledge the uncertainty and instability of the image block reconstruction problem can be reduced. Block processing is a common method of image processing, and can be used for convenient signal sampling and processing. Block methods can generally be divided into two types: nonoverlapping block and slide block. In slide block processing, the similarity among image blocks is strong, but the number of image blocks is high. In nonoverlapping block processing, the similarity among the image blocks is relatively weak, but their amount is small. This section uses nonoverlapping blocks, which are combined with the structure characteristics of the overcomplete dictionary to obtain the structured sparse prior of image blocks.

16.2.2 Overcomplete dictionary The core problems in compressed sensing reconstruction include: 1. How to construct an adaptive sparse dictionary for the signal to be reconstructed. For the image signal, the edge and texture are directional, and the direction can be arbitrary. Therefore, the constructed dictionary should have sufficient direction structure, so that it can represent the edge and texture block signal in an arbitrary direction adaptively and sparsely; 2. Assuming there is an overcomplete dictionary meeting the above condition, the next problem is how to design an effective algorithm for searching and optimizing in this dictionary in enough directions. This algorithm has to obtain the sparse reconstruction of the overcomplete dictionary for the signal within a reasonable period of time. For the first problem, we optimized the existing dictionary construction methods [5], and the ridgelet overcomplete dictionary for image sparse representation was obtained. There are enough direction structures in the dictionary to provide effective sparse representation for arbitrary image blocks; for the second problem, we designed a nonconvex reconstruction strategy based on natural calculation optimization and collaborative optimization, in order to search the ridgelet dictionary effectively within a reasonable time and obtain the accurate reconstruction estimation of direction and scale geometry information of the image.

620 Chapter 16

16.2.3 Structured compressed sensing model In order to overcome poor conditions and multimodal problems in the reconstruction caused by the redundancy of the dictionary, this section mines and exploits the structure sparse prior of the image in the dictionary, namely the structure characteristics of sparse representation of image blocks in the dictionary, in order to reduce the uncertainty of a single image block reconstruction problem and improve the accuracy and stability of image reconstruction. In this section, we first use the self-similar characteristics of the image, to execute sparse representation for image blocks with similar structures in the dictionary, and put forward the nonconvex reconstruction model based on a block strategy and overcomplete dictionary. In this model, image blocks with similar structures are considered to be represented by a group of atoms which are similar or have similar parameters. Therefore, similar image blocks can be combined to solve simultaneously, the amount of information of single reconstruction model is increased, the reconstruction information among image blocks can be exchanged and transferred, and so the overall quality of image reconstruction is promoted. In addition, in order to improve the accuracy of estimation of the local structure of the image, we established a reconstruction model based on the image block structure estimation. In this model, considering the structure of the ridgelet dictionary, particularly the structure of direction, is redundant relative to the direction structure of the image block, and so the subdictionary composed of a small amount of atoms with the same structure as the image block is capable of creating a sparse dictionary of image blocks. This model puts forward an estimation of the image block structure type based on compressed observation of the image block, and uses obtained structure estimation to guide the selection of a sparse subdictionary of image blocks. In addition, we also combined the reconstruction model based on structure estimation with a reconstruction strategy based on the existing collaborative optimization and evolutionary search optimization and established a direction-guided reconstruction method. These methods improved the accuracy of the estimation of the local structure of images while also reducing the reconstruction time.

16.3 Image sparse representation based on the ridgelet overcomplete dictionary Research into image multiscale geometric analysis shows that a good image representation method should have the qualities of multiresolution, locality, and direction [6e8]. Multiresolution, i.e., continuity, refers to the continuous approximation of an image from low and high resolution; locality means that the basic atoms used to represent the image

Nonconvex compressed sensing framework 621 have limited support in the spatial domain and frequency domain; and direction refers to the basic atoms that have arbitrary directions. One method that constructs the overcomplete redundancy dictionary with the qualities mentioned above is to execute the translation, direction transform, and scale transform or other linear operation on the multiscale geometric prototype function to obtain the atoms for the dictionary. There are two basic problems to this method: selection of the prototype atom and discretization of the parameter space. Cande`s pointed out that some ordinary functions, even the sine function, Gabor function, Gaussian function, or wavelet function, cannot represent sparsely the linear singularity of piecewise smooth signals (including the image signal) in two-dimensional or highdimensional spaces [9]. The edges in the image often show linear singularity. The ridgeletbased multiscale geometric analysis theory and application show that the ridgelet can effectively detect and match the linear singularity in the image. In order to effectively represent the most sensitive signal of the image in human eye edges, this section chooses the ridgelet function as the prototype function of atoms in the dictionary. The overcomplete dictionary D ˛ ℝBN used to represent the image block is D ¼ ðd1 ; d2 ; /; dN Þ, where di, i ¼ 1, 2, /N is the i-th atom in the dictionary, and its formation is calculated according to the following formula:   1 ðai uTi zbi Þ2=2 1 ðai uTi zbi Þ2=8 (16.1)  e e di ðzÞ ¼ W 2 pffiffiffi pffiffiffi where di ðzÞ ˛ ℝ B B is the atom with the same size as the image block, but di ˛ ℝB is pffiffiffi  2 its quantized version; z ¼ ðz1 ; z2 Þ ˛ 0; 1; 2; /; B  1 is the position variable of atom; this atom corresponds to the scale parameter group gi ¼ ðqi ; ai ; bi Þ, ai is the scale parameter, bi is the displacement parameter, qi is the direction parameter; where ui ¼ ðcosqi ; sinqi ÞT , W is used for normalization of the atom, and normalization of the atom’s is done using a 1. 3D model of ridgelet prototype function as shown in Fig. 16.1A. Some atoms of the ridgelet dictionary are shown in Fig. 16.1B. After selection of the prototype atom, the atomic scale in the dictionary and the sparse representation ability for the image also depend on the ranges of three parameters and their respective discrete intervals. According to the existing method to construct a ridgelet dictionary [5], the parameter space is set as: 1=2

1=2

E ¼ D1: m2 ;1: m2 W D1: m2 ;1: m2 where the range of the displacement parameter is related to the direction parameter: (  pffiffiffi  0; B ðsin q þ cos qÞ ; if q ˛ ½0; p=2Þ Gb ¼ pffiffiffi pffiffiffi  B cos q; B sin q ; otherwise

(16.2)

(16.3)

622 Chapter 16

Figure 16.1 Schematic diagrams of the Ridgelet prototype function and some atoms in the dictionary: (A) schematic diagram of an atomic function prototype; (B) schematic diagram of the ridgelet dictionary.

When using the dictionary to execute the sparse representation for an image, the atoms in the dictionary will respond to the image content with a consistent shape, size, position, and direction. Therefore, for an overcomplete dictionary designed for natural images, its structure must be sufficiently redundant with respect to the image content. Considering the important influence of the direction of the structure in human perception and understanding of the image, the dictionary must have sufficient direction and structure, and be able to represent adaptively any direction of the image block. That is to say, in the discretization scheme of the dictionary, the discrete interval of direction parameters must be small enough. Fig. 16.2 shows the influence of discrete intervals of direction parameters on the representation performance of the sparse dictionary. In the experiments, two dictionaries are constructed, and they execute sparse representation for the image. The scale parameters and discrete intervals of displacement parameters in the two dictionaries are the same, and are set to 0.2 and 1, respectively; the discrete intervals of the direction parameters in dictionaries are p=4 and p=45, respectively, i.e., these two dictionaries have four and 45 directions, respectively; the atomic scales in these dictionaries are 1201 and 14,116. The dictionaries execute the sparse representation for a local image of Lena with the OMP [10] method (the results are shown in Fig. 16.2). After comparison of the two results, it can be seen that there is an obvious block effect in the image obtained using the

Nonconvex compressed sensing framework 623

Figure 16.2 The sparse representation results of natural images by dictionaries with different numbers of directions: (A) original image, (B) four directions, 32.85 dB (0.9824), and (C) 45 directions, 41.42dB(0.9975).

dictionary with four directions, but there is almost no block effect in the image obtained using the dictionary with 45 directions, and its visual effect has apparent advantages. In the applications described in this section, the discrete interval of the direction parameter is set to p=36, and the dictionary is organized according to the direction. When assigning the atomic numbers, first assign continuously numbers for all atoms whose direction parameters valued the first discrete value, and then assign for atoms with the second discrete values, and so on. The atoms with the same direction parameters will form a subdirection dictionary. In the dictionary described in this section, the direction parameter of the atoms has 36 discrete values, which produces a total of 36 subdirection dictionaries. Discrete intervals of parameters a and b are set to 0.2 and 1, respectively. Therefore, there will be 11,281 atoms in the ridgelet dictionary constructed for 1616 image blocks. In order to verify the sparse representation dictionary performance of the dictionary on an image, we use the ridgelet dictionary to do the sparse representation for image blocks with sizes of 512512. The concrete method divides the image into nonoverlapped blocks, then uses the dictionary to sparsely decompose the image blocks block by block, and stitch the representation results of the image blocks in order, finally obtaining an approximate representation of the original image. The sparse decomposition method is the OMP [10] method, and the sparsity is set to 32, which means that each image block uses the linear combination of 32 atoms in the dictionary to do the sparse approximation. Fig. 16.3 shows the sparse representation results of two natural images. It can be seen that even using a greedy matching pursuit sparse decomposition method, the ridgelet dictionary is still able to get good sparse representation of natural images. The obtained images are consistent

624 Chapter 16

Figure 16.3 Sparse representation results for two natural images by a ridgelet dictionary: (A) sparse representation result for Lena PSNR:37.35 dB, SSIM:0.9862; (B) sparse representation result for Barbara PSNR:34.10 dB, SSIM:0.9843.

with the original images, and maintain a coherent and clear linear edge and texture, in particular, there is almost no block effect in the resulting image. Therefore, the ridgelet dictionary in this section can execute an effective sparse representation for a natural image under the block strategy, and can be used for a variety of image applications.

16.4 Structured reconstruction model 16.4.1 Structural sparse prior based on image self-similarity Compared with a single image, the structure of an image block shows the characteristics of singleness and consistency, and one image block often has many other image blocks with the same or similar structures. Taking the Barbara image as an example, this image is divided into blocks with a size of 1616, and the result is shown in Fig. 16.4. It can be seen that a single image block usually contains only a single structure. In addition, Fig. 16.4 also shows two groups of image blocks with similar structures, one is a set of image blocks with a smooth structure, and the other is a set of image blocks with a striped texture structure. As shown in Fig. 16.4, similar image blocks may or may not be adjacent in position, but there are often only a few types of structures in the whole image. In sparse reconstruction, estimating and reconstructing these types of structures can give the reconstruction estimation of the whole image.

Nonconvex compressed sensing framework 625

Figure 16.4 The blocks of Barbara and the similarity representation of image blocks.

Combining the sparse representation of image blocks with the overcomplete dictionary, it is noted that image blocks with similar structures can be represented by the same group of atoms in the dictionary. Therefore, using the self-similarity of the image and a set of dictionary atoms can represent a group of similar image blocks’ structure prior and can build a structural sparse constraint reconstruction model, to increase the information for reconstructing a single piece of the image block, and to reduce the uncertainty of the sparse reconstruction problem based on an overcomplete dictionary under the blocking strategy. Two methods are used to adopt this structure priori, the first is to classify the image blocks and jointly reconstruct every class of image blocks; the other is to match search for each image block, and use a group of similar image blocks’ information to reconstruct a single image block. The characteristics of the former method include that the number of problems needed to be solved is significantly less than the number of image blocks, and the reconstruction problems solving each class use a piece of information from the image blocks; the characteristics of the latter method are that the number of problems to be solved is consistent with the number of image blocks, but when estimating and reconstructing each image block, a group of observation vectors and information of similar image blocks are used.

16.4.2 Reconstruction model based on an estimation of the direction structure of image blocks The brilliant sparse representation performance of the ridgelet overcomplete dictionary on natural images benefits from the fine structure of the dictionary, including the direction

626 Chapter 16 and scale structure, which are redundant relative to a single image block. In addition, organizing atoms according to the parameters of dictionary atoms can result in a subdictionary for representing effectively a certain structure. For example, organizing atoms based on direction parameters can result in multiple direction dictionaries, and each dictionary can be used to represent effectively the linear mutation content with the same direction; organizing atoms based on scale parameters can result in multiple scale dictionaries, and each scale dictionary can be used to represent effectively the image blocks with the same scale structure. Therefore, we propose to select dictionary atoms according to the image structure and construct the sparse subdictionary to obtain an accurate estimation of the image block structure.

16.5 Nonconvex reconstruction strategy When selecting and designing the search strategy for a structured sparse reconstruction model, it should be considered whether the strategy is suitable for solving the designed structure model and whether the search strategy can ensure the accuracy and speed of estimation of signal reconstruction. According to the processing method of l0 norm and the convex property of sparsity measure on the algorithm, the existing reconstruction methods can be divided into the convex relaxation reconstruction method and the nonconvex reconstruction method. The convex relaxation reconstruction method obtains the convex reconstruction model by the relaxation sparse item, and makes the solving and calculation of complex structural models simpler and more effective. However, convex relaxation operation will lead to inevitable reconstruction accuracy loss. Currently, the nonconvex reconstruction method for directly solving the l0 norm constraint is the greedy algorithm. This method is suitable for quick search and complete reconstruction of the overcomplete dictionary and is applicable to many structured reconstruction models. Its disadvantages are mainly due to the local search strategy, which leads to relatively low reconstruction accuracy.

References [1] Lin L, Liu F, Jiao L. Compressed sensing by collaborative reconstruction on overcomplete dictionary. Signal Processing 2014;103:92e102. [2] Liu F, Lin L, Jiao L, et al. Nonconvex compressed sensing by nature-inspired optimization algorithms. IEEE transactions on cybernetics 2014;45(5):1042e53. [3] GAN L. Block compressed sensing of natural images. In: Proceedings of international conference on digital Signal processing; 2007. p. 403e6. [4] Mun S, Fowler JE. Block compressed sensing of images using directional transforms. In: 2009 16th IEEE international conference on image processing (ICIP). IEEE; 2009. p. 3021e4. [5] Xu JH. Design and reconstruction algorithm of sparse redundant dictionary in ridge frame. XiDian University Press; 2011.

Nonconvex compressed sensing framework 627 [6] Jiao LC, Tan S. Review and prospect of multi-scale geometrical analysis of image. Acta Electronica Sinica 2003;31(12A):1975e81. [7] Donoho DL, Levi O, Starck JL, et al. Multiscale geometric analysis for 3D catalogues. SPIE Conference on Astronomical Data Analysis 2002:101e11. [8] Do MN, Vetterli M. The Contourlet transform: an efficient directional multiresolution image representation. IEEE Transactions on Image Processing 2005;14(12):2091e106. [9] Candes EJ. Ridgelets: theory and applications. Stanford University; 1998. [10] Pati YC, Rezaiifar R, Krishnaprasad PS. Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In: Signals, systems and computers, 1993 conference record of the twenty-seventh asilomar conference on IEEE; 1993. p. 40e4.

C H A P T E R 17

Sparse representation combined with fuzzy C-means (FCM) in compressed sensing Chapter Outline 17.1 Basic introduction to fuzzy C-means (FCM) and sparse representation (SR) 17.2 Two versions combining FCM with SR 635

629

17.2.1 FDCM_SSR 635 17.2.2 SL_FCM 639

17.3 Experimental results

644

17.3.1 FDCM_SSR 644 17.3.1.1 UCI data set 644 17.3.1.2 Artificial images 645 17.3.1.3 Natural images 650

17.4 SAR images

654

17.4.1 SL_FCM 655 17.4.1.1 Artificial and natural images 655 17.4.1.2 Synthetic aperture radar images 663

References

665

17.1 Basic introduction to fuzzy C-means (FCM) and sparse representation (SR) Clustering is an important data-processing method [1,2], which has been widely applied in the areas of pattern recognition [3], image processing [4], data mining [5], etc. The aim of clustering is to partition data sets into meaningful groups with similar samples. In the clustering area, the fuzzy c-means clustering algorithm (FCM) proposed by Bezdek et al. [6] has become the most well-known fuzzy clustering algorithm and has been widely studied and applied in a variety of substantive domains [7e11]. The FCM algorithm introduces the fuzziness into the belongingness of each sample. Each of the samples is assigned a membership grade from 0 to 1. The objective function of the classical FCM algorithm is defined by fuzzy memberships and Euclidean distances of samples to cluster centers. Due to its flexibility and robustness for ambiguity, fuzzy clustering is still currently an active topic [12e16], and has been widely applied in the areas of pattern recognition [17,18], function approximation [19], image processing [20e22], machine learning [23], etc. Brain and Nature-Inspired Learning, Computation and Recognition. https://doi.org/10.1016/B978-0-12-819795-0.00017-7 Copyright © 2020 Tsinghua University Press. Published by Elsevier Inc. All rights reserved

629

630 Chapter 17 Given a data set X¼(x1, x2, /, xn) with n samples, the FCM method is an iterative clustering algorithm to partition n samples into c clusters by minimizing the cost function Jm ¼

n X c X

2 um ji d ðxi ; vj Þ

s:t:

i¼1 j¼1

c X

uji ¼ 1

(17.1)

j¼1

where uji is the degree of membership of the i-th sample xi in j-th cluster, m is a weighting exponent which controls the fuzziness of the resulting partition, vj is the center of j-th cluster, and d(xi, vj) is a distance between the sample xi and cluster center vj. Generally, the minimization of Eq. (17.1) can be solved by setting the partial derivatives to zero and each variable is iteratively updated as follows: uji ¼

c P



k¼1 n P

1 dji dki

2 ; m1

(17.2)

um ji xi

vj ¼ i¼1n P

i¼1

;

(17.3)

um ji

where dji is an abbreviation of the distance d(xi, vj). In a classical FCM algorithm, dji ¼ kxi  vj k2. In fact, several new distances are defined from different perspectives in the improved versions of the FCM algorithm to improve performance. After the Bezdek’s masterwork, many researchers have focused on improving performance of the classical FCM algorithm by different methods and from different perspectives. First of all, to overcome the problem that a point is equidistant from two prototypes in FCM, Krishnapuram and Keller [24,25] proposed a new clustering model named possibilistic c-means (PCM). On the basis of the PCM, two new models called, respectively, fuzzypossibilistic C-means (FPCM) [26] and possibilistic fuzzy C-means (PFCM) [27] are presented consecutively. Second, spatial contextual information of image was introduced by Ahmed et al. [28] into the classical FCM algorithm, so a new FCM method with spatial constraints (FCM_S) is proposed. To accelerate the operating speed of FCM_S, its two variants FCM_S1 and FCM_S2 [29] are presented, which respectively employed the mean and median filtered images obtained in advance to influence the labeling of image pixels, rather than updated the labeling of all image pixels in each iteration. Meanwhile, an enhanced FCM (EnFCM) [30] algorithm based on the gray levels rather than the image pixels is proposed to improve the running speed of FCM_S. Inspired by EnFCM, Cai et al. [31] used the gray and spatial information and proposed a fast generalized FCM (FGFCM) algorithm. Moreover, an idea of weighted mean was embedded into the FCM and a fuzzy weighted C-means (FWCM) algorithm [32] was proposed. After a few years, Hung et al.

Sparse representation combined with fuzzy C-means (FCM)

631

[33] proposed an improved version of FWCM, named a new weighted fuzzy C-means (NW_FCM). Additionally, in order to guarantee noise insensitiveness and image detail preservation, the local spatial and gray level information is incorporated in a novel fuzzy way, and a new fuzzy local information C-means (FLICM) [34] method is presented. Furthermore, after a generalized fuzzy C-means clustering (GFCM) [35] is presented, a new generalized fuzzy C-means clustering algorithm with improved fuzzy partition (GIFP_FCM) [36] is proposed to quicken the convergent speed of FCM algorithm. Afterward, with the rise of kernel technology [37,38], Jiao et al. [39] presented the kernel version of GIFP-FCM with spatial constraints (KGFCM_S1 and KGFCM_S2) to improve the clustering performance. The theoretical formulations of the above-mentioned fuzzy clustering algorithms are simply listed in Table 17.1. Note that some of them can only be used in image segmentation, due to the usage of the spatial relation among image pixels. To distinguish them in Table 17.1, the variables of equations in the fuzzy clustering algorithms applied to image and any data are denoted in general variable and vector forms, respectively. The meanings of notations used in Table 17.1 are described in Tables 17.2 and 17.3. In addition, to simplify the content of Table 17.1, we omit a c P constraint uij ¼ 1 for all fuzzy clustering algorithms. i¼1

The basic theory of SR is that a sample can be represented as the linear combination of a small amount of atoms in a dictionary. Recently, sparse representation (SR) [40,41] has attracted a great deal of attention and has been successfully used for image classification [42e49] and data clustering [50,51]. Simultaneously, a new SR-based clustering algorithm, called sparse subspace clustering (SSC) [52], used the data set itself as dictionary to obtain the sparse self-representation coefficients and successfully separate different moving objects in video. In addition, Liu et al. proposed a low-rank representation (LRR) [53] method to explore the sparsity of the data set in another way, and then presented an LRRbased algorithm, named multitask low-rank affinity pursuit (MLAP) [50], to segment a single natural image in the framework of SSC. SSC, LRR, and MLAP show that the sparse self-representation coefficients have good category distinguishing performance, and at the same time the sparse self-representation method has favorable noise robustness and data-adaptiveness, which motivates us to introduce the SR method into fuzzy clustering. The SR theory is that a sample data x can be represented as a linear combination of atoms in a dictionary D ˛ Rgh, where weights of all atoms compose the SR coefficients z of this sample. The basic SR model [40] is minkzk0 z

s:t: x ¼ Dz;

(17.4)

632 Chapter 17

Table 17.1: The theoretical formulations of the classical FCM algorithm and some of its improved versions.

Sparse representation combined with fuzzy C-means (FCM)

633

634 Chapter 17

Sparse representation combined with fuzzy C-means (FCM)

635

where k,k0 denotes the [0 norm. Since the objective function in Eq. (17.4) is an NP-hard problem, the [0 norm is always replaced by the [1 norm to convert the original model into a convex optimization problem. Then it can b solved by the optimization method [54]. Furthermore, the noise in the data is also considered in the sparse representation model, and the dictionary is constructed by the data set X itself [55], which can be expressed as minkZk1 þ lkEk2;1 s:t: X ¼ XZ þ E; diagðZÞ ¼ 0; Z;E

(17.5)

where Z is the corresponding coefficient matrix, l > 0 is the parameter to balance the effect of different parts, and E denotes noise of data set. The [1 norm and [2;1 norm of a P Prffiffiffiffiffiffiffiffiffi P 2ffi matrix are defined as kZk1 ¼ jzij j and kEk2;1 ¼ eij , respectively. i;j

j

i

Among the SR methods in image classification and data clustering, the most representative methods are sparse representation-based classification [54] and sparse subspace clustering [53] algorithms. The two methods directly utilize the SR coefficients to get the final classification and clustering results. This shows that the SR coefficients have favorable category distinguishing ability. In other words, it contains discriminate information, which benefits image classification tasks. In addition, the existing method based on SR used the spectral clustering approach to get the clustering result. However, it is well known that the spectral clustering algorithm has high computational complexity. Moreover, it is necessary that the SR coefficients matrix is transformed into a symmetric matrix before using the spectral clustering. This leads to the loss of effective information in the SR coefficients, which is unfavorable for the clustering results, because their roles in their respective reconstructions are different for any two samples. Therefore we try to employ the classical FCM algorithm to cluster the SR coefficients.

17.2 Two versions combining FCM with SR 17.2.1 FDCM_SSR A new fuzzy double C-means clustering algorithm based on sparse self-representation (FDCM_SSR) can simultaneously cluster two types of features of sample set. The first is the basic feature describing the physical properties of samples themselves by numerical methods. The second kind of feature is obtained by solving a sparse self-representation model based on the basic feature set, and is called a discriminant feature. The discriminant feature comprises the similar degree between each sample with all of the other samples, which reveals the global structure of the whole sample set. The two types of features can have different dimensions and distance measures, because each category in FDCM_SSR has two kinds of clustering centers which respectively correspond to the two types of features. Since combining the basic feature with the discriminant feature reveals the global

636 Chapter 17 Table 17.2: Notations used in Table 17.1. Notation

Means

X¼(x1, x2, /, xn) n c uij, tij vi m, q a, b, ls , lg Nj NR

Sample set Number of samples Number of clusters Degree of membership of the j-th sample in i-th cluster Center of i-th cluster Weighting exponents Scale parameters Set of neighbors falling into a window around j-th pixel Cardinality of Nj

Notation xj _

xj

l gj (pj, qj) djr P K(x, y)

Means Average of neighboring pixels lying within a window around pixel xj Median of neighboring pixels lying within a window around pixel xj Number of gray levels Number of pixels having gray value equal to j Spatial coordinate of the j-th pixel Euclidean distance between pixels j and r Norm Kernel function

structure information of the sample set, the FDCM_SSR algorithm has favorable performance in data clustering and image segmentation experiments. Many improved algorithms of FCM have been shown in the previous section, but they can only address a data set or a kind of feature of the sample set. FDCM_SSR can deal with two types of features with different dimensions. One is the generally mentioned feature, known as the basic feature and denoted by X. The basic feature of each sample only represents the basic physical properties of itself. Another is learned from the basic feature by utilizing the spare self-representation method. The sparse self-representation coefficients deduced by the spare self-representation model embody similarity of samples from the same category and variance among different classes’ samples, which contributes to the clustering. As the sparse self-representation coefficients have good categorydistinguishing performance, it is taken as the second kind of feature and referred to as a discriminant feature. Let Z be the discriminant feature set of all samples. By using two data sets X and Z, the objective function of the proposed algorithm is defined by Jm ¼

c X n X i¼1 j¼1

c  X  2  2   e um  v þ a z  v uij ¼ 1 s:t: kx k j i j i ij

(17.6)

i¼1

where xj and zj are the j-th column of X and Z, respectively, vi and e vi are the center of X and Z in the i-th cluster, respectively, a is parameter to control the effect of the discriminant feature. In other words, xj and zj are the basic feature and discriminant vi are the i-th cluster center of the feature of the j-th sample, respectively. Similarly, vi and e basic feature set and the discriminant feature set, respectively.

Table 17.3: Segmentation accuracies and runtime (s) of different algorithms on two artificial images.

638 Chapter 17 To solve the minimization of Eq. (17.6), the cost function in Eq. (17.6) is firstly rewritten as ! c X n c   X X   2 f ðuij Þ ¼ um kxj  vi k2 þ a zj  e vi  þ b uij  1 : ij

i¼1 j¼1

i¼1

By differentiating f(uij) with respect to uij, vi, e vi , and the Lagrange multiplier b, we have the following formulations: 0 11 m1 b (17.7) uij ¼ @    2 A ; m kxj  vi k2 þ a zj  e vi  n P

um ij ,xj

j¼1 vi ¼ n P

j¼1 n P

e vi ¼

(17.8)

;

(17.9)

um ij ,zj

j¼1 n P

j¼1 c X

; um ij

um ij

uij ¼ 1:

(17.10)

i¼1

Then through plugging Eq. (17.7) into Eq. (17.10), we get 0 1 1 11 0 m1 c 1 1 BX@  C A m1  ðbÞ ¼ @ A  2 2 m kxj  vi k þ a zj  e vi  i¼1 So 0 uij ¼ @

c X k¼1

1 11  2 !m1   e kxj  vi k þ a zj  vi A :  2 vk  kxj  vk k2 þ a zj  e

2

(17.11)

The procedure for the FDCM_SSR clustering algorithm is described in Algorithm 17.1. An obvious difference between the proposed FDCM_SSR algorithm with existing fuzzy clustering methods is that FDCM_SSR can process simultaneously two kinds of features of the same sample set. Different types of features cannot be simply concatenated, because different feature descriptors may potentially have different data distribution probabilities

Sparse representation combined with fuzzy C-means (FCM)

639

Algorithm 17.1 Fuzzy double C-means based on sparse self-representation (FDCM_SSR) Step 1: Input the basic feature set X ¼ (x1, x2, /, xn) and the number of cluster c. Step 2: Initialize the weighting n exponent m, convergenceo threshold h, the degree of ð0Þ

membership matrix Uð0Þ ¼ uij ; 1  i  c; 1  j  n , and loop counter b ¼ 0.

Step 3: Obtain discriminant feature set Z by solving the sparse self-representation model in Eq. (17.2). Step 4. Update the loop counter b ¼ b þ 1. Step 5: Update the cluster centers vi (1  i  c) of the basic feature set X by using Eq. (17.8). Step 6: Update the cluster centers e vi (1  i  c) of the discriminant feature set Z by using Eq. (17.9). n o ðbÞ Step 7: Update the degree of membership matrix UðbÞ ¼ uij by using Eq. (17.11).  ðbÞ Step 8: If max U Uðb1Þ TH, then qi ¼ oriented ! and Di ¼ D k , where if k* ¼ 0, (k* e 1) takes 35; and if k* ¼ 35, (k* þ 1) takes 0. The reconstruction method used here is OMP [15] by setting the sparsity as KD ¼ 10. Finally, the rest blocks are labeled as stochastic. The category of image blocks in this chapter is similar to that in Ref. [20]. The difference in the tasks is that we know nothing about the image except its compressed measurements; but the images are available in Ref. [20]. Obviously, the overhead of tabbing the labels is small. The settings of the threshold values are conservative, and some classification errors may occur. Fortunately, they can be partly revised in the second process and by the collaboration between local and nonlocal similar blocks, so the final reconstruction accuracy is not sensitive to the classification result. In the first process of GS_CR, an image block is estimated by the collaboration of a group of nonlocal similar blocks, whose measurements are similar and geometric types are the same. Two types of collaborative reconstruction models constrained, respectively, by the sparsity priors presented in the previous section are presented. Denote the collection of yi and its similarity as Yi ˛ RmbN1 (N1 ¼ 4 in experiments), where yi is the first column of Yi. For an image block xi, smooth or oriented, the representation of it is equivalent to 0 0 ! ! xi ¼ Di Si where Di ˛ RBN is the subset of D with N0 atoms. Accordingly, Si ˛ RN is the coefficient to be estimated. It is equivalent to delete parts of the zero-valued atoms before the reconstruction. It is noted that the coefficient variables with bars are the coefficient vectors corresponding to the subdictionaries of D. In this case, the collaborative reconstruction model constrained by the joint sparsity [17,18] and geometric structure is given by: ! Si ¼ arg minkYi  fDi Si k2 ; Si

s:t: kSi krow;0  K; qi ¼ smooth=oriented; i ¼ 1; 2; :::; L

(18.13)

680 Chapter 18 where kSi krow;0 calculates the number of the nonzero rows of Si. Following Si being ! si is equal to the first estimated, the image blocks are estimated by xi ¼ Di Si , where ! ! column of Si . For a stochastic block xi, i ¼ 1, 2, ., L, qi ¼ stochastic, it is represented as the sum of þ xhigh ¼ Dsmoothsi ð1Þ þ DCsmoothsi ð2Þ , where si ð1Þ ˛ RND1 and two parts: xi ¼ xlow i i si ð2Þ ˛ RND2. The estimation is separated into two successive parts. The collaborative reconstruction is described below. 1. Solve the following problem: Si

ð1Þ

¼ arg minkY i fDsmooth Sk2 ; s.t. kSkrow;0  K = 2

(18.14)

S

Then xlow ¼ Dsmooth si ð1Þ , where si ð1Þ is equal to the first column of Si i ð1Þ DsmoothSi .

ð1Þ

. Let YiC ¼ Yi-f

2. Solve Si

ð2Þ

2  ¼ arg minYiC  fDCsmooth S ; s.t. kSkrow;0  K = 2

(18.15)

S

¼ DCsmooth si ð2Þ , where si ð2Þ is equal to the first column of Si Then xhigh i

ð2Þ

.

3. The image block is estimated by xi ¼ Dsmooth si ð1Þ þ DCsmooth si ð2Þ The problems of Eqs. (18.13), (18.14), and (18.15) can all be solved by the Simultaneous Orthogonal Matching Pursuit (SOMP) [21,23] method with appropriate parameter settings. In this process, the image is estimated by X¼(x1 , x2 , .xL ). In the second process of GS\_CR, an image block is estimated by the collaboration of two groups of blocks. The first group is composed of the blocks collaborated in the first process; and the second group is composed of its N2 (N2 ¼ 8) neighbors. Denote the collection of the N1þN2 blocks for xi by Xi. Each xj ˛Xi provides a collaborative solution xj i for xi. The best one of {xj i } is the final estimation of GS_CR for  2   i xi: xi* ¼ arg min yi  fxj  xj i

2

In CR_CS, the method of deriving xj i from xj is time-consuming. In GS_CR, the reconstruction is hastened by establishing different collaborative patterns adapted to the geometric types. The new patterns are the simpler and special cases of that in CR_CS. For convenience, the general collaborative pattern, i.e., the one presented in CR_CS, is first elaborated before the introduction of the new patterns.

Compressed sensing by collaborative reconstruction

681

Rewrite the sparse representation formula by further removing the zero components from the coefficient vectors, no matter the type of block. The new formula is given by: xj ¼ Dj sj , where Dj comprises the nonzero-weighted atoms choosing for xi and Dj 3D; the nonzero vector sj comprises the combinatorial coefficients of Dj . The refinement is committed in Dj and sj . Suppose Dj ¼(dj1, dj1, ., djK)˛RBK, where the subscripts j1, j2, .,jK ˛{1, 2, ., N are the indices of the atoms in D. Denote the relationship between two different atoms, dk1, dk2 ˛ D, satisfied ak1 ¼ ak2, qk1 ¼ qk1 , and bk1 ¼ bk2,þbk0 by: dk1 ¼ TS(dk2, b0) and dk2 ¼ TS(dk1, -b0), where the two atoms have the same shapes but different ridge positions. In the general collaborative pattern, xj i is acquired from xj by solving: 

  i Dj ; sj i ¼ arg minyi  fDs; ðD;sÞ (18.16) s:t: D ˛ fðTS ðdj1 ; b1 Þ; TS ðdj2 ; b2 Þ; :::; TS ðdjK ; bK ÞÞg;  bk  bmax bmin k k ; k ¼ 1; 2; :::; K; D3D where bmin and bmax are the minimum and maximum values, respectively, to keep k k TS ðdjK ; bK Þ, k ¼ 1, 2, ., K, to be the members of D. The collaborative solution is i obtained by xj i ¼ Dj sj i. The problem of Eq. (18.16) is a combinatorial problem of D, where the value of s depends on D and is calculated by the least squared equation s¼(fD)þyi. It is solved by iteratively evaluating and updating the atoms in D as presented in Algorithm 3 of Ref. [21], which is time consuming. The collaboration model is based on the autoregressive model [21] that an image block could be represented by the atoms and/or their shifted versions used to represent its local and nonlocal neighbors. In GS_CR, since the geometric structure of the blocks is estimated, the similarity and autoregressive models between the blocks of different geometric types could be further investigated. The way to derive xj i from xj will be determined by the geometric type of xj, i.e., qj. In the very special case that i ¼ j, it is obvious that xj i ¼ xj ; and in the case that qj ¼ stochastic, the collaborative pattern is the general one. When the smooth blocks are considered, the shifting of the atoms used for representation would not significantly affect the representation accuracy, because the atoms are low frequency and with very wide ridges. Therefore, in the case that qj ¼ smooth, the shifting of the atoms could be ignored and the collaboration pattern would be very simple. The i collaborative solution xj i is computed by xj i ¼ Dj sj i, where s¼(fD)þyi is the least squared solution. In this case, two adjacent smooth blocks are assumed to be represented by a common group of atoms, but they are combined by different weighting coefficients. In an oriented image block, there is a sharp line-like edge. When all the atoms representing the block are shifted by imposing the same variety on their shift parameters, there will be a shifted edge in the resulted block. Accordingly, the assumption is cast that

682 Chapter 18 two adjacent directional blocks could be presented by two groups of atoms with the shift i variety in the atomic pairs. In the case that qj ¼ oriented, xj i ¼ Dj sj i is acquired by solving: 

  i Dj ; sj i ¼ arg minyi  fDs; ðD;sÞ (18.17) s:t: D ˛ fðTS ðdj1 ; b0 Þ; TS ðdj2 ; b0 Þ; :::; TS ðdjK ; b0 ÞÞg max bmin 0  b0  b0 ; D3D

where bmin and bmax are the minimum and maximum values, respectively, to keep all the 0 0 members of D to be the atoms of D. Comparing the problems in Eqs. (18.17) and (18.16), the number of candidate solutions in Eq. (18.17) is much less than that in Eq. (18.16). The problem can be solved by enumerating all the possible solutions of D and evaluating them individually. In sum, there are three collaborative patterns designed for the task of information exchange between blocks. The new patterns for the smooth and oriented blocks are more adaptive to the similarity relationship between the blocks and the atoms used for representation. The new patterns are also much simpler and easier to be solved than the general pattern. Hence GS_CR can outperform CR_CS in reconstruction accuracy and speed.

18.3 Experiment 18.3.1 Collaborative reconstruction method based on an overcomplete dictionary The performance of the algorithms will be tested in this section. The experiments are performed on five natural images of size 512  512 depicted in Fig. 18.3. Each image is divided into nonoverlapped 1024 blocks of size 16  16. All the reconstruction results are obtained based on the average of 50 trials on each data ratio. For each trial, a random Gaussian measurement matrix is used for sampling. The data ratio is the ratio of the size of measurement matrix to the size of the image matrix

Figure 18.3 Natural images used in the experiments.

Compressed sensing by collaborative reconstruction

683

(the number of pixels) to be reconstructed. The parameter of sparsity is fixed to be K ¼ 32, i.e., each block is assumed to be sparsely represented by 32 atoms of the dictionary with 11,281 atoms. In the first collaborative process, each image block is estimated by the collaboration of its eight nonlocal neighbors. Meanwhile, in the second process, each block is estimated by the collaboration of its four nonlocal and eight local neighbors. The reconstruction performance is evaluated in terms of PSNR and SSIM [24]. The first experiment tests each of the collaboration models. Fig. 18.4 gives the reconstruction results on Boats with a data ratio ¼ 0.3. Four results are presented: the result by applying OMP on each block without collaboration; the result by applying IHT algorithm [25,26] for the second collaborative process based on the first result (denoted by OMP þ IHT); the result of the first process of CR_CS (denoted by J_CS); and the overall result of CR_CS. The OMP þ IHT method tests the second collaborative process separately and shows that OMP could be replaced by other methods. By comparing the

Figure 18.4 The reconstruction results of Boats by different methods, data ratio ¼ 0.3: (A) original; (B) OMP, 25.87 dB(0.8712); (C) OMP þ IHT, 27.71 dB(0.8847); (D) J_CS, 28.34 dB(0.8703); (E) CR_CS, 28.88 dB(0.8864).

684 Chapter 18 Table 18.1: The PSNR(dB) and SSIM results of the reconstructed images by different methods. Data ratio Image Barbara

Lena

Einstein

Boats

Peppers

Mean

Method OMP J_CS CR_CS OMP J_CS CR_CS OMP J_CS CR_CS OMP J_CS CR_CS OMP J_CS CR_CS OMP J_CS CR_CS

0.2 21.30(0.7536) 25.64(0.7793) 25.71(0.7862) 25.30(0.8395) 28.50(0.8728) 28.22(0.8712) 28.16(0.8421) 30.83(0.8521) 31.01(0.8501) 22.25(0.7632) 26.45(0.7971) 26.34(0.7934) 23.70(0.7672) 28.24(0.8584) 28.15(0.8559) 24.14(0.7931) 27.93(0.8319) 27.89(0.8314)

0.3 23.20(0.8363) 26.89(0.8587) 26.92(0.8718) 28.31(0.9112) 29.97(0.9217) 30.33(0.9273) 29.17(0.8447) 32.37(0.9035) 32.81(0.9127) 24.64(0.8301) 27.83(0.8701) 28.19(0.8800) 28.17(0.8953) 29.50(0.9085) 29.95(0.9143) 26.70(0.8635) 29.31(0.8925) 29.64(0.9012)

0.4 25.45(0.8995) 27.98(0.9049) 28.52(0.9204) 30.15(0.9362) 31.28(0.9463) 31.73(0.9529) 32.66(0.9283) 33.46(0.9314) 33.94(0.9431) 27.39(0.9096) 28.89(0.9100) 29.50(0.9233) 30.18(0.9296) 30.78(0.9345) 31.17(0.9421) 29.17(0.9206) 30.48(0.9254) 30.97(0.9364)

0.5 27.15(0.9102) 28.79(0.9289) 29.30(0.9423) 32.15(0.9531) 32.40(0.9577) 32.99(0.9644) 34.30(0.9375) 34.16(0.9468) 34.99(0.9578) 29.07(0.9266) 29.85(0.9301) 30.57(0.9427) 31.44(0.9455) 31.68(0.9467) 32.14(0.9543) 30.82(0.9346) 31.38(0.9420) 32.00(0.9523)

images in Fig. 18.4, it can be seen that the resulting image by J_CS is better than that by OMP without collaboration. It can also be seen that the second collaborative model improves the images resulting in the first process no matter whether collaboration is used in the first process. Table 18.1 compares the average results of different reconstruction methods. The results of the first process exceed many of the results of OMP, especially when the data ratio is low. The second collaborative process improves the first process. Though the values of PSNR and SSIM are not improved much, the reconstructed images visually have less block and fake artifacts as shown in Figs. 18.5e18.7, which are the best reconstruction results among the 50 reconstructed images with data ratio ¼ 0.3. It can also be seen from the enlarged local parts of the images that the images resulting from CR_CS have clearer and more consistent edges and textures.

18.3.2 Geometric structure-guided collaborative reconstruction method The methods are tested on the same five 512512 natural images as shown in Fig. 18.3. Each reconstruction result is based on the average of 30 trials due to the randomness of the sampling operator. For each trial, the blocks are sampled by a Gaussian matrix.

Compressed sensing by collaborative reconstruction

685

Figure 18.5 The reconstruction results of Barbara by different methods, data ratio ¼ 0.3: (A) original; (B) part of (A); (C) part of (A); (D) OMP, 23.86 dB(0.8570); (E) part of (D); (F) part of (D); (G) J_CS, 27.31 dB(0.8561); (H) part of (G); (I) part of (G); (J) CR_CS, 27.75 dB(0.8784); (K) part of (J); (L) part of (J).

686 Chapter 18

Figure 18.6 The reconstruction results of Lena by different methods, data ratio ¼ 0.3: (A) original; (B) part of (A); (C) OMP, 28.87 dB (0.9165); (D) part of (C);(E) J_CS, 29.92 dB(0.9100); (F) part of (E); (G) CR_CS, 30.64 dB(0.9302); (H) part of (G).

Compressed sensing by collaborative reconstruction

687

Figure 18.7 The reconstruction results of Peppers by different methods, data ratio ¼ 0.3: (A) original; (B) OMP,28.38 dB(0.8999); (C) J_CS, 29.79 dB(0.9096); (D) CR_CS, 30.26 dB(0.9151).

Each block is supposed to be represented by K ¼ 32 out of N ¼ 12,195 atoms. The reconstruction performance is evaluated by the values of PSNR and SSIM [24], and by the visual effect of the reconstructed images. Five reconstruction methods are compared. The first is the TV (Total Variation) constrained reconstruction model, which is solved by the TVAL [27] method. The second one is OMP [16] for block-wise reconstruction. The third method is named Geometric Structured CS (GCS) that tests the geometric structured sparsity models. It is realized by executing the first process of GS_CR without collaboration, i.e., N1 ¼ 1.

688 Chapter 18 Table 18.2: The average PSNR(dB) and SSIM of the methods. Data ratio Image

Method

Lena

TV OMP GCS CR_CS GS_CR TV OMP GCS CR_CS GS_CR TV OMP GCS CR_CS GS_CR TV OMP GCS CR_CS GS_CR TV OMP GCS CR_CS GS_CR

Einstein

Peppers

Boats

0.2 28.58(0.8965) 25.73(0.8581) 29.77(0.8697) 28.65(0.8687) 30.65(0.8731) 26.12(0.7897) 20.00(0.7242) 26.54(0.8127) 25.60(0.7901) 27.90(0.8275) 31.43(0.8957) 26.62(0.8228) 31.27(0.8444) 30.87(0.8465) 31.45(0.8566) 28.61(0.9010) 23.48(0.8262) 28.67(0.8503) 28.10(0.8531) 29.33(0.8671) 27.03(0.8517) 22.61(0.7666) 26.32(0.7862) 26.45(0.7865) 27.01(0.7889)

0.25 29.30(0.9213) 26.88(0.8882) 29.64(0.9045) 28.74(0.8986) 30.85(0.9046) 26.38(0.8173) 20.79(0.7504) 26.51(0.8355) 25.79(0.8265) 27.64(0.8566) 32.13(0.9166) 28.64(0.8614) 31.76(0.8877) 31.54(0.8778) 32.08(0.8881) 30.02(0.9247) 26.23(0.8708) 29.52(0.8728) 28.57(0.8775) 30.02(0.8781) 27.89(0.8868) 22.36(0.8101) 27.13(0.8366) 26.22(0.8335) 27.18(0.8421)

850 Time for reconstruction (s)

Barbara

0.15 27.24(0.8469) 24.73(0.8200) 29.87(0.7786) 27.37(0.7890) 29.89(0.8112) 24.85(0.7360) 18.37(0.6755) 26.46(0.6875) 24.22(0.6958) 27.25(0.7249) 29.75(0.8538) 25.44(0.7556) 29.22(0.7214) 29.76(0.7793) 30.13(0.7689) 27.25(0.8550) 22.09(0.7778) 27.65(0.7485) 26.86(0.7919) 28.41(0.7862) 26.35(0.8011) 19.62(0.7045) 26.70(0.6293) 24.92(0.7018) 26.96(0.6746)

800 CR_CS GS_CR GCS OMP TV

750 700 650 600 100 50 0.15

0.2

0.25

0.3

0.35

Data ratio

Figure 18.8 Average time needed by different methods for Lena’s reconstruction.

0.3 30.48(0.9393) 27.89(0.9122) 30.32(0.9152) 30.20(0.9198) 30.82(0.9239) 26.15(0.8423) 22.94(0.8379) 27.34(0.8828) 26.90(0.8700) 27.65(0.8995) 33.45(0.9351) 28.67(0.8815) 32.68(0.9115) 32.76(0.9126) 32.90(0.9207) 30.93(0.9407) 27.44(0.8840) 30.65(0.9059) 29.99(0.9102) 30.87(0.9153) 28.95(0.9131) 24.82(0.8476) 27.82(0.8686) 27.89(0.8729) 28.26(0.8746)

Compressed sensing by collaborative reconstruction

689

Figure 18.9 The estimation of Barbara by different methods, data ratio ¼ 0.25: (A) original; (B) TV, 27.39 dB(0.8214); (C) OMP, 22.30 dB(0.8046); (D) GCS, 28.78 dB(0.8484); (E) CR_CS, 26.78 dB(0.8273); (F) GS_CR, 28.75 dB(0.8637); (G) part of (A); (H) part of (B); (I) part of (C); (J) part of (D); (K) part of (E); (L) part of (F).

690 Chapter 18

Figure 18.10 The estimation of Lena by different methods, data ratio ¼ 0.25: (A) original; (B) TV, 29.58 dB(0.9229); (C) OMP, 26.98 dB(0.8882); (D) GCS, 30.60 dB(0.9046); (E) CR_CS, 28.91 dB(0.9003); (F) GS_CR, 31.25 dB(0.9162); (G) part of (A); (H) part of (B); (I) part of (C); (J) part of (D); (K) part of (E); (L) part of (F).

Compressed sensing by collaborative reconstruction

691

The last two methods are CR_CS and GS_CR. The numerical results of the methods are displayed in Table 18.2. It is obvious that GCS and CR_CS outperform OMP a lot, which shows that both the proposed geometric structured sparsity models and the collaborative reconstruction models are effective. It also can be seen that GS_CR, which makes a hybrid use of geometric structures and collaborative models, outperforms other methods, except the TV method, on almost all the items. The TV method is the classic convex reconstruction method, which could be solved by fast numerical optimization methods. The TV constrain characterizes the piecewise smooth property of natural images. Though the numerical results of TV are good as they are presented in Table 18.2, the method fails to capture the local geometric structures as presented later. Fig. 18.8 makes a rough comparison of time for reconstructing the Lena image. Both TV and OMP are quite simple and take little time for reconstruction. GCS takes time for estimating geometric structure before applying OMP, whose running time is a little more than that for OMP. CR_CS takes the longest time, and its collaborative patterns between block pairs are monotonous. The running time of GS_CR is much less than that of CR_CS. Figs. 18.9 and 18.10 display some of the reconstruction images with the data ratio of 0.25. By comparing the resulting images and the enlarged local parts, it can be seen that GCS is good at local geometric structure reconstruction, but its performance heavily depends on the correctness of estimating geometric structures. CR_CS is helpful to maintain local and nonlocal structural consistency, but fails to capture the sharp edges and textures. GS_CR partly overcomes these shortcomings, and achieves the best results. The TV method performs well in the smooth regions of the images, but fails to obtain sharp edges and recover line-like textures. One of the ways to improve the TV method is to incorporate other priors. There have been some methods available for the problems regularized by the combination of l1 norm, TV, and other constraints [28e30].

References [1] Donoho D. Compressed sensing. IEEE Transactions on Information Theory 2006;52:1289e306. [2] Cande`s E. Compressed sampling. In: Proceeding of the international congress of mathematics; 2006. p. 1433e52. [3] Cande`s E, Romberg J, Tao T. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory 2006;52:489e509. [4] Donoho D, Elad M, Temlyakov V. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on Information Theory 2006;52:6e18. [5] Rauhut H. Compressed sensing and redundant dictionaries. IEEE Transactions on Information Theory 2008;54:2210e9. [6] Wu J, Liu F, Jiao L, Wang X, Hou B. Multivariate compressive sensing for image reconstruction in the wavelet domain: using scale mixture models. IEEE Transactions on Image Processing 2011;20:489e509. [7] Mallat S, Zhang Z. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing 1993;12:3397e415.

692 Chapter 18 [8] Olshausen B, Field D. Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision Research 1997;23:3311e25. [9] Olshausen B, Field D. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 1996;381:607e9. [10] Gan L. Block compressed sensing of natural images. In: Proceedings of international conference on digital signal processing; 2007. p. 403e6. [11] Mun S, Fowler J. Block compressed sensing of images using directional transforms. In: Proceedings of international conference on image processing; 2009. p. 3021e4. [12] Mairal J, Bach F, Ponce J, Sapiro G, Zisserman A. Non-local sparse models for image restoration. In: Proceedings of IEEE 12th international conference on computer vision; 2009. p. 2272e9. [13] Dasgupta S, Gupta A. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures and Algorithms 2003;22:60e5. [14] Yonina E, Gitta K, editors. Compressed sensing: theory and applications. Cambridge University Press; 2012. [15] Pati Y, Rezaifar R, Krishnaprasad P. Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In: Conference record of the twenty-seventh asilomar conference on signals, systems and computers; 1993. p. 40e4. [16] Tropp J, Gilbert A. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory 2007;53:4655e66. [17] Cotter S, Rao B, Engan K, Kreutz-Delgado K. Sparse solution to linear inverse problems with multiple measurement vectors. IEEE Transactions on Signal Processing 2005;53:2477e88. [18] Tropp J, Gilbert A, Strauss M. Simultaneous sparse approximation via greedy pursuit. In: Proceedings of international conference on acoustics, speech. Signal Processing; 2005. p. 1520e6149. [19] Li X, Orchard MT. New edge-directed interpolation. IEEE Transactions on Information Theory 2001;10:1521e7. [20] Yang S, Wang M, Chen Y, Sun Y. Single-image super-resolution reconstruction via learned geometric dictionaries and clustered sparse coding. IEEE Transactions on Image Processing 2012;21:4016e28. [21] Baraniuk RG, Cevher V, Duarte MF, Hegde C. Model-based compressive sensing. IEEE Transactions on Information Theory 2010;56:1982e2001. [22] Lin L, Liu F, Jiao L. Geometric structure guided collaborative compressed sensing. Signal Processing: Image Communication 2016;40:16e25. [23] Lin L, Liu F, Jiao L. Compressed sensing by collaborative reconstruction on overcomplete dictionary. Signal Processing 2014;103:92e102. [24] Wang Z, Bovik A, Sheikh H, Simoncelli E. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 2004;13:600e12. [25] Blumensath T, Davies M. Iterative hard thresholding for compressive sensing. Applied and Computational Harmonic Analysis 2009;27:265e74. [26] Blumensath T. Accelerated iterative hard thresholding. Signal Processing 2012;92:752e6. [27] Li C. Compressive sensing for 3d data processing tasks: applications, models and algorithms (Ph. D. thesis). Rice University; 2011. [28] Lustig M, Donoho D, Pauly JM, Sparse MRI. The application of compressed sensing for rapid MR imaging. Magnetic Resonance in Medicine 2007;58:1182e95. [29] Goldstein T, Osher S. The split Bregman method for l1-regularized problems. SIAM Journal on Imaging Sciences 2009;2:323e43. [30] Hosseini MS, Plataniotis KN. High-accuracy total variation with application to compressed video sensing. IEEE Transactions on Image Processing 2014;23:3869e84.

C H A P T E R 19

Hyperspectral image classification based on spectral information divergence and sparse representation Chapter Outline 19.1 The research status and challenges of hyperspectral image classification

693

19.1.1 The research status of hyperspectral image classification 693 19.1.2 The challenges of hyperspectral image classification 695

19.2 19.3 19.4 19.5 19.6

Motivation 696 Spectral information divergence (SID) 698 Sparse representation classification method based on SID 699 Joint sparse representation classification method based on SID 700 Experimental results and analysis 701 19.6.1 19.6.2 19.6.3 19.6.4

References

Comparison of the measurements 702 Comparison of the performance of sparse representation classification methods 703 Analysis of parameters 704 The proof of convergence 707

707

19.1 The research status and challenges of hyperspectral image classification The classification problem of hyperspectral images is a typical pattern recognition problem. However, the hyperspectral image is different from traditional natural images or multispectral images, with the unique characteristics of the data bringing huge challenges to image analysis and processing [1].

19.1.1 The research status of hyperspectral image classification With the development of computer graphics, spectroscopy, machine learning, and pattern recognition theory and years of research and exploration, the technology of hyperspectral image classification has proposed very promising methods. These methods can be

Brain and Nature-Inspired Learning, Computation and Recognition. https://doi.org/10.1016/B978-0-12-819795-0.00019-0 Copyright © 2020 Tsinghua University Press. Published by Elsevier Inc. All rights reserved

693

694 Chapter 19 classified from different angles, such as the type of features, whether to use prior information, and the models adopted. By the type of features, these methods can be divided into classification methods based on data statistical characteristics and classification methods based on spectral features. The features used in classification generally include the original spectral characteristics, texture features extracted or statistical features, or some integrated features. In addition, based on the number of features used in the process of classification, classification methods also can be divided into single-feature classifications and multifeature classifications. By considering the spectral mixing, classification models can be divided into pure point models and mixed models. Pure point models do not consider spectral mixing, and output hard classification results; mixed models consider the mixing phenomenon of the spectrum, and the output is generally a soft classification result. Based on the characteristics of the combination of image and spectrum in hyperspectral data, classification methods can be divided into pixelwise classifications (PWCs) and the spectral-spatial classifications (SSCs). The pixelwise classifiers classify and identify by regarding each pixel as an independent individual. SSC methods do not only consider the spectral information of each pixel, but also classify by combining spatial structure knowledge of the image. Based on the number of classifiers, these methods can be divided into single-classifier methods and multiclassifier methods, and based on whether using training samples, these methods can be divided into supervised classification and unsupervised classification. The above categorization has been widely recognized by hyperspectral remote sensing academia, two research directions of hyperspectral image classification methods based on feature types are briefly introduced in the following text. The first is the classification method based on spectral characteristics, which includes the spectral matching method and mixed pixel decomposition method. The spectral matching method identifies the object by comparing the match between object spectra and reference spectra or the standard spectrum. The main spectral matching methods include minimum distance matching, spectral angle matching, cross-correlation spectral matching, spectral encoding matching, and spectral characteristic parameters matching, etc. The mixed pixel decomposition technique uses unmixing analysis to estimate the mixing mode and the corresponding proportion of the spectrum to achieve subpixel classification based on the pixels being mixed with a certain proportion by the spectrum of several ground objects. Spectral unmixing is generally divided into two steps of endmember extraction and mixing model. Classification methods based on spectral features have simple operations, and do not require significant prior knowledge of the ground, and so they have been applied in many fields. However, this kind of classification method is very dependent on the spectral

Hyperspectral image classification based on spectral information

695

information of hyperspectral images, when the images are affected by climate change, the spectral curve changes, "different body with same spectrum," or "same body with different spectrum" phenomenon occurs, the classification performance of this kind of method is impacted. The second is the classification method based on data statistical characteristics. After many years of research, many mature and classic classification methods have been formed, which include decision tree classification, maximum likelihood classification, k nearest neighbor method [2], logistic regression, Bayes classifier [3], expectation maximization algorithm, Gaussian process classifier [4], ensemble learning, artificial neural network classifier [5], deep learning, and support vector machine (SVM) [6]. These methods mainly make decisions through the analysis of various statistical rules, so the influence of noise is relatively small, and they have better effects in the actual application. However, these methods are based on the law of large numbers, which needs a certain number of training samples and prior knowledge. The hyperspectral images are often unable to obtain a sufficient number of training samples to accurately estimate the type of prior knowledge. Therefore, to design a classification method with better effect on the hyperspectral data is a problem to be solved.

19.1.2 The challenges of hyperspectral image classification Hyperspectral image data usually have the characteristics of high dimension, a large amount of data, easily affected by noise, big intraclass differences, and nonlinear classification, etc., which makes hyperspectral image classification very difficult. (1) The curse of dimensionality. With the development of remote sensing technology, spectral resolution is higher, and the feature dimension of data also increases. Theoretically, the increase in spectral information can describe the object more comprehensively, enhance the discriminative ability, contribute to the classification task, and increase the classification accuracy. However, with the increase in spectral bands, the characteristic dimension also increases accordingly, and the parameters’ estimation of the traditional classification model requires more training samples. In the application of hyperspectral images, obtaining training samples often incurs a high cost, so the number of training samples is very limited, resulting in the classifier’s parameter estimation accuracy being greatly reduced with lower classification accuracy, which is the so-called “curse of dimensionality” problemdalso known as Hughes phenomenon. In summary, in the case of limited training samples, ensuring the classification accuracy of the whole image is a very challenging problem. (2) A large amount of data. Since the start of the 21st century, remote sensing technology has made a breakthrough in enabling an imaging system which can provide high spatial resolution and high temporal resolution images to become the focus of

696 Chapter 19

(3)

(4)

(5)

(6)

research, and the amount of image data has increased markedly. At the same time, the demand for real-time data analysis and processing is increasing. Although hardware technology has been rapidly developed, it cannot meet the requirements. Therefore, how to improve the classification efficiency in ensuring precision is an urgent problem to be solved. Noise. The noises produced by air, sensing instruments, quantization, and data transmission are often superimposed on the signal collected by the sensor. Therefore, hyperspectral image classifiers need to consider the effects of noise with a degree of uncertainty, increasing the difficulty of classification. Big differences within classes. Hyperspectral images can obtain the object information from a large area, so the number of objects is large. In addition, the distribution and spectral response mechanism are very complex, with the phenomenon of the same material with different substances but with similar spectral characteristics often appearing. The large number of classes and large differences within classes have brought great challenges to traditional classification methods. Multifeature fusion. Based on past experience, single features can only describe pixels in one aspect, and there is no one feature that has high discrimination for all categories. Therefore, multifeature representation and the fusion method are the hotspots of current research. Nonlinear problem. The influence of different factors, such as reflectance and illumination condition difference, and similar objects with different spectral curves, creates the nonlinear classified problem of hyperspectral data, which presents a great challenge to the traditional linear classification algorithm.

Due to the above problems, the traditional classification methods often exhibit poor classification performance. In order to make full use of the abundant information contained in hyperspectral data, a new classification method needs to combine remote sensing information science, pattern recognition, computational intelligence, and other theory and technology from other disciplines.

19.2 Motivation Hyperspectral images are obtained by imaging of airborne or satellite sensors on a target area, which contains information of objects in tens to hundreds of consecutive and segmented bands from visible light to the infrared spectral region. Hyperspectral images have many applications in many fields, such as military, agriculture, and mineral exploration. In each of these applications, hyperspectral image classification is particularly important, with the process being to determine the category for each pixel. In recent years, with the development of the theory of compressed sensing [7,8]and the sparse modeling method [9], sparse representation has been widely used in the fields of computer

Hyperspectral image classification based on spectral information

697

vision and pattern recognition [10]. Among these, classification based on sparse representation has aroused the interest of many researchers. Although the hyperspectral image has high dimensionality, similar pixels are commonly distributed in the same lowdimensional subspace. Therefore, researchers used collected equipment based on compressed sensing to obtain hyperspectral images [11]. Reference [12] presents an unsupervised learning tool based on sparse dictionary to execute geological exploration. Researchers have proved that each class of data in a hyperspectral image is approximately distributed in a lowdimensional linear subspace [13], and provides a fast homotopy, sparse representation classification method. Chen et al. [14,15] proposed a classification method of hyperspectral images based on sparse representation, which proved that a hyperspectral pixel can be represented linearly and sparsely by atoms in a structured dictionary, with this method taking into account the spatial texture information in sparse reconstruction. Although the sparse representation classifier (SRC) has many applications in hyperspectral image classification, these methods are not for hyperspectral image classification problems, and so there is no comprehensive consideration of the spectral characteristics of hyperspectral images and the structure spectral and spatial characteristics. The traditional SRC used [2 norm, namely the Euclidean distance (ED), to measure the reconstruction error. Due to the development of remote sensing technology, spectral resolution and coverage have been greatly improved, but limited spatial resolution has caused some pixels to be mixed by a variety of materials. On the other hand, due to the influence of the atmosphere, the absorption or reflection of different materials in different bands can change with uncertainty and randomness. These factors can cause changes to the hyperspectral image data. Therefore, the traditional SRC using the Euclidean distance to compute the similarity between pixels is not the best choice. In order to better measure the similarity between pixels, Chang [16] put forward and proved spectral information divergence (SID) that can better measure spectrum changes than the commonly used Euclidean distance or spectral angle mapper (SAM) (under certain conditions, the Euclidean distance is equivalent to SAM). The SID measurement has been widely used in hyperspectral image analysis, such as unmixing [17,18], spectrum matching [16,19], target recognition [20], classification [21], band analysis [22], etc. SID is similar to Euclidean distance (ED) and SAM, and can be used as the measure of similarity between two pixels. In order to more effectively describe spectral changes, similarities, and discriminants of hyperspectral pixels in the sparse representation classification, this section combines the advantages of spectral information divergence and sparse representation, and introduces a classification method based on sparse SID (SID-based SRC). In addition, there are many works that have demonstrated that spatial information is of great help to the classification performancedthis section also introduces the joint sparse representation classification method based on SID (SID-based JSRC). This method considers the spectrum, spatial neighbor information, and the sparsity of data under the framework of sparse

698 Chapter 19 representation, and successfully realizes the joint classification of spectral and spatial information, with satisfactory results being achieved.

19.3 Spectral information divergence (SID) Given a pixel x ¼ ðx1 ; x2 ; /; xb ÞT , each element xi in vector x corresponds to the reflection for the corresponding band with wavelength being li , and all the elements are nonnegative. According to the nature of radiance or reflectance, we define the probability measurement P of x as: xi pi ¼ pðxi Þ ¼ Pb

(19.1)

i¼1 xi

The vector PðxÞ ¼ ðp1 ; p2 ; /; pb ÞT is the ideal probability representation of x ¼ ðx1 ; x2 ; /; xb ÞT . As is well known, each pixel in hyperspectral image can be regarded as a single signal source composition of b bands, and so its spectral changes can be expressed by statistical data P(x). Similarly, if there is another .P pixel y, its probability T b vector is QðyÞ ¼ ðq1 ; q2 ; /; qb Þ , where qi ¼ qðyi Þ ¼ yi i¼1 yi . According to the information theory, the self-information of x and y is defined as the combination of selfinformation in each band, and can be expressed, respectively, as: IðxÞ ¼ ðI1 ðxÞ; I2 ðxÞ; /; Ib ðxÞÞT

(19.2)

T

(19.3)

IðyÞ ¼ ðI1 ðyÞ; I2 ðyÞ; /; Ib ðyÞÞ

where the self-information in the i-th band is Ii(x) ¼ log(pi) and Ii(y) ¼ log(qi). The relative entropy of vector y to x can be defined as: CEðx; yÞ ¼

b P

pi CEðxi ; yi Þ ¼

i¼1

b X

pi ðIi ðyÞ  Ii ðxÞÞ

i¼1

  pi ¼ PT ðxÞ  ðIðyÞ  IðxÞÞ pi log ¼ qi i¼1

(19.4)

b P

In Eq. (19.4), CE(x, y) is called KullbackeLeibler information divergence or crossentropy. Based on Eq. (19.4), defining a symmetric similarity metric Spectral information divergence (SID). SID takes in the spectral information of the pixel by calculating the relative entropy, and can measure the spectral similarity from the perspective of information theory. SID measures the spectral similarity of pixels x and y as shown in formula (19.5): SIDðx; yÞ ¼ CEðx; yÞ þ CEðy; xÞ ¼ PT ðxÞ  ðIðyÞ  IðxÞÞ þ QT ðyÞ  ðIðxÞ  IðyÞÞ

(19.5)

Hyperspectral image classification based on spectral information

699

19.4 Sparse representation classification method based on SID The traditional sparse representation model assumes that similar pixels are approximately distributed in the same low-dimensional subspaces, which means one pixel can be an approximately linearly joint representation representated by a few atoms of a known dictionary. In the sparse representation model, we need to solve the sparse representation coefficient a of the test sample x in the reconstruction. Assume that a given dictionary D is made up with the training samples, the representation coefficient a, which meets Da ¼ x, can be obtained through the following optimization problem: b ¼ arg min jjak0 a

s:t:

jjx  Da k2  ε

where ε is the fault tolerance. The optimization problem in the formula can also be understood as a minimum approximation error under a certain degree of sparsity as follows:  b ¼ arg min jjx  Daj2 s:t: jjak0  K a

(19.6)

(19.7)

where K is the upper limit of sparsity. The above problem is NP-hard, and commonly uses an approximation algorithm to solve, such as orthogonal matching pursuit (OMP). OMP is a greedy algorithm, which adds a most similar atom in each calculation until K atoms are selected or the approximate error reaches a preset threshold. In the sparse representation models in Eqs. (19.6) and (19.7), jjx  Dak2 uses the Euclidean distance to measure the similarity between reconstructed pixels and real pixels. In the calculation of each iteration of the OMP algorithm, the correlation parameter def CP ¼ j < x; b x > j is used to measure the similarity between the selected atoms and the residual vector. The authors of Ref. [16] have demonstrated that SID is more effective than the Euclidean distance to describe the spectral changing characteristics. Therefore, this section introduces an SID-based sparse representation classifier (SID-based SRC) for hyperspectral image classification. Assume the known dictionary D is made up with training samples, the sparse representation coefficient vector a can be obtained by solving the sparse reconstruction problem  b ¼ arg min ak0 a s:t: SIDðx; Da Þ  ε (19.8) and it can also be expressed as: b ¼ arg min SID ðx; DaÞ s:t: a

 ak  K 0

(19.9)

where K represents the sparsity. Similar to the traditional sparse reconstruction problems, this problem is an NP-hard problem, but it can be solved by greedy algorithms [23,24] or relaxation to a convex programming problem [25].

700 Chapter 19 b in formula (19.9) is obtained, we can determine the Once the sparse coefficient vector a category of test pixel x. Suppose there are C categories, RESc(x) is defined as residual error of the c-th class, the error between the real test sample and the reconstruction pixel by the training samples in the c-th class is: b c jj2 ; RESc ðxÞ ¼ jjx  Ac a

c ¼ 1; 2; /; C

(19.10)

b c Þ; RESc ðxÞ ¼ SIDðx; Ac a

c ¼ 1; 2; /; C

(19.11)

or:

c

b corresponding to the b is part of the coefficients of reconstruction coefficient a where a training samples in the c-th class. The label of test pixel x is the class with the smallest residual error: labelðxÞ ¼ arg

min

c¼1;2;/;C

RESc ðxÞ

(19.12)

19.5 Joint sparse representation classification method based on SID In the hyperspectral image, adjacent pixels are often composed of similar materials, so their spectral characteristics have high correlation. In order to fully use space information and spectral information of adjacent pixels together, this section introduces the joint sparse representation classification method based on SID (SID-based JSRC). Assume that there are L pixels in the neighborhood of a hyperspectral image and each pixel can be expressed by a vector, then this pixel can consist of one b  L matrix X ¼ [x1, x2,/, xL]. In the joint sparse representation model, X can be represented as: X ¼ ½x1 ; x2 ; /; xL  ¼ ½Da1 ; Da2 ; /; DaL  ¼ D½a1 ; a2 ; /; aL  ¼ DA

(19.13)

Each pixel can be jointly and linearly represented by partial atoms in the dictionary, because {xi}i¼1,2,/,L ˛ NL has very high correlation, it can be assumed that these pixels have the same selected atoms, i.e., {ai}i¼1,2,/,L has the same sparse pattern, and A ¼ ½a1 ; a2 ; /; aL , there are K rows of elements of matrix A ˛