Direct Methods for Sparse Matrices [2ed.] 0198508387, 9780198508380

The subject of sparse matrices has its root in such diverse fields as management science, power systems analysis, survey

1,024 99 3MB

English Pages 429 [451] Year 2017

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Direct Methods for Sparse Matrices [2ed.]
 0198508387, 9780198508380

Citation preview

GLOSSARY OF SYMBOLS A User’s original matrix. AT Transpose of A. A(k) Reduced matrix (of order n × n) before step k of Gaussian elimination. A[k] Matrix (order n × n) associated with k th finite element. Ai: Row i of the matrix A. A:j Column j of the matrix A. B, C, ... Other matrices. b Right-hand side vector. c Intermediate solution vector. ck Depending on the context, either component k of c or column count (number of entries in column k of matrix). D Diagonal matrix. ei ith column of I. I Identity matrix. L Lower triangular matrix, usually formed by triangular factorization of A. L\U LU factorization of a matrix packed into a single array. n Order of the matrix A. O() If f (n)/g(n) → k as n → ∞, where k is a constant, then f (n) = O(g(n)). P Permutation matrix (usually applied to matrix rows). Q Orthogonal matrix (usually a permutation matrix). Note that QQT = I. R Upper triangular matrix in a QR factorization. r Residual vector r = b − Ax. rk Row count (number of entries in row k of matrix). U Upper triangular matrix, usually formed by triangular factorization of A. u Threshold for numerical pivoting. vi ith eigenvector of A. x Solution vector.  Relative precision. κ(A) Condition number of A. λi ith eigenvalue of A. (k) ρ Largest element in any reduced matrix, that is maxi,j,k |aij |. τ Number of entries in the matrix A. ∈ Belongs to.

NUMERICAL MATHEMATICS AND SCIENTIFIC COMPUTATION Series Editors A.M. STUART E. SÜLI

NUMERICAL MATHEMATICS AND SCIENTIFIC COMPUTATION Books in the series Monographs marked with an asterisk (*) appeared in the series ‘Monographs in Numerical Analysis’ which is continued by the current series. For a full list of titles please visit https://global.oup.com/academic/content/series/n/numerical-mathematics-and-scientificcomputation-nmsc/?lang=en&cc=gb *J. H. Wilkinson: The Algebraic Eigenvalue Problem *I. Duff, A. Erisman, and J. Reid: Direct Methods for Sparse Matrices *M. J. Baines: Moving Finite Elements *J. D. Pryce: Numerical Solution of Sturm–Liouville Problems C. Schwab: p- and hp- Finite Element Methods: Theory and Applications in Solid and Fluid Mechanics J. W. Jerome: Modelling and Computation for Applications in Mathematics, Science, and Engineering A. Quarteroni and A. Valli: Domain Decomposition Methods for Partial Differential Equations G. Em Karniadakis and S. J. Sherwin: Spectral/hp Element Methods for Computational Fluid Dynamics I. Babuška and T. Strouboulis: The Finite Element Method and its Reliability B. Mohammadi and O. Pironneau: Applied Shape Optimization for Fluids S. Succi: The Lattice Boltzmann Equation for Fluid Dynamics and Beyond P. Monk: Finite Element Methods for Maxwell’s Equations A. Bellen and M. Zennaro: Numerical Methods for Delay Differential Equations J. Modersitzki: Numerical Methods for Image Registration M. Feistauer, J. Felcman, and I. Straškraba: Mathematical and Computational Methods for Compressible Flow W. Gautschi: Orthogonal Polynomials: Computation and Approximation M. K. Ng: Iterative Methods for Toeplitz Systems M. Metcalf, J. Reid, and M. Cohen: Fortran 95/2003 Explained G. Em Karniadakis and S. Sherwin: Spectral/hp Element Methods for Computational Fluid Dynamics, Second Edition D. A. Bini, G. Latouche, and B. Meini: Numerical Methods for Structured Markov Chains H. Elman, D. Silvester, and A. Wathen: Finite Elements and Fast Iterative Solvers: With Applications in Incompressible Fluid Dynamics M. Chu and G. Golub: Inverse Eigenvalue Problems: Theory, Algorithms, and Applications J.-F. Gerbeau, C. Le Bris, and T. Lelièvre: Mathematical Methods for the Magnetohydrodynamics of Liquid Metals G. Allaire and A. Craig: Numerical Analysis and Optimization: An Introduction to Mathematical Modelling and Numerical Simulation K. Urban: Wavelet Methods for Elliptic Partial Differential Equations B. Mohammadi and O. Pironneau: Applied Shape Optimization for Fluids, Second Edition K. Boehmer: Numerical Methods for Nonlinear Elliptic Differential Equations: A Synopsis M. Metcalf, J. Reid, and M. Cohen: Modern Fortran Explained J. Liesen and Z. Strakoš: Krylov Subspace Methods: Principles and Analysis R. Verfürth: A Posteriori Error Estimation Techniques for Finite Element Methods H. Elman, D. Silvester, and A. Wathen: Finite Elements and Fast Iterative Solvers: With Applications in Incompressible Fluid Dynamics, Second Edition I. Duff, A. Erisman, and J. Reid: Direct Methods for Sparse Matrices, Second Edition

Direct Methods for Sparse Matrices SECOND EDITION

I. S. DUFF Rutherford Appleton Laboratory, CERFACS, Toulouse, France, and Strathclyde University

A. M. ERISMAN The Boeing Company, Seattle (retired) and Seattle Pacific University

J. K. REID Rutherford Appleton Laboratory and Cranfield University

3

3

Great Clarendon Street, Oxford, OX2 6DP, United Kingdom Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide. Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries © I. S. Duff, A. M. Erisman, and J. K. Reid 2017 The moral rights of the authors have been asserted First Edition published in 1986 Second Edition published in 2017 Impression: 1 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by licence or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above You must not circulate this work in any other form and you must impose this same condition on any acquirer Published in the United States of America by Oxford University Press 198 Madison Avenue, New York, NY 10016, United States of America British Library Cataloguing in Publication Data Data available Library of Congress Control Number: 2016946839 ISBN 978–0–19–850838–0 Printed and bound by CPI Group (UK) Ltd, Croydon, CR0 4YY Links to third party websites are provided by Oxford in good faith and for information only. Oxford disclaims any responsibility for the materials contained in any third party website referenced in this work.

PREFACE The subject of sparse matrices has its roots in such diverse fields as management science, power systems analysis, surveying, circuit theory, and structural analysis. Mathematical models in all of these areas give rise to very large systems of linear equations that could not be solved were it not for the fact that the matrices contain relatively few nonzeros. It has become apparent that the equations can be solved even when the pattern is irregular, and it is primarily the solution of such problems that we consider. A great deal has changed since the first edition of this book was published 30 years ago. There has been considerable research progress. Simply updating the book to account for more recent results is important. In addition, our world has become much more dependent on mathematical models for the complex designs we produce. To take just one example, The Boeing Company used wind tunnels as the primary design tool and mathematical models as a tool for independent insight in the design of large aircraft at the time of the first edition. Since that time, the mathematical model has become the primary design tool, and the wind tunnel is used for validation. In 1994, Boeing had the first test flight of the 777, its first all-digitally-designed aeroplane. This major transition raised the stakes on the importance of large-scale simulations. There has been a similar shift in many other fields. Furthermore, models have been extended to other areas since 1986. At that time, it was common in aerodynamics, structural analysis, chemical processing design, electronics design, power systems design and analysis, weather forecasting, and linear programming to include a sparse matrix model at the core. Today, these same models continue to be developed, but are solved for significantly larger problems. In addition, sparse matrices appear in models for speech recognition, language processing, computer gaming, signal processing, big data, bioscience, social networks, and process simulations of all kinds. In many cases, these models started in off-line design applications, but have moved to real-time decision support applications. Thus, the rapid and reliable solution of large sparse systems is more important than ever. On the other hand, computing has also advanced at a rapid rate. In over 25 years, it is safe to say that computing price performance has improved approximately 100,000 times based on Moore’s Law. Chris Anderson, editor of Wired Magazine and author of Free, argues that computing has advanced to the point where waste is good. All of the work we did to reduce memory and to get the most out of computing cycles is (sometimes) no longer needed. Computing is so cheap that it is better not to take expensive people time to save what does not cost very much.

viii

PREFACE

Since the essence of this book is to describe efficient algorithms for solving sparse matrix problems arising at the heart of most large-scale simulations, Anderson’s argument may suggest this work is no longer needed. However, in this case he is wrong, as any scientist working in the field would already know. Here is why. As the problem size and hence the matrix size grows, ignoring sparsity would cause the computing times to grow as the cube of the size of the problem. Thus, a problem that is 10 times as large will require 1000 times the computation. Computation for sparse problems is much more problem dependent, but we might expect a problem 10 times as large could require 30–50 times the computation. At the time of the first edition of the book in 1986, a problem with 10 000 variables was regarded as very large. Today, problems with 10 000 000 variables are solved and efficient algorithms for making this possible are discussed in this book. The subject is intensely practical and we have written this book with practicalities ever in mind. Whenever two methods are applicable we have considered their relative merits, basing our conclusions on practical experience in cases where a theoretical comparison is not possible. We hope that the reader with a specific problem may get some real help in solving it. Non-numeric computing techniques have been included, as well as frequent illustrations, in an attempt to bridge the usually wide gap between the printed page and the working computer code. Despite this practical bias, we believe that many aspects of the subject are of interest in their own right and we have aimed for the book to be suitable also as a basis for a student course, probably at MSc level. Exercises have been included to illustrate and strengthen understanding of the material, as well as to extend it. In this second edition, we have introduced a few research exercises that will extend the reader’s understanding even further and could be used as the basis for research. We have aimed to make modest demands on the mathematical expertise of the reader, and familiarity with elementary linear algebra is almost certainly needed. Similarly, only modest computing background is expected, although familiarity with modern Fortran is helpful. After the introductory Chapter 1, the computational tools required to handle sparsity are explained in Chapter 2. We considered making the assumption of familiarity with the numerical analysis of full matrices, but felt that this might provide a barrier for an unreasonably large number of potential readers and, in any case, have found that we could summarize the results required in this area without lengthening the book unreasonably. This summary constitutes Chapters 3 and 4. The reader with a background in computer science will find Chapter 2 straightforward, while the reader with a background in numerical analysis will find the material in Chapters 3 and 4 familiar. Note that the rest of the book is built around the basic material in these chapters. Sparsity is considered in earnest in the remainder of the book. In Chapter 5, Gaussian elimination for sparse matrices is introduced, and the utility of standard software packages to realize the potential saving is emphasized. Chapters 6–9 are

PREFACE

ix

focused on ordering methods to preserve sparsity. These include transforming a matrix to block triangular form when possible (Chapter 6), variations of local pivoting strategies to preserve sparsity (Chapter 7), orderings to achieve band and variable band form (Chapter 8), and dissection strategies (Chapter 9). The implementation of the solution of sparse systems using these orderings is the subject of Chapters 10–14. In these chapters, we consider the cases where the factorizations and solutions can be developed without concern for numerical values, as well as the cases where account of numerical values needs to be taken during factorization. We examine such implementations for modest to very large problems across a variety of computer architectures. Finally, in Chapter 15, we consider other exploitations of sparsity. Since the previous edition, much work in sparse matrix research has involved the development of algorithms that exploit parallel architectures. We decided that, rather than have a single chapter on this, we would discuss parallel aspects relevant to the work in each chapter. We use bold face in the body of the text when we are defining terms. For the purposes of illustration, we include fragments of Fortran code conforming to the ISO Fortran 95 standard, and some exercises ask the reader to write code fragments in Fortran or another language of his or her choice. Where a more informal approach is needed, we use syntax loosely based on Algol 60, with ‘:=’ meaning assignment. We use equal-width font for Fortran and normal font for informal code, sometimes using both in the same figure. Throughout the book we present timing data drawn from a variety of computers to illustrate various points. This diversity of computers is due to the different papers, environments, and time periods from which our results are drawn. It does not cause a difficulty of interpretation because it is the relative performance within a particular area that is compared; absolute numbers are less meaningful. We refer to many sparse matrix packages in the book and aim to provide somewhere (usually at its first occurrence) a short description and a reference. This will be indexed in bold so that whenever we refer to the package the reader will be able to find the description and reference easily. Three appendices cover matrix norms, some pictures of sparse matrices from various applications, and solutions to selected exercises. These appendices precede the collected set of references and indices to authors and subjects. We find it convenient to estimate operation counts and storage requirements by the term that dominates for large problems. We will use the symbol O for this. For instance, if a certain computation needs 31 n3 − 61 n(n+1) multiplications we might write this as 13 n3 + O(n2 ) or O(n3 ). Efficient use of sparsity is a key to solving large problems in many fields. We hope that this book will help people doing research in sparse matrices and supply both insight and answers for those attempting to solve these problems.

x

PREFACE

Acknowledgements for the first edition The authors wish to acknowledge the support of many individuals and institutions in the development of this book. The international co-authorship has been logistically challenging. Both AERE Harwell and Boeing Computer Services (our employers) have supported the work through several exchange visits. Other institutions have been the sites of extended visits by the authors, including the Australian National University and Argonne National Laboratory (ISD), Carnegie-Mellon University (AME), and the Technical University of Denmark (JKR). The book was typeset at Harwell on a Linotron 202 typesetter using the TSSD (Typesetting System for Scientific Documents) package written by Mike Hopper. We wish to thank Harwell for supporting us in this way, the staff of the Harwell Reprographic Section for their rapid service, Mike Hopper for answering so many of our queries, and Rosemary Rosier for copying successive drafts for us to check. Oxford University Press has been very supportive (and patient) over the years. We would like to thank the staff involved for the encouragement and help that they have given us, and for their rapid response to queries. Many friends and colleagues have read and commented on chapters. We are particularly grateful to the editors, Leslie Fox and Joan Walsh, for going far beyond their expected duties in reading and commenting on the book. Others who have commented include Pat Gaffney, Ian Gladwell, Nick Gould, Nick Higham, John Lewis, and Jorge Mor´e. Finally, we wish to thank our supportive families who accepted our time away from them, even when at home, during this lengthy project. Thanks to Diana, Catriona, and Hamish Duff; Nancy, Mike, Andy, and Amy Erisman; and Alison, Martin, Tom, and Pippa Reid. Harwell, Oxon, England and Seattle, USA, May, 1986.

I. S. D. A. M. E. J. K. R.

Acknowledgements for the second edition The challenge of writing a second edition in a very active research area has been greatly assisted by the help of others. We wish particularly to thank the following people for reading parts of the draft and making comments that led to major improvements: Mario Arioli, Cleve Ashcraft, Tim Davis, Pat Gaffney, Abdou Guermouche, Anshul Gupta, Jonathan Hogg, Sherry Li, Bora U¸car, and Cl´ement Weisbecker. We would also like to thank our colleagues Mario Arioli, Nick Gould, Jonathan Hogg, and Jennifer Scott for their encouragement and for the many discussions over the years about the algorithms that we describe. Oxford University Press has been very supportive (and patient). We would like to thank the staff involved for the encouragement and help that they have given us, and for their rapid response to queries.

PREFACE

xi

Modern technology has made this book far easier to revise than it was to write the first edition. Now we can exchange chapters by email instead of airmail taking 3 or 4 days at best. We typeset the first edition, but each page involved an expensive photographic master. Now the whole book can be typeset in seconds, viewed on the screen, and printed inexpensively. Iain (ISD) wishes to acknowledge the support of the Technical University of Denmark and the Australian National University through exchange visits that facilitated the writing of this book. He would like to acknowledge the close collaboration with his colleagues and friends in France, particularly in Toulouse, at CERFACS and IRIT. Stimulating interactions and invitations to the juries of theses have helped him to keep abreast of developments in the field. Al (AME) would like to thank his supportive wife Nancy, both in enabling the work and hosting visits during the project. He also thanks his two co-authors; he has worked with Iain and John since the early 1970s, values their different perspectives, and appreciates them as friends. In the course of writing this second edition, John has routed his trips to the US through Seattle for blocks of working time, and has hosted Al at his home in Benson. John (JKR) would like to thank Al (AME) and his wife Nancy for the many occasions when he has stayed in their house in order to make progress with writing the book. He would like to thank his grand-daughter Poppy Reid for the careful work she did during 2 weeks of work experience at RAL. She selected test problems from the Florida collection, ran the HSL code MA48 on them and collected statistics. Her results are the basis of eight tables, starting with Table 5.5.1. All of us would like to thank our wives, Di Duff, Nancy Erisman, and Alison Reid for their support over so many years. RAL-STFC, Oxon, England and Seattle, USA, March, 2016.

I. S. D. A. M. E. J. K. R.

CONTENTS

1 Introduction 1.1 Introduction 1.2 Graph theory 1.3 Example of a sparse matrix 1.4 Modern computer architectures 1.5 Computational performance 1.6 Problem formulation 1.7 Sparse matrix test collections

1 1 2 5 10 12 13 14

2 Sparse matrices: storage schemes and simple operations 2.1 Introduction 2.2 Sparse vector storage 2.3 Inner product of two packed vectors 2.4 Adding packed vectors 2.5 Use of full-sized arrays 2.6 Coordinate scheme for storing sparse matrices 2.7 Sparse matrix as a collection of sparse vectors 2.8 Sherman’s compressed index scheme 2.9 Linked lists 2.10 Sparse matrix in column-linked list 2.11 Sorting algorithms 2.11.1 The counting sort 2.11.2 Heap sort 2.12 Transforming the coordinate scheme to other forms 2.13 Access by rows and columns 2.14 Supervariables 2.15 Matrix by vector products 2.16 Matrix by matrix products 2.17 Permutation matrices 2.18 Clique (or finite-element) storage 2.19 Comparisons between sparse matrix structures

18 18 18 20 20 22 22 23 25 26 29 30 30 31 32 34 35 36 36 37 39 40

3 Gaussian elimination for dense matrices: the algebraic problem 3.1 Introduction 3.2 Solution of triangular systems

43 43 43

xiii

xiv

CONTENTS

3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13

Gaussian elimination Required row interchanges Relationship with LU factorization Dealing with interchanges LU factorization of a rectangular matrix Computational sequences, including blocking Symmetric matrices Multiple right-hand sides and inverses Computational cost Partitioned factorization Solution of block triangular systems

44 46 47 49 49 50 52 54 55 57 59

4 Gaussian elimination for dense matrices: numerical considerations 4.1 Introduction 4.2 Computer arithmetic error 4.3 Algorithm instability 4.4 Controlling algorithm stability through pivoting 4.4.1 Partial pivoting 4.4.2 Threshold pivoting 4.4.3 Rook pivoting 4.4.4 Full pivoting 4.4.5 The choice of pivoting strategy 4.5 Orthogonal factorization 4.6 Partitioned factorization 4.7 Monitoring the stability 4.8 Special stability considerations 4.9 Solving indefinite symmetric systems 4.10 Ill-conditioning: introduction 4.11 Ill-conditioning: theoretical discussion 4.12 Ill-conditioning: automatic detection 4.12.1 The LINPACK condition estimator 4.12.2 Hager’s method 4.13 Iterative refinement 4.14 Scaling 4.15 Automatic scaling 4.15.1 Scaling so that all entries are close to one 4.15.2 Scaling norms 4.15.3 I-matrix scaling

62 62 63 65 67 67 68 68 69 70 70 70 71 72 74 74 75 79 79 80 80 81 83 84 84 85

5 Gaussian elimination for sparse matrices: an introduction 5.1 Introduction 5.2 Numerical stability in sparse Gaussian elimination

89 89 90

CONTENTS

5.3

5.4

5.5 5.6 5.7 5.8 5.9

5.2.1 Trade-offs between numerical stability and sparsity 5.2.2 Incorporating rook pivoting 5.2.3 2 × 2 pivoting 5.2.4 Other stability considerations 5.2.5 Estimating condition numbers in sparse computation Orderings 5.3.1 Block triangular matrix 5.3.2 Local pivot strategies 5.3.3 Band and variable band ordering 5.3.4 Dissection Features of a code for the solution of sparse equations 5.4.1 Input of data 5.4.2 The ANALYSE phase 5.4.3 The FACTORIZE phase 5.4.4 The SOLVE phase 5.4.5 Output of data and analysis of results Relative work required by each phase Multiple right-hand sides Computation of entries of the inverse Matrices with complex entries Writing compared with using sparse matrix software

6 Reduction to block triangular form 6.1 Introduction 6.2 Finding the block triangular form in three stages 6.3 Looking for row and column singletons 6.4 Finding a transversal 6.4.1 Background 6.4.2 Transversal extension by depth-first search 6.4.3 Analysis of the depth-first search transversal algorithm 6.4.4 Implementation of the transversal algorithm 6.5 Symmetric permutations to block triangular form 6.5.1 Background 6.5.2 The algorithm of Sargent and Westerberg 6.5.3 Tarjan’s algorithm 6.5.4 Implementation of Tarjan’s algorithm 6.6 Essential uniqueness of the block triangular form 6.7 Experience with block triangular forms 6.8 Maximum transversals 6.9 Weighted matchings 6.10 The Dulmage–Mendelsohn decomposition

xv

90 92 93 93 94 94 95 96 96 98 99 101 101 101 102 103 103 104 105 106 106 108 108 109 110 111 111 113 115 116 118 118 119 122 125 125 127 128 130 133

xvi

CONTENTS

7 Local pivotal strategies for sparse matrices 7.1 Introduction 7.2 The Markowitz criterion 7.3 Minimum degree (Tinney scheme 2) 7.4 A priori column ordering 7.5 Simpler strategies 7.6 A more ambitious strategy: minimum fill-in 7.7 Effect of tie-breaking on the minimum degree algorithm 7.8 Numerical pivoting 7.9 Sparsity in the right-hand side and partial solution 7.10 Variability-type ordering 7.11 The symmetric indefinite case 8 Ordering sparse matrices for band solution 8.1 Introduction 8.2 Band and variable-band matrices 8.3 Small bandwidth and profile: Cuthill–McKee algorithm 8.4 Small bandwidth and profile: the starting node 8.5 Small bandwidth and profile: Sloan algorithm 8.6 Spectral ordering for small profile 8.7 Calculating the Fiedler vector 8.8 Hybrid orderings for small bandwidth and profile 8.9 Hager’s exchange methods for profile reduction 8.10 Blocking the entries of a symmetric variable-band matrix 8.11 Refined quotient trees 8.12 Incorporating numerical pivoting 8.12.1 The fixed bandwidth case 8.12.2 The variable bandwidth case 8.13 Conclusion 9 Orderings based on dissection 9.1 Introduction 9.2 One-way dissection 9.2.1 Finding the dissection cuts for one-way dissection 9.3 Nested dissection 9.4 Introduction to finding dissection cuts 9.5 Multisection 9.6 Comparing nested dissection with minimum degree 9.7 Edge and vertex separators 9.8 Methods for obtaining dissection sets

137 137 138 138 140 142 143 145 147 149 153 153 156 156 156 158 162 162 163 166 167 168 169 170 173 173 174 175 177 177 177 179 180 182 182 185 186 188

CONTENTS

9.9 9.10

9.11

9.8.1 Obtaining an initial separator set 9.8.2 Refining the separator set Graph partitioning algorithms and software Dissection techniques for unsymmetric systems 9.10.1 Background 9.10.2 Graphs for unsymmetric matrices 9.10.3 Ordering to singly bordered block diagonal form 9.10.4 The performance of the ordering Some concluding remarks

10 Implementing Gaussian elimination without symbolic factorize 10.1 Introduction 10.2 Markowitz ANALYSE 10.3 FACTORIZE without pivoting 10.4 FACTORIZE with pivoting 10.5 SOLVE 10.6 Hyper-sparsity and linear programming 10.7 Switching to full form 10.8 Loop-free code 10.9 Interpretative code 10.10 The use of drop tolerances to preserve sparsity 10.11 Exploitation of parallelism 10.11.1 Various parallelization opportunities 10.11.2 Parallelizing the local ordering and sparse factorization steps 11 Implementing Gaussian elimination with symbolic FACTORIZE 11.1 Introduction 11.2 Minimum degree ordering 11.3 Approximate minimum degree ordering 11.4 Dissection orderings 11.5 Numerical FACTORIZE using static data structures 11.6 Numerical pivoting within static data structures 11.7 Band methods 11.8 Variable-band (profile) methods 11.9 Frontal methods: introduction 11.10 Frontal methods: SPD finite-element problems 11.11 Frontal methods: general finite-element problems 11.12 Frontal methods for non-element problems 11.13 Exploitation of parallelism

xvii

188 189 191 193 193 194 197 200 201

204 204 205 210 213 215 218 219 221 222 225 228 228 228

232 232 233 236 238 239 240 241 244 245 246 250 251 255

xviii

CONTENTS

12 Gaussian elimination using trees 12.1 Introduction 12.2 Multifrontal methods for finite-element problems 12.3 Elimination and assembly trees 12.3.1 The elimination tree 12.3.2 Using the assembly tree for factorization 12.4 The efficient generation of elimination trees 12.5 Constructing the sparsity pattern of U 12.6 The patterns of data movement 12.7 Manipulations on assembly trees 12.7.1 Ordering of children 12.7.2 Tree rotations 12.7.3 Node amalgamation 12.8 Multifrontal methods: symmetric indefinite problems

258 258 259 263 263 266 266 269 270 271 271 273 276 277

13 Graphs for symmetric and unsymmetric matrices 13.1 Introduction 13.2 Symbolic analysis on unsymmetric systems 13.3 Numerical pivoting using dynamic data structures 13.4 Static pivoting 13.5 Scaling and reordering 13.5.1 The aims of scaling 13.5.2 Scaling and reordering a symmetric matrix 13.5.3 The effect of scaling 13.5.4 Discussion of scaling strategies 13.6 Supernodal techniques using assembly trees 13.7 Directed acyclic graphs 13.8 Parallel issues 13.9 Parallel factorization 13.9.1 Parallelization levels 13.9.2 The balance between tree and node parallelism 13.9.3 Use of memory 13.9.4 Static and dynamic mapping 13.9.5 Static mapping and scheduling 13.9.6 Dynamic scheduling 13.9.7 Codes for shared and distributed memory computers 13.10 The use of low-rank matrices in the factorization 13.11 Using rectangular frontal matrices with local pivoting 13.12 Rectangular frontal matrices with structural pivoting 13.13 Trees for unsymmetric matrices

281 281 282 283 284 287 287 287 288 289 290 292 294 295 295 297 299 300 300 302 303 304 306 310 312

CONTENTS

xix

14 The SOLVE phase 14.1 Introduction 14.2 SOLVE at the node level 14.3 Use of the tree by the SOLVE phase 14.4 Sparse right-hand sides 14.5 Multiple right-hand sides 14.6 Computation of null-space basis 14.7 Parallelization of SOLVE 14.7.1 Parallelization of dense solve 14.7.2 Order of access to the tree nodes 14.7.3 Experimental results

315 315 316 318 318 319 319 320 321 321 322

15 Other sparsity-oriented issues 15.1 Introduction 15.2 The matrix modification formula 15.2.1 The basic formula 15.2.2 The stability of the matrix modification formula 15.3 Applications of the matrix modification formula 15.3.1 Application to stability corrections 15.3.2 Building a large problem from subproblems 15.3.3 Comparison with partitioning 15.3.4 Application to sensitivity analysis 15.4 The model and the matrix 15.4.1 Model reduction 15.4.2 Model reduction with a regular submodel 15.5 Sparsity constrained backward error analysis 15.6 Why the inverse of a sparse irreducible matrix is dense 15.7 Computing entries of the inverse of a sparse matrix 15.8 Sparsity in nonlinear computations 15.9 Estimating a sparse Jacobian matrix 15.10 Updating a sparse Hessian matrix 15.11 Approximating a sparse matrix by a positive-definite one 15.12 Solution methods based on orthogonalization 15.13 Hybrid methods 15.13.1 Domain decomposition 15.13.2 Block iterative methods

325 325 326 326 327 328 328 328 329 330 331 331 333 334 335 337 339 341 343 344 346 348 348 350

xx

CONTENTS

A Matrix and vector norms

355

B Pictures of sparse matrices

359

C Solutions to selected exercises

367

References

390

AUTHOR INDEX

412

SUBJECT INDEX

418

1 INTRODUCTION The use of graph theory to ‘visualize’ the relationship between sparsity patterns and Gaussian elimination is introduced. The potential of significant savings from the exploitation of sparsity is illustrated by one example. The effect of computer hardware on efficient computation is discussed. Realization of sparsity means more than faster solutions; it affects the formulation of mathematical models and the feasibility of solving them.

1.1 Introduction A matrix is sparse if many of its coefficients are zero. The interest in sparsity arises because its exploitation can lead to enormous computational savings and because many large matrix problems that occur in practice are sparse. How much of the matrix must be zero for it to be considered sparse depends on the computation to be performed, the pattern of the nonzeros, and even the architecture of the computer. Generally, we say that a matrix is sparse if there is an advantage in exploiting its zeros. This advantage is gained from both not having to store the zeros and not having to perform computation with them. We say that a matrix is dense if none of its zeros is treated specially. This book is primarily concerned with direct methods for solving sparse systems of linear equations, although other operations with sparse matrices are also discussed. The significant benefits from sparsity come from solution time reductions, but more importantly from the fact that previously intractable problems can now be solved. These matrices may have hundreds of thousands or even millions of rows and columns. Often, it is possible to gain insight into sparse matrix techniques by working with the graph associated with the matrix, and this is considered in Section 1.2. There is a well-defined relationship between the pattern of the nonzero coefficients of a square sparse matrix and its associated graph. Furthermore, results from graph theory sometimes provide answers to questions associated with algorithms for sparse matrices. We introduce this topic here in order to be able to use it later in the book. To illustrate the potential saving from exploiting sparsity, we consider a small example in Section 1.3. Without going into detail, which is the subject of the rest of the book, we use this example to motivate the study. When serial computers with a single level of memory were our only concern, there was a near-linear relationship between the number of floating-point operations (additions, subtractions, multiplications, and divisions) on real Direct Methods for Sparse Matrices, second edition. I. S. Duff, A. M. Erisman, and J. K. Reid. c Oxford University Press 2017. Published 2017 by Oxford University Press.

2

INTRODUCTION

(or complex) values and the run time of the program. Thus, a program requiring 600 000 operations was about 20% more expensive than one requiring 500 000 operations unless there were unusual overheads. In such an environment, exploitation of sparsity meant reducing floating-point computation while keeping the overheads in proportion. Cache memories have introduced additional complications. It is necessary to consider the effect of data movement, since waiting for data to be loaded into cache takes much more time than a floatingpoint operation. Vector and parallel architectures, possibly using multicore chips, present further levels of difficulty for comparisons between algorithms. A brief discussion of modern architectures is contained in Section 1.4 and of computational performance in Section 1.5. In Section 1.6, we discuss the formulation of mathematical models from the viewpoint of the exploitation of sparsity. If sparsity tools are available, it is useful to apply these tools in a straightforward manner within an existing formulation. However, going back to the model may make it possible to reformulate the problem so that more is achieved. This section is intended only to stimulate thinking along these lines, since a full discussion is outside the scope of this book. Finally, in Section 1.7, we explain the need for a collection of sparse matrix test problems. 1.2

Graph theory

Matrix sparsity and graph theory are subjects that can be closely linked. The pattern of a square sparse matrix can be represented by a graph, for example, and then results from graph theory can be used to obtain sparse matrix results. George and Liu (1981), among others, do this in their book. Graph theory is also a subject in its own right, and early treatments are given by K¨onig (1950) and Harary (1969). More recent references include Bondy and Murty (2008). In this book, we use graph theory mainly as a tool to visualize what is happening in sparse matrix computation. As a result, we use only limited results from graph theory and make no assumption of knowledge of the subject by the reader. In this section, we introduce the basic concepts. Other concepts are introduced as they are used, for example, in Chapter 9. For practical reasons, we do not necessarily exploit all the zeros. Particularly in the intermediate calculations, it may be better to store some zeros explicitly. We therefore use the term entry to refer to those coefficients that are handled explicitly. All nonzeros are entries and some zero coefficients may also be entries. In this discussion connecting a sparse matrix and its associated graph, we relate the graph and the entries. A directed graph or digraph consists of a set of nodes (also called vertices) and directed edges between nodes. Any square sparse matrix pattern has an associated digraph, and any digraph has an associated square sparse matrix pattern. For a given square sparse matrix A, a node is associated with each row/column. If aij is an entry, there is an edge from node i to node j in the directed graph. This is usually written diagrammatically as a line with an

GRAPH THEORY



 ××  ××     × × × × ×

1

3 2

3

4

Fig. 1.2.1. An unsymmetric matrix and its digraph. 

 ×× × × × ×     × × × × ××

1

2

3

4

Fig. 1.2.2. A symmetric matrix and its graph. arrow, as illustrated in Figure 1.2.1. For example, the line from node 1 to node 2 in the digraph corresponds to the entry a12 of the matrix. The more general representation of the directed graph includes self-loops on nodes corresponding to diagonal entries (see, for example, Varga (1962), p. 19). To make the visualization simpler, we do not include self loops for the common case where the diagonal coefficients of the matrix are all nonzero or the zeros are to be treated as entries. Where the distinction between zeros and nonzeros on the diagonal is required, self loops are needed to represent the nonzeros on the diagonal. Formally, G(A), the digraph associated with the matrix A is not a picture, but a set V of nodes and a set E of edges. An edge is an ordered pair of nodes (vi , vj ) and is associated with the matrix entry aij . The associated picture of these nodes and edges is the usual way to visualize the graph. For a symmetric matrix, a connection from node i to node j implies that there must also be a connection from node j to node i; therefore, a single line may be used and the arrows may be dropped. We obtain an undirected graph or graph, as illustrated in Figure 1.2.2. We say that a graph is connected if there is a path from any node to any other node. An important special case occurs when a connected graph contains no closed path (cycle). Such a graph is called a tree. In a tree, we can pick any node as the root, to give a rooted tree, as illustrated in Figure 1.2.3, where the node labelled 5 is the root. From any node other than the root there is a unique path to the root. If the node is i and the next node on the path to the root is j, then i is called the child of j and j is called the parent of i. It is conventional to draw trees with the root at the top and with all the children of a node at the same height. A node without a child is called a leaf. A digraph with no closed paths (cycles) is also important in sparse matrix work (see Section 13.7) and is known as a directed acyclic graph or DAG. A fundamental operation used in solving equations with matrices involves adding multiples of one row, say the first, to other rows to make all entries in the first column below the diagonal equal to zero. This process is illustrated in

4

INTRODUCTION

5  ×  × ×  4 6     × ×   × × × × ×     × × × 1 2 3 ×× Fig. 1.2.3. A matrix whose graph is a rooted tree.



×

the next section. Detailed discussion of the algorithm, which is called Gaussian elimination, is found in Chapters 3 and 4. Notice that when this is applied to the matrix of Figure 1.2.1, adding a multiple of the first row to the fourth creates a new entry in position (4,2). This new entry is called a fill-in. Graph theory helps in visualizing the changing pattern of entries as elimination takes place. Corresponding to the graph G, the elimination digraph Gy for node y is obtained by removing node y and adding a new edge (x, z) whenever (x, y) and (y, z) are edges of G, but (x, z) is not. For example, G1 for the digraph of Figure 1.2.1 would have the representation shown in Figure 1.2.4, with the new edge (4, 2) added. Observe that this is precisely the digraph corresponding to the 3×3 submatrix that results from the elimination of the (4,1) entry by the row transformation discussed in the last paragraph. In the case of a symmetric matrix whose graph is a tree, no extra edges are introduced when a leaf node is eliminated. The corresponding elimination operations introduce no new entries into the matrix. 2

3

4

Fig. 1.2.4. The digraph G1 for the digraph of Figure 1.2.1. This relationship between graph reduction and Gaussian elimination was first discussed by Parter (1961). It is most often used in connection with symmetric matrices, since a symmetric permutation of a symmetric matrix leaves its graph unchanged except for the numbering of its nodes. For an unsymmetric matrix, it is often necessary to interchange rows without interchanging the corresponding columns. This leads to a different digraph, and the correlation between the digraph and Gaussian elimination is not so apparent (Rose and Tarjan 1978), see Exercise 1.6. If symmetric permutations are made, the digraph remains unchanged apart from the node numbering. A graph for unsymmetric matrices that remains invariant even when the permutation is not symmetric is the bipartite graph. In this graph, the nodes

EXAMPLE OF A SPARSE MATRIX 1

1

2

2

3

3

4

4

5

Fig. 1.2.5. The bipartite graph for the matrix in Figure 1.2.1. are divided into two sets, one representing the rows and the other the columns, so that a row node, i say, is joined to a column node, j say, if and only if aij is an entry in the matrix. In Figure 1.2.5, we show the bipartite graph for the matrix in Figure 1.2.1. 1.3

Example of a sparse matrix

In the design of safety features in motor cars and aeroplanes, the dynamics of the human body in a crash environment have been studied with the aim of reducing injuries. In early studies, the simple stick figure model of a person illustrated in Figure 1.3.1 was used. Much more sophisticated models with many more variables are used today. However, this small model is ideal for tracing through the ideas behind the manipulation of a sparse matrix, and we use it for that purpose. 4 3 11

13 2

14

12 1

8 9

5 6

10

7

Fig. 1.3.1. Stick figure modelling a person. The dynamics are modelled by a set of time-dependent differential equations with which we are not concerned here. The body segments are connected at joints (the nodes of the graph), as illustrated in Figure 1.3.1. These segments may not move independently since the position of the end of one segment must match the position of the end of the connecting segment. This leads to 42 equations of constraint (three for each numbered joint), which must be added to the mathematical system to yield a system of algebraic and differential equations to be solved numerically. At each time step of the numerical solution (typically there will be thousands of these), a set of 42 linear algebraic equations must be solved for the reactions at the joints.

6

INTRODUCTION

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × ×

13 14 × ×

× × × × ×

Fig. 1.3.2. The pattern of the matrix associated with the stick person of Figure 1.3.1. Since there are 14 numbered joints, with three constraints for each joint, the matrix representing the algebraic equations may be considered as a 14 × 14 block matrix with entries that are 3×3 submatrices. This is the pattern shown in Figure 1.3.2, where each × represents a dense 3×3 submatrix. The pattern of this block matrix may be developed from Figure 1.3.1 by associating the given joint numbers with block numbers. For example, since joint 6 is connected by body segments to joints 5 and 7, there is a relationship between the corresponding reactions, and block row 6 has entries in block columns 5, 6, and 7. Referring to the discussion of the previous section on the relationship between sparse matrices and graph theory, note that the stick person shown in Figure 1.3.1 corresponds to the graph of the matrix in Figure 1.3.2. In the remainder of this section, we demonstrate that a careful utilization of the structure of a matrix as in Figure 1.3.2 leads to a significant saving in the computational cost of the solution of the algebraic equations. Since these equations must be solved at each time step, with the numerical values changing from step to step while the sparsity pattern remains fixed, cost savings can become significant over the whole problem. In Section 3.12, the relationship between block and single variable equations is established. Here it is sufficient to say that an efficient solution of linear equations whose coefficient matrix has the sparsity pattern of Figure 1.3.2 can readily be adapted to generate an efficient solution of the block equations. We therefore examine the solution of the 14 simultaneous equations whose sparsity pattern is shown in Figure 1.3.2. Methods for solving systems of n linear equations n X

aij xj = bi ,

i = 1, 2, ..., n,

(1.3.1)

j=1

or, in matrix notation, Ax = b,

(1.3.2)

EXAMPLE OF A SPARSE MATRIX

7

are discussed and compared in Chapters 3 and 4. For the purposes of this example, where n = 14, we simply solve the equations by a sequence of row operations. Multiples of equation 1 are first added to the other equations to eliminate their dependence on x1 . This leaves a revised system of equations of which only the first involves x1 . If no account is taken of the zeros, these calculations will require 29 floating-point operations for the elimination of x1 from each of 13 equations, making 29×13 operations in all. Then multiples of the second equation are added to the later equations to eliminate their dependence on x2 . This requires 27 operations for each of 12 equations, making 27 × 12 operations in total. We continue until we reach the fourteenth equation, when we find that it depends only on x14 and may be solved directly. The total number of operations to perform this transformation (including operations on zeros) is 29×13 + 27×12 + 25×11 + · · · + 5×1 = 1911.

(1.3.3)

The solution of the resulting trivial fourteenth equation requires only one division. Then the thirteenth equation may be solved for x13 , using the computed x14 (requiring three operations). Similarly x12 , x11 , · · · , x1 are computed in turn using the previously found components of x. Thus, the calculation of the solution from the transformed equations requires 1 + 3 + · · · + 27 = 196

(1.3.4)

operations. The process of eliminating variables from each equation in turn and then solving for the components in reverse order is called Gaussian elimination. Its variants are discussed further in Chapter 3. Following the same approach, but operating only on the entries, requires far less work. For instance, in the first step, multiples of equation 1 need only be added to equations 2, 5, and 8, since none of the others contain x1 . Furthermore, after a multiple has been calculated, only 4 multiplications (including one for the right-hand side) are required to multiply the first equation by it. Therefore 3 divisions, 12 multiplications, and 12 additions are needed to eliminate x1 . Note, however, that when a multiple of equation 1 is added to equation 2, new entries are created in positions 5 and 8. Figure 1.3.3 shows by bullets all the fill-ins that are created. The elimination of x2 from each of the five other equations in which it now appears requires 1 division and 7 multiplications and causes fill-ins to equations 3, 5, 8, 11, and 13. The number of floating-point operations needed for the complete elimination is 2 × (13 × 5) + 3 × (11 × 4) + 4 × (9 × 3) + 2 × (7 × 2) + 2 × (5 × 1) = 408. (1.3.5) The corresponding number of operations required to solve the transformed equations is 7 + 2×11 + 3×9 + 3×7 + 2×5 + 2×3 + 1 = 94.

(1.3.6)

8

INTRODUCTION

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 × × × × × × × × × • •

4 5 6 7 × • × • × • • × × × × × × × × • • • × • •

8 9 10 × • • • × • • × × × × × × × × × • • • • • • •

11 12 13 14

× × × × • • • • • • • • • • • • • • × × × × × • × × • • • • • • • × • × × × ×

Fig. 1.3.3. The pattern of the matrix and its fill-ins. Fill-ins are marked •. Thus, utilizing the sparsity in the solution process reduces the required number of operations from 2 107 to 502, a factor of more than four. For a large matrix, the factor will be far greater. This becomes particularly significant when the saving is repeated thousands of times. To realize these savings in practice requires a specially adapted algorithm that operates only on the entries. For a large matrix, it is also desirable to store only the entries. The development of data structures and special algorithms for achieving these savings is the major topic of this book. Before leaving this example we ask another natural question. Can the equations be reordered to permit their solution with even fewer operations? One of the principal ordering strategies that we discuss in Chapter 7, when applied to the symmetric matrix of Figure 1.3.2, results in the reordered pattern of Figure 1.3.4. Here, the original numbering is indicated by the numbering on the 4 7 6 10 9 12 14 5 8 1 2 3 11 13 4 × × 7 × × 6 × × × 10 × × 9 × × × 12 × × 14 × × 5 × × × × 8 × × × × 1 × × × × 2 × × × × × 3 × × × × × 11 × × × × × 13 × × × × ×

Fig. 1.3.4. The pattern of the reordered matrix. rows and columns. The symmetry of the matrix is preserved because rows and columns are reordered in the same way.

EXAMPLE OF A SPARSE MATRIX

9

If we carry through the transformation on the matrix of Figure 1.3.4, we observe that no new entries are created. Using cost formulae as before, the number of operations required is 105. Solving the transformed equations adds another 48 operations. The results of the three approaches are summarized in Table 1.3.1. The gain is more dramatic on large practical problems and, in Table 1.3.2, we show the same data for a problem from the Florida Sparse Matrix Collection (see Section 1.7). Table 1.3.1 Numbers of operations for Gaussian elimination on matrix of Figure 1.3.2. Treating matrix as

dense

sparse unordered ordered Transformation cost 1 911 408 105 Solution cost 196 94 48

Table 1.3.2 Operations for Gaussian elimination on onetone1, which has order 36 057 and 341 088 entries. Treating matrix as

dense

sparse unordered ordered Transformation cost (×109 ) 31 251 17.6 3.6 Solution cost (×106 ) 1 300 33.4 5.3 Several comments on this reordering are appropriate. First, it produced an ordering where no new entries were generated; this is not typical. Secondly, the ordering is optimal for this problem (in the sense that another ordering could not produce fewer new entries) which also is not typical, because finding an optimal ordering is very expensive for a genuine problem. Thirdly, that this ordering is based on a practical approach and is able to produce a significant saving is typical. Fourthly, we assumed one ordering was better than another because it required fewer arithmetic operations. This is not necessarily true. The time required to solve a problem on a modern computer depends on much more than the amount of arithmetic to be done. This is discussed in Section 1.4. In this discussion, the symmetry of the matrix has been ignored (apart from the use of a symmetric reordering). In Chapter 3, we introduce the Cholesky method, which exploits the symmetry and approximately halves the work for the elimination operations on the matrix. Also, we did not consider the effects of computer rounding in this illustration. These effects are discussed in Chapter 4. This example gives an indication that exploiting sparsity can produce dramatic computational savings and that reordering can also be significant in producing savings. Methods of reordering sparse equations to preserve sparsity are the major topic of Chapters 7–9. The order of magnitude cost reduction

10

INTRODUCTION

reflected in Table 1.3.1 for this example is not unusual, and indeed much greater gains can be obtained on large problems, as illustrated by Table 1.3.2. 1.4

Modern computer architectures

Until the early 1970s, most computers were strictly serial in nature. That is, one arithmetic operation at a time ran to completion before the next commenced. Although some machines allowed operands to be prefetched (fetched before they are needed) and allowed some overlapping of instructions, the execution time of a program implementing a numerical algorithm could be well approximated by the formula time = (number of operations)/K, (1.4.1) where K is the theoretical peak performance in floating-point operations per second. The work in moving data between memory and functional units was either low or proportional to the computation time for the arithmetic. Over the years, raw speed has dramatically increased, and has been measured in Mflops (millions of floating-point operations per second), Gflops (109 floating-point operations per second), Tflops (1012 floating-point operations per second) and even occasionally Pflops (1015 floating-point operations per second)1 . Furthermore, computer architectures have evolved to multiple hierarchies of memory and perform parallel operations, sometimes at a massive scale, making such simple relationships no longer valid. Arithmetic operations may be segmented into several (usually four or five) distinct phases and functional units designed so that different operands can be in different segments of the same operation at the same time. This technique, called pipelining, is an integral component of vector processing, since it is particularly useful when performing calculations with vectors. Nearly all current chips are able in some way to catch the essence of vector processing: they are able to execute operations on vectors of length n much faster than they can execute those operations on sets of n scalars. Vector processing was widely in use on high-performance computers when we wrote the first edition of this book and we gave it prominence. Current computer architectures employ a memory hierarchy with several layers. Fastest is the level-1 cache, commonly denoted by `1 , next the `2 cache, then the `3 cache and then the main memory. Data is moved between these levels in cache lines, typically 64 bytes, and special hardware is used to map main memory addresses to addresses in the cache and to move wanted data into the caches to replace data not needed at the time. In Table 1.4.1, we show figures for the Intel Sandy Bridge chip that is used in the desktop machine that we use for most of the runs in this book. Latency is the delay in the movement of data and is usually measured in terms of the number of clock periods needed. In the 1 These are pronounced Megaflops, Gigaflops, Teraflops, and Petaflops, respectively. The production of an Exaflop computer (1018 floating-point operations per second) is expected in the early 2020s.

MODERN COMPUTER ARCHITECTURES

11

table, we give rounded numbers. In practice there is often a range, depending on the status of the data concerned. For example, if it has been referenced in the recent past, access might be faster. We skip these complications and just give an indication of the relative speeds and latencies in the table to show the importance of good cache management in algorithm design. Table 1.4.1 Cache sizes and speeds for the 8-core Intel Xeon E5-2687W. Clock speed is 3.1 GHz. Speed and latency are for transfer to `1 cache and the figures for `1 cache are for transfer to the CPU. Cache level 1 2 3 M

Size Speed Latency (bytes) (GBytes/s) clocks 8 × 32 KB 367 5 8 × 256 KB 86 12 20 MB 41 28 256 GB 10 189

If a computation involves data that is scattered in memory and few operations are performed on each datum, it will run far slower than the theoretical peak because most of the time will be spent in moving cache lines between different levels of memory. To obtain high performance, it is important to amortize the cost of moving data to the `1 cache by ensuring that once data are in the `1 cache, many operations are performed on them. As complex as this may appear, it is simply illustrated with everyday examples. To illustrate hierarchical memory, think about the way sugar is used in a home. There is a bowl of sugar on the table which makes small amounts of sugar readily available. But if larger amounts of sugar are needed, then it takes a trip to the pantry to get a larger store of sugar. More time is required, but more sugar is available. To get sugar in even larger amounts, you could drive to the local market and buy some. The time for retrieval is much higher, but the available quantity is much greater. The computer model assumes that data must pass through the various stages, so the sugar from the pantry would be first used to refill the sugar bowl on the table, and the sugar would still be taken from the bowl, for example. Pipelining is readily illustrated by a manufacturing assembly line for automobiles. Creating a single automobile in this environment would take a great deal of time for it to pass through all of the stages of production. However, by filling the assembly line (pipeline) with automobiles, after the first one comes out, the second one follows in just the time it takes to perform each stage of production. For parallel computers, the situation with data access becomes even more complicated. In shared memory processing, the unit of execution is the thread and any item in memory is directly addressable from every thread. Conflicts occur

12

INTRODUCTION

when two threads wish to access the same item. When there are many threads, the hardware to maintain shared access becomes very complicated and makes an accurate determination of the cost of fetching data essentially impossible. Note that a single silicon chip now normally has many cores and each core supports one or more threads. In distributed memory machines, the unit of execution is the process where each process has its own memory, and access to its own memory is much faster than to the memory of another process. To complicate things further, many architectures have a hybrid structure where data is exchanged between processes by message passing, but each process has several threads accessing shared memory. We note that the word ‘process’ is also used within this book in its normal English sense. Which meaning is intended is clear from the context and should not cause any confusion. Distributed memory parallelism can be illustrated by the challenge of writing this book. Each author has his own brain corresponding to a process, and work is parcelled out so that each can work independently on his own piece. However, to create the whole, from time to time the work must be brought together through message passing (email) or brought into a common memory (a personal visit or a Skype session). Results are achieved most effectively when the independent work is large, the message passing is small, and the meetings represent a small percentage of the overall project. 1.5

Computational performance

A key operation in linear algebra is the multiplication of two dense matrices, or more generally, the operation C := C + αAB,

(1.5.1)

for a scalar α and matrices A, B, and C of orders (say) m×k, k×n, and m×n. This can be performed very effectively if B and C are stored by columns (the Fortran way) and there is room in the `1 cache for A and a few columns of B and C. C is computed column by column. For each column, 2m reals are moved into the cache, (2k + 1)m floating-point operations are performed, and m real results are stored to memory and are no longer required in the cache. If k is reasonably large, say 32, the data movement time will be small compared with the arithmetic time. There is no need for the cache to hold B or C—indeed there is no limit on the size of n. The system will bring the columns into cache as they are needed and may indeed be able to overlap the computation and data movement. This is known as streaming. For the highest computational performance, it is necessary to exploit any parallelism available in the computer. We will make use of it at various levels. For the first, the computation is arranged to involve fully independent subproblems for which code can execute independently. We might permute a sparse set of equations to the form

PROBLEM FORMULATION

     

A11 A22 A33 A51 A52 A53

A44 A54

    b1 x1 A15  x2   b2  A25          A35    x3  =  b3  , A45   x4   b4  b5 x5 A55

13

(1.5.2)

which allows us to begin by working independently on the four subproblems      bi yi Aii Ai5 , i = 1, . . . , 4. (1.5.3) = bi5 zi A5i Ai55 P5 P5 where i=1 Ai55 = A55 and i=1 bi5 = b5 . Such a block structure can occur though the use of a nested dissection ordering, as discussed in Chapter 9, or when domain decomposition is used in the numerical solution of partial differential equations. Parallelism can also be exploited at the level of the still sparse submatrices, Aij . Finally, we show in Chapters 11–13 how dense kernels can be used at the heart of a sparse code. For example, consider the matrix multiplication problem of the previous paragraph. If the problem is large enough, the data may be redistributed across the processes so that they share the work and each works on submatrices for which streaming occurs. When discussing the merits of various algorithms throughout the remainder of this book, we will use the formula (1.4.1) where appropriate. K represents the theoretical peak performance of the computer. Attaining the peak performance would require maximally efficient use of the memory hierarchy, which is rarely possible for sparse problems. The data structures in a sparse problem are unlikely to use the pipelines, buffers, and memory hierarchies without inefficiencies. For this reason, we compare performance on different computer architectures and different problems using cases from the Florida Sparse Matrix Collection (see Section 1.7) and elsewhere. In the case of dense vectors and matrices, the outlook is more predictable. Here the basic manipulations have been tuned to the various architectures using the BLAS (Basic Linear Algebra Subprograms). The Level 1 BLAS (Lawson, Hanson, Kincaid, and Krogh 1979) are for vector operations, the Level 2 BLAS (Dongarra, Du Croz, Hammarling, and Hanson 1988) are for matrix-vector operations, and the Level 3 BLAS (Dongarra, Du Croz, Duff, and Hammarling 1990) are for matrix-matrix operations. Optimized versions of these subroutines are available from vendors. Of particular importance are the Level 3 BLAS because they involve many arithmetic operations for each value that is moved through caches to the registers. Although the BLAS are crucial in the efficient implementation of dense matrix computation, they also play an important role in sparse matrix computation because it is often possible to group much of the computation into dense blocks. 1.6

Problem formulation

If the potential gains of sparse matrix computation indicated in this chapter are to be realized, it is necessary to consider both efficient implementation and basic

14

INTRODUCTION

problem formulation. For the dense matrix problem, the order n of the matrix controls the requirements for both the storage and solution time to solve a linear system. In fact, quite precise predictions of solution time (O(n3 )) and storage (O(n2 )) may be made for a properly formulated algorithm (see Section 3.11). This type of dependence on n becomes totally invalid for sparse problems. This point was demonstrated for the example in Section 1.3, but it should be emphasized that it is true in general. The number of entries in the matrix, τ , is a more reliable indicator of work and storage requirements, but even using τ , precise predictions similar to those of the dense case are not possible. For much of the book, we will be concerned with the implementation of the solution to the set of sparse linear equations Ax = b,

(1.6.1)

where A is an n×n, nonsingular matrix. However, equation (1.6.1) arises from the formulation of the solution to a mathematical model. How that model is formulated can result in varying amounts of sparsity. Equations like (1.6.1) come from linear least squares problems, circuit simulation problems, control systems problems, and many other sources. In each case, seemingly innocent ‘simplifications’, which may be very helpful in the dense case, can destroy sparsity and make the solution of (1.6.1) either much more costly or infeasible. It is the authors’ belief that most very complicated physical systems have a mathematical model with a sparse representation. Thus, anyone concerned with the solution of such models must pay careful attention to the way the model is formulated. We will illustrate, though certainly not exhaustively, some examples from diverse applications in Chapter 15. The real benefits of sparsity have depended upon the formulation of the model as well as the choice of solution algorithm. 1.7 Sparse matrix test collections Comparisons between sparse matrix strategies and computer programs are difficult because of the enormous dependence on implementation details and because the various ordering methods (introduced in Section 5.3 and discussed in detail in Chapters 7–9 are heuristic. This means that comparisons between them will be problem dependent. These concerns led to the development of a set of test matrices by Duff and Reid (1979) extended by Duff, Grimes, and Lewis (1989). A nice interface for accessing both the problems and associated statistics has been designed at NIST (MatrixMarket 2000) and the original Harwell-Boeing format was extended to the Rutherford-Boeing format by Duff, Grimes, and Lewis (1997). The GRID-TLSE project in Toulouse offers a web-friendly interface to direct sparse solvers and sparse matrix test problems (Amestoy, Duff, Giraud, L’Excellent, and Puglisi 2004). Currently, the most extensive collection of sparse matrices is the University of Florida Sparse Matrix Collection (Davis and Hu 2011). Not only are the matrices

SPARSE MATRIX TEST COLLECTIONS

15

available in several formats, but there is a substantial discussion of each with several pictures, not just of the matrix. Additionally, there is software to enable the extraction of matrices with various characteristics including the application domain. A major objective of the test collections has been to represent important features of practical problems. Sparse matrix characteristics (such as matrix size, number of entries, and matrix pattern including closeness to symmetry) can differ among matrices arising from, for example, structural analysis, circuit design, or linear programming. The test problems from these different application areas vary widely in their characteristics and often have very distinctive patterns. Pictures of some of these patterns are included in Appendix B. While their patterns are often not regular, they are certainly not random. Throughout this book, we will use matrices to illustrate our discussion, to demonstrate the performance of our algorithms, and to show the effect of varying parameters. The default is that these will be matrices from the Florida Collection. Those not from this collection will be flagged by a footnote and the accompanying text will indicate their origin. Where we show the numbers of entries in a table, we count explicit zeros because the analysis performed may be required later for another matrix having the same pattern where some of these entries are now nonzero. We count the entries in both the upper and lower triangular parts of a symmetric matrix, although some algorithms require only one of the triangles. These test matrices will be symmetric, unsymmetric, or even rectangular determined by the algorithm or code being considered. As some approaches exploit patterns that are nearly symmetric, we will sometimes display the symmetry index, as defined by Erisman, Grimes, Lewis, Poole Jr., and Simon (1987), which is the proportion of off-diagonal entries for which there is a corresponding entry in the transpose, so that a symmetric matrix has a symmetry index of 1.0. Another group of matrices that we use arises from solving partial differential equations on a square q × q grid, pictured for q = 5 in Figure 1.7.1. The simplest finite-element problem has square bilinear elements. Here, there is a one-to-one correspondence between nodes in the grid and variables in the system of equations. The picture of the grid is also a picture of the graph. The resulting symmetric matrix A has order n = (q + 1)2 , and each row has entries in nine columns corresponding to a node and its eight nearest neighbours. Such a pattern can also arise from a 9-point finite-difference discretization on the same grid. Another test case arises from the 5-point finite-difference discretization, in which case each row has off-diagonal entries in columns corresponding to nodes connected directly to the corresponding node by a grid line. Further regular problems arise from the discretization of three-dimensional problems using 7-point, 11-point, or 27-point approximations. These matrices are important because they typify matrices that occur when solving partial differential equations and because very large test problems can be generated easily.

16

INTRODUCTION

Fig. 1.7.1. A 5×5 discretization. Another way of generating large sets of matrices is to use random number generators to create both the pattern and the nonzero values. Early testing of sparse matrix algorithms was done in this way. While small random test cases are useful for testing sparse matrix codes, we do not recommend their use for performance testing because of their lack of correlation to real problems. When using an existing sparse matrix code for particular applications, the user will be confronted with a variety of choices as will be discussed in the rest of the book. Since many of the algorithms discussed here are based on heuristics, we recommend that users experiment with these choices for their own applications. The test problems provide an invaluable source for this purpose. Other readers may be more interested in extending research: which algorithms work best depending on size of matrix, application area, computer architecture, and many other factors? The test problems are useful for this purpose as well. The Florida Collection will also be used in several of the Research Exercises that we have included in many of the chapters. Exercises 1.1 For the machine described in Section 1.4, calculate the maximum performance of a code that moves data in 10 KB blocks and reuses it three times when in the `1 cache. 1

3

8

6

2

4

5

7

Fig. E1.1. A graph with 8 nodes. 1.2 For the graph shown in Figure E1.1, write down the corresponding sparse matrix pattern. 1.3 For the graph in Figure E1.1, re-label the nodes starting on the left working from top to bottom (1,2,4,3,6,5,8,7) and write down the corresponding matrix pattern. Compare this pattern with the one from Exercise 1.2, and comment on the differences and similarities.

SPARSE MATRIX TEST COLLECTIONS

17

1.4 Using the matrix of Exercise 1.2, find the pattern of the matrix that results from the elimination of the subdiagonal entries in columns one and two by adding multiples of rows one and two to appropriate later rows. 1

3

2

4

Fig. E1.2. A graph with 4 nodes. 1.5 Can you find a reordering for the sparse matrix corresponding to the graph shown in Figure E1.2 for which no new entries are created when solving equations with this as coefficient matrix? 1.6 For the matrix in Figure 1.2.1, reorder it by swapping rows 1 and 4. Draw its associated digraph, and compare it with the digraph in Figure 1.2.1. Comment on the difference and the similarities. 1.7 From Exercises 1.3 and 1.6, how does the relationship between the digraph of an unsymmetric matrix differ from the relationship between the symmetric matrix and its graph?

2 SPARSE MATRICES: STORAGE SCHEMES AND SIMPLE OPERATIONS We consider how best to store sparse vectors, to add a multiple of one to another, and to form inner products. We also discuss the effective use of a full array even when the vector is sparse. We compare the different ways that sparse matrices can be held and discuss the use of linked lists. Sorting may be needed before or during a sparse matrix computation, so we discuss two sorting methods that we have found particularly useful: the counting sort and the heap sort. We consider grouping variables into supervariables, and we discuss matrix-vector and matrix-matrix products.

2.1 Introduction The aim of the chapter is to examine data structures suitable for holding sparse matrices and vectors. We can only do this in conjunction with a consideration of the operations we wish to perform on these matrices and vectors, so we include a discussion of the simple kinds of operations that we need. An important distinction is between static structures that remain fixed and dynamic structures that are adjusted to accommodate fill-ins as they occur. Naturally, the overheads of adjusting a dynamic structure can be significant. Furthermore, the amount of space needed for a static structure will be known in advance, but this is not the case for a dynamic structure. Both types are widely used in the implementation of sparse Gaussian elimination. We discuss the use of them in detail in Chapters 10–14. Usually what is required is a very compact representation that permits easy manipulation. There is no one best data structure; most practical computer codes use different storage patterns at different stages. 2.2 Sparse vector storage For simplicity, we begin by considering the storage of sparse vectors of order n. Some of our remarks generalize at once to sparse matrices whose columns (or rows) may be regarded as sets of sparse vectors. Furthermore, vector operations play an important role in Gaussian elimination (see Chapters 3 and 4). A sparse vector may be held in a full-length vector of storage (length n). When the vector is very sparse, this is rather wasteful of storage, but is often used because of the simplicity and speed with which it may then be manipulated. For instance, the i-th component of a vector may be found directly. To economize in storage we may pack the vector by holding the entries as real, integer pairs (xi , i), one for each entry. In Fortran, we normally use a real array Direct Methods for Sparse Matrices, second edition. I. S. Duff, A. M. Erisman, and J. K. Reid. c Oxford University Press 2017. Published 2017 by Oxford University Press.

SPARSE VECTOR STORAGE

19

(for example, value in Figure 2.2.1) and a separate integer array (for example, index in Figure 2.2.1), each of length at least the number of entries. This is called the packed form. We call the operation of transforming from full-length to packed form a gather and the reverse a scatter. product = 0.0 do k = 1,tau product = product + value(k)*w(index(k)) end do

Fig. 2.2.1. Code for the inner product between packed vector (value, index) with tau entries and full-length vector w. To compare the storage requirements for the two methods, we first look at two examples. For a vector of length 10 000 with value 3.7 in location 90, and –4.2 in location 5 008, the vector would be represented by: value = (3.7, -4.2) index = (90, 5008) Only four storage locations are required for this vector compared with 10 000 representing it as a full vector, a dramatic saving. Now consider the case of the vector (3.2, 4.1, –5.3, 0). This requires four storage locations, but the corresponding vector in packed form would be represented by value = (3.2, 4.1, -5.3) index = (1, 2, 3) In this case, the packed form requires more storage (six units) than the full-length vector (four units). In general, it is easy to see that the packed form requires less storage when the vector is at least 50% sparse. When the numerical values in the vector are held in double-precision arrays, or if the values are single-precision complex numbers, then the break-even point will drop to 25%, and will drop even further in the extended-precision real or double-precision complex case. Thus, the packed form generally requires far less storage in practical computations, where the vectors, at least at the beginning of the computation, are far less dense than 25%. Applying an elementary operation, say adding a multiple of component i to component j, is not convenient on this packed form because we need to search for xi and xj . A sequence of such operations may be performed efficiently if just one full-length vector of storage is used. For each packed vector, we perform the following steps: (i) Place the nonzero entries in a full-length vector known to be set to zero. (ii) For each operation: (a) Revise the full-length vector by applying the operations to it, and (b) if this changes a zero to a nonzero, alter the integer part of the packed form to correspond.

20

SPARSE MATRICES: STORAGE SCHEMES AND SIMPLE OPERATIONS

(iii) By scanning the integers of the packed form, place the modified entries back in the packed form while resetting the full-length vector to zero. Notice that the work performed depends only on the number of entries and not on n. It is very important in manipulating sparse data structures to avoid complete scans of full-length vectors.1 However, the above scheme does require the presence of a full-length real array and does not distinguish between default zeros and entries that have the value zero. One remedy is to use a full-length integer vector that is zero except for those components that correspond to entries. These hold the position of the entry in the sparse representation. This makes it almost as cheap to access entries and is equally cheap at identifying whether an entry is present. We illustrate the use of a full-length integer vector in Section 2.4. 2.3

Inner product of two packed vectors

Taking an inner product of a packed vector with a full-length vector is very simply and economically achieved by scanning the packed vector, as shown in the Fortran code of Figure 2.2.1. A fast inner product between two packed vectors may be obtained by first expanding one of them into a full-length vector. An alternative, if they both have their components in order, is to scan the two in phase as shown in the Fortran code of Figure 2.3.1. product = 0.0; kx = 1; ky = 1 do while (kx0) then value_x(k) = value_x(k) + alpha*value_y(p(i)) p(i) = 0 end if end do ! Add the fill-ins at the end. do k = 1,tauy i = index_y(k) if(p(i)>0) then taux = taux + 1 value_x(taux) = alpha*value_y(k) index_x(taux) = i p(i) = 0 end if end do

Fig. 2.4.1. Code for the operation x := x + αy. are in order, an in-phase scan of the two packed vectors, analogous to that shown in Figure 2.3.1, may be made. Note that fill-ins need not occur at the end of the packed form, so we can be sure that the order is preserved only if a fresh packed form is constructed (so that the operation is of the form z := x + αy). We set the writing of such code as Exercise 2.1. Perhaps surprisingly, it is more complicated than the unordered code. In fact, there appears to be little advantage in using ordered packed vectors except to avoid needing the full-length vector p, but even this may be avoided at the expense of slightly more complicated code (see Exercise 2.2).

22

SPARSE MATRICES: STORAGE SCHEMES AND SIMPLE OPERATIONS

An alternative approach is as follows: (i) For each entry xk , place its position in the packed vector in pk . (ii) Scan y. For each entry yi , check pi . If it is nonzero, use it to find the position of xi and modify it, xi := xi + αyi ; otherwise, add i to the packed form of x and set xi := αyi . (iii) Scan the revised packed form of x. For each i, set pi := 0. While it is slightly more expensive than the first approach, since x is likely to have more entries than y, it offers one basic advantage. It permits the sequence of operations X x := x + αj y(j) (2.4.2) j

to be performed with only one execution of steps (i) and (iii). Step (ii) is simply repeated with y(1) , y(2) , ..., while the indices for x remain in p. Step (ii) may be simplified if it is known in advance that the sparsity pattern of x contains that of y. We will see later (Chapter 10) that there are times during the solution of sparse equations when each of these approaches is preferable. 2.5

Use of full-sized arrays

In the last three sections, we have considered the temporary use of a fulllength vector of integer storage to aid computations with packed sparse vectors. Provided sufficient storage is available, a simpler possibility is to use full-length real vectors all the time. Whether this involves more computer time will depend on the degree of sparsity, the operations to be performed and the details of the hardware. Recent tests, see Table 10.7.1, suggest that indirect addressing is only about 30% slower. A further possibility is to use a full-length vector of storage spanning between the first and final entry. This allows execution that is almost as fast and may result in a very worthwhile saving in storage (and computation) compared with holding the whole vector in full-length storage. A sparse matrix, too, may be held in full-length storage, either as a two-dimensional array or as an array of one-dimensional arrays. Similar considerations apply as for vectors. For a typical large sparse problem, where n is in the tens of thousands or even hundreds of thousands, most of the computation is done with vectors that are far less than 25% dense and these sparse storage schemes offer very significant advantages. Nevertheless, at some stages of the computation, we may encounter relatively dense subproblems for which these trade-offs are important. 2.6

Coordinate scheme for storing sparse matrices

Perhaps the most convenient way to specify a sparse matrix is as its set of entries in the form of an unordered set of triples (aij , i, j). In Fortran, these may be held

SPARSE MATRIX AS A COLLECTION OF SPARSE VECTORS

23

type entry integer :: row_index, col_index real :: value end type Fig. 2.6.1. Derived type for the coordinate scheme. as an array of derived type such as that in Figure 2.6.1 or as one real array and two integer arrays, all of length the number of entries. The matrix   1. 2. 0. 0. 5.  0. 0. −3. 4. 0.     A= (2.6.1)  0. −2. 0. 0. −5.  ,  −1 0. 0. −4. 0.  0. 3. 0. 0. 6. for example, might have the representation in Table 2.6.1. This matrix is 44% dense (11 entries in a 5 × 5 matrix). It is interesting that if reals and integers require the same amount of storage, less total storage is required by the full-sized matrix (25 words) than by the packed form (33 words). The position is reversed for double-length reals to the ratio 25:22 and an even bigger change (to 25:16.5) takes place with extended reals. The situation for complex-valued matrices is analogous to that of double-length reals. Table 2.6.1 The matrix (2.6.1) stored in the coordinate scheme. Subscripts 1 row index 4 col index 1 value −1.

2 5 2 3.

3 1 2 2.

4 1 1 1.

5 6 7 8 9 10 11 5 2 4 3 3 2 1 5 3 4 5 2 4 5 6. −3. −4. −5. −2. 4. 5.

The major difficulty with this data structure lies in the inconvenience of accessing it by rows or columns. It is perfectly suitable if we wish to multiply by a vector in full-length storage mode to give a result also in full-length storage mode. However, the direct solution of a set of linear equations, for example, may involve a sequence of operations on the columns of the coefficient matrix. There are two principal storage schemes that provide ready access to this information: the collection of sparse vectors and the linked list. These we now describe. 2.7 Sparse matrix as a collection of sparse vectors Our first alternative to the coordinate scheme is to hold a collection of packed sparse vectors of the kind described in Section 2.2, one for each column (or row). The components of each vector may be ordered or not. Since our conclusion in Section 2.4 was that there is little advantage in ordering them, we take them to be unordered. The collection may be held as an array of a derived type with allocatable components, for example, type sparse vector in Figure 2.7.1 or Figure 2.7.2.

24

SPARSE MATRICES: STORAGE SCHEMES AND SIMPLE OPERATIONS

type vector_entry integer :: index real :: value end type type sparse_vector type(vector_entry), allocatable :: entry(:) end type Fig. 2.7.1. Derived type for sparse vectors, using a derived type for the entries. type sparse_vector integer, allocatable :: index(:) real, allocatable :: value(:) end type Fig. 2.7.2. Derived type for sparse vectors, using integers and reals. The disadvantage is that the sizes of the columns vary during the execution. Each change of size will involve allocating a new array, copying the data, and deallocating the old array. It is therefore usual to pack the data into an integer and a real array or into an array of derived type. For each member of the collection, we store the position of its start (in either array) and the number of entries. Thus, for example, the matrix (2.6.1) may be stored as shown in Table 2.7.1. Table 2.7.1 The matrix (2.6.1) stored as a collection of sparse column vectors. Subscripts 1 len col 2 col start 1 row index 4 value −1.

2 3 3 1 1.

3 1 6 5 3.

4 5 6 7 2 3 7 9 12 1 3 2 4 2. −2. −3. −4.

8

9

2 3 4. −5.

10 11

1 5.

5 6.

In this representation, the columns are stored in order (column 1, followed by column 2, followed by column 3, etc.). If we hold the start of an imagined extra column (column 6 in Table 2.7.1), it is not necessary to store both col start and len col, since col start can be calculated from len col or len col can be calculated from col start. If it is only necessary to access the matrix forwards or backwards, as is the case when a triangular factor is being held (see Section 10.5), then holding just len col is satisfactory and has the advantage that the integers are smaller and so may fit into a shorter computer word. Where the columns may need to be accessed in arbitrary sequence, we may dispense with len col and hold just col start; holding just len col would lead to a costly extra computation for finding entries in a particular column. A basic difficulty with this structure is associated with inserting new entries. This arises when a multiple of one column is added to another, since the

SHERMAN’S COMPRESSED INDEX SCHEME

25

new column may be longer, as discussed in Section 2.4. It is usual to waste (temporarily) the space presently occupied by the column and add a fresh copy at the end of the structure. Once the columns have become disordered because of this, both len col and col start are needed (in this case, the location of an extra imagined column is not needed). After a sequence of such operations, we may have insufficient room at the end of the structure for the new column, although there is plenty of room inside the structure. Here, we should ‘compress’ it by moving all the columns forward to become adjacent once more. It is clear that this data structure demands some ‘elbow room’ if an unreasonable number of compresses are not to be made. 2.8 Sherman’s compressed index scheme Some economy of integer storage is possible with the compressed index scheme of Sherman (1975), which we now describe. If we look at the matrix of Figure 2.8.1, our usual storage scheme would hold the off-diagonal entries as a collection of ×× 0 ×× × 0 ×× ××× ×× × Fig. 2.8.1. A matrix pattern. packed vectors (Section 2.7) as shown in Table 2.8.1. The compressed index scheme makes use of the fact that the tail of one row often has the same pattern as the head of the next row (particularly so for the factors after fillin has occurred) and so the same indices can be reused. In order to do this, the length of each row (that is the number of off-diagonal entries) must be kept and we show the resulting data structure in Table 2.8.2. A similar gain is obtained automatically when using multifrontal schemes. Referring to Figure 14.2.1, we see that NCOL*NPIV - NPIV*(NPIV-1)/2 entries will require only NPIV+NCOL indices. Table 2.8.1 Matrix pattern of Figure 2.8.1 stored as a collection of sparse row vectors. Subscript lenrow irowst jcn

1 3 1 2

2 2 4 4

3 2 6 5

4 5 6 7 8 1 8 9 4 5 4 5 5

We indicate some of the gains obtained on practical problems in Table 2.8.3 where, if compressed storage were not used, the number of indices would be equal to the number of entries in the factors. Indeed, on the matrix from 5-point discretizaton on a square, see Section 1.7, the storage for the reals increases as

26

SPARSE MATRICES: STORAGE SCHEMES AND SIMPLE OPERATIONS

Table 2.8.2 Matrix pattern of Figure 2.8.1 stored as a collection of sparse row vectors, using compressed storage. Subscript lenrow irowst jcn

1 3 1 2

2 2 2 4

3 4 2 1 2 3 5

Table 2.8.3 Illustration of savings obtained through use of compressed storage scheme. Order 900 130 Nonzeros 7 744 1 296 Entries in factors 34 696 1 304 Number of indices 11 160 142 O(n3 /2) for banded storage (or O(n log n) at best), while the integer storage for indices can be shown to be O(n) (see Exercise 2.5). 2.9

Linked lists

Another data structure that is used widely for sparse matrices is the linked list. We introduce linked lists in this section and show how they can be used to store sparse matrices in Section 2.10. Although Fortran has explicit pointers, we choose to use array subscripts for the sake of efficiency and so that comparisons with other data structures are easy to make. The essence of a linked list is that there is a pointer to the first entry (header pointer) and with each entry is associated a pointer (or link) that points to the next entry or is null for the final entry. The list can be scanned by accessing the first entry through the header pointer and the other entries through the links until the null link is found. For example, we show in Table 2.9.1 a linked list for the set of indices (10, 3, 5, 2), where we have used zero for the null pointer. Scanning this linked list, we find the header points to entry 1 with the value 10, its link points to entry 2 with value 3, and its link in turn points to entry 3 with value 5. The link from entry 3 points to entry 4 whose value is 2 and whose link value is 0, indicating the end of the list. When using linked lists, it is important to realize that the ordering is determined purely by the links and not by the the physical location of the entries. Table 2.9.1 Linked list holding (10, 3, 5, 2). Subscripts 1 2 3 4 Values 10 3 5 2 Links 2 3 4 0 Header 1

LINKED LISTS

27

Table 2.9.2 Linked list holding (10, 3, 5, 2) in a different physical order. Subscripts Values Links Header

1 2 3 4 3 10 2 5 4 1 0 3 2

Table 2.9.3 Linked list holding (10, 3, 5, 2) in increasing order of values. Subscripts Values Links Header

1 2 3 4 3 10 2 5 4 0 1 2 3

For example, the same ordered list (10, 3, 5, 2) can be stored as in Table 2.9.2. Furthermore, the links can be adjusted so that the values are scanned in order without moving the physical locations (see Table 2.9.3). If we are storing a vector of integers in isolation, it would be nonsensical to prefer a linked list, since we would have unnecessarily increased both the storage required and the complexity of accessing the entries. There are two reasons, however, why linked lists are used in sparse matrix work. The first is that entries can be added without requiring adjacent free space. For example, an extra entry, say 4 between 3 and 5, could be accommodated as shown in Table 2.9.4 even if locations 5–8 hold data that we do not wish to disturb. Secondly, if we wish to remove an entry from the list, no data movement is necessary since only the links need be adjusted. The result of removing entry 3 from the list of Table 2.9.2 is shown in Table 2.9.5. A problem with adding and deleting entries is that usually the previous entry must be identified so that its link can be reset, although additions may be made to the head of the list if the ordering is unimportant. If the list is being scanned at the time a deletion or insertion is needed, the previous entry should be available. However, if an essentially random entry is to be deleted or if insertion in order is wanted, we need to scan the list from its beginning. At the expense of more storage, this search may be avoided by using a doubly-linked list that has a second link associated with each entry and that points to the previous entry. An Table 2.9.4 Linked list holding (10, 3, 4, 5, 2). Subscripts Values Links Header

1 2 3 4 5 6 7 8 9 3 10 2 5 * * * * 4 9 1 0 3 * * * * 4 2

28

SPARSE MATRICES: STORAGE SCHEMES AND SIMPLE OPERATIONS

Table 2.9.5 Removal of entry 3 from the Table 2.9.2 list. The only changed link is shown in bold. Subscripts Values Links Header

1 2 3 4 * 10 2 5 * 4 0 3 2

Table 2.9.6 (10, 3, 5, 2) in a doubly-linked list. Subscripts Values Forward links Backward links Forward header Backward header

1 2 3 4 3 10 2 5 4 1 0 3 2 0 4 1 2 3

example is shown in Table 2.9.6. It is now straightforward to add or delete any entry (see Exercise 2.7). A significant simplification can occur when the values are distinct integers in the range 1 to n and a vector of length n is available to hold the links. The location i may be associated with the value i so that there is no need to store the value. Our example (10, 3, 5, 2) is shown in Table 2.9.7 held in this form. Several such lists may be held in the same array of links in this way, provided no two lists contain a value in common. For instance, we may wish to group all the columns with the same number of entries. If columns (10, 3, 5, 2) all have one entry, columns (1, 9, 4) have two entries and columns (6, 7, 8) have three entries, the three groups may be recorded together as shown in Table 2.9.8. We have used the value zero to indicate a null pointer, but any value that cannot be a genuine link may be used instead, so it may be used to hold other information. We illustrate this in Section 2.13. For a more complete discussion of linked lists we refer the reader to Knuth (1969, pp. 251–257). Table 2.9.7 (10, 3, 5, 2) in an implicit linked list. Subscripts 1 2 3 4 5 6 7 8 9 10 Links 0 5 2 3 Header 10

SPARSE MATRIX IN COLUMN-LINKED LIST

29

Table 2.9.8 (10, 3, 5, 2), (1, 9, 4), and (6, 7, 8) in one array as implicit linked lists. Subscripts 1 2 3 4 5 6 7 8 9 10 Links 9 0 5 0 2 7 8 0 4 3 Headers 10 1 6

2.10

Sparse matrix in column-linked list

A major benefit of using linked lists for the storage of sparse matrices is that the elbow room and compressing operations associated with the data structure in Section 2.7 can be avoided entirely. To store the matrix as a collection of columns, each in a linked list, we need an array of header pointers with the j th entry pointing to the location of the first entry for column j. We illustrate this in Table 2.10.1 for the 5×5 matrix (2.6.1). The values are physically located in the rather arbitrary order shown in Table 2.6.1, but the links have been constructed so that the columns are scanned in column order. We employ this ordering because it facilitates operations on the matrix and enables us to illustrate the flexibility of the structure better. However, it is not a requirement of the linked list scheme that the columns be accessed in order, since a variation of the technique of Section 2.4 may be used for the critical operation of adding a multiple of one column to another. Table 2.10.1 The matrix (2.6.1) held as a linked list. Subscripts 1 col start 4 row index 4 value −1. link 0

2 3 5 3. 0

3 4 5 6 7 8 9 10 11 6 10 11 1 1 5 2 4 3 3 2 1 2. 1. 6. −3. −4. −5. −2. 4. 5. 9 1 0 0 0 5 2 7 8

To illustrate the use of this structure, we work through column 4. col start(4) = 10, row index(10) = 2, and value(10) = 4.0, so the first entry in column 4 is in the (2,4) position and has value 4.0. link(10) = 7, row index(7) = 4, and value(7) = −4.0, so the next entry in column 4 is in the (4,4) position and has value −4.0. Since link(7) = 0, there are no further entries in column 4. We can illustrate how the structure is used by supposing that a multiple of column 1 of the matrix (2.6.1) is added to column 2. A fill-in occurs in position (4,2) and this may be added to the linked list of Table 2.10.1, while preserving the order within row 2, by placing the value of the new entry in value(12), setting row index(12) = 4, giving the new entry the link value that entry (3, 2) used to have, and setting the link of entry (3, 2) to 12. The resulting new link array is shown in Table 2.10.2, the changed entries being shown in bold. The col start array is not affected in this case.

30

SPARSE MATRICES: STORAGE SCHEMES AND SIMPLE OPERATIONS

Table 2.10.2 The link array after inserting entry (4,2). Subscripts 1 2 3 4 5 6 7 8 9 10 11 12 link 0 0 9 1 0 0 0 5 12 7 8 2 We can delete entries as described in Section 2.9 and normally find it convenient to link deleted entries together so that their storage locations are readily available for subsequent insertions. 2.11

Sorting algorithms

Often we are confronted with the need to take a list of τ values and associated keys, and sort the list so that the keys are in ascending or descending order. This can be costly unless a good sorting algorithm is employed, so we introduce two sorting methods, counting and heap, which we will use throughout the book. The counting sort is for the special case where the keys are integers in a limited range, say (1:n). An example is the sorting of the rows of a sparse matrix by their numbers of entries. The restricted key values allow us to conduct the sort in O(τ ) + O(n) operations. The more general case can be performed by the heap sort in O(τ log τ ) operations. 2.11.1

The counting sort

Suppose we are given arrays value and key, both of of length τ , holding values and integer keys in the range (1:n). We show that, using a counting sort (Feurzeig 1960) we can sort the arrays so that the keys are in ascending order in O(τ )+O(n) operations. When τ is substantially less than n, this sort is not recommended (see Exercise 2.9). The sorting algorithm consists of three steps: (i) We go through key counting the number of occurrences of each entry i in range (1:n), storing the results of the count in an array of length n, which we will call work: work(1:n) = 0 do k = 1, tau j = key(k) work(j) = work(j) + 1 end do (ii) For each key value i, we calculate in work(i) the location in the sorted vector of keys of the first entry with value greater than i (for example, if there are m instances of 1, work(1) will end having the value m+1): work(1) = work(1) + 1 do j = 2, n work(j) = work(j) + work(j-1) end do

SORTING ALGORITHMS

31

(iii) This allows us to run through the entries, moving each to its sorted position: do k = 1, tau j = key(k) k_new = work(j) - 1 work(j) = k_new value_new(k_new) = value(k) map(k) = k_new end do For the first entry with key value j, a suitable position is work(j)-1. Decrementing work(j) by 1 enables the same calculation to be made for the next entry with key value j. It means that the entries with the same key value are placed backwards and work(j) ends holding the position of the first entry with key value j or greater. The array map of length tau is used hold the positions of the original entries when sorted, useful if another array is to be sorted with the same keys. Note the need for a second array, value new, for storing values in the sorted positions. We show how to avoid the extra array in a specific application at the end of Section 2.12. It may be helpful to work through this counting sort with an example, and we have provided one in Exercise 2.10. 2.11.2

Heap sort

In the preceding section, we were able to sort a list with a computational effort that was proportional to the length of the list. It is, of course, very desirable to do this, but there are cases where it is not possible. A very useful algorithm in such a case is the heap sort. It can be performed in place, that is, without temporary storage. For a list of length n, the computational effort is proportional to n log n. The algorithm is based on regarding the items as being held in a special binary tree. Node 1 is the root and nodes 2 and 3 are its children. In general, node i has nodes 2i and 2i + 1 as its children unless 2i > n or 2i + 1 > n, in which case node i is a leaf or has one child. We illustrate this for n = 12 in Figure 2.11.1. We will refer to the node with the greatest index as the last node (so that node 12 is the last node in Figure 2.11.1). Suppose that it desired to order the items in ascending order and they have already been associated with such a binary tree so that the value at each node is less than or equal to that at any child. It follows that the root value is less than or equal to the value at any other node. It can therefore be extracted as the leading entry of the sorted list. Now remove the last node and place it in the vacated root position. If the new root value is greater than that of one of its children, swap it with the child having the lesser value. The root now has value less than or equal to the values at its children. If the swap was done, do the same for the swapped node and its children. Continue until no swap is needed or the leaf level is reached. At this point, we have a new a binary tree with the value at

32

SPARSE MATRICES: STORAGE SCHEMES AND SIMPLE OPERATIONS

Fig. 2.11.1. A binary tree with 12 nodes. each node less than or equal to that at a child. Restoring this property will have taken at most log2 n swaps. The second sorted value can be taken from the root. Continuing, the sorted list can be constructed with less than n log2 n swaps. We may build the tree similarly. Suppose we already have a tree with less than n nodes. Add a new node as node n + 1. If its value is less than that of its parent, swop them. Since the old parent had a value less than or equal to that at its old child, so will the new parent. If the swap has been done, do the same for the swapped node and its parent. Continue until no swap is needed or the root node is reached. At this point, we have a new a binary tree with the required property. Restoring this property will have taken at most log2 n swaps. Continuing, the full tree can be constructed from an initial tree with one node with less than n log2 n swaps. One way to hold the heap is to use use an array of dimension one, with the value at node i stored in position i. This allows the whole sort to be performed in place. 2.12 Transforming the coordinate scheme to other forms Neither of the structures described in Sections 2.7 and 2.10 is particularly convenient for initial specification of a sparse matrix to a library subroutine. Fortunately, we can sort from the coordinate scheme (Section 2.6) to a columnlinked list or to a collection of sparse vectors (rows or columns) in O(n) + O(τ ) operations, if n is the matrix order and τ is the number of entries. A column-linked list may very easily be constructed from an unstructured list by scanning it once, as shown in Figure 2.12.1. Essentially, we begin with a null list and add the entries in one by one without ordering them within columns. Note that the row indices and values are not accessed at all; the column indices are accessed but are not moved. Note also that we could overwrite column index with links, thereby saving the storage occupied by the array link. If ordering within the columns is required, this can be achieved by first linking by rows and then using these links to scan the matrix in reverse row order when setting column links (Exercise 2.11). A temporary array for holding row header pointers will be needed. It is fascinating that in this way we order the τ entries

TRANSFORMING THE COORDINATE SCHEME TO OTHER FORMS

33

col_start(1:n) = 0 do k = 1,tau j = col_index(k) link(k) = col_start(j) col_start(j) = k end do Fig. 2.12.1. Generating a column-linked structure. in 2τ operations (similar operations to those of the loop in Figure 2.12.1) in view of the fact that a general-purpose sort involves O(τ log2 τ ) operations. This is possible because of the special nature of the data to be ordered; it consists of a set of integers in the range (1:n) and we have a work vector col start of length n. To transform a matrix held in the coordinate scheme, with entries (aij , i, j), to a collection of sparse column vectors in O(n) + O(τ ) operations, we may make use of the counting sort (see Section 2.11.1). Since the goal is to sort the entries (aij , i, j) by columns, we apply the counting sort with indices j as keys. In this case, the most straightforward way to carry this out is to use temporary arrays to store the entries aij and row indices i. If another matrix with exactly the same pattern is to be processed, we can save the map array and sort it directly. Note that the row indices are not sorted within each column. Fortunately, often this is not needed, as was illustrated in Sections 2.3 and 2.4. If sorting within the columns is needed, a similar procedure may be applied by rows and the rows accessed in reverse order when constructing the collection of columns. If it is not acceptable to require the additional temporary arrays, we may replace step (iii) by a sequence of cyclic permutations, performed by a subroutine that we will name cyclic move and which has a single argument k. Starting from the entry in position k, say aij , it decreases work(j) by one, moves the entry in the location work(j) into temporary storage, and moves aij into this position. It then applies the same procedure to the displaced entry, which is now col_end(0) = 1 col_end(1:n) = work(1:n) jj = 1 do while jjcol_end(jj-1)) then call cyclic_move(k) else jj = jj + 1 end if end do Fig. 2.12.2. Using a sequence of cyclic permutations.

34

SPARSE MATRICES: STORAGE SCHEMES AND SIMPLE OPERATIONS

in temporary storage and continues until an entry is placed in position k. Suitable code for finding starting positions k is shown in Figure 2.12.2. Note that each entry is moved directly to its final position (via temporary storage) and again work automatically ends storing the positions of the starts of the columns. The entries are now scanned twice: once in (i) and once when being moved. The entries are again not sorted into order within each column.

2.13

Access by rows and columns

We will see in Chapter 10 that, although the sparse matrix storage structures we have so far considered are sometimes perfectly suitable, there are times when access by rows and columns is needed. In both the column-linked list and the collection of sparse column vectors, the entries in a particular row cannot be discovered without a search of all or nearly all the entries. This is required, for instance, to reveal which columns are active during a single stage of Gaussian elimination. If the matrix is held as a collection of sparse column vectors, a satisfactory solution (Gustavson 1972) is also to hold the structure of the matrix as a separate collection of sparse row vectors. In this second collection there is no need to hold the numerical values themselves, since all real arithmetic is performed on the sparse column vectors and the row collection is required only to reveal which columns are involved in the elimination steps. We show our 5 × 5 example stored in this format in Table 2.13.1. It differs from Table 2.7.1 only because of the addition of the integer arrays len row, row start, and col index. Table 2.13.1 The matrix (2.6.1) stored in Gustavson’s format. Subscripts 1 2 3 4 5 6 7 8 9 10 11 len col 2 3 1 2 3 col start 1 3 6 7 9 4 1 5 1 3 2 4 2 3 1 5 row index value −1. 1. 3. 2. −2. −3. −4. 4. −5. 5. 6. len row 3 2 2 2 2 1 4 6 8 10 row start col index 2 1 5 3 4 5 2 1 4 2 5 Similarly, for the linked list, we may add row indices and column links. This, however, means that four integers are associated with each entry. Curtis and Reid (1971) felt that this gave unacceptable storage overheads, and therefore dropped the row and column indices, but made the last link of each row (or column) contain the negation of the row (or column) index instead of zero. Thus, the row (or column) index of an entry could always be found by searching to the end of the row (or column). For really sparse matrices this search is not expensive. We show our 5 × 5 example in this format in Table 2.13.2.

SUPERVARIABLES

35

Table 2.13.2 The matrix (2.6.1) stored in Curtis–Reid format. Subscripts 1 2 3 4 5 6 7 8 9 10 11 col start 4 3 6 10 11 4 6 9 1 2 row start value −1. 3. 2. 1. 6. −3. −4. −5. −2. 4. 5. 5 2 7 8 col link −1 −2 9 1 −5 −3 −4 row link 7 5 11 3 −5 10 −4 −3 8 −2 −1 Given either the collection of sparse row vectors or the row-linked list, it is extremely easy to construct these larger structures (see Exercises 2.12 and 2.13). 2.14

Supervariables

If a set of columns of a sparse matrix have identical structure, they can be treated together with a single index list. The set of variables associated with such a set of columns is known as a supervariable. Working with supervariables not only saves memory, but can lead to significantly faster execution of code. Duff and Reid (1996a, Section 2.5) showed that supervariables can be identified with an amount of work that is proportional to the total number of entries. This is done by ordering the entries by rows and constructing the sets of supervariables for the submatrices with successively more rows. They start with all the variables in a single supervariable. This is split into two supervariables by moving each variable j corresponding to an entry a1j into a new supervariable. Continuing, for i = 2, 3, . . . , n, they split each supervariable containing a variable j corresponding to an entry aij into two by moving any such variable into a new supervariable. If this leaves the old supervariable empty, it is discarded. The whole algorithm is illustrated in Figure 2.14.1. Reid and Scott (2009) implemented this with four arrays of length n. Put all the variables in one supervariable do i = 1, n do for each j in the list for row i sv = the supervariable to which j belongs if this is the first occurrence of sv for row i then establish a new supervariable new sv record that new sv is associated with sv else new sv = the new supervariable associated with sv end if move j to new sv discard sv if it is now empty end do end do

Fig. 2.14.1. Identifying supervariables.

36

2.15

SPARSE MATRICES: STORAGE SCHEMES AND SIMPLE OPERATIONS

Matrix by vector products

Any of the storage schemes we have considered is suitable for multiplying a sparse matrix by a full-length vector to give a full-length vector result. For the operation y = Ax, (2.15.1) where x is sparse, access to A by columns is desirable. In fact, all we need are the columns corresponding to the nonzero components xi , and any data structure that provides access by columns is suitable. The vector y is just a linear combination of these columns and the technique for adding sparse vectors that we described at the end of Section 2.4 is very suitable. If column access is not available, a scan of the whole packed matrix A is inevitable. It is important, however, that we do not scan the packed vector x for each entry in A, and this can be avoided if x is first expanded into a full-length vector. If access by rows is available, the result y can be placed at once in packed storage since yi depends only on the i-th row of A. The BLAS (Section 1.5) provide basic operations for dense matrices but, after work by the BLAS Technical Forum (Blackford, Demmel, Dongarra, Duff, Hammarling, Henry, Heroux, Kaufman, Lumsdaine, Petitet, Pozo, Remington, and Whaley 2002) at the end of the twentieth century, a standard for operations on sparse matrices was defined (Duff, Heroux, and Pozo 2002). This included a sparse matrix times a dense vector kernel which, although crucial in the efficiency of iterative methods of solution, is seldom used in the implementation of sparse direct codes. The sparse matrix by sparse matrix product discussed in the next section is not included in the sparse BLAS. 2.16

Matrix by matrix products

We now consider the matrix by matrix product C = BA

(2.16.1)

where A and B are conformable (although not necessarily square) sparse matrices, and C is perhaps sparse. For dense matrices, the usual way of computing C is to consider each entry cij as an inner product of the i-th row Bi: of B and the j-th column A:j of A, that is X cij = bik akj = Bi: A:j . (2.16.2) k

The trouble with this formula for a general sparse matrix is that it is very difficult to avoid performing multiplications bik akj with one or other of the factors having the value zero (and an explicit test for a zero is likely to be equally expensive). For instance, if A is stored by columns and B is stored by rows, column j of A may be loaded into a full-length vector and then cij may be calculated by scanning the entries of row i of B. This means that all entries of B are scanned

PERMUTATION MATRICES

37

for each column of A. If the sparsity pattern of C is already known, cij need not be calculated unless it is an entry, but there are still likely to be many occasions when we multiply an entry of B by a zero in the full-length vector. None of these unnecessary operations is performed with the outer-product formulation as a sum of rank-one matrices, X C= B:k Ak: , (2.16.3) k

which is natural if B is stored by columns and A is stored by rows (note that B:k is a column vector and Ak: is a row vector, so B:k Ak: has the shape of C). If both matrices are stored by columns, column j of C may be accumulated as a linear combination of the columns of B by expressing (2.16.3) in the form X C:j = akj B:k . (2.16.4) k

This is analogous to performing a sequence of matrix by vector products in the first way described in Section 2.15. If both matrices are stored by rows, row i of C may be accumulated as a linear combination of the rows of A by the formula X Ci: = bik Ak: , (2.16.5) k

and the required work is identical to that when A and B are both stored by columns. An important special case is when B = AT , which can arise in the leastsquares problem, in which case C is the normal matrix. If A is stored by rows, we have the case of (2.16.3) with B:k = ATk: which yields X C= ATk: Ak: . (2.16.6) k

If A is stored by columns, we have the case of (2.16.2) with Bi: = AT:i which, in general, will not be as efficient. Indeed, it may be preferable to make a copy of A that is stored by rows. Another advantage of the outer-product approach is that A need not be stored as a whole. We may generate the rows successively (or read them from auxiliary storage) and accumulate the contribution from Ak: at once, after which this data is no longer needed. We refer the reader to Gustavson (1978) for a more detailed discussion of the topic of this section. 2.17

Permutation matrices

Permutation matrices are very special sparse matrices (identity matrices with reordered rows or columns). They are frequently used in matrix computations

38

SPARSE MATRICES: STORAGE SCHEMES AND SIMPLE OPERATIONS

to represent the ordering of components in a vector or the rearrangement of rows and columns in a matrix. For this reason, we want a special representation of permutation matrices that is both compact in storage and readily allows permutation operations to be performed. If we think of the permutation as a sequence of interchanges (i, pi ), i = 1, 2, ..., n − 1

(2.17.1)

where pi ≥ i, this sequence of interchanges operating on a vector of length n can be expressed in matrix notation as Px,

(2.17.2)

where P is a permutation matrix. Note that P may be obtained by performing the sequence of interchanges (2.17.1) on the identity matrix. Note also that for any permutation matrix, the relation PPT = PT P = I

(2.17.3)

is true. A convenient way to store such a permutation matrix is simply as the set of integers pi , i = 1, 2, ..., n − 1. (2.17.4) In this case, x may be permuted in place, that is without any additional array storage, by code such as that shown in Figure 2.17.1. do i = 1,n-1 j = p(i) temp = x(i) x(i) = x(j) x(j) = temp end do Fig. 2.17.1. Code to permute x in place. In fact, each interchange (i, pi ) is itself a representation of an elementary permutation matrix Pi (elementary because only two rows or columns of the identity matrix have been interchanged), and P is the product P = Pn−1 Pn−2 ...P1 .

(2.17.5)

PT = PT1 PT2 ...PTn−1

(2.17.6)

The inverse permutation may therefore be represented by the same set of interchanges, but now in reverse order. An alternative representation of P is as the set of n integers πi , i = 1, 2,..., n, which represent the components of x in successive positions of y = Px.

CLIQUE (OR FINITE-ELEMENT) STORAGE

39

do i = 1,n y(i) = x(perm(i)) end do Fig. 2.17.2. Code to form y = Px from x. The integers πi are in the array perm. In this case, in-place permutation is not straightforward (see Exercise 2.15), but y can be formed from x as shown in Figure 2.17.2. It involves fewer array element accesses than the code of Figure 2.17.1, so is likely to run faster. This representation is equally convenient for the inverse permutation, again provided in-place permutation is not required. Which of these representations is preferable depends on the application. Often speed is not of great importance because this is an O(n) calculation and other transformations involving more operations will be needed too. In this case, the convenience of being able to sort in place may be important. However, sometimes the average number of entries per row may be very modest (for example, two or three), in which case the speed of permutations may be quite important. To construct πi , i = 1, 2,..., n from pi , i = 1, 2,..., n − 1 is trivial. We merely have to apply the successive interchanges to the vector (1, 2, ..., n). The reverse construction is not quite so easy, but it can be done in O(n) operations if care is taken (see Exercise 2.17). 2.18

Clique (or finite-element) storage

Matrices that arise in finite-element calculations have the form X A= A[k]

(2.18.1)

k

where each matrix A[k] is associated with a finite element and has nonzeros in only a few rows and columns. Therefore, it is possible to hold each A[k] as a small dense matrix together with a set S [k] of indices to indicate where it belongs in the overall matrix. For example, the symmetric matrix whose pattern is shown in Figure 1.3.2 might consist of the sum of matrices A[k] whose index sets S [k] are (1, 2), (1, 5, 8), (2, 3, 11, 13), (3, 4), (5, 6), (6, 7), (8, 9), (9, 10), (11, 12), and (13, 14). Another way to store a sparse matrix A is therefore as the sum (2.18.1). Any sparse matrix may be written in this form by putting each entry into its own A[k] . We will see in Section 11.2 that there can be advantages during Gaussian elimination in using this representation because elimination operations may be represented symbolically by merging index lists. Because such a merged list cannot be longer than the sum of the lengths of the lists merged, this symbolic operation can be done without any need for additional storage. In this way, we can be sure not to run out of storage while finding out how much storage will be needed for the subsequent operations. Furthermore, the merge of two index lists

40

SPARSE MATRICES: STORAGE SCHEMES AND SIMPLE OPERATIONS

will be much faster than the corresponding elimination step since all the entries are involved in the elimination. There can be advantages in retaining the form (2.18.1) for the matrix by vector multiplication y = Ax, (2.18.2) where x is a full-length vector. In this case, each product A[k] x

(2.18.3)

may be formed by: (i) Gathering the required components of x into a vector of dimension that of A[k] (a gather operation). (ii) Performing a full-length matrix by vector multiplication. (iii) Adding the result back into the relevant positions in the full-length vector y (a scatter operation). Steps (i) and (iii) involve indirect addressing so are likely to be executed relatively slowly, but full advantage can be taken of the hardware in step (ii) and there will be a gain if there are many more operations to be performed in step (ii) than in steps (i) and (iii). Thus, this technique will be successful if the number of indices in all the index sets (this is the number of operations in steps (i) and (iii)) is substantially less than the number of nonzeros in A. 2.19

Comparisons between sparse matrix structures

We will now make some comparisons between the sparse matrix structures that we have been describing. It has already been remarked that the coordinate scheme is the most convenient for general-purpose data input, but is inadequate for later use. It therefore remains necessary to consider the linked list, the collection of sparse vectors, and clique storage. We first remark that coding for the linked list is usually simpler and shorter. Figure 2.12.1 provides an example of just how compact it can be. A further advantage is that no movement of data is needed, no elbow room is needed, nor does the collection need occasional compression. Its disadvantages are in greater storage demands (because of having to hold the links themselves) or greater demands for computer time if the row and/or column indices are overwritten by the links (because then occasional searches are needed). The linked list has some run-time disadvantage because of the access time for the links themselves. Furthermore, when we need a matrix entry we are likely to need the rest of its row/column too. These may be scattered around memory, which can cause many cache misses on a cache-based machine and ‘page thrashing’ (an unreasonable number of page faults) on a machine with virtual storage. We discussed some of the benefits of clique storage in Section 2.18. If the problem is from finite elements, the clique representation (2.18.1) is the natural way to hold the original matrix. However, as we remarked in Section 2.18,

COMPARISONS BETWEEN SPARSE MATRIX STRUCTURES

41

Table 2.19.1 Some statistics on artificial generation of cliques. Since each matrix is symmetric, only the nonzeros in one of its triangular halves need be stored. Matrix order 147 1176 292 85 Nonzeros in matrix 2 449 18 552 2 208 523 Nonzeros stored 1 298 9 864 1 250 304 Number of generated cliques 157 907 373 102 Number of entries in generated cliques 2 970 11 286 2 029 481

it is possible to ‘disassemble’ any symmetric matrix into a sum of the form (2.18.1). Duff and Reid (1983) investigated this further and we show some of their results in Table 2.19.1. This indicates that there is unlikely to be any gain from artificially generating the clique storage for a matrix if it is naturally in assembled form. However, when Gaussian elimination is applied to a symmetric (or symmetrically structured) matrix, cliques are automatically produced and there are advantages in storing them in this form (see Chapter 11). We are faced with a problem-dependent decision and no firm conclusion may be made for the best sparse matrix structure. Some computer codes make use of several structures; the reader with a particular application should consider the advantages and disadvantages described here as they relate to the application in hand. Exercises 2.1 Write code, analogous to that of Figure 2.3.1, to execute the operation z := x + αy where x, y, and z are sparse vectors held in packed form with components in order. 2.2 Write code, analogous to that of Figure 2.4.1, for the operation x := x + αy where x and y are unordered packed vectors. Do not use a work vector, but assume that an integer workspace of length the number of entries in y is available and that a full-length integer vector containing non-negative integers may be used temporarily as workspace, provided it is restored afterwards. [Hint: Set component i of the full-length integer vector to the position of yi in the packed vector, preserving the old value in the integer workspace.] 2.3 Write code analogous to that of Figure 2.4.1 for the operation x := x + αy when it is known that the sparsity pattern of y is a subset of that of x. 2.4 Design an algorithm to compress a collection of sparse column vectors. 2.5 Show that with Sherman’s compressed index storage scheme, the integer storage needed for the triangular factors of the n × n matrix arising from the 5-point discretization of the Laplacian operator on a square grid is O(n) if pivots are chosen in the natural order. 2.6 Write code to delete entry i from a linked list held in an array of links with a header. 2.7 Write code to delete entry i from a doubly-linked list held with forward links in an array and a header and backward links in another array.

42

SPARSE MATRICES: STORAGE SCHEMES AND SIMPLE OPERATIONS

2.8 Suppose an array of length n has all its values lying between 1 and n. Write code to generate a doubly-linked list containing all entries with value k. 2.9 Suppose you want to sort the array value = (3, 1000, 1). Since this satisfies the condition that the entries of value are in the range (1:1000), a counting sort could be used (see Section 2.11.1) with τ = 3, n = 1000. Would this be a reasonable approach? 2.10 In Section 2.11.1, we described the counting sort. Sort the array key = (5,3,1,2,5,4,3,4) using the counting sort algorithm. What are the values of n and τ ? What are the values of the array work at the end of steps (i) and (ii)? What is the value of the array map at the end of step (iii)? Using value = (v1 , . . . , v8 ), what is the value of the array value new at the end of step (iii)? What is the value of the array work at the end of step (iii)? 2.11 Write code to transform a matrix held in the coordinate scheme (aij , i, j) to a column-linked sparse matrix with links such that any column scan through them is in order of row indices. [Hint: See the third paragraph of Section 2.12; replace row indices temporarily by row links, restoring them when scanning by rows creating column links.] 2.12 Write code that constructs the arrays of Gustavson’s collection of sparse column vectors (Table 2.13.1) from a collection of sparse row vectors. Do not assume that the rows are stored contiguously and ensure that the generated row indices are in order within each column. 2.13 Write code that, given a matrix held as a collection of rows in linked lists, sets up the Curtis–Reid format (Table 2.13.2). Ensure that column links are in natural order. 2.14 Write code to perform the in-place permutation PT x when P is held as the sequence of interchanges (i, pi ), i = 1, 2,..., n − 1. 2.15 By exploiting the fact that the permutation (i → πi , i = 1, 2, ..., n) can be expressed as a sequence of cycles i1 → i2 = πi1 , i2 → i3 = πi2 , · · · , ik → i1 = πik , write code for the in-place sort corresponding to that of Figure 2.17.1. 2.16 Write code to form y = PT x when P is held as the set of integers πi , i = 1, 2,..., n, which represent the positions of the components of x in Px. 2.17 Given the permutation (i → πi , i = 1, 2, ..., ), write code that involves O(n) operations to construct the same permutation as a sequence of interchanges (i, pi ), i = 1, 2, ..., n − 1. 2.18 Identify the permutation matrix P such that PAPT is the matrix pattern in Figure 1.3.4 and A is the pattern of Figure 1.3.2. Write P in the two different compact storage schemes of Section 2.17. 2.19 Show that if P is a permutation matrix, PT P = I.

3 GAUSSIAN ELIMINATION FOR DENSE MATRICES: THE ALGEBRAIC PROBLEM We review the fundamental operations in the direct solution of linear equations without concern for rounding error caused by computer arithmetic. We consider the relationship between Gaussian elimination and LU factorization. We compare the use of different computational sequences including left-looking and right-looking. We show that advantage can be taken of symmetry and we consider the use of blocking. Our choice of material is based on what will be useful in the sparse case.

3.1 Introduction In this chapter, we review the algebraic properties associated with the use of Gaussian elimination for the direct solution of the equation Ax = b

(3.1.1)

where A is an n×n nonsingular dense matrix, x is a vector of n unknowns, and b is a given vector of length n. The effect of numerical inaccuracies is deferred to the next chapter. Sparsity is not the subject of this chapter, although considerations of sparsity motivate the selection of material and the manner of presentation. Almost all direct methods for the solution of (3.1.1) are algebraically equivalent to Gaussian elimination, which we introduced in Section 1.3. Variations deal with different computational sequences. Books by Wilkinson (1965), Forsythe and Moler (1967), Stewart (1973), Strang (1980), Higham (2002), and Golub and van Loan (2012), give a more complete treatment of the numerical solution of dense equations. We present some of the basic material in this chapter and the next so that we can later develop the concepts needed to handle sparsity. 3.2 Solution of triangular systems Basic to all general-purpose direct methods for solving equation (3.1.1) is the concept that triangular systems of equations are ‘easy’ to solve. They take the form Ux = c, (3.2.1) where U is upper triangular (all entries below the main diagonal are zero), or the form Lc = b, (3.2.2) where L is lower triangular (all entries above the main diagonal are zero). Direct Methods for Sparse Matrices, second edition. I. S. Duff, A. M. Erisman, and J. K. Reid. c Oxford University Press 2017. Published 2017 by Oxford University Press.

44

THE ALGEBRAIC PROBLEM

Systems of the form (3.2.1) may be solved by the steps xn = cn /unn  xk = ck −

(3.2.3a) n X

 ukj xj  / ukk , k = n − 1, n − 2, ..., 1,

(3.2.3b)

j=k+1

provided ukk 6= 0, k = 1, 2, ..., n. This is known as back-substitution. Similarly, equation (3.2.2) may be solved forwards by the steps c1 = b1 /l11  ck = bk −

(3.2.4a) k−1 X

 lkj cj  / lkk , k = 2, 3, ..., n

(3.2.4b)

j=1

provided lkk 6= 0, k = 1, 2, ..., n. This is known as forward substitution. The goal of our methods for solving equation (3.1.1) is to transform this equation into one of (3.2.1) or (3.2.2), which may then be solved readily. An upper (lower) triangular matrix is said to be unit upper (lower) triangular if it has ones on the main diagonal. 3.3

Gaussian elimination

One transformation to triangular form is Gaussian elimination, which we introduced in Section 1.3 and illustrate with the system      b1 x1 a11 a12 a13  a21 a22 a23   x2  =  b2  . (3.3.1) b3 x3 a31 a32 a33 Multiplying the first equation by l21 = a21 /a11 (assuming a11 6= 0) and subtracting from the second produces the equivalent system      b1 a11 a12 a13 x1  0 a(2) a(2)   x2  =  b(2)  , (3.3.2) 2 22 23 x3 b3 a31 a32 a33 where (2)

a22 = a22 − l21 a12 ,

(3.3.3a)

(2) a23 (2) b2

= a23 − l21 a13 , and

(3.3.3b)

= b2 − l21 b1 .

(3.3.3c)

Similarly, multiplying the first equation by l31 = a31 /a11 and subtracting from the third produces the equivalent system

GAUSSIAN ELIMINATION

45

    b1 a11 a12 a13 x1  (2)   (2) (2)   0 a22 a23   x2  =  b2  . (2) (2) (2) x3 b3 0 a32 a33 

(2)

(2)

(3.3.4)

(2)

Finally, multiplying the new second row by l32 = a32 /a22 (assuming a22 6= 0) and subtracting from the new third row produces the system      b1 a11 a12 a13 x1  (2)   (2) (2)  (3.3.5)  0 a22 a23   x2  =  b2  (3) (3) x 3 b3 0 0 a33 Notice that equation (3.3.5) has the upper triangular form (3.2.1) with the correspondences     a11 a12 a13 b1   (2)  (2) (2)  U =  0 a22 a23  and c =  b2  . (3.3.6) (3) (3) 0 0 a33 b3 This may be performed in general by creating zeros in the first column, then the second, and so forth. For k = 1, 2, ..., n − 1 we use the formulae (k)

(k)

lik = aik /akk , i > k, (k+1) aij (k+1) bi

=

(k) aij

=

(k) bi

(3.3.7a)



(k) lik akj ,

i, j > k,

(3.3.7b)



(k) lik bk ,

i > k,

(3.3.7c)

(1)

where aij = aij , i, j = 1, 2, ..., n. Because of its role here, each lik is a called (k)

a multiplier. The only assumption required is that akk 6= 0, k = 1, 2, ..., n. These entries are called pivots in Gaussian elimination. It is convenient to use the notation A(k) x = b(k) (3.3.8) for the system obtained after (k − 1) stages, k = 1, 2, ..., n, with A(1) = A and b(1) = b. The final matrix A(n) is upper triangular (see equation (3.3.5), for example). We use the notation U for A(n) and c for b(n) so that the final system is (3.2.1), that is, Ux = c. (3.3.9) Note that the upper triangular elements of U are (k)

ukj = akj , k < j

(3.3.10)

and (3.3.7b) may be written (k+1)

aij

(k)

= aij − lik ukj , i, j > k.

(3.3.11)

46

THE ALGEBRAIC PROBLEM

3.4

Required row interchanges (k)

The Gaussian elimination described above breaks down when akk = 0, illustrated by the case 

01 23



   4 x1 .= x2 5

(3.4.1)

Exchanging the equations to give 

23 01



x1 x2



  5 = 4

(3.4.2)

completely avoids the difficulty. This simple observation is the basis for the solution to the problem for a matrix of any order. We illustrate by working with n = 5. The extension to the general case is obvious. Suppose we have proceeded through two stages on a system of order 5 and the system has the form



(1)

(1)

a11 a12 (2)  a22     

(3)

(1)

a13 (2) a23 0 (3) a43 (3) a53

(1)

a14 (2) a24 (3) a34 (3) a44 (3) a54

 (1)  (1)  a15  x1  b1 (2)  (2)   a25   b2    x2    (3)   (3)  = .  a35   x3  b  3  (3)  (3)   x4      a45 b4 (3) (3) x5 a55 b5

(3.4.3)

(3)

If a43 6= 0 or a53 6= 0, we interchange the third row with either the fourth or fifth row in order to obtain a nonzero pivot and proceed. This interchanging is called pivoting. The only way this can break down is if (3)

(3)

(3)

a33 = a43 = a53 = 0.

(3.4.4)

In this case, equations 3, 4, and 5 of system (3.4.3) do not involve x3 , which can be given an arbitrary value. This means that A is singular and equation (3.1.1) does not have a unique solution, violating our assumption that A is nonsingular. Extending to the general case, we see that as long as A is nonsingular, the (k) equations may always be reordered through interchanging rows so that akk is nonzero.

RELATIONSHIP WITH LU FACTORIZATION

3.5

47

Relationship with LU factorization

We now examine the computation from another point of view. Let L(k) be the unit lower triangular matrix 

L(k)



1 ·

     =     

     ,     

· 1 −lk+1,k · · · −ln,k

· ·

(3.5.1)

1

which differs from the identity matrix I only in the k-th column below the main diagonal, where the negatives of the multipliers lik appear. Matrices of the form (3.5.1) are sometimes called elementary lower triangular matrices. This permits the relations (3.3.7b) to be expressed in matrix notation as the relation A(k+1) = L(k) A(k) .

(3.5.2)

Using this relation for all values of k gives the equation U = A(n) = L(n−1) L(n−2) ... L(1) A.

(3.5.3)

Now the inverse of L(k) is given by the equation 

(L

(k) −1

)



1 ·

     =     

     .     

· 1 lk+1,k · · · · ln,k

·

(3.5.4)

1

as may readily be verified by directly multiplying the matrix (3.5.1) by the matrix −1 −1 (3.5.4). Multiplying equation (3.5.3) successively by (L(n−1) ) , ..., (L(1) ) gives the equation −1

A = (L(1) )

(L(2) )

−1

... (L(n−1) )

−1

U.

By direct multiplication using (3.5.4), we find the equation

(3.5.5)

48

THE ALGEBRAIC PROBLEM



(L(1) )−1 (L(2) )−1 ...(L(n−1) )−1



1

 l21 ·   l31   =      ln1

      = L . (3.5.6)     

· 1 lk+1,k · · · · ln,k

· 1

We therefore see that equation (3.5.5) can be written as the triangular factorization A = LU, (3.5.7) where U = A(n) and L is the unit lower triangular matrix of multipliers shown in equation (3.5.6). Thus, Gaussian elimination performs the same computation as LU factorization. We illustrate with an example in Exercise 3.1. Observe that, in doing this computation on a computer, we may use a single two-dimensional array if we overwrite A(1) by A(2) , A(3) , etc. Furthermore, each multiplier lij may overwrite the zero it creates. Thus, the array finally contains both L and U in packed form, excluding the unit diagonal of L. This is often called the L\U array. For the right-hand side, if we substitute LU for A in equation (3.1.1) we obtain the equation LUx = b. (3.5.8) Noting that Ux = c, we see that c may be found by solving the lower triangular system Lc = b (3.5.9) by forward substitution, see equations (3.2.4). The only distinction between Gaussian elimination and LU factorization is that in LU factorization we consciously save the multipliers at each step. The benefit of saving these multipliers comes when we want to solve another problem with the same matrix, but a different right-hand side. In this case, there is no need to factorize the matrix again, but we simply use the factors on the new right-hand side following equations (3.5.8), (3.5.9), and (3.2.1). In some cases, we do not know all of the right-hand sides at the time of the factorization, and sometimes one right-hand side is dependent on the solution to the problem with a different right-hand side. There is another way to derive the LU factorization of A by directly calculating the entries in L and U from (3.5.7). We leave that to Exercise 3.2. We have described the factorization LU where L is unit lower triangular. If D is the diagonal matrix formed from the diagonal entries of A(n) , we can write A(n) as DU to give the alternative factorization A = LDU,

(3.5.10)

where both L and U are unit triangular matrices. We find it convenient to use this form in the symmetric case, see Section 3.9.

DEALING WITH INTERCHANGES

49

3.6 Dealing with interchanges If any row interchanges are needed during elimination, they are applied to the whole L\U array and the result is exactly as if they had been applied to A and the elimination had proceeded without interchanges (Exercise 3.3). Such row interchanges may be represented in matrix notation as the premultiplication of A by the permutation matrix P that is obtained from I by applying the same sequence of row interchanges (see Section 2.17). Hence, we will have computed the triangular factorization PA = LU. For numerical considerations (Chapter 4) and sparsity considerations (Chapter 5 and beyond), we may perform column permutations on A as well as row permutations. These may be expressed as another permutation matrix Q so that we are computing the triangular factorization PAQ = LU. Notice that the effect of row and column permutations can be incorporated easily into this solution scheme. If P and Q are permutation matrices such that the factorization can be written PAQ = LU,

(3.6.1)

then the solution of equation (3.1.1) is given by the solution x to the system PAQQT x = Pb

(3.6.2)

so that the steps (3.2.1) and (3.2.2) are replaced by the steps y = Pb, Lz = y, Uw = z, and x = Qw.

(3.6.3a) (3.6.3b) (3.6.3c) (3.6.3d)

Thus, the row permutation matrix P keeps track of interchanges of the entries of b, see equation (3.6.3a), and the column permutation matrix Q keeps track of interchanges that sort the solution back into original order, see equation (3.6.3d). To simplify notation, we ignore the effects of interchanges in the rest of this chapter. 3.7 LU factorization of a rectangular matrix It is straightforward to extend LU factorization to the case of a full-rank rectangular matrix A with m rows and n columns. The matrix L(k) is now of order m and still produces zeros below the diagonal in column k. If m < n, the factorization stops when column m is reached, since this and all later columns have no entries below the diagonal. If m > n, the factorization stops after column n is reached, since there are no further columns. In both cases, the result is the factorization A = LU, (3.7.1) where L is m×m and U is m×n. U is upper trapezoidal in the sense that uij = 0 if i > j.

50

THE ALGEBRAIC PROBLEM

Our main reason for mentioning this is that it is very useful for breaking the factorization of a large square matrix into parts that can be handled efficiently. We use it for this purpose in Section 3.8. 3.8

Computational sequences, including blocking

In Section 3.3, the elimination steps began with the first column and proceeded through subsequent columns in order. At each step, the entire matrix below and to the right of the pivot was modified (see (3.3.11)). For obvious reasons, this is called right-looking. There are many alternatives. One is to delay the updates (3.3.11) for aij until column j is about to be pivotal, that is, (j) aij

= aij −

j−1 X

lik ukj .

(3.8.1)

k=1

The entries lik are to the left of aij in the matrix pattern, so this form is called left-looking. On modern hardware, it is usually faster because the intermediate (k) values aij , k = 2, . . . , j −1 do not need to be stored temporarily in memory. The difference between the left- and right-looking variants of Gaussian elimination may be illustrated by considering which entries are active at each stage. Figures 3.8.1 and 3.8.2 show what is stored at the beginning of the third major processing step on a 5 × 5 matrix. Quantities to be altered during the step are marked * and quantities to be used without alteration are marked †. While the left-looking (3.8.1) is more efficient on modern hardware than the right-looking (3.3.11), it requires all the columns to the left of the active column u11 l21 l31 l41 l51

u12 u22 l32 l42 l52

u13 u23 u†33 (3) ∗ a43 (3)∗ a53

u14 u24 u†34 (3)∗ a44 (3)∗ a54

u15 u25 u†35 (3) ∗ a45 (3)∗ a55

† Entries used. * Entries changed.

Fig. 3.8.1. Computational sequence for right-looking Gaussian elimination. u†11 † l21 † l31 † l41 † l51

u†12 u†22 † l32 † l42 † l52

a∗13 a∗23 a∗33 a∗43 a∗53

a14 a24 a34 a44 a54

a15 a25 a35 a45 a55

† Entries used. * Entries changed.

Fig. 3.8.2. Computational sequence for left-looking Gaussian elimination.

COMPUTATIONAL SEQUENCES, INCLUDING BLOCKING

51

to be accessed while it is being updated. Unless the matrix is small, these columns will not fit into the cache, so that access is slow. The solution to this problem is to group the updates by blocks so that each datum that is moved into cache is used in many operations. The form that we find the most useful is the left-looking algorithm until a given number nb of columns has been factorized, then perform a right-looking block update. If nb columns fit into cache, we can factorize the first nb columns with one data movement into cache. The Level 2 BLAS subroutine ger can be used for the update (3.3.11). Let us partition the matrix thus:   A11 A12 A= , (3.8.2) A21 A22 where A11 is of order nb × nb. Once the processing of the first nb columns is complete, we will have the factorization      U11 L11 A11 . (3.8.3) = L21 I A21 If we had updated the remaining columns, the factorization would have been      U11 U12 L11 A11 A12 (3.8.4) = ¨ 22 L21 I A21 A22 A By equating corresponding blocks we find the relations L11 U12 = A12

(3.8.5)

¨ 22 = A22 − L21 U12 . A

(3.8.6)

and ¨ 22 is known as the Schur complement of A11 . A The columns of U12 may be found by forward substitution through L11 . If the block size nb is small enough for L11 to fit into cache, it can remain there while each column of A12 is accessed, processed, and the resulting column of U12 is stored in memory. The Level 3 BLAS (Section 1.5) subroutine gemm1 performs a matrix multiply and matrix add, so is very suitable for the operation (3.8.6). Most computer vendors supply optimized versions of gemm and it can run at a high percentage of the peak speed of the computer (over 95% of peak for some machines). When called from a sequential code on a shared-memory machine, it will often use shared-memory parallel programming. When vendor-supplied BLAS are not available, an efficient version can be obtained or generated using ATLAS (Automatically Tuned Linear Algebra Software, Whaley, Petitet, and Dongarra, 2000). 1 There are four versions: sgemm for single-precision real; dgemm for double-precision real; cgemm for single-precision complex; and zgemm for double-precision complex.

52

THE ALGEBRAIC PROBLEM

The remaining operations for the given matrix are exactly those that would ¨ 22 . This can be done by the left-looking be made for factorizing the matrix A algorithm on its first nb columns followed by the updating of its remaining columns. We continue in this way until the whole matrix is factorized. The algorithm is summarized in Figure 3.8.3. recursive subroutine factor(A, L, U) ! A is input and L, U are output. if A has more than nb columns then A11 A12 Partition A to A21 A22 where A11 is oforder nb  × nb  U11 L11 A11 = Factorize L21 I A21 Solve L11 U12 = A12 call factor (A22  − L21 U  12 , L22 , U22 )  U11 U12 L11 ;U = L= U22 L21 L22 else Factorize A= LU end if end subroutine factor Fig. 3.8.3. Recursive factorization. Blocking by columns is also useful for distributed parallel computing. Each block of columns is held on a single process, usually in ‘round robin’ fashion with process i holding blocks i, i + p, i + 2p, . . ., i = 1, 2, . . . p. In this context, a rightlooking algorithm is termed fan out and a left-looking algorithm is termed fan in. These terms originate from a one-sided view of communication. In the fan-out algorithm, once a process has finished work on a block column, it sends all the resulting updates to the processes that will use them. In the fan-in algorithm, when a process is ready to work on a block column, it fetches accumulated updates from each process that contributes to the block column. The fan-in algorithm performs significantly less communication, but extra storage is needed for all the partially accumulated updates. 3.9

Symmetric matrices

An important special case of these algorithms occurs when A is symmetric (that is, aij = aji for all i, j). Matrices with this property arise in structural analysis, in some areas of power systems analysis, frequently in optimization, and in many other application areas. In this case, assuming that pivots are chosen from the diagonal, the active parts of A(k) , k = 1, 2, ..., n are symmetric; that is the relations

SYMMETRIC MATRICES

(k)

(k)

aij = aji , i ≥ k, j ≥ k

53

(3.9.1)

hold. This is easily deduced from the fundamental Gaussian elimination equation (3.3.7b). Equation (3.9.1) is true for k = 1 and equation (3.3.7b) tells us that it is true for k + 1 if it is true for k. It follows that only about half of the arithmetic operations need be performed in the symmetric case. The relation (3.9.1) also allows the immediate deduction that in the LDU factorization of A, see (3.5.10), the relation lij = uji , i > j

(3.9.2)

holds, that is L = UT . The factorization is usually written LDLT in this case and, of course, U is not stored. If A is also positive definite (that is, if xT Ax > 0 for all nonzero vectors x), then it can be shown that the diagonal entries of D must be positive (see Exercise 3.8). In this case, the factorization can be written 1 1 ¯L ¯T , A = (LD 2 )(D 2 LT ) = L

(3.9.3)

where D = diag(dii ). This is called the Cholesky factorization. Note that, in practice, many symmetric matrices are positive definite because an energy minimizing principle is associated with the underlying mathematical model. (k) As in the unsymmetric case, the factorization breaks down if any akk is zero, although this cannot happen in the positive-definite case (see Exercise 3.9). Unfortunately, we destroy symmetry if we use row interchanges to avoid this (k) difficulty. If another diagonal entry ajj , j > k is nonzero, the symmetric permutation that interchanges rows j and k and columns j and k will bring (k) the nonzero ajj to the required position. The result is a factorization PAPT = LDLT ,

(3.9.4)

where P is the permutation matrix that represents all the interchanges. Unfortunately, even this can break down without A being singular, as the example   01 A= (3.9.5) 10 shows. We will return to this in the next chapter (Section 4.9). As in the unsymmetric case, there is some freedom in the choice of computational sequence. Those corresponding to storage of the lower triangular part of the matrix are shown in Figures 3.9.1 and 3.9.2 and correspond to those in Figures 3.8.1 and 3.8.2, respectively.

54

THE ALGEBRAIC PROBLEM

d11 l21 l31 l41 l51

d22 † (3)∗ l32 † d33 (3)∗ (3)∗ l42 † a43 a44 (3)∗ (3)∗ (3)∗ l52 † a53 a54 a55

† Entries used. * Entries changed.

Fig. 3.9.1. Computational sequence for right-looking Gaussian elimination on the lower triangle. d†11 † l21 † l31 † l41 † l51

d†22 † l32 a∗33 † l42 a∗43 a44 † l52 a∗53 a54 a55

† Entries used. * Entries changed.

Fig. 3.9.2. Computational sequence for left-looking Gaussian elimination on the lower triangle. LDLT decomposition. 3.10

Multiple right-hand sides and inverses

So far we have considered the case with a single right-hand side b, but it is not difficult to extend the algorithms to the case with k right-hand sides. We can now carry out the forward substitution phase (3.5.9) as LC = B

(3.10.1)

where C is the n × k matrix representing the results of forward substitution. This can be followed by computing the k solutions in the n × k matrix X from UX = C.

(3.10.2)

If we always knew all the right-hand sides in advance, we could carry out the forward substitution step (3.10.1) while doing the factorization. As for the case with a single right-hand side, there are many problems for which we do not know all the right-hand sides in advance, or even where one right-hand side depends on the solution of a previous problem. In these cases, saving the multipliers gives a significant computational saving. The computation may be ordered in various ways, as for Gaussian elimination (see Section 3.8). C may be computed column by column, corresponding to carrying out forward substitution one right-hand side at a time, or C may be computed row by row, corresponding to calculating the first entry of each righthand side, then the second, etc. We have similar freedom in computing X either a column at a time or a row at a time.

COMPUTATIONAL COST

55

An important special case is where the inverse A−1 is required, since this may be obtained by solving AX = I (3.10.3) by taking columns of I as successive right-hand side vectors. Note that, in this case, a worthwhile computational saving is available in equation (3.10.1) because the forward substitution operations always leave a right-hand side that is a lower (i) triangular matrix (that is, cij = 0 for i < j). If a sequence of problems with the same matrix, but different right-hand sides b is to be solved, it is tempting to calculate the inverse and use it to form the product x = A−1 b, (3.10.4) but this is not necessary. This is because the number of floating-point operations needed to form the product A−1 b is approximately that of solving LUx = b

(3.10.5)

given the factorization A = LU (see Section 3.11). The form (3.10.4) may be advantageous on parallel architectures (again, see Section 3.11), but usually the work of calculating the inverse is an unnecessary expense. In the sparse case, the inverse A−1 is usually dense (see Section 15.6), whereas the factors L and U are usually sparse, so using relation (3.10.4) may be very much more expensive (perhaps by factors of hundreds or thousands) than using relation (3.10.5). 3.11

Computational cost

One commonly used measure of computational cost is the number of floatingpoint operations and we use it here because it represents a lower bound on the cost. Notice that we adhere to our definition of floating-point operation in Section 1.1, although some authors have in the past counted the number of multiply-add pairs and most architectures have hardware for this combination. Since all of the factorization methods presented in this chapter perform the same operations, we examine the costs of Gaussian elimination. We continue to examine the algebraic problem and assume that the matrix has been reordered, (k) if necessary, so that each inequality akk 6= 0 holds. Let rk be the number of entries to the right of the main diagonal in row k of A(k) . Let ck be the number of entries below the main diagonal in column k of A(k) . For a dense matrix, these have values r1 = n−1, c1 = n−1, r2 = n−2, . . ., rn = 0. Computing A(k+1) from A(k) involves computing ck multipliers, and then performing ck rk multiply-add pairs. The total cost of Gaussian elimination, excluding the work on the right-hand side, is therefore given by the formula 2

n−1 X k=1

ck rk +

n−1 X k=1

ck = 32 n3 − 12 n2 − 16 n.

(3.11.1)

56

THE ALGEBRAIC PROBLEM

As n becomes larger the leading term dominates the cost, and formula (3.11.1) is usually written in the form 2 3 3n

+ O(n2 ).

(3.11.2)

When A is symmetric and can be factorized symmetrically, only the upper (or lower) triangle of the matrices A(k) need be calculated, which yields the formula 1 3 2 (3.11.3) 3 n + O(n ). Referring to equations (3.2.3b), we see that the back-substitution operation involves one multiply-add pair for each off-diagonal entry of U. Hence, the backsubstitution cost, including the divide for each ukk , is given by the simple formula n+2

n−1 X

ri = n2 .

(3.11.4)

i=1

Similarly, the forward substitution cost is given by the formula 2

n−1 X

ci = n2 − n.

(3.11.5)

i=1

To compute A−1 efficiently requires solving the systems of equations AX = I

(3.11.6)

(see Section 3.10). Taking care to exploit the fact that the right-hand side is lower triangular throughout forward elimination, the overall cost is 34 n3 + O(n2 ), approximately twice the cost of the factorization itself (see Exercise 3.12). Note that using A−1 to compute the solution of Ax = b as A−1 b and computing x from the factorization both require 2n2 operations. In the sparse case, using the inverse may be much more expensive. Note that a quite different way of computing certain entries of the inverse of a sparse matrix is discussed in Section 15.7. Using these cost figures we note that, on a serial machine, it is better not to use A−1 explicitly even when computing the matrix transformation Q = A−1 BA.

(3.11.7)

The cost of multiplying two n×n dense matrices in the straightforward way is 2n3 (see Exercise 3.13). Thus, actually using A−1 to compute Q requires (ignoring O(n2 ) terms)  1. Compute A−1 : 2n3  2. Form A−1 B : 2n3 Total 6n3 . (3.11.8)  −1 3. Form (A B)A : 2n3

PARTITIONED FACTORIZATION

Alternatively, we may use the following steps  1. Factor A as LU : 32 n3  2. Solve AX = B : 2n3 Total 4 23 n3 .  3. Form XA : 2n3

57

(3.11.9)

Again, we emphasize that in the sparse case the advantage of avoiding the computation of A−1 is much more marked. Unfortunately, simply counting floating-point operations is not adequate on modern computers. For example, if the data is not already in the cache, it may take longer to copy it there than to do the actual floating-point operations. We see this in the performance of the Level 3 BLAS where data reuse yields performance close to the peak for the hardware. This means that two algorithms with identical operation counts can have quite different performance. In this case, we really need to consider the movement of data and cache speeds, sizes, and latency as we discussed in Section 1.4. There is scope for exploiting parallelism in the use of A−1 , since the computation of each component of A−1 x can be regarded as an independent calculation that involves a row of A−1 . It may therefore be worthwhile to calculate A−1 explicitly, although parallelism can also be used in forward substitution and back-substitution by blocking the computation (see Section 3.13). Although, as we remarked in Section 3.10, it is even less practical to think of forming an explicit inverse for a sparse matrix, small dense matrices occur in many sparse solvers and the use of their explicit inverses can aid parallelism. There are algorithms which reduce the number of operations for n × n matrix by matrix multiplications below O(n3 ). Strassen (1969), reduces the number of operations to O(nlog2 7 ) ≈ O(n2.8074 ). Others, based on Strassen’s work, involve less operations if n is very large. For example, Pan (1984) requires O(n2.496 ) operations. There are terms in the operation counts that are subdominant for very large values of n but are significant for modest values, so Strassen’s algorithm is used only for dense matrices whose order is large, with a threshold of around 1000. The other algorithms have subdominant terms that are so large that we doubt if they are useful in practice. We also note that the stability of these techniques is slightly worse than straightforward matrix-matrix multiplication (Higham 2002). For these reasons, we assume in this book that multiplying two n × n matrices involves 2n3 floating-point operations. 3.12

Partitioned factorization

The ideas discussed so far in this chapter may readily be extended to partitioned matrices, the blocks being treated similarly to the single entries of the previous sections. In Section 3.8, we suggested this concept with the focus on keeping portions of the computation in cache, but the idea of partitioned matrices is much more broadly applicable. In Chapter 1, the 14 × 14 matrix we used to motivate the study of sparsity was a block matrix; each entry was a 3 × 3 matrix

58

THE ALGEBRAIC PROBLEM

representing the three degrees of freedom associated with each joint. Often a very large problem will have a natural partitioned structure associated with pieces of the problem. For example, a model of the interconnected electrical grid for the Western Region of the United States is illustrated in Figure B.10 in Appendix B. The blocks along the main diagonal represent the tighter interconnection within major cities of the region. In general, then, we are interested in factorizing matrices of the form   B11 B12 B13 . . . B1k       B21 B22 B23 . . . B2k     A A 11 12    = A, (3.12.1) B =  B31 B32 B33 . . . B3k  =    A21 A22    ... ... ... ... ...    Bk1 Bk2 Bk3 . . . Bkk where A11 = B11 , A12 = [B12 . . . B1k ], etc. To factorize the matrix B, we can therefore factorize   A11 A12 A= A21 A22

(3.12.2)

and then apply the relationship (3.12.1) recursively to deal with each partitioned block of B. We assume that A11 and A22 are square submatrices. We will refer to the submatrices on the diagonal as the diagonal blocks; note that, in general, they are not diagonal matrices. The LU factorization of A may be written in the form      A11 A12 L11 U11 U12 A= = , (3.12.3) A21 A22 L21 L22 U22 where L11 and L22 are lower triangular submatrices and U11 and U22 are upper triangular submatrices. This is just a partitioning of the standard LU factorization. By equating corresponding blocks we find the relations A11 L11 U12 L21 U11 ¨ 22 A

= L11 U11 , = A12 , = A21 , and

(3.12.4a) (3.12.4b) (3.12.4c)

= A22 − L21 U12 = L22 U22 .

(3.12.4d)

Therefore, form (3.12.3) may be constructed by triangular factorization of A11 , formation of the columns of U12 and the rows of L21 by forward substitution, and finally the triangular factorization of the Schur complement of A22 , A22 − L21 U12 . An interesting variation, known as implicit block factorization, results ¨ 22 = L22 U22 are stored, but U12 is if the factorizations A11 = L11 U11 and A

SOLUTION OF BLOCK TRIANGULAR SYSTEMS

59

not. When U12 is needed as a multiplier, L−1 11 A12 is used instead. This has little merit in the dense case, but in the sparse case it is extremely likely that U12 has many more entries than A12 so less storage will be needed and sometimes less computation also. There is a corresponding implicit version of the factorization (3.12.3) in which L21 is not stored. Note that just as we need to avoid zero pivots during ordinary factorization, we must here avoid singular block pivots. The example   1 1 1 A =  1 1 -1  (3.12.5) 0 1 1 illustrates the case of a nonsingular matrix with a singular first block. We discuss such issues in Section 4.4. 3.13

Solution of block triangular systems

If the matrix L has the form 

 L11  L21 L22    , L L L L=  31 32 33   :  : : . Lm1 Lm2 Lm3 . . . Lmm

(3.13.1)

the system Lx = c may be solved by block forward substitution, that is, successively solving Lkk xk = ck −

k−1 X

Lkj xj , k = 1, 2, . . . , m,

(3.13.2)

j=1

which allows Level 2 BLAS to be used. Alternatively, we may set c(1) = c and for k = 1, 2, . . . , m perform the calculations (k)

Lkk xk = ck (k+1) (k) cj = cj − Ljk xk , j = k + 1, k + 2, . . . m

(3.13.3)

again allowing Level 2 BLAS to be used. This form is inherently more suitable for parallel working because the updates to the different parts of c are independent. Similarly, if the matrix U has the form   U11 U12 U13 . . . U1m  U22 U23 . . . U2m     U33 . . . U3m  U= (3.13.4) ,  . :  Umm

60

THE ALGEBRAIC PROBLEM

the system Ux = c may be solved by block back-substitution, that is, successively solving m X Ukk xk = ck − Ukj xj , k = m, m − 1, ..., 1, (3.13.5) j=k+1

which allows Level 2 BLAS to be used. Alternatively, we may set c(1) = c and for k = 1, 2, . . . , m perform the calculations: (k)

Ukk xk = ck (k+1) (k) cj = cj − Ujk xk , j = 1, 2, . . . k − 1 .

(3.13.6)

Again, both forms allow Level 2 BLAS to be used and the second form is inherently more suitable for parallel working. Exercises 3.1 For the matrix



 121 A = A =  1 1 1 −1 0 1 verify the following Gaussian elimination steps:     100 1 21 L(1) =  −1 1 0  L(1) A(1) = A(2) =  0 −1 0  101 0 22     100 1 21 (2) (2) (2) (3) L = 0 1 0 L A = A = U =  0 −1 0  021 0 02      100 1 00 1 00 (1) −1 (2) −1 L = (L ) (L ) =  0 1 0   0 1 0  =  1 1 0  −1 0 1 0 −2 1 −1 −2 1 Verify also that these matrices satisfy the relation LU = A. (1)

3.2 With a 3 × 3 matrix, calculate the entries in L and U by directly multiplying L and U. Show that you get the matrix U of (3.3.6). 3.3 With row permutations at each step of the elimination, equation (3.5.3) takes the form U = L(n−1) P(n−1) L(n−2) P(n−2) ...L(1) P(1) A, (3.13.7) where P(k) represents an interchange between row k and a later row and L(k) has the form shown in equation (3.5.1). Show that this can also be expressed in the form ¯ (n−1) L ¯ (n−2) ... L ¯ (1) P(n−1) P(n−2) ... P(1) A, U=L

(3.13.8)

¯ (k) is the matrix where L ¯ (k) = P(n−1) P(n−2) ... P(k+1) L(k) P(k+1) ... P(n−2) P(n−1) . L

(3.13.9)

Interpret equation (3.13.8) as Gaussian elimination applied to the permuted matrix P(n−1) ... P(1) A.

SOLUTION OF BLOCK TRIANGULAR SYSTEMS

61

−1

3.4 Compute L(1) and (L(1) ) for the matrix A of Exercise 3.1.    1000 1000 2 1 0 00 1 0 0   3.5 Find the product   3 0 1 0   0 5 1 0 . 4001 0601 3.6 Determine the formulae for computing L, D, and U, where A = LDU, L is unit lower triangular, U is unit upper triangular, and D is diagonal. 3.7 Given A = LDU as in Exercise 3.6, what are the steps needed to solve Ax = b? 3.8 Prove that if A is symmetric and positive definite, then for A = LDLT defined in Section 3.9, each dii must be positive. 3.9 Prove that if A is symmetric and positive definite then no interchanges are required (k) in computing A = LDLT and the algorithm cannot fail, that is, the inequalities akk 6= 0, k = 1, 2, ..., n, hold. (Hint: Use Exercise 3.8.)   1000 2 1 0 0  3.10 Compute the inverse of the matrix   3 0 1 0 . 4001 3.11 What is the computational cost of forward substitution given in equations (3.2.4) for the special case where b = (0 0 . . . 0 1)T ? 3.12 Show that, given the factorization A = LU, the number of operations needed to solve AX = I is 43 n3 + O(n2 ). (Hint: Use Exercise 3.11.) 3.13 Determine the cost of multiplying two general n × n matrices in the standard way. 3.14 Show how one can use the block factorization (3.12.3) to solve a partitioned system of equations. 3.15 If

 A=

 A11 A12 , A21 A22

where A11 is a lower triangular k×k matrix, A22 is 1×1, and A12 and A21 are both dense, compare the computational costs of using factorization (3.12.3) and its implicit variant.

4 GAUSSIAN ELIMINATION FOR DENSE MATRICES: NUMERICAL CONSIDERATIONS We review the impact of inexact computation (caused by data uncertainty and computer rounding error) on Gaussian elimination. We need to ensure that the algorithm produces answers that are consistent with the given data; that is, that the algorithm is stable. We want to know whether small changes to the original data lead to large changes in the solution; that is, whether the problem is ill-conditioned. We show how to control stability and measure conditioning. The concepts of pivoting, scaling, iterative refinement, and estimating condition numbers are introduced, which will be tools used while addressing sparsity issues in the rest of the book.

4.1

Introduction

Neither problem data nor computer arithmetic is exact. Thus, when the algorithms of the previous chapter are implemented on a computer, the solutions will not be exact. As we prepare to solve very large systems of equations, perhaps with millions of unknowns, it is natural to be concerned about the effect of these inaccuracies on the final result. There are two ways to study the effect of arithmetic errors on the solution of sets of equations. The most obvious way, accumulating their effect through every stage of the algorithm, is extremely tedious and usually produces very pessimistic results unless full account is taken of the correlations between errors. There is a less direct way to assess this error, which has become standard practice for numerical linear algebra computations. It was pioneered by Wilkinson (1961) and involves answering two questions: e the exact solution of a ‘nearby’ problem? (i) Is the computed solution x (ii) If small changes are made to the given problem, are changes in the exact solution also small? A positive answer to the first question means that we have been able to control the computing error. That is, the error in the computed solution is no greater than would result from making small perturbations to the original problem and then solving the perturbed problem exactly. In this case, we say that the algorithm is stable. We seek always to achieve this. If the answer to the second question is positive, that is, if small changes in the data produce small changes in the solution, we say that the problem is well-conditioned. Conversely, when small changes in the data produce large changes in the solution, we say that the problem is ill-conditioned. Note that Direct Methods for Sparse Matrices, second edition. I. S. Duff, A. M. Erisman, and J. K. Reid. c Oxford University Press 2017. Published 2017 by Oxford University Press.

COMPUTER ARITHMETIC ERROR

63

this is a property of the problem and has nothing to do with the method used to solve it. Care is always needed in the ill-conditioned case. Data errors and errors introduced by the algorithm may cause large changes to the solution. Sometimes (see, for example, the simple case discussed in Section 4.10) the ill-conditioning may be so severe that only a few figures, or even none, are computed with certainty even though the algorithm is stable. When both the algorithm is stable and the problem is well conditioned, the computed solution will be accurate (see Exercise 4.1). Because ill-conditioning is a property of the problem, it may be altered by changing the problem formulation. An example of this can be seen in the solution of the linear leastsquares problem Ax = b, (4.1.1) where A has m rows and n columns, with m > n, which we solve in the sense of minimizing kb−Axk22 . We can formulate the least-squares system as the normal equations AT Ax = AT b, (4.1.2) where now AT A is square, symmetric, and positive definite. While the exact solutions of the two sets of equations are the same, the sensitivity of the solution to small changes in b is not the same as its sensitivity to small changes in AT b. As an illustration, consider the problem     10 11 21  11 12  x =  23  (4.1.3) 12 13 25   1 . If the first component of b is changed for which the exact solution is x = 1   0.987 . If the first component of AT b is by 0.01%, the solution changes to 1.012   6.51 . It is clear that the original changed by 0.01%, the solution changes to −4.05

problem is far less sensitive to small changes than the normal equations, that is, reformulation of the problem as a set of normal equations is undesirable from the point of view of conditioning. This is true more generally for normal equations, see Higham (2002, chapter 20), but a full exploration of the sparse least-squares problem is beyond the scope of this book. This discussion has been deliberately vague, without defining the precise meanings of large and small or giving rigorous bounds. These issues are addressed and illustrated in the rest of this chapter. 4.2

Computer arithmetic error

Nowadays, most computers use IEEE binary floating-point arithmetic (IEEE 1985) to represent real numbers. They are held in the form ±d1 .d2 . . . dp × 2i

64

NUMERICAL CONSIDERATIONS

where i is an integer, d1 , d2 , . . . dp are binary digits (0 or 1) and d1 6= 0 unless d2 = d3 = . . . = dp = 0. The number of digits p is 24 in single precision and 53 in double precision. The exponent i is limited to the range −126 ≤ i ≤ 127 in single precision and −1022 ≤ i ≤ 1023 in double precision. This allows numbers in the ranges 2−126 ≈ 1.2×10−38 to (1−2−24 )×2128 ≈ 3.4×1038 and 2−1022 ≈ 2.2×10−308 to (1−2−53 )×21024 ≈ 1.8×10308 , respectively. Representation errors occur as soon as numbers are entered into the computer unless they are exactly representable in this format. For instance, 31 cannot be held exactly. For virtually all problems, as soon as A and b are entered into the computer, they are replaced by A + ∆A and b + ∆b, where ∆A and ∆b are errors. This is likely to be true even for problems with no inexact data. Often the data for the matrix A and the vector b are derived from measurement or other approximation. In this case, these approximations combine with the computer representation of the numbers, adding an independent source of perturbation. Once the arithmetic begins with these numbers, further error is introduced. A fundamental problem is that the associativity property of addition does not hold; for example, the relation ˜ +c) ˜ = (a+b) ˜ +c, ˜ a+(b

(4.2.1)

˜ represents computer addition, is usually untrue. where + For the purposes of illustration in this chapter, we use a hypothetical computer with a 3-decimal floating-point representation of the form ±d1 .d2 d3×10i where i is an integer, d1 , d2 , d3 are decimal digits and d1 6= 0 unless d1 = d2 = d3 = 0. For convenience, we will write such numbers in fixed-point form (for example, 1.75, 0.00171, 14400). In this case, note that the relations ˜ 2420+1.58 = 2420,

(4.2.2)

˜ ˜ −2420+(2420 +1.58) = 0,

(4.2.3)

˜ ˜ (−2420+2420) +1.58 = 1.58

(4.2.4)

and are all true. This kind of error, without proper control, can have a devastating effect on the solution. On the other hand, with the care discussed in the rest of this chapter ensuring algorithm stability, very large systems of equations can be solved with little loss of accuracy from this source. We find it convenient to refer to the relative precision  of the computer. This is the smallest number such that if † is any of the arithmetic operators +, -, ×, /, and ˜† is the corresponding computer operation, then a˜†b has the value a(1 + a ) † b(1 + b ), where a and b are perturbations satisfying the inequalities |a | ≤ , |b | ≤ . For IEEE arithmetic,  is 2−24 ≈ 0.6×10−7 in single precision and 2−53 ≈ 1.1×10−16 in double precision; furthermore, the accuracy in IEEE

ALGORITHM INSTABILITY

65

arithmetic is guaranteed relative to the exact result so that a˜†b has the value (a † b)(1 + c ) with |c | ≤ , which can be interpreted as the same as the earlier bound with either a or b equal to zero. In our hypothetical 3-digit computer, we assume that the arithmetic operations are performed as accurately as possible and, hence,  has the value 0.0005. 4.3

Algorithm instability

Using our hypothetical 3-digit computer, we illustrate the effects of computer errors on the algorithms of Chapter 3 with the matrix   0.001 2.42 A= . (4.3.1) 1.00 1.58 Since the relation a11 6= 0 holds, the factorization proceeds without interchanges and we compute (2) ˜ a22 = −2420+1.58 = −2420. (4.3.2) The effect of this error is seen, for example, in solving the equation      0.001 2.42 x1 5.20 Ax = = . x2 1.00 1.58 4.57

(4.3.3)

The triangular system that results from applying Gaussian elimination is      5.20 0.001 2.42 x1 = , (4.3.4) x2 −5200 0 −2420 which would be exact if we had started from      5.20 0.001 2.42 x1 = . x2 0 1.00 0

(4.3.5)

The computed solution is  e= x

x e1 x e2



 =

 0.00 , 2.15

while the 3-place approximation to the true solution is     1.18 x1 ' . x= x2 2.15

(4.3.6)

(4.3.7)

Instead of the exact LU factorization of Chapter 3, we have computed an e U, e given by approximate factorization A ' L      0.001 2.42 1 0.001 2.42 ' . (4.3.8) 1.00 1.58 1000 1 −2420

66

NUMERICAL CONSIDERATIONS

The factorization-error matrix eU e −A H=L

(4.3.9)

must be small for a stable algorithm (the term ‘stable’ was defined in Section 4.1), and for the current example we find   0.0 0.0 H= , (4.3.10) 0.0 −1.58 which is not small relative to A, so we conclude that our algorithm is unstable. We illustrate this phenomenon again in Exercise 4.3. Another way to show that the algorithm is unstable is to examine the residual r = b − Ae x,

(4.3.11)

and find that krk is not small compared with kbk or compared with kAk ke xk (see Appendix A for definitions of norms). In our example, the residual (computed exactly) is   −0.003 r= , (4.3.12) 1.173 whose norm is not small compared with kbk. eU e (see Note that the damage that led to the inaccurate factorization L equations (4.3.9) and (4.3.10)) was done with the computation (4.3.2), which treated 1.58 as zero. In fact, it would have treated any number in the interval (−5, 5) as zero. The reason that it did so was the large growth in size that took (2) place in forming a22 from a22 . The backward error analysis of Wilkinson (1965), as extended by Reid (1971), shows that Gaussian elimination (or LU factorization) is stable provided such growth does not take place. Reid’s results, when modified for our wider bounds on rounding errors (see the end of Section 4.2), yield the inequality (k)

|hij | ≤ 5.01  n max |aij | k

(4.3.13)

eU e − A, where  is the for coefficients of the factorization-error matrix H = L relative precision and n is the matrix order. Unfortunately, it is not practical to control the sizes of all the coefficients (k) aij and, in any case, some growth in the size of a coefficient that begins by being very small may be perfectly acceptable. It is normal practice, therefore, to control the largest coefficient and weaken inequality (4.3.13) to the bound |hij | ≤ 5.01  n ρ,

(4.3.14)

where ρ is given by the equation (k)

ρ = max |aij |. i,j,k

(4.3.15)

CONTROLLING ALGORITHM STABILITY THROUGH PIVOTING

67

Note, however, the crucial influence of scaling. For example, if the entries in one row are much bigger than those in the others, we will tolerate large growth in these other rows, which may be disastrous to the accuracy of the solution. We return to the consideration of scaling in Sections 4.14 and 4.15, assuming for the present that our problem is well scaled. Fortunately, most problems occur naturally this way. Poor scaling is often associated with such features as different units (for example, distances measured in millimetres and kilometres). 4.4

Controlling algorithm stability through pivoting

When performing Gaussian elimination, we showed in Chapter 3 that it was necessary to interchange rows to avoid a zero pivot. In Section 4.3, we illustrated that the use of a small pivot can lead to growth in the size of the computed entries. On our 3-digit computer, the small pivot led to a large multiplier resulting in the addition −2420 + 1.58, which destroys the value 1.58. Interchanges are again necessary for stability, this time to control growth in the resulting numbers. We study a variety of pivoting strategies aimed at controlling growth in the course of the factorization. 4.4.1

Partial pivoting

One reason why we might get growth that destroys smaller values in the course of the computation comes from having a very large multiplier. A solution to this is to require that the inequality |lij | ≤ 1 should hold for the coefficients of the matrix L. This is readily achieved by reordering the rows of the matrix so that (k) the new pivot akk satisfies the inequalities (k)

(k)

|akk | ≥ |aik | , i > k.

(4.4.1)

The k-th column must be scanned below the main diagonal to determine its entry of largest magnitude. The row containing this entry is exchanged with the k-th row and so becomes the new k-th row. This strategy is called partial pivoting. Applying this strategy to example (4.3.1), we would interchange rows one and two of A before beginning the reduction and would compute the factors      1.00 1.58 1 1.00 1.58 = . (4.4.2) 0.001 2.42 0.001 1 2.42 Using these factors, we compute 

 1.17 e= x , 2.15

(4.4.3)

which is almost correct to three decimal places (x1 = 1.176). Note that this strategy for computation subject to rounding errors is similar to the strategy used for treating a zero pivot in exact arithmetic. In the exact case, we were concerned only to avoid a zero pivot; in the presence of rounding,

68

NUMERICAL CONSIDERATIONS

small pivots must also be avoided. Note also that a zero pivot gives the ultimate disaster of infinite growth and total loss of accuracy. In practice, Gaussian elimination with partial pivoting is considered to be a stable algorithm. This is based on experience rather than rigorous analysis, since the best a priori bound that can be given for a dense matrix is ρ ≤ 2n−1 max |aij |. i,j

(4.4.4)

This bound is easy to establish (see Exercise 4.5), but generally very pessimistic, particularly in the case where n is very large. 4.4.2

Threshold pivoting

While we want to control the size of the multiplier, we may find adherence to (4.4.1) overly restrictive when other factors need to be taken into account, for example, sparsity or moving data in and out of cache. The compromise strategy is termed threshold pivoting and requires the satisfaction of the inequalities (k)

(k)

|akk | ≥ u |aik |, i > k,

(4.4.5)

where u, the threshold parameter, is a value in the range 0 < u ≤ 1. If u has the value 1, we have the partial pivoting condition while if u is very small, we have essentially no pivoting except to avoid zeros. Any value between these allows some pivoting accompanied by some growth for the multipliers. The bound (4.4.4) is replaced by ρ ≤ (1 + u−1 )

n−1

max |aij |. i,j

(4.4.6)

For the case where A is dense and can be held in memory, there is no advantage to introducing this flexibility, since the same number of compares and the same amount of computation will be done independently of the value of u. We will return to this strategy in Section 5.2.1 when we consider the sparse case. 4.4.3

Rook pivoting

Growth in the course of the computation may occur for reasons other than having a multiplier larger than 1. Growth can take place steadily with multipliers bounded by 1, as is shown by a carefully constructed case, which achieves the bound of equation (4.4.4). Relative growth also takes place in the example   1.00 2420 A= . (4.4.7) 1.00 1.58 The larger number may be the result of accumulated growth in the (1,2) position where all entries in the original, much larger, matrix were in the range (0.5,5). Note that when working with three significant decimals, the value 1.58 is destroyed by the calculation (−2420 + 1.58), but not as a result of a large

CONTROLLING ALGORITHM STABILITY THROUGH PIVOTING

69

multiplier. This kind of problem could also be the result of a poorly scaled problem, which we discuss in Section 4.14. A way to address this kind of growth is to use a larger number not in the first column as the pivot. One way to do this is to use rook pivoting (Neal and Poole 1992), where the pivot is chosen if it is larger than or equal to any (k) entry in both its row and its column. That is to say, the new pivot akk satisfies the inequalities (k) (k) (k) |akk | ≥ max {|aik |, |aki |}, i > k. (4.4.8) It is intuitively clear that this strategy should produce less growth in the factorization than partial pivoting. The partial pivoting strategy only sought to limit the size of the multiplier. However, a multiplier that is less than one can produce a relatively large addition to an entry in the lower submatrix if an entry in the row is much larger than the pivot. Foster (1997) has shown that, for rook pivoting, the growth is bounded by: ρ ≤ f (n) max |aij |, f (n) = 1.5n3/4 log n

(4.4.9)

Experience has shown that often the number of compares to find the pivot satisfying the rook criteria is only modestly more than partial pivoting. However, it is possible to create a case where every entry in the matrix must be examined (see Exercise 4.6). While it would appear that rook pivoting does a better job of controlling growth than partial pivoting, it has not been used much in computer codes for dense matrices. We do not know if this is because partial pivoting is usually stable or because of a lack of stability monitoring in existing codes. We will, however, return to rook pivoting in Section 5.2 because it is much more attractive for sparse matrices. 4.4.4

Full pivoting

We can push the search for a pivot one step further, by choosing as pivot the largest entry in the remaining submatrix at any step. That is, row and column interchanges are performed at each stage to ensure that the inequalities (k)

(k)

|akk | ≥ |aij | , i ≥ k, j ≥ k

(4.4.10)

all hold, then the stronger bound ρ ≤ f (n) max |aij |, f (n) =

q n(21 31/2 41/3 ...n1/(n−1) )

(4.4.11)

has been obtained by Wilkinson (1961). This strategy is known as full or complete pivoting. Note that f (n) is much smaller than 2n−1 for large n (for example f (100) ' 3570). It was speculated that f (n) may be replaced by n, but this was disproved by Gould (1991), who found a 13×13 matrix for which the growth is 13.0205.

70

4.4.5

NUMERICAL CONSIDERATIONS

The choice of pivoting strategy

The bound (4.4.4) for partial pivoting is achieved for a carefully contrived matrix, as shown by Wilkinson (1965, p. 212), but years of experience in solving such problems in the dense case have established partial pivoting as the algorithm of choice. To this date, we have seen no documented advantage of working harder to control stability through rook pivoting or full pivoting in the dense case. Furthermore, there is no advantage to threshold pivoting in the dense case, since each pivot choice involves the same amount of computation and each has to be examined whether using threshold or partial pivoting. When we examine local pivoting strategies for the sparse case in Chapter 5, and in more detail in Chapter 7, we will bring both threshold and rook pivoting together. 4.5

Orthogonal factorization

An alternative approach to Gaussian elimination, which avoids much of the numerical difficulty discussed in this and the previous section, is that of orthogonal factorization. That is, the matrix A is factorized as A = QU,

(4.5.1)

where Q is orthogonal and U is upper triangular, so that the solution to the equation Ax = b (4.5.2) is easily effected through premultiplication of b by QT and back-substitution through U. Methods using the factorization (4.5.1) are sometimes used because of their good numerical properties (Wilkinson 1965), but are about twice as costly in arithmetic as Gaussian elimination for the dense case and usually more for the sparse case. We discuss their use on sparse systems in Section 15.12. 4.6

Partitioned factorization

For the partitioned factorization of Section 3.12, there is no really satisfactory method of guaranteeing the numerical stability. Two potential numerical difficulties must be considered. First, with the partitioning   A11 A12 , (4.6.1) A= A21 A22 we must verify the assumption that A11 is nonsingular, since this need not be true even when A is nonsingular. The example (3.12.5) illustrates this point. Secondly, the stability of the matrix factorization cannot be assured by simply controlling ¯ 22 = A22 − A21 A−1 A12 . the stability of the factorizations of A11 and A 11 This stability problem cannot occur for the symmetric positive-definite case as long as A11 is a principal submatrix of A because of the equivalence with ordinary factorization and its proven stability (see Section 4.8).

MONITORING THE STABILITY

4.7

71

Monitoring the stability

The a priori bounds in equations (4.4.4) and (4.4.6) are generally not at all tight and are of limited usefulness in monitoring the stability. We therefore now look at a posteriori methods of monitoring stability. If instability is detected, further work will be necessary to get a solution with as much accuracy as the data warrants. This might involve iterative refinement (Section 4.13), choosing the pivotal sequence afresh, or even working with greater precision. Two forms of stability monitoring are suggested by the discussion of Section 4.3. After the factorization has been found we may compute the eU e − A, or after the solution has been found we factorization-error matrix H = L may compute the residual vector r = b − Ae x. Computing the residual r is easier, and in practice this measure for the stability of the solution is usually employed. Indeed, if krk is small compared with kbk, we have obviously solved a nearby problem since Ae x = b − r; if krk e is the exact solution of the equation is small compared with kAk ke xk, then x (A + H)e x = b where kHk is small compared with kAk (see Exercise 4.4). Thus, if krk is small compared with kbk, kAe xk, or kAk ke xk, we have done a good job in solving the equation. For the converse, it is important to compare krk with kAk ke xk, rather than with kbk since krk may be large compared with kbk in spite of an accurately computed approximate solution. The set of equations      −0.232 0.287 0.512 x1 = (4.7.1) x2 0.358 0.181 0.322 has as exact solution



x1 x2



x e1 x e2





 1000 , −561

(4.7.2)

 1000 = , −560

(4.7.3)

=

yet the approximate solution 



which would be considered very accurate on a 3-decimal computer, has residual   −0.512 r= (4.7.4) −0.322 and certainly krk is large relative to kbk (although not with respect to kAk ke xk). This is an example of ill-conditioning, a subject to be explored later in this chapter. Moreover, r tells us nothing about the behaviour for other vectors b. For example, the linear system with matrix (4.3.1) and right-hand side vector   0.001 b= (4.7.5) 1.00

72

NUMERICAL CONSIDERATIONS

has the exact solution



 1.00 x= , 0.00

(4.7.6)

and this would be produced exactly by our 3-digit computer with the algorithm of Section 4.3. This does not imply that the factorization is stable, but rather that the instability does not affect the solution for this particular b. We cannot assume that the same will be true for other vectors b, and we know that the algorithm of Section 4.3 is unstable. The alternative of stability monitoring by checking the size of eU e − Ak kHk = kL

(4.7.7)

(k)

is expensive. Even computing ρ = maxi,j,k |aij | is expensive, since it involves (k)

testing |aij | over all i, j, k, which is an O(n3 ) operation for a dense matrix. Attempts have been made to find bounds for ρ (for example, Erisman and Reid (1974)) but, for general matrices, the bounds are too loose to be useful. We describe a practical application where this has been used effectively in Section 4.8. The best way of monitoring stability is to compute a posteriori the scaled residual kb − Axk . (4.7.8) kbk + kAkkxk Note that this will not tell you how accurate the solution is, but only how close the problem that was actually solved is to the original problem. 4.8

Special stability considerations

Sometimes particular properties of the matrix (either mathematical or related to the physical origin of the model) will either assure the stability of the factorization or give reason to simplify the pivoting strategies. Some of these are discussed in this section. A real symmetric matrix is said to be positive definite if for any nonzero vector x the inequality xT Ax > 0 (4.8.1) holds. Equivalently, A is positive definite if all its eigenvalues are positive (see Exercise 4.7). Such matrices arise frequently in practice because in many applications xT Ax corresponds to the energy, which is a fundamentally positive quantity. For this case, Wilkinson (1961) has shown that with diagonal pivots no growth takes place in Gaussian elimination in the sense that (k)

ρ = max |aij | ≤ max |aij |. i,j,k

i,j

(4.8.2)

As was briefly stated in Section 3.9, symmetry can be preserved in this factorization, thereby saving both storage and time. An algorithm often used is the Cholesky factorization

SPECIAL STABILITY CONSIDERATIONS

73

A = LLT ,

(4.8.3)

A = LDLT ,

(4.8.4)

see equation (3.9.3), or the variant

which avoids the calculation of square roots. In another class of matrices that arise in applications, A is diagonally dominant, that is its coefficients satisfy the inequalities X |akk | ≥ |aik |, k = 1, 2, ..., n (4.8.5) i6=k

or the inequalities |akk | ≥

X

|akj |, k = 1, 2, ..., n.

(4.8.6)

j6=k

Here, it may be shown (see Exercise 4.8) that (k)

ρ = max |aij | ≤ 2 max |aij |, i,j

i,j,k

(4.8.7)

so that the factorization is always stable with diagonal pivots. Another class to consider is the set of indefinite symmetric matrices. Here, choosing pivots from the diagonal may be unstable or impossible, as is illustrated by the case   011 A =  1 0 1 . (4.8.8) 110 Standard partial pivoting destroys the symmetry and, hence, we are faced with the storage and computing demands of an unsymmetric matrix. A stable way of preserving symmetry is discussed in the next section. For any symmetric matrix, the work of Erisman and Reid (1974) results in the bound i X ρ ≤ max |dk lik 2 | (4.8.9) i

k=1

for growth, which is much tighter than the corresponding bound for general matrices. This bound was used for a case where A was a complex symmetric (non-Hermitian) matrix arising from a frequency domain analysis problem in electrical power systems. Diagonal pivots are known to be stable in the positivedefinite Hermitian case, but there is no mathematical foundation for diagonal pivots in the complex symmetric case. Erisman and Spies (1972) wanted to take advantage of the symmetry by using diagonal pivots, drawing on an argument from the physical origin of the problem, in order to improve performance. They used the bound (4.8.9) in a code developed for this purpose to verify the stability of the factorization a posteriori. This code remained in use at Boeing through the 1990s and may continue to be used now.

74

4.9

NUMERICAL CONSIDERATIONS

Solving indefinite symmetric systems

When the matrix A is symmetric but indefinite, choosing pivots from the diagonal may be unstable or impossible, as illustrated by example (4.8.8). Yet ignoring symmetry leads to twice the cost in storage and computation. Fortunately, an algorithm has been developed (Bunch and Parlett 1971; Bunch 1974), which preserves symmetry and maintains stability for this case. The idea is simply to extend the notion of a pivot to 2 × 2 blocks, as well as single entries. Thus, if A has the form   K CT A= (4.9.1) CB where K and B are symmetric and K is nonsingular, the first step of the LDLT factorization may be expressed in the form     K 0 I 0 I K−1 CT A= . (4.9.2) CK−1 I 0 I 0 B − CK−1 CT K has order 1 in the usual case, but in the extended case K may have order 1 or 2. Note that B − CK−1 CT is symmetric. Bunch, Kaufman, and Parlett (1976) search at most two columns of the lower triangular part of A(k) and perform (k) at most one symmetric interchange to ensure that either akk is suitable as a # " (k) (k) akk ak,k+1 is suitable as a 2 × 2 pivot. They report timings 1 × 1 pivot or (k) (k) ak+1,k ak+1,k+1 comparable with those of the Cholesky algorithm, and their stability bound is ρ ≤ 2.57n−1 max |aij |.

(4.9.3)

Barwell and George (1976) also report that using 2 × 2 pivots need involve little extra expense. It is interesting to note that there is no need to extend beyond 2 × 2 pivots. In the sparse case, it is important to use a form of threshold pivoting that involves 2 × 2 pivots and we discuss this in Section 5.2. 4.10

Ill-conditioning: introduction

e satisfying We now assume that we have been successful in computing a vector x the equation (A + H)e x = b, (4.10.1) e has been computed stably. As yet, we have no where kHk is small, that is x e is accurate. The simple example assurance that x     0.287 0.512 0.799147 , b= (4.10.2) A= 0.181 0.322 0.502841 bears this out, if we continue to use 3-decimal arithmetic. The exact solution is

ILL-CONDITIONING: THEORETICAL DISCUSSION

75



 0.501 x= . 1.28 Representation error requires rounding b on input to   0.799 e b= . 0.503 The computed 3-digit LU factors of A are    1 0.287 0.512 . 0.631 1 −0.001

(4.10.3)

(4.10.4)

(4.10.5)

e has The perturbed system Ae x=b 

1.00 e= x 1.00

 (4.10.6)

as its exact solution and this is the solution that is obtained using the factors e − Ae (4.10.5). The quantities krk = kb − Ae xk and ke rk = kb xk are both relatively small; in fact, when r and e r are computed in 3-decimal arithmetic, both are zero. ek is thus in the data, rather than The cause of the relatively large value of kx − x in the algorithm. In this case, we are faced with an ill-conditioned problem: small changes in the data can cause large changes in the solution. The next few sections are devoted to understanding and working with illconditioned problems. 4.11

Ill-conditioning: theoretical discussion

A problem is ill-conditioned if small changes in the data can produce large changes in the solution. In the next paragraph, we show that this would not be possible if the matrix A scaled the norms of all vectors about equally, that is if kAxk had about the same size for all vectors x with kxk = 1. Thus, we can characterize the ill-conditioning of A in terms of the variation in kAxk. Suppose that the matrix A is such that there are two vectors v and w satisfying the relations kvk = kwk (4.11.1) and kAvk  kAwk.

(4.11.2)

If b has the value Av, the equation Ax = b

(4.11.3)

has solution x = v. If b is changed by the vector Aw, which is relatively small in view of inequality (4.11.2), the solution is changed by w, which is not small in view of equality (4.11.1). Therefore, the problem is ill-conditioned.

76

NUMERICAL CONSIDERATIONS

Clearly, the amount of ill-conditioning is dependent on how large the ratio of the two sides of relation (4.11.2) can be. The ratio can be written in the form kAvk kA−1 yk kAvk kwk = , kAwk kvk kvk kyk

(4.11.4)

if y is the vector Aw. The first term has maximum value kAk (see Appendix A for a discussion of norms) and the second term has maximum value kA−1 k, so the ratio kAvk/kAwk can be as large as kAk kA−1 k. Because of its relationship to the condition of A, the quantity κ(A) = kAk kA−1 k

(4.11.5)

is known as the condition number for A. In the extreme case, v is the vector that maximizes kAvk/kvk and w is the vector A−1 y, where y maximizes kA−1 yk/kyk. If the norm is the two-norm (see Appendix A), kAvk2 /kvk2 and kAwk2 /kwk2 are the largest and smallest singular values of A, and v and w are the corresponding singular vectors. For the matrix A defined in (4.10.2), it is interesting to observe the exact relationships         1.0 0.799 1.0 −0.000232 A = and A = , (4.11.6) 1.0 0.503 −0.561 0.000358 which shows the wide variation of kAxk∞ for vectors such that kxk∞ = 1.0. e. This wide variation explains the reason for the large difference x − x To see the role of the condition number (4.11.5) in establishing the relationship between the backward error and the accuracy of solution we assume that, for the general problem Ax = b, (4.11.7) b is perturbed by δb and the corresponding perturbation of x is δx so that the equation A(x + δx) = b + δb (4.11.8) is satisfied, then by subtraction followed by multiplication by A−1 we find the relation δx = A−1 δb. (4.11.9) By taking norms in both equations (4.11.7) and (4.11.9), we find the inequalities kAk kxk ≥ kbk

(4.11.10)

kδxk ≤ kA−1 k kδbk,

(4.11.11)

and

ILL-CONDITIONING: THEORETICAL DISCUSSION

77

from which follows the inequality kδxk kδbk ≤ kAk kA−1 k , kxk kbk

(4.11.12)

which again illustrates the role of the condition number (4.11.5). For simplicity of exposition, we have so far in this section confined attention to perturbations of the vector b. In this case, the bound may be far from sharp. For example, if x is the vector w of (4.11.1) and (4.11.2), kAk kxk will grossly overestimate kbk in (4.11.10), which in turn leads to the inequality (4.11.12) grossly overestimating kδxk/kxk. To see that the condition number realistically estimates how well or poorly conditioned the system is, we must consider the effect of perturbations to A. Let the perturbed system be (A + δA)(x + δx) = b.

(4.11.13)

Subtracting equation (4.11.7) and rearranging gives the equation Aδx = −δA(x + δx).

(4.11.14)

If we now multiply by A−1 and take norms, we find the inequality kδxk ≤ kA−1 k kδAk kx + δxk,

(4.11.15)

which can be rearranged to kδAk kδxk ≤ kA−1 k kAk . kx + δxk kAk

(4.11.16)

Comparing this with inequality (4.11.12), we see that the condition number κ(A) = kAk kA−1 k plays a similar role in relating the relative error in the solution with the relative error in the data. Now, however, we are considering a whole class of perturbations δA and would expect almost all to be such that the inequality (4.11.15) is reasonably sharp, which implies that inequality (4.11.16) can be expected to be sharp. This should be contrasted with the possible lack of sharpness of inequality (4.11.10) for particular vectors b. We have now considered the effects of perturbations to A and b separately, but, of course, in practice we will have solved a problem of the form (A + δA)(x + δx) = b + δb,

(4.11.17)

with perturbations to both A and b. The perturbations can be taken to include both data uncertainty and algorithmic error. If A is reasonably well scaled and the algorithm is stable, we can expect the algorithmic contribution to be small

78

NUMERICAL CONSIDERATIONS

compared with kAk and kbk. By subtracting the given equation from (4.11.17) we find Aδx = δb − δAx − δA δx, (4.11.18) and multiplying by A−1 and taking norms yields the inequality kδxk ≤ kA−1 k < (kδbk + kδAk kxk + kδAk kδxk),

(4.11.19)

which may be rearranged to the form kδxk(1 − kA−1 k kδAk) ≤ kA−1 k(kδbk + kδAk kxk).

(4.11.20)

Provided the inequality kA−1 k kδAk < 1

(4.11.21)

holds, we deduce the relation kδxk ≤

kA−1 k (kδbk + kδAk kxk) 1 − kA−1 k kδAk

(4.11.22)

for the absolute error. For the relative error, we divide this by kxk and use the inequality kbk ≤ kAk kxk to give the inequality   κ(A) kδbk kδAk kδxk ≤ + , (4.11.23) kxk kAk 1 − κ(A) kδAk kbk kAk

where κ(A) = kAk kA−1 k is the condition number, and the inequality (4.11.21) ensures that the denominator in this bound is positive. This final inequality shows the relationship between the relative solution error and the relative error in A and b, together with the key role played by the condition number. Observe that inequality (4.11.23) substantiates the remarks made in Section 4.1. If we have a stable algorithm, this ensures that a neighbouring problem has been solved, that is, that kδbk kδAk + kbk kAk is small. This only ensures an accurate solution if neighbouring problems have neighbouring solutions, that is if κ(A) is small. Finally, with algorithm instability and ill-conditioning, there is a cumulative effect. There are two main problems with using the condition number (4.11.5). The first is that it is very dependent on scaling, which is apparent even if A is a diagonal matrix. For example, consider the identity matrix with the last diagonal entry replaced by a small number, δ say. The norm of the matrix will be around 1, while the norm of the inverse matrix will be around 1/δ (the exact values depending on the norm being used) so that the condition number will be around 1/δ. However, if the matrix is scaled to make its last entry 1, the condition

ILL-CONDITIONING: AUTOMATIC DETECTION

79

number of the scaled matrix is just 1. The second is that the condition number takes no account of the right-hand side or the fact that small entries (specially zeros) may be known within much smaller tolerances than large entries. If the errors in the components aij and bi are bound by |aij | and |bi | for a scalar , we get the condition number of Skeel (1979)

−1

|A |(|Ake x| + |b|) (4.11.24) κskeel (A, x, b) = ke xk e is the computed solution. We will return to this condition number in where x Section 15.5 but, for the remainder of this chapter, we will continue to use the condition number given by equation (4.11.5) which is the one normally used and has the merit of being easier to compute or estimate. 4.12

Ill-conditioning: automatic detection

Clearly, conditioning has a major influence on the accuracy with which we can solve the equations and so it would be useful to have an algorithm to return an estimate of the condition number without the cost of computing A−1 . If the condition number is low, a small residual would indicate that we have a good solution of our system. Many methods have been suggested to effect this estimation, but there are really two main methods used today and both can readily be implemented when the matrix is sparse. The objective is to estimate the amount of ill-conditioning in the problem inexpensively (relative to the cost of solution of the problem itself). In general, we are interested only in the order of magnitude of the condition number. We note in passing that many other methods have been suggested to estimate the conditioning including explicit computation of singular values, explicit computation of A−1 , use of the determinant, and monitoring of pivot size. These are all either too impractical or unreliable to be of use to us. 4.12.1

The LINPACK condition estimator

Cline, Moler, Stewart, and Wilkinson (1979) suggested estimating kA−1 k1 by calculating the vector x from the equation AT x = b

(4.12.1)

for a specially constructed vector b, then solving the equation Ay = x

(4.12.2)

and using kyk1 /kxk1 as an estimate for kA−1 k1 . The vector b has components ±1 and signs chosen during the course of solution of (4.12.1) with the aim of making kxk1 large. The choice of signs at most doubles the work in the forward substitution phase of solving (4.12.1) so the overall cost is equivalent to solving 2 to 2 21 sets of equations using the already computed factors of A. Details are

80

NUMERICAL CONSIDERATIONS

given in Section 15.5 of Higham (2002), which also contains a justification of the procedure in terms of the singular value decomposition of A. Cline et al. (1979) report that good estimates were obtained from their runs on test matrices, although Cline and Rew (1983) have shown that it is possible to construct examples for which the estimate is poor. O’Leary (1980) showed experimentally that a worthwhile improvement, particularly for small n, is available by using   kyk1 , kxk∞ . (4.12.3) max kxk1 She recommended applying the procedure to AT if an estimate of kA−1 k∞ is wanted, rather than changing the norm. Cline, Conn, and Van Loan (1982) discuss ‘look behind’, as well as ‘look ahead’ algorithms for obtaining an estimate of κ2 (A) and show by experiment that they give good estimates. None of these papers make a theoretical analysis of the quality of their estimates.

4.12.2

Hager’s method

Hager (1984) proposed an algorithm for computing the maximum of the function f (x) = kBxk1 over the set S = {x : kxk1 ≤ 1}. His method is an iterative one that keeps x in S and, at each stage, increases the value of f (x). The global maximum of f (x) is kBk1 . Hager’s algorithm uses the subgradient of f (x) given by δf = ξ T B, where ξi is equal to 1 if the ith component of Bx is positive and is –1 otherwise. For computing the condition number B is set to A−1 so that the computation of the subgradient is performed by solving the two systems Ay = x

and AT z = ξ,

where ξi is equal to 1 if yi ≥ 0 and –1 otherwise, and δf (x) = zT . So, as for the LINPACK estimator, we need to solve two sets of equations at each iteration. However, if we assume that we have already computed the factorization of the matrix A, this is not too costly. Although Hager’s algorithm can only guarantee convergence to a local maximum (and usually within only a few iterations), Higham (1987) has shown that it usually gives a good estimate of the global maximum value. A modified version of Hager’s algorithm programmed by Higham (1988) is the basis for the condition number estimation in LAPACK (Anderson, Bai, Bischof, Blackford, Demmel, Dongarra, Du Croz, Greenbaum, Hammarling, McKenney, and Sorensen 1999). 4.13

Iterative refinement

We now discuss a strategy designed to improve the accuracy of a computed e. solution x

SCALING

81

Because we are defining an iteration, we change notation slightly and let x(1) be the computed solution to Ax = b. Let the residual vector be r(1) = b − Ax(1) .

(4.13.1)

Formally, we observe the relations A−1 r(1) = A−1 b − x(1) = x − x(1) .

(4.13.2)

Hence, we may find a correction ∆x(1) to x(1) by solving the equation A∆x(1) = r(1) .

(4.13.3)

In practice, of course, we cannot solve this exactly, but we may construct a new approximate solution x(2) = x(1) + ∆x(1) , (4.13.4) and, in general, define an iteration, with steps r(k) = b − Ax(k) ,

(4.13.5a)

A∆x(k) = r(k) , and x

(k+1)

=x

(k)

(4.13.5b) (k)

+ ∆x

,

(4.13.5c)

for k = 1, 2, ... . Note that when computing the residual r(k) in (4.13.5a), the original matrix is used, so that if the iteration yields a small residual, we will have solved a nearby problem. This is particularly useful when there is any uncertainty over eU e as an approximation to A, for example, when using threshold the accuracy of L pivoting (Section 4.4.2). If the residuals r(k) are computed using additional precision, the iteration usually converges to the solution in full precision. 4.14

Scaling

We have assumed so far that the given matrix A and the solution x are well scaled. However, norms and condition numbers are affected by scaling. For instance, the bound (4.11.12) may be very poor because κ(A) is unnecessarily large and it will give very poor relative bounds for any components of x that are much smaller than the rest (for instance, if  1.27×106 , x= 3.26 

the bound kδxk∞ ≤ 6.4 is only useful for x1 ).

82

NUMERICAL CONSIDERATIONS

To focus our discussion of matrix scaling, we consider the matrix   1.00 2420 A= . 1.00 1.58

(4.14.1)

The partial pivoting strategy discussed at the beginning of Section 4.4 apparently justifies the use of the (1, 1) pivot, and the resulting factorization    1 1.00 2420 e e LU = (4.14.2) 1.00 1 −2420 does not have any growth if this is measured relative to max |aij |. Since this matrix factorization is closely related to the unstable factorization in Section 4.3, we see that the presence of the large entry in (4.14.1) may cause failure in both the pivot selection and growth assessment strategies. The matrix A in (4.14.1) is poorly scaled because of the wide range of entries that are, by default, presumed to all have the same relative accuracy. It is surprisingly difficult to deal in any automatic way with poorly-scaled problems because, judging only from the given data, it is generally not possible to determine what is significant. Forsythe and Moler (1967, pp. 37–46) discuss this in detail. One possibility is equilibration, where diagonal matrices D1 and D2 are selected so that D1 AD2 (4.14.3) has the largest entry in each row and column of the same magnitude (see Bauer (1963), for example). Unfortunately, this choice allows D1 and D2 to vary widely. In our example, choosing  −3  10 and D2 = I (4.14.4) D1 = 1 produces the scaled matrix  D1 AD2 =

 0.00100 2.42 , 1.00 1.58

(4.14.5)

while choosing  D3 = I

and D4 =



1 10−3

(4.14.6)

 1.00 2.42 . 1.00 0.00158

(4.14.7)

produces the scaled matrix  D3 AD4 =

Both are well-conditioned equilibrated matrices, but they lead to significance being attached to different entries and to different pivotal choices. While careful

AUTOMATIC SCALING

83

computation with 2×2 problems will still allow acceptable results, in larger cases the different scalings can result in very different solutions. If the unscaled matrix (4.14.1) had been caused by a badly-scaled first equation, then scaling (4.14.5) would be the proper choice. If it had been caused by choosing units for variable x2 that are 103 times too large then scaling (4.14.7) would be the proper choice. Note that when computing a stable solution to a system of equations with scaled matrix D1 AD2 , we solve the equation (D1 AD2 )(D−1 2 x) = D1 b.

(4.14.8)

The computed solution (D−1 2 x) is the solution to a problem ‘near’ (D1 AD2 )y = D1 b.

(4.14.9)

This x need not be the solution to a problem ‘near’ (D3 AD4 )(D−1 4 x) = D3 b.

(4.14.10)

for some other scaling (see Exercises 4.16–4.19). For these reasons, using consistent units in data and variables and making consistent modelling assumptions are usually the best way to achieve a wellscaled problem. Automatic scaling methods (next section) are useful when these modelling considerations are not feasible. 4.15

Automatic scaling

Automatic scaling algorithms do not consider the origin of the matrix being scaled, but rather try to optimize some objective function related to what might intuitively be considered a well scaled matrix. However, this intuition can take fairly significantly different forms that we illustrate by considering three different objective functions. Right-hand sides can be taken into account by appending them as additional columns. A small example that illustrates how important this can be is     2.42 1.02 3.05 × 10−6 x= . (4.15.1) 2.81 × 10−6 1.25 × 10−6 1.74 For the matrix itself, scaling the second row by 106 or scaling the second column by 106 produces a well-scaled matrix. If we add the extra column, we see that only scaling the second column by 106 gives a good scaling. On the other hand, if the right-hand side were   2.42 , 1.74 × 10−6 only scaling the second row by 106 gives a good scaling.

84

4.15.1

NUMERICAL CONSIDERATIONS

Scaling so that all entries are close to one

A practical automatic scaling algorithm has been developed by Curtis and Reid (1972) following a suggestion of Hamming (1971), which is to choose scaling matrices D1 = diag(e−ρi ) and D2 = diag(e−γj ), where ρi and γi , i = 1, 2, ..., n are chosen to minimize the expression X (log |aij | − ρi − γj )2 . (4.15.2) aij 6=0

The scaled matrix D1 AD2 then has the logarithms of its nonzeros as small as possible in the least-squares sense, which corresponds to the coefficients of the scaled matrix having modulus near to one. The minimization of expression (4.15.2) is a linear least-squares problem, which Curtis and Reid were able to solve very effectively by the method of conjugate gradients. The major cost, particularly in the dense case, lies in the computation of the logarithms. For the matrix (4.14.1), this algorithm yields the scaling matrices     0.1598 1 D1 = and D2 = (4.15.3) 6.256 0.01617 and produces the scaled matrix  D1 AD2 =

 0.1598 6.2559 , 6.2559 0.1598

(4.15.4)

which is more satisfactory than either of the equilibration scalings (4.14.5) or (4.14.7). For the symmetric case, symmetric scalings are desirable. The scheme of the previous paragraph may be restricted to symmetric scalings, which does not lead to any worsening of the objective function. To see this, suppose that expression (4.15.2) is at its minimum value and symmetrize the scaling by replacing each ρi and γi by their average. For every term log |aij | − ρi − γj in expression (4.15.2) for which i 6= j, there is a term log |aji | − γi − ρj . The symmetrization will cause both of these terms to be replaced by their average, which cannot increase their sum of squares. The symmetrization does not affect any diagonal term log |aii | − ρi − γi , so we have shown that whole sum of squares is not increased. These scalings are implemented in HSL (2016), the collection of codes of the Rutherford Appleton Laboratory, as MC29 (unsymmetric case) and MC30 (symmetric case). 4.15.2

Scaling norms

Another approach to scaling is to use an iterative technique to make the infinity norms of all the rows and columns close to one. Knight, Ruiz, and U¸car (2014) achieve this by repeatedly scaling the rows and columns by the square roots of their norms. Starting with A(0) = A, iterate k + 1 has entries

AUTOMATIC SCALING

(k+1)

aij

(k)

(k)

85

(k)

= kai: k−1/2 aij ka:j k−1/2 . ∞ ∞

(4.15.5)

Since any element of A(k) is less than the norm of its row or column, all entries (k) of A(k) , k ≥ 1, are less than one in absolute value. If ail is the largest entry in (k+1) row i, ail has absolute value (k+1)

|ail

(k)

(k)

(k)

−1/2 | = kai: k1/2 ≥ kai: k1/2 ∞ ka:l k∞ ∞ .

(4.15.6)

It follows that the convergence of the norm of row i to 1 is bounded thus (k+1)

kai:

(k)

1/2 k∞ ≥ kai: k∞ .

(4.15.7)

A similar result applies for the columns. Ruiz stops the iteration when (k) (k) max 1 − kai: k∞ ≤  and max 1 − ka:j k∞ ≤  1≤j≤n

1≤i≤m

for a given value of . For the matrix (4.14.1), this algorithm yields the scaling matrices     0.0203 1.1212 D1 = and D2 = 0.8919 0.0203 after two iterations and produces the scaled matrix   0.0228 1 D1 AD2 = . 1 0.0286

(4.15.8)

(4.15.9)

This is not dissimilar from the result (4.15.4) as may be seen by dividing that matrix by 6.2559. This scaling is implemented in HSL as MC77. 4.15.3

I-matrix scaling

An I-matrix is a matrix whose diagonal entries are all equal to one in absolute value with all off-diagonal entries less than or equal to one in absolute value. Olschowka and Neumaier (1996) proposed an algorithm for finding a permutation and scaling to convert a given matrix to this form, assuming that the matrix can be permuted to have entries on its diagonal. Their algorithm maximizes the product of the moduli of entries on the diagonal of the permuted matrix; that is, it finds a permutation σ so that n Y

|aiσi |

(4.15.10)

i=1

is maximized and obtains suitable scaling factors to make the permuted matrix an I-matrix. Duff and Koster (2001) use a carefully constructed search to find the I-matrix scaling for a sparse matrix and we discuss their algorithm in Section 6.9.

86

NUMERICAL CONSIDERATIONS

Exercises 4.1 In solving Ax = b, make the case for why a stable algorithm, combined with a well-conditioned matrix A, will lead to a computed solution that is accurate. 4.2 Show that the 2-norm condition numbers of A and AT A are related by the equation κ2 (AT A) = (κ2 (A))2 . 4.3 The 3-digit approximation to the solution of the sets of equations −0.0101 x1 + 1.40 x2 − 0.946 x3 = 1.50 4.96 x1 + 0.689 x2 + 0.330 x3 = 0.496 −1.50 x1 − 2.68 x2 + 0.100 x3 = −0.290 is x1 = 0.225, x2 = −0.0815, x3 = −1.71. Without pivoting and using 3-digit arithmetic (assume that each operation produces a 3-digit result and is performed with sufficient guard digits for the only error to arise from the final rounding to 3 significant figures), compute the solution to this system of equations. Note that no component of the solution is correct. What is the indication that something has gone wrong? 4.4 e satisfies the equation (A + H)e (i) For r = b − Ae x, show that x x = b, where re xT H = ke . 2 xk 2

(ii) If krk2 = αkAk2 ke xk2 , show that kHk2 ≤ αkAk2 . e is the solution of a nearby set of equations if and only if (iii) Hence show that x krk2 is small compared with kAk2 ke xk2 . (iv) Show also that the fact that krk2 is small compared with kAe xk2 or kbk2 is a sufficient condition. 4.5 Show that the inequality (4.4.4) holds with partial pivoting in use. 4.6 The matrix



 1 1 1 A = 1 5 4 2 2 3

illustrates that a rook pivoting strategy could require examining each entry in the matrix before determining the first pivot. Show that for any order n, there is such a matrix. 4.7 Show that a symmetric matrix is positive definite if and only if all its eigenvalues are positive. 4.8 Show that, if the matrix A is diagonally dominant by rows (inequality (4.8.6)), then P Pn (2) so is A(2) and the inequalities n j=2 |aij | ≤ j=1 |aij | hold. Deduce that the numbers (k)

encountered in Gaussian elimination satisfy the inequalities |aij | ≤ 2 maxi,j |aij |. Show that the same result is true if A is diagonally dominant by columns (inequality (4.8.5)).

AUTOMATIC SCALING

87

4.9 Show that if A is symmetric, with eigenvalues λi , kAk2 = maxi |λi |. −1

(k)

(k)

is true if ak is the vector 4.10 Show that the inequality kA−1 kp ≥ kak kp (k) (k) (0, 0, ..., 0, akk , ..., ank )T of subdiagonal coefficients encountered after applying k − 1 (k) steps of Gaussian elimination to A. Hint: consider the singular matrix (A − ak eTk ), where ek is column k of I. 4.11 The matrix (4.3.1) has inverse A−1 '



 −0.6533 1.0007 . 0.4135 −0.0004 

Show that κ∞ (A) ' 4.26 and that if D1 =

103

 , κ∞ (D1 A) ' 2424.

1

4.12 Compute the solution of the equation 

0.287 0.512 0.181 0.322



x1 x2



 =

−2.32 3.58



Hint: use (4.11.6). 4.13 Perturb the factor U in equation (4.10.5) by changing −0.001 to −0.0006 and use the revised factorization to solve the problem in Exercise 4.12. Hence, demonstrate that the problem in Exercise 4.12 is ill-conditioned. 4.14 From the solution obtained in Exercise 4.13, compute the associated residual for the problem of Exercise 4.12. Is it small relative to b? Explain. 4.15 Compute the LINPACK estimate (see Section 4.12) for the condition number of the matrix (4.10.2).   1.00 2420 4.16 For the badly-scaled matrix , used for illustration in Section 4.14, 1.00 1.58     3570 2.15 consider the solution of equations with two different right-hand sides , . 1.47 3.59 Using the row scaling of equation (4.14.4) produces the scaled problems 

     0.001 2.42 3.57 0.00215 x, y = , . 1.00 1.58 1.47 3.59

Evaluate krk∞ for the candidate solutions  e= x

−0.863 1.48



 and

e= y

 3.59 . −0.001

Could these candidate solutions be considered solutions to problems ‘near’ the scaled one? 4.17 Using column scaling on the problems of Exercise 4.16, equation (4.14.6) produces the scaled problems 

     1.00 2.42 3570 2.15 x, y = , . 1.00 0.00158 1.47 3.59

88

NUMERICAL CONSIDERATIONS

Evaluate krk∞ for the candidate solutions  e= x

−10.0 1480



 and

e= y

 3.59 . −0.595

Could these candidate solutions be considered solutions to problems ‘near’ the scaled one? 4.18 Rescale the solutions of Exercise 4.16 to give candidate solutions to the scaled problems of Exercise 4.17 and evaluate krk∞ for them. Are they solutions to nearby problems? 4.19 Rescale the solutions of Exercise 4.17 to give candidate solutions to the original problems of Exercise 4.16 and evaluate krk∞ for them. Are they solutions to nearby problems?

Research exercises R4.1 To explore the relationship between scaling and solution accuracy, consider the poorly scaled matrix  A=

 1.245 0.0002732 . −0.0001543 0.0003745

If we apply row scaling to A, we get the well-scaled matrix  A=

 1.245 0.0002732 , −1.543 3.745

whereas applying column scaling to A yields the well-scaled matrix  A=

 1.245 2.732 . −0.0001543 3.745

Both scaled matrices yield stable factorizations in floating-point arithmetic, but they treat the small numbers very differently. The row scaling algorithm treated the (1,2) entry as having little significance, while the (2,1) entry is treated with great significance. The column scaling approach did the opposite. Depending on the right-hand side, these two scaled matrices can lead to very different solutions. What is the ‘right’ one? Unfortunately, the intended matrix is only known if we know by some other means that the matrix was poorly scaled due to the second variable being measured in the wrong units (column scaling), or the second equation being of little significance or entered in the wrong units (row scaling). Using single-precision arithmetic, experiment with different right-hand sides and then observe what cases yield similar or different answers for the two scalings. For right-hand sides try 

1.808 1.261



 and

 1.519 . 0.0005639

Might the right-hand side aid in determining the best scaling for solution accuracy?

5 GAUSSIAN ELIMINATION FOR SPARSE MATRICES: AN INTRODUCTION This chapter provides an introduction to the use of Gaussian elimination for solving sets of linear equations that are sparse We examine the three principle phases of most computer programs for this task: ANALYSE, FACTORIZE, and SOLVE. We stress the importance of acceptable overheads and of numerical stability, but are not concerned with algorithmic or implementation details that are the subjects of later chapters.

5.1

Introduction

With the background of the previous chapters, we can now begin the discussion of the central topic of this book: the effective solution of large sparse linear systems (with perhaps millions of equations) by direct methods. This subject is divided into two major topics: adapting the numerical methods from the dense case to the sparse case while maintaining sparsity; and efficiently computing the solution. The major objective of this chapter is to identify the issues associated with these topics and put them in the perspective of the whole solution algorithm. Detailed discussion of ordering methods and data structures for implementing the solution algorithm are deferred to later chapters. We illustrated in Chapter 1 that the preservation of sparsity in the course of factorization significantly depends on how the equations are ordered. We showed in Chapters 3 and 4 that preserving the numerical stability of the factorization requires care in controlling growth in the size of the matrix entries. Unfortunately, these two objectives may be in conflict: the pivot providing the most stability may destroy sparsity and the one best preserving sparsity may undermine stability. We consider here, in a preliminary way, how to create the kind of compromise needed to do both. We saw in Chapter 2 that we must take great care in performing operations with sparse matrices and sparse vectors. Without this care, savings in computation that come with sparsity may be lost. In the rest of this chapter, we discuss the structure of a typical code and the key elements that go into each phase. We approach this overview by looking at features of general-purpose computer programs for solving sparse sets of linear equations. This provides a good way to consider the key issues without the risk of confusion through too much attention to detail. Furthermore, the use of an existing code is generally the best first step in solving an actual problem. It provides a benchmark for Direct Methods for Sparse Matrices, second edition. I. S. Duff, A. M. Erisman, and J. K. Reid. c Oxford University Press 2017. Published 2017 by Oxford University Press.

90 GAUSSIAN ELIMINATION FOR SPARSE MATRICES: AN INTRODUCTION

performance even for very specialized problems for which a new code might be written. For purposes of illustration, we use the HSL sparse matrix codes MA48 (Duff and Reid 1996a) and MA57 (Duff 2004). MA48 solves unsymmetric problems and contains most of the features described in this chapter and the next two chapters. MA57 is a multifrontal code (see Chapter 12) for symmetric problems that need not be positive definite. We have selected MA48 and MA57 because of our familiarity with them, their widespread use, their availability, and the diversity of characteristics that they display. Both codes are written in Fortran 77, but may be accessed through Fortran 95 modules called HSL MA48 and HSL MA57, which additionally provide greater functionality. We note that we are not necessarily recommending that these codes be used in all cases, and we do discuss other codes later in the book. For example, these codes only exploit parallelism through calls to the BLAS. 5.2

Numerical stability in sparse Gaussian elimination

The numerical stability of the factorization is not a concern in the symmetric positive-definite case, see Section 4.8. We discussed the case for more general dense matrices in Chapter 4, where we showed that the stability of the factorization depends on the growth in the size of the matrix entries during Gaussian elimination. This growth is dependent both on the pivot choice, which determines the size of the multiplier, and on cumulative effects during the computation. For the dense case, partial pivoting is the algorithm of choice as discussed in Section 4.4.1. However, when the matrix is sparse, the use of partial pivoting does not give enough scope for maintaining sparsity. Note that user data may be known to only a few decimal places so that adequate answers may be obtained in the presence of some growth. 5.2.1

Trade-offs between numerical stability and sparsity

Consider the matrix that has the pattern illustrated on the right of Figure 5.2.1 with diagonal entries 1.0 and off-diagonal entries 1.1. The partial pivoting strategy would destroy all sparsity because none of the diagonal entries would be available as a pivot. However, if we adapt the threshold pivoting strategy of Section 4.4.2 to the sparse case by looking for pivots that preserve sparsity well and satisfy the threshold test with u = 0.1, all the diagonal entries except the last are available as pivots. In general, it would seem to be desirable to allow some growth for the sake of a reduction in the overall amount of computation, but the general question is how much growth should be allowed? That is, how small should u be chosen? We now do a simple analysis of the effect of using threshold pivoting on the numerical stability of a sparse factorization. We assume that each pivot is selected on sparsity grounds, but is required to satisfy the inequality (k)

(k)

|akk | ≥ u |aik |, i > k,

(5.2.1)

NUMERICAL STABILITY IN SPARSE GAUSSIAN ELIMINATION

Original order

91

Reordered matrix

×××××××× ×× × × × × × × × × × × × ×

×

× × × × × × × × × × ×× ×××××××× ×

Fig. 5.2.1. Reordering can preserve sparsity in factorization.

which bounds the modulus of the multipliers or equivalently the modulus of the entries of the matrix L to be less than 1/u. It also limits the growth in a single column during a single step of the reduction, according to the inequality (k+1)

(k)

| ≤ (1 + u−1 ) max |aij |

(5.2.2)

max |aij | ≤ (1 + u−1 )cj max |aij |,

(5.2.3)

max |aij i

i

and the growth overall by (k)

i

i

where cj is the number of off-diagonal entries in the jth column of U = (n) {aij , i ≤ j}. The inequality (5.2.3) is an immediate consequence of inequality (5.2.2) and the fact that column j changes only cj times during the reduction. This was discussed by Gear (1975). For an overall bound on (k)

ρ = max |aij |,

(5.2.4)

ρ ≤ (1 + u−1 )c max |aij |,

(5.2.5)

c = max cj .

(5.2.6)

i,j,k

we find i,j

where j

This result is a generalization of the result (4.4.6) that we obtained for a full matrix in Chapter 4 because, in this case, cj = j − 1 and c = n − 1. While the bound in the full case is attainable, it is generally considered over-pessimistic in practice. Inequality (5.2.3) merely indicates that because cj is usually small in the sparse case, we may take u < 1 to allow greater freedom of pivot choice, while still maintaining some control of stability. Scaling (see Section 4.14) has an obvious effect on this result. It can have a profound effect on the choice of pivots and numerical stability. We indicate

92 GAUSSIAN ELIMINATION FOR SPARSE MATRICES: AN INTRODUCTION

where scaling can be used in sparse codes in Section 5.4 and consider the effects of scaling on sparse pivoting in detail in Section 13.5. For many codes, the parameter u must be selected by the user. We have mostly used the values 0.1 and 0.01 for u. Other authors have recommended smaller values. In Section 7.8, we use tests on actual problems to explore an appropriate value for u. The choice between 0.1 and 0.01 is not definitive and may well depend on the class of problems being solved. Another concern in assessing stability for the large, sparse case is the presence of n in the stability bound |hij | ≤ 5.01ρn

(5.2.7)

(see (4.3.14)). For large problems (say n = 1 000 000), n could play a substantial role in the bound. Gear (1975) has generalized the bound to the sparse case and obtained  |hij | ≤ 1.01ρ uc3 + (1 + u)c2 + O(2 ). (5.2.8) While this is now independent of n, c3 can be large, so the result is not entirely satisfactory. 5.2.2

Incorporating rook pivoting

Rook pivoting was introduced in Section 4.4.3 and can be adapted to incorporate the threshold u, although it adds some complexity to the code. While requiring that the pivot preserves sparsity, we also require it to be at least as great as u times any other entry in its row and in its column. Rook pivoting is not greatly used for the dense case because partial pivoting has generally proved satisfactory. The attraction of threshold rook pivoting in the sparse case is that it may preserve sparsity nearly as well as threshold pivoting, while better limiting growth. Threshold rook pivoting is now used in some sparse codes for this reason, but we know of no extensive testing that has been done on the choice of u in this case. The complication is the requirement to test across a row when the data structure holds the matrix in sparse form by columns. We note that for symmetric matrices with pivots chosen from the diagonal, partial threshold pivoting is the same as threshold rook pivoting. We conclude by noting the generally pessimistic nature of the bounds for stability, the likelihood that the stability will vary between classes of problems, and the lack of wide experimentation on threshold rook pivoting for the sparse case. Because of this, we recommend the use of the residual in estimating the stability of the factorization, and the use of iterative refinement (Section 4.13). For users solving many problems in a particular domain (circuit design, structural analysis, signal process, for example), gathering statistics on the stability estimations will be helpful in refining u for the best results for that domain, and deciding whether it would be helpful to move from threshold pivoting to threshold rook pivoting.

NUMERICAL STABILITY IN SPARSE GAUSSIAN ELIMINATION

5.2.3

93

2 × 2 pivoting

For the symmetric indefinite case where 2 × 2 pivots may be needed to preserve stability and symmetry, as discussed in Section 4.9, an extension to the sparse case is needed. Duff and Reid (1996b) recommend the test "  #−1  (k)  −1  (k) max |a | kj akk ak,k+1 u  j6=k,k+1  ≤ (5.2.9) a(k) a(k) u−1 max |a | k+1,j k+1,k

k+1,k+1

j6=k,k+1

for each 2 × 2 pivot. This corresponds to the version (k)

(k)

|akk |−1 max |aik | ≤ u−1 i

(5.2.10)

of equation (4.4.5) and maintains the validity of inequality (5.2.3). Again, the values 0.1 and 0.01 are mostly used for u. 5.2.4

Other stability considerations

Another issue associated with stability estimation in sparse Gaussian elimination is the proper interpretation of inequality (5.2.7). We have bounded the size of the matrix H that satisfies the equation A + H = LU

(5.2.11)

and thus established the successful factorization of a neighbouring matrix. However, if H has a sparsity pattern different from that of A, A + H is qualitatively different from A. We corrected for this by iterative refinement in Section 4.13 for the dense case. This is particularly important for the sparse case for two reasons. First, threshold pivoting may lead to large growth (large (k) eU ˜ and values of aij ). When computing a solution we will have replaced A by L the zeros in fill-in positions will have been replaced by numbers of size about (k)  maxk |aij |, which is a qualitative change. This qualitative change is discussed further in Section 15.5. Secondly, n can be very large in the sparse case. As well as improving the solution, iterative refinement can be used to indicate e. Erisman and Reid (1974) suggested applying random the likely error in x artificial perturbations to the entries of A and b when calculating the residuals, in order that the consequent changes ∆x(k) indicate the uncertainties in the solution. Exact zeros (usually caused by the lack of certain physical connections) are not perturbed. In this way, data uncertainties in the values of the entries can be taken into account. Often iteration to full working precision is not justified. Partial pivoting and the various threshold schemes all work to control the size of the multipliers, but there is a greater likelihood of growth due to accumulation of the size of an entry as n grows. There is no concern about stability for any diagonal pivot selection in the symmetric positive-definite case, but there is in other cases.

94 GAUSSIAN ELIMINATION FOR SPARSE MATRICES: AN INTRODUCTION

Consider the matrix that has the pattern illustrated on the right of Figure 5.2.1 with diagonal entries 1.0 and off-diagonal entries 2.0. So long as u < 0.5, diagonal pivots will be selected, but after k steps, the value in the (n, n) position will grow to 1 − 4(k − 1). This growth is not an issue for modest values of n, but if n = 1 000 000, the growth may cause significant instability. This suggests that monitoring the values of those entries that change at each step should be performed to enable us to detect any excessive growth, but this is rarely done because of the expense. 5.2.5

Estimating condition numbers in sparse computation

In Chapter 4, we showed that the stability of the factorization is only part of the story in assessing the accuracy of the solution. The other part is the conditioning of the matrix. The LINPACK condition estimator, see Section 4.12.1 and Hager’s method, see Section 4.12.2, may be implemented without having to refactorize the matrix but simply using the existing factors with additional right-hand sides. This would provide a reasonable estimate of the condition of the problem. When adapting these methods to the sparse case, one must be cautious. In the dense case, factorization dominates the overall cost. As we will show in Section 5.5, this need not be the case for sparse matrices. Arioli, Demmel, and Duff (1989) use Hager’s method, as modified by Higham (1988) to estimate condition numbers of a sparse matrix and a code for this is available as HSL subroutine MC71. What we have described in the trade-off between sparsity and numerical considerations is readily incorporated in the Markowitz ordering, described in Section 5.3.2. Incorporation with the banded and dissection methods of Sections 5.3.3 and 5.3.4 will require a bit more care, and that discussion is deferred to Chapters 8 and 9, respectively.

5.3

Orderings

Ordering the rows and columns to preserve sparsity in Gaussian elimination was demonstrated to be effective in the example of Section 1.3. Another illustration is shown in Figure 5.2.1, where the reordered matrix preserves all zeros in the factorization, but the original order preserves none. There are four quite different strategies for ordering a sparse matrix which are introduced here. First, we may be able to reorder the matrix to block triangular form. We discuss this in Section 5.3.1, and develop the algorithms to do this in Chapter 6. We start here because if such an ordering exists, we can confine the factorization steps to smaller blocks on the diagonal, rather than to the whole matrix. Secondly, we consider local strategies where, at each stage of the factorization, we choose as pivot the entry that preserves sparsity ‘as well as possible’ according to some criteria. We introduce this in Section 5.3.2 and develop and analyse such algorithms in detail in Chapter 7.

ORDERINGS

95

Thirdly, there are methods of preserving sparsity by confining fill-ins to particular areas of the matrix. We introduce these in Section 5.3.3 and develop them in detail in Chapter 8. Fourthly, a strategy under the heading of dissection methods looks at the matrix in its entirety, seeking to split the overall problem into independent subproblems. These methods are particularly applicable for parallel computing environments. For certain classes of sparse matrix problems, such methods yield the best overall sparsity ordering. These methods are introduced in Section 5.3.4 and are developed in detail in Chapter 9. The main conclusion is that there is no single ‘best’ method for preserving sparsity. The criteria for what is best is more difficult than it might seem at first. For example, the criterion of ‘least fill-in’ might seem to be best, but may require a great deal more data structure manipulation to implement than one with more fill-in but simpler structures. Furthermore, actually determining an ordering that produces the least amount of fill-in is not practical for large problems, see Section 7.1. There are similar difficulties with ‘fewest operations’ or least overall computing time. These objectives are influenced by how much of the computation can be carried out in parallel, as well as the data structure issues. In Chapters 8 and 9 we provide, where possible, comparative results between these different classes of methods making use of experiments on standard test problems. 5.3.1

Block triangular matrix

A11

A=

A21

A22

A31

A32

A33

Fig. 5.3.1. Block lower triangular matrix. A pattern that is worthwhile to seek is the block lower triangular form illustrated in Figure 5.3.1. If we partition x and b similarly, we may solve the equation Ax = b by solving the equations Aii xi = bi −

i−1 X j=1

Aij xj , i = 1, 2, 3,

(5.3.1)

96 GAUSSIAN ELIMINATION FOR SPARSE MATRICES: AN INTRODUCTION

where the sum is zero for i = 1. We have to factorize only the diagonal blocks Aii . The off-diagonal blocks Aij , i > j, are used only in the multiplications Aij xj . In particular, all fill-in is confined to the blocks on the diagonal. Any row and column interchanges needed for the sake of stability and sparsity may be performed within the blocks on the diagonal and do not affect the block triangular structure. Algorithms for achieving this form are discussed in Chapter 6. 5.3.2 Local pivot strategies A very simple but effective strategy for maintaining sparsity is due to Markowitz (1957). At each stage of Gaussian elimination, he selects as pivot the nonzero entry of the remaining reduced submatrix with lowest product of number of other entries in its row and number of other entries in its column. This is the method that was used in Section 1.3, and it is easy to verify that it will accomplish the reordering shown in Figure 5.2.1. The Markowitz strategy and some variations are discussed in Chapter 7. Implementation details for the unsymmetric case are considered in Chapter 10 and for the symmetric case in Chapter 12. Surprisingly, the algorithm can be implemented in the symmetric case without explicitly updating the sparsity pattern at each stage, which greatly improves its speed. Of course, we must consider not only sparsity preservation, but also numerical stability. These strategies are called local methods because decisions are made locally at each stage of the factorization without regard to how they may affect later stages. 5.3.3 Band and variable band ordering A very different approach is to permute the matrix to a form within which all fill-ins during the elimination are confined. One example is the banded form, with aij = 0 if |i − j| > s for some value s, which is illustrated on the left of Figure 5.3.2. Another is its generalization to the variable-band form (also called skyline, profile, and envelope) with aij = 0 if j − i > si or aji = 0 if j − i > ti for some values si and ti , i = 1, 2, . . . n, which is shown on the right of Figure 5.3.2. × × × ×

× × × × ×

× × × × × ×

× × × × × × ×

× × × × × × ×

× × × × × × ×

× × × × × × ×

× × × × × × ×

× × × × × × ×

× × × × × × ×

× × × × × × ×

× × × × × × ×

× × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × ×

× × × × × × ×

× × × × × ×

× × × × ×

× × × × × × × × × × × × × × × × × ×

× × × ×

Fig. 5.3.2. Band and variable-band matrices.

× × × × × × ×

× × × × ×

× × × × × × × × × × × × × × × ×

× × × × × × × ×

ORDERINGS

97

Fig. 5.3.3. A block tridiagonal matrix.

It is easy to verify that if no further interchanges are performed, there is no fill-in ahead of the first entry of a row or ahead of the first entry of a column. It follows that both the band and variable-band forms are preserved. If the matrix can be permuted to have a narrow bandwidth, this is an effective way to exploit sparsity. In this book, we treat this as one of several alternatives. Algorithms for obtaining these forms are discussed in Section 8.3. They are, in fact, based on finding permutations to block tridiagonal form, illustrated in Figure 5.3.3. If the blocks are small and numerous, such a matrix is banded with a small bandwidth. George (1977) pointed out that the factorization of a block tridiagonal matrix A can be written in the form      D1 A12

I

A11 A21

 A12      

  A21 D−1 1      =       

A22 A32 A23 A33 . .

.

.

.

.

I A32 D−1 2

I .

. .

.

      

D2 A23

. I

. AN N

D3 A34 .

. .

.

      

DN

(5.3.2) where D1 = A11 , Di = Aii −

(5.3.3a) Ai,i−1 D−1 i−1 Ai−1,i ,

i = 2, 3, ..., N.

(5.3.3b)

To use this form to solve a set of equations Ax = b requires the forward substitution steps c1 = b1 , ci = bi −

(5.3.4a) Ai,i−1 D−1 i−1 ci−1 ,

i = 2, 3, ..., N,

(5.3.4b)

followed by the back-substitution steps xN = D−1 N cN , xi =

D−1 i ci

− Ai,i+1 ci+1 , i = N − 1, ..., 1.

(5.3.5a) (5.3.5b)

It therefore suffices to keep the off-diagonal blocks of A unmodified and the diagonal blocks Di in factorized form

98 GAUSSIAN ELIMINATION FOR SPARSE MATRICES: AN INTRODUCTION

Di = Li Ui , i = 1, 2, ....

(5.3.6)

The matrices Li and Ui are just the diagonal blocks of the ordinary LU factorization of A, as may be seen (Exercise 5.1) by substituting Li Ui for Di in equation (5.3.2). The fact that all fill-in is avoided in off-diagonal blocks therefore means that storage is saved overall. The operation count for forming the factorization is the same if the triple product in equation (5.3.3b) is formed −1 −1 −1 as (Ai,i−1 U−1 i−1 )(Li−1 Ai−1,i ), since the matrices Ai,i−1 Ui−1 and Li−1 Ai−1,i are −1 −1 part of the ordinary LU factorization. However, if Ai,i−1 Ui−1 Li−1 Ai−1,i is calculated with a different ordering of products, the operation count is likely to differ. Similarly, there may or may not be a saving in operation count when using the implicit factorization to solve a set of equations. 5.3.4

Dissection  A11 0 A13  0 A22 A23  A32 A32 A33  1

3

2

Fig. 5.3.4. A dissection of the region and its corresponding matrix. Another technique for preserving zeros is dissection. If the matrix pattern corresponds to a finite-element or finite-difference approximation over a region, the removal of part of the region (for example region 3 in Figure 5.3.4) may cut the problem into independent pieces. Ordering the variables of the cut (and the corresponding equations) last yields a bordered block diagonal form, as illustrated by the matrix in Figure 5.3.4. The zero off-diagonal blocks remain zero during factorization if no further interchanges are performed except within blocks and the eliminations in block A33 are done last. Methods such as nested dissection and one-way dissection are examples of this general approach. These and other ordering methods which preserve zeros are the subject of Chapter 9. The doubly-bordered block diagonal form (Figure 5.3.5) is a generalization

Fig. 5.3.5. Doubly-bordered block diagonal form.

FEATURES OF A CODE FOR THE SOLUTION OF SPARSE EQUATIONS

99

Fig. 5.3.6. Bordered block triangular form.

of the matrix in Figure 5.3.4, and is a desirable form. Another desirable form is the bordered block triangular form Figure 5.3.6, of which a simple case is the bordered triangular form. An important example of a doubly-bordered block diagonal matrix, that obtained by a one-way dissection ordering, is discussed in Section 9.2, and its generalization to the case where the diagonal blocks are themselves in this form is discussed in Section 9.3. 5.4

Features of a code for the solution of sparse equations

Having looked at numerical computation adapted to the sparse case and at ordering, we now look at the phases of the computer code, again from a user viewpoint. We begin by describing some of the features required of a general computer code for the direct solution of sparse linear equations. The first step is the representation of the sparse matrix problem. The user of the code should not have to write complicated code to force the data into a form that is convenient for the implementation. An inexpert user may code a faulty or poor sorting algorithm that loses most of the potential gains in efficiency. For this reason, a common choice is the requirement to specify the entries as an unordered sequence of triples (i, j, aij ) indicating row index, column index, and numerical value. This is called the coordinate scheme; it was discussed and illustrated in Section 2.6. Sometimes it may be equally convenient to specify the matrix as a collection of column (or row) vectors, as discussed and illustrated in Section 2.7. For example, in a nonlinear problem the Jacobian matrix may be generated column by column using automatic differentiation or finite differences. These storage schemes are usually not by themselves adequate for efficient solution, and so the original (user’s) data structure will have to be transformed to one that is efficient internally. Often it is the case that a sequence of systems of equations must be solved where the sparsity pattern is fixed, but the numerical values change. One example is the solution of time-dependent nonlinear differential equations. The Jacobian matrix can be accommodated in a single sparsity pattern for every time point, while the numerical values change (although some entries may be zero at the initial point). Another example is the design problem where values of parameters must be chosen to maximize some measure of performance. In these cases, it is

100 GAUSSIAN ELIMINATION FOR SPARSE MATRICES: AN INTRODUCTION

worthwhile to invest effort in the choice of a good ordering for the particular sparsity pattern and in the development of a good associated data structure, since the costs can be spread over many solutions. Except in the symmetric and positive-definite case, this raises some numerical stability questions when the values of the entries change and these were discussed in Section 5.2. We may want to use one matrix factorization to solve a system of equations with many right-hand sides. This is the case when performing iterative refinement (Section 4.13) and may also happen because of different analyses of a mathematical model based on various inputs, or it may come from modifying the method of the last paragraph to use one Jacobian matrix over a number of steps. Finally, at the conclusion of the calculations, the user will need an output file collecting the solutions and supporting parameters for estimating stability and conditioning of the problem. It may be that the data will be used by an external graphical package, or for a history file used to modify selections such as an appropriate size of u across a problem class. It may be that the file is used to create a new set of problems to solve based on a parametric study. Because of these common uses of a general sparse matrix computer code, it is usual to consider three distinct phases: ANALYSE, FACTORIZE, and SOLVE (although not all codes fit this template exactly). These phases and their features are summarized in Table 5.4.1. For matrices whose pattern is symmetric or nearly so, it is possible to organize the computation so that the ANALYSE phase works on the sparsity pattern alone and involves no actual computation on Table 5.4.1 Key features of ANALYSE, FACTORIZE, and SOLVE. Phase ANALYSE

Key features 1. Transfer user data to internal data structures. 2. Determine a good pivotal sequence. 3. In case of numerical stability concerns, carry out factorization, possibly scaling the matrix beforehand. 4. Prepare data structures for the efficient execution of the other phases. FACTORIZE 1. Transfer numerical values to internal data structure format. 2. Perform matrix factorization based on the chosen pivotal sequence, potentially adjusting the sequence if stability is an issue. In this case, it is usually advantageous to have scaled the matrix. SOLVE 1. Perform forward substitution and backsubstitution using stored L\U factors. 2. Perform iterative refinement to improve the solution and to estimate the error.

FEATURES OF A CODE FOR THE SOLUTION OF SPARSE EQUATIONS 101

real numbers. On the other hand, with unsymmetric problems we often work on the actual numbers too, so a factorization may be a by-product of the analysis. For this reason, this phase for unsymmetric matrices is sometimes termed ANALYSE-FACTORIZE. 5.4.1 Input of data Depending on the capability of the sparse matrix package, the user will need to do more or less work in preparing the data for use by the code. Most codes will accept data in the coordinate scheme (i, j, aij ) in any order, which is easy to construct from any other sparse-matrix format. In this case, the code will need to transform this set of unordered triples to the structure needed for the sparse computation. There may be options for detecting data errors, such as a row or column with no entries or an invalid row or column index. The structure used for the sparse computation is often a sequence of sparse columns and the code may use the transformation discussed in Section 2.12. When the data is already in this format, the code can bypass this work. In addition to inputting the data, the user may need to specify the amount of workspace. Modern codes will allocate additional workspace if it proves to be too small, but this can slow the computation. The user will need to specify the type of ordering algorithm to use (the broad types we outlined in Section 5.3 are not all supported in each code). The user also has the option of identifying whether this is a new matrix pattern or a different set of data using a pattern that has already been used. The user may need to specify other parameters, such as the pivot threshold u. 5.4.2 The ANALYSE phase It is in the ANALYSE phase that the fundamental differences between codes are most evident. In the unsymmetric case, we often deal with both numerical and structural data in this phase. The matrix may need to be scaled. In the course of selecting the pivot order, the updated eliminated matrix with the numerical and structural information, including fill-in, is held. This means that the amount of additional storage that is required is unknown beforehand. This is a case where the user needs to estimate the amount of workspace required. A modern code is able to reallocate arrays whenever necessary. This phase contains a careful implementation of the manipulation of sparse vectors and matrices, drawing on the ideas developed in Chapter 2. A typical code for symmetric systems assumes that pivots may be selected from the diagonal without risk of instability. The pivot selection is accomplished without numerical values and fill-in is tracked implicitly. Thus, no additional storage is needed during ANALYSE, and the precise requirements for FACTORIZE in the positive-definite case can be reported to the user. 5.4.3 The FACTORIZE phase On completion of the ANALYSE phase, we know the pivot sequence, the fill-ins, and have determined data structures that enable us to compute the LU factors of

102 GAUSSIAN ELIMINATION FOR SPARSE MATRICES: AN INTRODUCTION

the input matrix. Usually, the entries are supplied to FACTORIZE in precisely the order in which they were supplied to ANALYSE. In the case where the factorization is computed in the course of ANALYSE, this phase can be skipped and we can go directly to SOLVE. When we enter the FACTORIZE phase for a matrix that is not positive definite, we may need to scale the matrix and we need to be concerned with the numerical stability of the factorization. One possibility (used in the MA48 code of Table 5.5.1) is to keep the column order but allow fresh row interchanges (see Section 10.4), and this is often only slightly slower. If it results in much more fill-in, the ANALYSE phase may have to be repeated. It is our experience, however, that problems in a sequence with the same sparsity pattern are often related closely enough for the subsequent factorizations to use the same sequence. Because of this, one possibility is to enter the new data and use the structures to create the numerical factors without change and without checking for numerically unstable pivots. In this case, it is important to validate the stability of the factorization after the fact, and perhaps re-enter the ANALYSE or FACTORIZE phase. In the case where numerical stability concerns are not present, the user may enter the FACTORIZE phase with numerical values, and with the precise storage required for the fill-ins based on the output from ANALYSE. do k = k1,k2 j = col_index(k) w(j) = w(j) - alpha*val(k) end do Fig. 5.4.1. An inner loop of FACTORIZE. An important operation in the numerical factorization of MA48 involves a loop like that illustrated in Figure 5.4.1. Although this is a very simple loop without any searching or testing and although the array val is accessed sequentially, we are addressing w indirectly through col index. This means that memory accesses are more irregular than for similar operations on dense matrices. A major development of the last 30 years or so has been the reworking of sparse direct algorithms so that all indirect addressing is avoided in the innermost loops so that Level 2 and Level 3 BLAS kernels for dense matrices can be used in their implementation. This is particularly important because almost all modern machines utilize cache memories (sometimes at multiple levels of hierarchy) and data movement is almost always the bottleneck for high performance architectures. 5.4.4

The SOLVE phase

For the SOLVE phase, we need to enter the right-hand side or sides, with a parameter for the number of right-hand sides.

RELATIVE WORK REQUIRED BY EACH PHASE

103

In Chapter 3, we described forward substitution and back-substitution in terms of accessing rows of L and rows of U, but adapting to columns of one or both is quite easy (see Exercise 5.2). The variations are insignificant unless advantage is to be taken of sparsity in b or of a requirement for only a partial solution. Details of implementation for SOLVE are deferred to Chapters 10 and 14. Note that if A = LU, UT LT constitutes an LU factorization of AT , so SOLVE should have the facility to solve either Ax = b or AT x = b. We make another comment about the SOLVE phase. If the right-hand side b has leading zeros, the forward substitution phase can be started at the first nonzero. If all of the solution is not needed, but only the solution related to some of the variables, then the backward substitution can be stopped after the variables required have been computed. More discussion of this option can be found in Section 7.9. Many codes do not offer this option, but when it is available, it can lead to considerable savings. 5.4.5

Output of data and analysis of results

In the output phase, the user gathers three different types of data for use in his or her own environment. The first, of course, is the solution, now available for graphing or analysing and reusing. The second is the analysis of the results. Unlike working with small problems, it is often difficult to assess whether the results of a very large-scale computation are reasonable. Yet it is developing a feel for the results that is so critical. Changing the parameters (increasing resistance in an electrical circuit, adding to the load on a structure, etc.) are ways that errors, either in the model or in the data, might be detected. Using stability and condition estimators coming from the solution are helpful here as well. In this area, once the model and data are trusted, gathering a history of conditioning and stability for the particular problem class is important. This will guide the user in determining an appropriate value for the pivot threshold u over the problem class, determining the need to incorporate threshold rook pivoting to improve stability in the calculations, etc. The third use of the data is in the preparation for solving a related problem based on the outputs from the first. This problem-dependent use of the data will generally not be built into a sparse matrix code, except in very limited circumstances such as the use of iterative refinement, so the user needs to manage this part of the computation. 5.5

Relative work required by each phase

Here, we compare ANALYSE, FACTORIZE, and SOLVE using particular codes and selected test data to give the user a sense of how they perform. Consider the times in Table 5.5.1 (using the unsymmetric code MA48 where numerical values must be provided in the ANALYSE phase). In this case, ANALYSE is much more costly than FACTORIZE. This illustrates why we should avoid re-entering the ANALYSE phase with new problem data unless numerical considerations require it. Of course, we

104 GAUSSIAN ELIMINATION FOR SPARSE MATRICES: AN INTRODUCTION

Table 5.5.1 Times for the three phases of MA48 (seconds on a Dual-socket 3.10 GHz Xeon E5-2687W, using the Intel compiler with O2 optimization). Matrix ID goodwin rim onetone1 shyy161 lhr71c Order (×103 ) 7.3 22.6 36.1 76.5 70.3 Entries (×103 ) 324.8 1 015.0 341.1 329.8 1 528.1 ANALYSE 2.19 20.36 5.94 6.98 1.09 FACTORIZE 0.28 3.44 0.42 1.15 0.33 SOLVE 0.004 0.023 0.006 0.012 0.011 Table 5.5.2 Times for the three phases of MA57 (seconds on a Dual-socket 3.10 GHz Xeon E5-2687W, using the Intel Fortran compiler with O2 optimization). Matrix ID bmw3 2 helm2d03 ncvxqp1 darcy003 cfd1 Order (×103 ) 227.4 392.3 12.1 389.9 70.7 Entries (×103 ) 11 288.6 2 741.9 74.0 2 097.6 1 825.6 ANALYSE 0.86 1.55 0.07 0.15 0.78 FACTORIZE 10.84 2.13 0.53 0.67 4.62 SOLVE 0.11 0.07 0.01 0.04 0.05 should bear in mind that if the answers are inaccurate, it is no consolation that the method was fast. In the symmetric case, because of some critical implementation details, discussed in Chapter 12, ANALYSE can be significantly less costly than FACTORIZE, as we can see from the results for MA57 in Table 5.5.2. Because the problems of Tables 5.5.1 and 5.5.2 are different, no comparison between the codes is implied. Another comment about SOLVE is important to the user. While in the dense case SOLVE is usually trivial compared with FACTORIZE (involving O(n2 ) compared with O(n3 ) operations), this is not the case for sparse problems. The DARCY003 column of Table 5.5.2 illustrates that the SOLVE time can be about a tenth of the corresponding FACTORIZE time. Our experience is that this could be an even higher fraction on smaller problems. The main issue in many applications is that there may be very many calls to SOLVE for each call to FACTORIZE. For example, Duff (2004) reports that, in the run of an optimization code, 34 174 calls to the SOLVE subroutine of MA57 were made, but only 125 matrix factorizations. Thus, the efficient implementation of SOLVE can be as important as the other phases.

5.6

Multiple right-hand sides

There are several applications when many sets of equations need to be solved for the same matrix, for example for different loads on the same structure or different incident angles for a scattering problem. Of course, you could just solve the equations in a loop, but then the cost of solving k, say, right-hand sides would

COMPUTATION OF ENTRIES OF THE INVERSE

105

Table 5.6.1 Times in seconds per right-hand side using MA57 on a Dual-socket 3.10 GHz Xeon E5-2687W, using the Intel compiler with O2 optimization. Matrix ID

dawson5 bmw3 2 turon m nasasrb cfd1

Order (×103 )

Entries (×103 )

51.5 1 010.8 227.4 11 288.6 189.9 1 690.9 54.9 2 677.3 70.7 1 825.6

Number right-hand sides 1 5 10 20 0.010 0.006 0.004 0.004 0.093 0.045 0.032 0.026 0.039 0.020 0.015 0.013 0.025 0.012 0.008 0.007 0.047 0.023 0.017 0.015

be about k times the cost of one solution. However, if we operate on all the righthand sides at the same time then we can reuse the entries of the factorized matrix and save greatly on data movement costs. This can be effected by using Level 3 BLAS replacing the Level 2 BLAS used for single right-hand sides. There is also the possibility of vectorizing or parallelizing across the right-hand sides, effectively working by rows, rather than by columns. The MA57 code uses Level 3 BLAS for multiple right-hand sides and we illustrate the savings in Table 5.6.1. The amount of these savings will be system dependent reflecting the better performance of Level 3 BLAS over Level 2 BLAS on the machine in question. We note that, on modern machines, a crucial difference between the FACTORIZE and SOLVE phases is that, in general, FACTORIZE is compute-bound, while SOLVE is memory-bound. Solving for multiple right-hand sides compensates a little for this. 5.7

Computation of entries of the inverse

Although we explained in Section 3.10 (second half) that one should not compute the inverse of a sparse matrix in order to solve a set of linear equations, there are situations when explicit entries of the inverse of a sparse matrix are required. One common case is that the diagonal entries of the inverse of the normal equations matrix AT A provide variances in statistical problems where the coefficient matrix is A. The off-diagonal entries provide co-variances or sensitivity information. We note that, from the equation AX = I, with A factorized as A = LU, entry (i, j) of the inverse is given by xi where Ux = L−1 ej . Thus, we have a classic case where we have both a (very) sparse right-hand side and the requirement for only a few (or sometimes one) component of the

106 GAUSSIAN ELIMINATION FOR SPARSE MATRICES: AN INTRODUCTION

solution. In Section 15.7, we show another way to compute entries of the inverse of a sparse matrix. 5.8

Matrices with complex entries

There are many applications where the coefficient matrix is complex rather than real. The case of complex unsymmetric matrices can be handled by almost identical code to that for real unsymmetric matrices. In the symmetric case, an added complication is that both Hermitian matrices and complex symmetric matrices occur. In the former case, the diagonal entries are real numbers with zero imaginary part and this may be enforced by the code to avoid contamination by rounding errors introducing a nonzero imaginary part. The other problem with Hermitian matrices is that if any entry is permuted to the other triangle, it is necessary to replace the value by its conjugate. Both complex symmetric matrices and non-positive-definite Hermitian matrices will require numerical pivoting, as in the case of indefinite symmetric systems. A notable exception arises in the case of a symmetric complex matrix in frequency domain analysis of large scale circuits as discussed by Erisman (1972) and Erisman and Spies (1972). Here, while we know of no mathematical justification for using diagonal pivots without regard to numerical values, this seems to be a valid approach based on a physical argument and significant testing. In this case, monitoring the stability is required to ensure that the factorization is stable, and the bound (4.8.9) is recommended. 5.9

Writing compared with using sparse matrix software

If a large sparse matrix problem had to be solved in 1970, software needed to be written for the problem, as no standard codes were widely available. Today, the situation is very different. Very useful sparse matrix software is generally available and is actively being improved. The sophistication of sparse direct solvers, even more so if parallelism is being exploited, leads us to strongly recommend that the reader does not attempt to write his or her own sparse direct code, at least initially. Often the work involved in performing operations on a sparse matrix of order n with τ = cn nonzeros is only O(c2 n) (see Exercise 5.4). One of the subtleties in writing sparse matrix software lies in avoiding O(n2 ) or more operations. This can happen in very innocent ways and then dominate the computation. An example lies in the implementation of the pivot selection in ANALYSE. A na¨ıve strategy for the algorithm of Markowitz would require checking all remaining entries at every step, which would require at least O(n2 ) comparisons in total. A number of other O(n2 ) traps were identified by Gustavson (1978). Exercises 5.1 By substituting Li Ui for Di in equation (5.3.2), show that Li and Ui are the diagonal blocks of the ordinary LU factorization of A.

WRITING COMPARED WITH USING SPARSE MATRIX SOFTWARE

107

5.2 Rewrite the algebraic description of back-substitution and forward substitution, equations (3.2.3) and (3.2.4), to access both L and U by columns. 5.3 With columns of L stored in packed form, write a program to perform forward substitution. 5.4 Give a plausible argument to illustrate that a sparse LU factorization may be performed in O(c2 n) operations, if n is the order and cn is the number of nonzeros. 5.5 Tables 5.5.1 and 5.5.2 present dramatically different relative costs for ANALYZE and FACTORIZE. Explain. 5.6 In the early stages of developing a model, it is important to test the model to see its response under a variety of conditions. How might the insight from Table 5.6.1 help in guiding this testing phase? Research exercise R5.1 Using a standard sparse matrix code, such as one of those mentioned in this chapter, gain some experience as a user by trying several different problems, perhaps from the University of Florida test collection (Davis and Hu 2011). Experiment with different features and try to assess the validity of your results.

6 REDUCTION TO BLOCK TRIANGULAR FORM We consider algorithms for reducing a general sparse matrix to block triangular form. This form allows the corresponding set of linear equations to be solved as a sequence of subproblems. We discuss the assignment problem of placing entries on the diagonal as a part of the process of finding the block triangular form, though this problem is of interest in its own right. We also consider the extension of the block triangular form to rectangular and singular matrices.

6.1

Introduction

If we are solving a set of linear equations Ax = b

(6.1.1)

whose matrix has a block triangular form, savings in both computational work and storage may be made by exploiting this form. Our purpose in this chapter is to explain how a given matrix may be permuted to this form, and to demonstrate that the form is (essentially) unique. Remarkably economical algorithms exist for this task, typically requiring O(n) + O(τ ) operations for a matrix of order n with τ entries. We find it convenient to consider permutations to the block lower triangular form   B11

  B21 B22    B31 B32 B33   PAQ =  · · ·   · · ·    · · · 

·

BN 1 BN 2 BN 3

·

             

· · ·

·

(6.1.2)

BN N

though the block upper triangular form could be obtained with only minor modification (see Exercise 6.1). A matrix that can be permuted to the form (6.1.2), with N > 1, is said to be reducible. If no block triangular form other than the trivial one (N = 1) can be Direct Methods for Sparse Matrices, second edition. I. S. Duff, A. M. Erisman, and J. K. Reid. c Oxford University Press 2017. Published 2017 by Oxford University Press.

FINDING THE BLOCK TRIANGULAR FORM IN THREE STAGES

109

found, the matrix is called irreducible1 . We expect each Bii to be irreducible, for otherwise a finer decomposition is possible. The advantage of using block triangular forms such as (6.1.2) is that the set of equations (6.1.1) may then be solved by the simple forward substitution Bii yi = (Pb)i −

i−1 X

Bij yj , i = 1, 2, ...,

(6.1.3)

j=1

(where the sum is zero for i = 1) and the permutation x = Qy.

(6.1.4)

We have to factorize only the diagonal blocks Bii , i = 1, 2, ..., N . The off-diagonal blocks Bij , i > j, are used only in the multiplications Bij yj . Notice in particular that all fill-in is confined to the diagonal blocks. If row and column interchanges are performed within each diagonal block for the sake of stability and sparsity, this will not affect the block triangular structure. There are classes of problems for which the algorithms of this chapter are not useful. For some applications, it may be known a priori that the matrix is irreducible. Matrices arising from the discretization of partial differential equations are often of this nature. In other cases, the decomposition may be known in advance from the underlying physical structure. If A is symmetric with nonzero diagonal entries, reducibility means that A may be permuted to block diagonal form. While the methods of this chapter will find this form, it is likely to be more efficient to find it via the elimination tree, see Section 12.4. Even when the form is known, the automatic methods may still be useful to avoid the need to explicitly input the problem in the required form or to verify that it has not been changed by data input errors. Note that many areas produce reducible systems, including chemical process design (Westerberg and Berna 1979), linear programming (Hellerman and Rarick 1971), and economic modelling (Szyld 1981). Because of the speed of the block triangularization algorithms and the savings that result from their use in solving reducible systems of equations, they can be very beneficial. 6.2 Finding the block triangular form in three stages Techniques exist for permuting directly to the block triangular form (6.1.2), but we do not know of any advantage over the three-stage approach: 1 An

n × n irreducible matrix can also be categorized as having the strong Hall property: every n × k submatrix, for k = 1, 2, . . . , n − 1 has at least k + 1 nonzero rows. This is a consequence of the fact that if the matrix is reducible, it has the form (6.1.2) with BN N a square matrix of order k < n; the k columns of BN N have only k nonzero rows. Some authors reserve the terms reducible and irreducible for the case Q = PT . For Q 6= PT , Harary (1971) uses the terms ‘bireducible’ and ‘bi-irreducible’ and Schneider (1977) uses the term ‘fully indecomposable’. To the best of our knowledge, the term strong Hall was first adopted by Coleman, Edenbrandt, and Gilbert (1986) who used the property to prove that symbolic Gaussian elimination of AT A can accurately predict the structure of R for a QR factorization of A.

110

REDUCTION TO BLOCK TRIANGULAR FORM

(i) Look for row and column singletons. (ii) Permute entries onto the diagonal (usually called finding a transversal). (iii) Use symmetric permutations to find the block form itself. In this chapter, we discuss the three stages separately. The essential uniqueness of the form is discussed in Section 6.6, along with an argument for why only symmetric permutations are required in the third stage to achieve the best decomposition. Transversal algorithms are considered in a wider context in Section 6.8, and we introduce the concept of weighted transversals in Section 6.9. Finally, we discuss the extension of the block triangular form to rectangular and singular matrices in Section 6.10. The benefits of using a block triangular form are illustrated in Section 6.7. 6.3

Looking for row and column singletons

If the matrix has a row singleton and we permute it to the position (1,1), the matrix has the form # " PAQ =

B11

(6.3.1)

B21 B22

where B11 has order 1. If the matrix B22 has a row singleton, we may permute it to the position (2,2), the matrix will still have the form (6.3.1) but now B11 is a lower triangular matrix of order 2. Continuing in this way until no row singletons are available, we find the permuted form (6.3.1) with B11 a lower triangular matrix. Similarly, we may look for column singletons successively and permute each in turn to the end of the matrix. Following this we have the form   B11

 PAQ =   B21 B22

 , 

(6.3.2)

B31 B32 B33

where B11 and B33 are lower triangular matrices. Of course, any of the matrices B11 , B22 , and B33 may be null. We observe that at the conclusion of finding column singletons to make B33 as large as possible, we do not need to return to B11 to see if we have created new row singletons in B22 (see Exercise 6.2). We show an example in Figure 6.3.1. There is one row singleton and choosing it creates another. There are two column singletons and choosing them creates another. We are left with a middle block of size 2. There is a long history of looking for row and column singletons in the process of creating block triangular forms. Sargent and Westerberg (1964) used a graph theoretic approach to transform a matrix to lower triangular form. Rather than using this to find submatrices as we did here, they used it to motivate their algorithm for finding the block triangular form (6.1.2). We look at their

FINDING A TRANSVERSAL

1 2 3 4 5 6 7

1 ×     ×    

2 3 4 5 6 7 × × × ×   ×× ×   ×   × ×  × ××

becomes

6 7 1 4 2 3 5

111

7 6 1 4 2 3 5 × × ×      × ×     × ×    × ××    × ××  × ×

Fig. 6.3.1. Using row and column singletons.

algorithm in Section 6.5.2. Both Karp and Sipser (1981) and Magun (1998) used the identification of singletons as a step in their algorithms to find a transversal. In practice, many problems have large triangular blocks B11 and B33 so that looking for row and column singletons is very worthwhile. For example, control systems often have this form. In the rest of the chapter, we consider how B22 can be permuted to block triangular form. We first look for permutations of this matrix that place entries on its diagonal (finding a transversal) and then use symmetric permutations to achieve the block triangular form. 6.4 6.4.1

Finding a transversal Background

Our purpose in this section is to motivate an algorithm for finding a transversal based on the work of Hall (1956). While it is more efficient to find all row and column singletons first, the algorithm does not depend on this. The algorithm which appeared in Hall’s paper was based on a breadth-first search, while the algorithm we discuss here follows the lines of the work of Kuhn (1955) and uses a depth-first search. As we shall see in the following text, however, it is the implementation of the search that is of crucial importance to the efficiency of the algorithm. The algorithm we describe follows closely the work of Duff (1972, 1981c) and Gustavson (1976), who have discussed this algorithm extensively, including algorithmic details, heuristics, implementation, and testing. Duff (1981a) provides a Fortran implementation available from the Collected Algorithms from TOMS (Algorithm 575) or as package MC21 from HSL. Duff, Kaya, and U¸car (2011) perform a detailed comparison of algorithms for obtaining a maximum transversal. The algorithm can be described in either a row or a column orientation; we follow the variant that looks at the columns one by one and permutes the rows. After permutations have been found that place entries in the first k − 1 diagonal positions, we examine column k and seek a row permutation that will: (i) Preserve the presence of entries in the first k − 1 diagonal positions, and (ii) result in column k having an entry in row k.

112

REDUCTION TO BLOCK TRIANGULAR FORM 

×



×

 × ×     × ×   ×  ×    ×  ×    × × × ×

Fig. 6.4.1. A cheap assignment for column 6. 

×



×

 × ×   × ×      × ×     × ×    ×× ×  × ×

Fig. 6.4.2. A single preliminary row interchange is needed.

The algorithm continues in this fashion extending the transversal by one at each stage. Success in this extension is called an assignment. Sometimes this transversal extension step is trivial. If column k has an entry in row k or beyond, say in row i, a simple interchange between rows i and k will place an entry in the (k, k) position: a cheap assignment. This is illustrated in Figure 6.4.1, where the assignment in column six is made by the interchange of rows six and seven. Figure 6.4.2 shows a simple case where a cheap assignment is not available, but the interchange of rows one and seven is adequate to preserve the (1, 1) entry and make a cheap assignment possible in column six. Sometimes the transversal cannot be extended because the matrix is singular for all numerical values of the entries. Such a matrix is called symbolically singular or structurally singular. This is illustrated in Figure 6.4.3. Observe that columns 1, 2, and 4 have entries in only two rows so these columns are linearly dependent for any numerical values of their entries. Finally, the transversal may be extendible, but the permutation that accomplishes this may be rather more complicated. Such a matrix pattern is shown in Figure 6.4.4. It may be verified that the sequence of row interchanges 

××

×



 × ×    × ×   × × ×  × ×

Fig. 6.4.3. A symbolically singular case.

FINDING A TRANSVERSAL

1 2 3 4 5

1 2 3 4 5 × × × × ×     × ×    × ×  ××

becomes

2 4 3 5 1

113

1 2 3 4 5 ×× ×  × ×     ××     ××  × ×

Fig. 6.4.4. A more complicated transversal extension.

(1, 2), (2, 4), (4, 5) succeeds in placing an entry in the last diagonal position, while keeping entries in the other diagonal positions. The goal of the algorithm described in the next section is to accomplish the transversal extension step quickly and reliably when it can be done, and to identify symbolically singular cases. We show in Section 6.4.3 that it is sufficient to consider the columns one at a time. In our discussion on obtaining a transversal, we follow the language of the other sections of this book and explain the concepts and algorithms in matrix terms. However, we should note that it is very common to describe these algorithms in graphical terminology and we discuss this in Section 6.8. 6.4.2

Transversal extension by depth-first search

Our algorithm provides a systematic means of permuting the rows so that the presence of entries in the first k − 1 diagonal positions is preserved, while an entry is moved to position (k, k). If column k has an entry in row k or beyond, say in row i, a simple interchange of rows i and k suffices. Otherwise we proceed as follows. Suppose the first entry in column k is in row j. If column j has an entry in row k or beyond, say in row p, this means that an interchange between rows p and j will preserve the (j, j) entry and move the entry in column k to row p ≥ k. This is illustrated in Figure 6.4.2, where k = 6, j = 1, p = 7. If column j does not have an entry in row k or beyond, any off-diagonal entry in column j, say (p, j), will allow the row interchange (p, j) to preserve the (j, j) entry while providing an opportunity to look in column p for an entry in row k or beyond. This sequence is illustrated in Figure 6.4.5 where k = 6, j = 2, p = 4, and the pair of interchanges (2, 4) and (4, 6) suffices. In general, we seek a sequence of columns c1 , c2 , ..., cj with c1 = k having entries in rows r1 , r2 , ..., rj , with ri = ci+1 , i = 1, 2, ..., j − 1, and rj ≥ k. Then the sequence of row interchanges (r1 , r2 ),..., (rj−1 , rj ) achieves what we need. To find this we make use of a depth-first search with a look-ahead feature. Starting with the first entry in column k we take its row number to indicate the next column and continue letting the first off-diagonal entry in each column indicate the subsequent column. In each column, we look for an entry in row k or beyond. Always taking the next column rather than trying other rows in the present column is the depth-first part. Looking to see if there is an entry in row k or

114

REDUCTION TO BLOCK TRIANGULAR FORM

1 2 3 4 5 6

1 2 3 4 5 6 × ×  × ×   × ×     × ×     × ×  ××

becomes

1 4 3 6 5 2

1 2 3 4 5 6 × ×  × ×    × ×      × ×    × ×  × ×

Fig. 6.4.5. Two row interchanges needed in the transversal algorithm. 

×  |  ×        

− × − × ×

 −−−−−× ×   × ×   | × |   ×−×   ××  × ×  ××

Fig. 6.4.6. Reaching a row already considered.

beyond is the look-ahead feature. This sequence has one of the following possible outcomes: (i) We find a column with an entry in row k or beyond (some authors call this a cheap assignment but we prefer to reserve the term for immediate assignments in column k). (ii) We reach a row already considered (which is not worth taking because it leads to a column already considered). (iii) We come to a dead end (that is a column with no off-diagonal entries or one whose off-diagonal entries have all already been considered). In case (i) we have the sequence of columns that we need. In case (ii) we take the next entry in the current column. In case (iii), we return (backtrack) to the previous column and start again with the next entry there. An example where we reach a row already considered is shown in Figure 6.4.6. A more complicated example, containing both features, is shown in Figure 6.4.7. Note that in the case of backtracking, the paths involved in the backtrack do not contribute to the final column sequence. The row interchanges for Figure 6.4.6 are (1, 3), (3, 5), (5, 6), (6, 8). The interchanges for Figure 6.4.7 are (1, 3), (3, 6), (6, 8). The combined effect of the row interchanges at stage k is that the new rows 1 to k consist of the old rows 1 to k − 1 and one other. Therefore, any entry in rows 1 to k − 1 before the interchanges will be in rows 1 to k afterwards. It follows that for the look-ahead feature each entry need be tested only once during the whole algorithm to see if it lies in or beyond the current row k. We therefore

FINDING A TRANSVERSAL 

×  |  ×        

− × − × ×

115

 −−−−−× ×   × ×   | ×   ×−×   × ×  × ×  ××

Fig. 6.4.7. Returning to a previous column.

keep pointers for each column to record how far the look-ahead searches have progressed. In particular, the columns in the interior of the depth-first sequence have all been fully examined in look-ahead searches and never need to be so examined again. This simple observation has a significant effect on the efficiency of the implementation. 6.4.3 Analysis of the depth-first search transversal algorithm In this section, we examine the behaviour of the transversal algorithm when the matrix is symbolically singular, look at its computational complexity, and comment on another transversal finding algorithm. Suppose the algorithm has failed to find a path from column k to row k or a row beyond it. If the total number of rows considered is p, we will have looked at p of the first k − 1 columns and also at column k, and found that none of these p + 1 columns contains an entry outside the p rows. Thus, the p + 1 columns have entries in only p rows and the matrix is symbolically singular.2 A simple example of this is in Figure 6.4.3 and another is in Exercise 6.3. We may continue the algorithm from column k + 1 to get other entries onto the diagonal, but we must leave a zero in the (k, k) position. From the description of the algorithm, we observe that there are n major steps, if n is the matrix order. At each step we consider each entry at most once in the depth-first search and at most once in the look-ahead part, so if the matrix has τ entries the number of elementary operations is at worst proportional to nτ . Duff (1981c) provides a detailed discussion of this and includes an example, see Figure 6.4.8 and Exercise 6.4, which shows that this bound may be attained. In general, however, we would expect far fewer steps and Duff’s test results confirm this. Duff et al. (2011) give an exhaustive analysis and comparison of methods for obtaining a maximum transversal that are based on augmenting paths. They 2 An m × n matrix, with m ≥ n, for which every m × k submatrix has at least k nonzero rows is said to have the Hall property. An m × n matrix that does not have the Hall property has structural rank less than n, by the argument used in this paragraph. A matrix that has structural rank less than n does not have a transversal of length n, so our algorithm breaks down at some stage with a set of p + 1 columns with entries in only p rows, so the matrix does not have the Hall property. Therefore an m × n matrix has the Hall property if and only if it has structural rank n (Hall 1935).

116

REDUCTION TO BLOCK TRIANGULAR FORM × × × × × ×

× × × × × ×

× × × × × ×

× × × × × ×

× × × × × × ×

× × × ×

× ×× ×× ×× ××× ××

× × × × × × ×× × × × × × ×× ×× ×× ×× ×× ××

Fig. 6.4.8. An example requiring O(nτ ) accesses.

examine seven different algorithms, some based on breadth-first search, some on depth-first search, and some using both. Their findings agree with Duff (1981c) inasmuch as the algorithms with the lowest worst case complexity do not always outperform other algorithms. For example, the algorithm of Hopcroft and Karp 1 (1973) which requires at most O(n 2 τ ) operations is not necessarily better than simpler depth-first searches whose worst case complexity is O(nτ ). The PF algorithm of Pothen and Fan (1990) is a depth-first search algorithm that is very similar to MC21. The main algorithmic difference is that, when a transversal extension is found, the PF algorithm continues immediately to try to find other extensions, disjoint from the first, before resetting the search parameters and moving to the next column. This is the algorithm favoured by Duff et al. (2011), but it is modified by them so that the direction of searching alternates between different passes of the algorithm. We discuss transversal selection further in Sections 6.8 and 6.9. 6.4.4

Implementation of the transversal algorithm

As described in its column orientation, the transversal algorithm may be implemented with the positions of the entries stored as a collection of sparse column vectors (Section 2.7). An array is needed to keep track of the row permutations, since all interchanges should be done implicitly. An array is needed to record the column sequence c1 , c2 , ... and we need arrays to record how far we have progressed in each column in the look-ahead search and in the local depthfirst search. Finally, we require an array that records which rows have already been traversed at this stage. It need not be reinitialized if the value k is used as a flag at stage k. Other implementation heuristics have been considered to increase the speed of the algorithm. Perhaps the most obvious is to try to maximize the number

FINDING A TRANSVERSAL × ×

117

× ×× × ×× ×× × × ××× × ××

×× ××× ×× ×× ×× ××

Fig. 6.4.9. Illustration of benefit of preordering columns.

of cheap assignments. Duff (1972) and Gustavson (1976) suggest a number of heuristics, including ordering the columns initially by increasing numbers of entries. Figure 6.4.9 illustrates why this appears to be a good idea. As shown, only 12 n cheap assignments could be made followed by 12 n row interchanges. If the columns are reordered by increasing numbers of entries, the entire transversal selection can be made with cheap assignments. The asymptotic bound of the algorithm does, however, remain the same (see Exercise 6.5). Note that each column is searched only once for a cheap assignment, so reordering the rows provides little scope for improving the efficiency of the searches for cheap assignments. Another possibility is to interchange a later column with column k whenever a cheap assignment is not available. In this way, we may assign cheaply at least half the transversal of a nonsingular matrix (see Exercise 6.6). Note that our algorithm does not always get as many as 12 n cheap assignments (see Exercise 6.7). There are more complex cheap heuristics than the simple greedy matching (SGM) of the previous paragraph, although their worst case complexity is still linear in matrix order and number of entries. Two that are discussed and compared by Duff et al. (2011) are the Karp-Sipser Greedy Matching (KSM) (Karp and Sipser 1981) and the minimum degree matching (MDM) (Magun 1998). KSM first looks for singletons, as we did in Section 6.3. In the simple version of their algorithm, an entry from the submatrix with no singletons is then chosen randomly. In the full version (which is seldom implemented), a row or column with two entries is used and an entry is chosen randomly when there are no singleton or doubletons. MDM treats singletons in the same way as KSM, but when there are none it always chooses a row or column with least entries. There is a two-sided variant that then chooses an entry of the row (column) that has least entries in its column (row). Duff (1981c) found that the cost of SGM, although low, did not give sufficient savings when running the depth-first search algorithm on the reduced matrix to

118

REDUCTION TO BLOCK TRIANGULAR FORM

make the cheap pre-pass worthwhile. This was in part because it is common for the given matrix to have entries on most of its diagonal. The permutations corresponding to the cheap assignments could destroy this property. In the case of identifying singletons as we did in Section 6.3, however, no initial decrease in the original entries on the diagonal can take place (see Exercise 6.8). Duff et al. (2011) find that KSM and MDM usually find a transversal that is close to maximum and therefore recommend their use, whatever subsequent algorithm is used to complete the matching. 6.5 6.5.1

Symmetric permutations to block triangular form Background

We assume that a row permutation P1 has been computed so that P1 A has entries on every position on its diagonal (unless A is symbolically singular), and we now wish to find a symmetric permutation that will put the matrix into block lower triangular form. That is, we wish to find a permutation matrix Q such that QT (P1 A)Q has the form   B11

  B21 B22    B31 B32 B33   T Q (P1 A)Q =  · · ·   · · ·    · · · 

·

BN 1 BN 2 BN 3

·

       ,      

· · ·

·

(6.5.1)

BN N

where each Bii cannot itself be symmetrically permuted to block triangular form. Then, with P = QT P1 , we will have achieved the desired permutation (6.1.2). We will show (Section 6.6) that symmetric permutations are all that is required once the transversal has been found. While any row and column singletons will usually have been removed, we do not rely on this. It is convenient to describe algorithms for this process with the help of the digraphs (directed graphs) associated with the matrices (see Section 1.2). With only symmetric permutations in use, the diagonal entries play no role so we have no need for self-loops associated with diagonal entries. Applying a symmetric permutation to the matrix causes no change in the associated digraph except for the relabelling of its nodes. Thus, we need only consider relabelling the nodes of the digraph, which we find easier to explain than permuting the rows and columns of the matrix. If we cannot find a closed path (cycle), as defined in Chapter 1, through all the nodes of the digraph, then we must be able to divide the digraph into two parts such that there is no path from the first part to the second. Renumbering the first group of nodes 1, 2, ..., k and the second group k + 1, ..., n will produce a

SYMMETRIC PERMUTATIONS TO BLOCK TRIANGULAR FORM 5



 ×× × ×     ×××     × × × ×

119

1

2

3

4

Fig. 6.5.1. A 5 × 5 matrix and its digraph.

corresponding (permuted) matrix in block lower triangular form. An example is shown in Figure 6.5.1, where there is no connection from nodes (1, 2) to nodes (3, 4, 5). The same process may now be applied to each resulting block until no further subdivision is possible. The sets of nodes corresponding to the resulting diagonal blocks are called strong components. For each, there is a closed path passing through all its nodes but no closed path includes these and other nodes. The digraph of Figure 6.5.1 contains just two strong components, which correspond to the two irreducible diagonal blocks. A triangular matrix may be regarded as the limiting case of the block triangular form in the case when every diagonal block has size 1 × 1. Conversely, the block triangular form may be regarded as a generalization of the triangular form with strong components in the digraph corresponding to generalized nodes. This observation forms the basis for the algorithms in the next three sections. 6.5.2

The algorithm of Sargent and Westerberg

Algorithms for finding the triangular form may be built upon the observation that if A is a symmetric permutation of a triangular matrix, there must be a node in its digraph from which no path leaves. This node should be ordered first in the relabelled digraph (and the corresponding row and column of the matrix permuted to the first position). Eliminating this node and all edges pointing into it (corresponding to removing the first row and column of the permuted matrix) leaves a remaining subgraph, which again has a node from which no paths leave. Continuing in this way, we eventually permute the matrix to lower triangular form. To implement this strategy for a matrix that may be permuted to triangular form, we may start anywhere in the digraph and trace a path until we encounter a node from which no paths leave. This is easy since the digraph contains no closed paths (such a digraph is called acyclic); any available choice may be made at each node and the path can have length at most n − 1, where n is the matrix order. We number the node at the end of the path first, and remove it and all edges pointing to it from the digraph, then continue from the previous node on the path (or choose any remaining node if the path is now empty) until once again we reach a node with no path leaving it. In this way, the triangular form is identified and no edge is inspected more than once, so the algorithm is economical. We illustrate this with the digraph of Figure 6.5.2. The sequence of paths is illustrated in Figure 6.5.3.

120

REDUCTION TO BLOCK TRIANGULAR FORM

1

3

5

7

2

4

6

Fig. 6.5.2. A digraph corresponding to a triangular matrix.    

3 Path 2 2    1 1 1 1 2 3 Step

4 2 1 4

5 4 2 1 5

4 2 2 7 1 1 1 6 6 6 6 7 8 9 10 11

Fig. 6.5.3. The sequence of paths used for the Figure 6.5.2 case, where nodes selected for ordering are shown in bold.

It may be verified (see Figure 6.5.4) that the relabelled digraph 3 → 1, 5 → 2, 4 → 3, 2 → 4, 1 → 5, 7 → 6, 6 → 7 corresponds to a triangular matrix. Observe that at step 5 there was no path from node 5 because node 3 had already been removed. Note also that at step 9 the path tracing was restarted because the path became empty. For large problems known to have triangular structure, this algorithm is very useful in its own right. It allows users to enter data in any convenient form and automatically finds the triangular structure. We observe that this is a graph-theoretic interpretation of the algorithm that we used in Section 6.3 for the case where the matrix may be reduced to triangular form. Sargent and Westerberg (1964) generalized this idea to the block case. They define as a composite node any group of nodes through which a closed path has been found. Starting from any node, a path is followed through the digraph until: (i) A closed path is found (identified by encountering the same node or composite node twice), or (ii) a node or composite node is encountered with no edges leaving it. In case (i), all the nodes on the closed path must belong to the same strong component and the digraph is modified by collapsing all nodes on the closed path into a single composite node. Edges within a composite node are ignored, and edges entering or leaving any node of the composite node are regarded as entering or leaving the composite node. The path is now continued from the composite node. In case (ii), as for ordinary nodes in the triangular case, the composite node is numbered next in the relabelling. It and all edges connected to it are removed, and the path now ends at the previous node or composite node, or starts from any remaining node if it would otherwise be empty.

SYMMETRIC PERMUTATIONS TO BLOCK TRIANGULAR FORM 

××  ×        

121

   × × × ×   ××     ××   ×       ××  becomes  × × ×  ×   × × × ×     × × × × ×  × × × ××

Fig. 6.5.4. The matrices before and after renumbering. 7

1

2

3

6

4

5

Fig. 6.5.5. A digraph illustrating the algorithm of Sargent and Westerberg. Thus, the blocks of the required form are obtained successively. This generalization of the triangularization algorithm shares the property that each edge of the original digraph is inspected at most once. We illustrate with the example shown in Figure 6.5.5. Starting the path at node 1, it continues 1 → 2 → 3 → 4 → 5 → 6 → 4, then (4, 5, 6) is recognized as a closed path, and nodes 4, 5, and 6 are relabelled as composite node 40 . The path is continued from this composite node to become 1 → 2 → 3 → 40 → 7 → 3. Again, a closed path has been found and (3, 40 , 7) is labelled as 30 and the path becomes 1 → 2 → 30 . Since there are no edges leaving 30 , it is numbered first and removed. The path is now 1 → 2 and no edges leave node 2, so this is numbered second. Finally, node 1 is numbered as the last block. The corresponding original and reordered matrices are shown in Figure 6.5.6. The difficulty with this approach is that there may be large overheads associated with the relabelling in the node collapsing step. A simple scheme such as labelling each composite node with the lowest label of its constituent nodes can result in O(n2 ) relabellings. For instance, in Figure 6.5.7 successive composite nodes are (4, 5), (3, 4, 5, 6), (2, 3, 4, 5, 6, 7), (1, 2, 3, 4, 5, 6, 7, 8); in general, such    ×× ××   ××  ×× ×          × × × ×         × × × × × ×  becomes      ×  × × ×      × ×  × × × × × ×× 

Fig. 6.5.6. The matrices before and after renumbering.

122

REDUCTION TO BLOCK TRIANGULAR FORM 1

2

3

4

8

7

6

5

Fig. 6.5.7. A case causing many relabellings. a digraph with n nodes will involve 2+4+6+...+n = n2 /4 + n/2 relabellings. Various authors (for example, Munro 1971a, 1971b, Tarjan 1975) have proposed schemes for simplifying this relabelling, but the alternative approach of Tarjan (1972) eliminates the difficulty. This is described in the next section. Even though the Sargent and Westerberg algorithm is not as good as Tarjan’s, we introduced it for two reasons. First, it is very simple. Secondly, it was developed very early and demonstrated excellent performance on many problems. It provides an intuitive motivation for the more elegant, optimal algorithm of Tarjan. 6.5.3

Tarjan’s algorithm

The algorithm of Tarjan (1972) follows the same basic idea as that of Sargent and Westerberg, tracing paths and identifying strong components. It eliminates the relabelling through the clever use of a stack very like that in Figure 6.5.3, which was used to find the triangular form. The stack is built using a depthfirst search and records both the current path and all the closed paths so far identified. Each strong component eventually appears as a group of nodes at the top of the stack and is then removed from it. We first illustrate this algorithm with two examples and then explain it in general.

Stack

                  

Step

3 2 2 1 1 1 1 2 3

4 3 2 1 4

5 4 3 2 1 5

6 5 4 3 2 1 6

64 5 4 3 2 1 7

7 64 5 4 3 2 1 8

73 64 5 4 3 2 1 9

7 63 5 4 3 2 1 10

7 6 53 4 3 2 1 11

7 6 5 43 3 2 1 12

7 6 5 4 3 2 2 1 1 1 13 14 15

Fig. 6.5.8. The stack corresponding to Figure 6.5.5. Figure 6.5.8 shows the stack at all steps of the algorithm for the digraph of Figure 6.5.5 starting from node 1. In the first six steps, the stack simply records the growing path 1 → 2 → 3 → 4 → 5 → 6. At step 7, we find an edge connecting the node at the top of the stack (node 6) to one lower down (node 4). This is recorded by adding a link, shown as a subscript. Since we know that there is a path up the stack, 4 → 5 → 6, this tells us that (4,5,6) lie on a closed path, but

SYMMETRIC PERMUTATIONS TO BLOCK TRIANGULAR FORM 3

1

123

6

2

4

5

7

8

Fig. 6.5.9. An example with two nontrivial strong components. we do not relabel the nodes. Similarly, at step 9, we record the link 7 → 3 and this indicates that (3, 4, 5, 6, 7) lie on a closed path. There are no more edges from node 7 so it is removed from the path. However it is not removed from the stack because it is part of the closed path 3 −7. We indicate this in Figure 6.5.8 by showing 7 in bold. Now node 6, which is immediately below node 7 on the stack, is at the path end so we return to it, but the only relabelling we do is to make its link point to 3 and we discard the link from 7, thereby recording the closed path 3–7. We now look for unsearched edges at node 6 and find that there are none, so we label 6 in bold, set the link from 5 to 3, and discard the link from 6 (see column 11 of Figure 6.5.8). At the next step we treat node 5 similarly and in the following loop we label 4 in bold and discard its link, but no change is made to the link at 3 which may be regarded as pointing to itself. Column 13 of Figure 6.5.8 still records that nodes 3, 4, 5, 6, and 7 lie on a closed path. Next, we find that 3 has no unsearched edges. Since it does not have a link to a node below it, there cannot be any path from it or any of the nodes above it to a node below it. We have a ‘dead-end’, as in the triangular case. The nodes 3–7 constitute a strong component and may be removed from the stack. The trivial strong components (2) and (1) follow. The algorithm also works starting from any other node. Notice that the strong component was built up gradually by relabelling stack nodes in bold once their edges have been searched and each addition demands no relabelling of nodes already in the composite node. This example was carefully chosen to be simple in the sense that the path always consists of adjacent nodes in the stack. This is not always so, and we illustrate the more general situation in the next example.

Stack

                      

Step

3 31 3 2 2 2 21 1 1 1 1 1 1 2 3 4 5

4 3 21 1 6

5 4 3 21 1 7

6 5 4 3 21 1 8

8 7 7 76 7 64 6 6 6 6 6 5 54 54 54 54 54 4 4 4 4 4 4 3 3 3 3 3 3 21 21 21 21 21 21 1 1 1 1 1 1 9 10 11 12 13 14

7 6 5 4 3 21 1 15

3 21 1 16

3 2 1 17

Fig. 6.5.10. The stack corresponding to Figure 6.5.9.

124

REDUCTION TO BLOCK TRIANGULAR FORM

Consider the digraph shown in Figure 6.5.9. The stack, starting from node 1, is shown in Figure 6.5.10. At step 5, node 3 is removed from the path because it has no unsearched edges, but is not removed from the stack because of its link to node 1. Node 4 is added at step 6 because of the edge (2,4) and the path is 1 → 2 → 4. Similarly node 7 is added because an edge connects to it from node 5. At step 12, node 8 is identified as a strong component because it has no edges leading from it and it does not have a link pointing below it in the stack. At step 13, node 7 has no more edges and is recognized as a member of a strong component because of its link to node 6 and is therefore labelled in bold. The strong components (4,5,6,7) and (1,2,3) are found in steps 15 and 17. A useful exercise is to carry through this computation from another starting node (Exercises 6.9 and 6.10). As with the Sargent and Westerberg algorithm, we start with any node and if ever the stack becomes empty we restart with any node not so far used. It is convenient to give such starting nodes links to themselves. At a typical step we look at the next unsearched edge of the node at the end of the path (we will call this node the current node) and distinguish these cases: (i) The edge points to a node not on the stack, in which case this new node is added to the top of the stack and given a link that points to itself. (ii) The edge points to a node lower on the stack than the node linked from the current node, in which case the link is reset to the link of this lower node. (iii) The edge points to a node higher on the stack than the node linked from the current node, in which case no action is needed. (iv) There are no unsearched edges from the current node and the link from the current node points below it. In this case, the node is left on the stack but removed from the path. The link for the node before it on the path is reset to the lesser of its old value and that of the link for the current node. (v) There are no unsearched edges from the current node and the link from the current node does not point below it. In this case, the current node and all those above it on the stack constitute a strong component and so are removed from the stack. It is interesting (see Exercise 6.11) to see that the excessive relabelling associated with the digraph of Figure 6.5.7 is avoided in the Tarjan algorithm. A formal proof of the validity of the Tarjan algorithm can readily be developed from the following observations: (i) At every step of the algorithm, there is a path from any node on the stack to any node above it on the stack. (ii) For any contiguous group of nodes on the stack, but not on the path (nodes marked in bold), there is a closed path between these nodes and the next node on the stack below this group.

ESSENTIAL UNIQUENESS OF THE BLOCK TRIANGULAR FORM

125

(iii) If nodes α and β are a part of the same strong component, α cannot be removed from the stack by the algorithm unless β is removed at the same time. Together, (i) and (ii) say that nodes removed from the stack at one step must be a part of the same strong component, while (iii) says that all of the strong component must be removed at the same step. 6.5.4

Implementation of Tarjan’s algorithm

To implement Tarjan’s algorithm, it is convenient to store the matrix as a collection of sparse row vectors, since following paths in the digraph corresponds to accessing entries in the rows of the matrix. It was convenient for illustration in the last section to label bold those nodes on the stack whose edges have all been searched, but for efficiency we need a rapid means of backtracking from the current node to the previous node on the path. This is better done by storing a pointer with each path node. Hence the following arrays of length n, the matrix order, are needed: (i) One holding the nodes of the stack. (ii) One holding, for each path node, a link to the lowest stack node to which a path has so far been found. (iii) One holding, for each path node, a pointer to the previous path node. (iv) One holding for each node, its position on the stack if it is there, its position in the final order if it has been ordered, or zero if neither. (v) One holding, for each path node, a record of how far the search of its edges has progressed. In addition, the starts of the blocks must be recorded. Note that each edge is referenced only once and that each of the steps (i) to (v) of Section 6.5.3 that is associated with an edge involves an amount of work that is bounded by a fixed number of operations. There will also be some O(n) costs associated with initialization and accessing the rows, so the overall cost is O(τ ) + O(n) for a matrix of order n with τ entries. Further details of the algorithm and its implementation are provided by Tarjan (1972) and Duff and Reid (1978a). Duff and Reid (1978b) also provide an implementation in Fortran; it is interesting to note that their implementation involves less than 75 executable statements. 6.6

Essential uniqueness of the block triangular form

In the last three sections, we have described two alternative algorithms for finding a block triangular form given a matrix with entries on its diagonal. There remains, however, some uncertainty as to whether the final result may depend significantly on the algorithm or the set of entries placed on the diagonal in the first stage. We show in this section that for nonsingular matrices, the result is essentially unique in the sense that any one block form can be obtained from any other by applying row permutations that involve the rows of a single block

126

REDUCTION TO BLOCK TRIANGULAR FORM

row, column permutations that involve the columns of a single block column, and symmetric permutations that reorder the blocks (see also Duff (1977)) We begin by considering the variety of forms possible from symmetric permutations of a matrix. These correspond to no more than different relabellings of the same digraph. Since the strongly connected components of the digraph are uniquely defined, it follows that the diagonal blocks of the block triangular form are uniquely defined. The order in which these blocks appear on the block diagonal must be such that block α must precede block β for every pair of blocks α, β, which are such that the component corresponding to β is connected to the component corresponding to α. It follows that often there is scope for permuting the blocks as wholes. Also, of course, we are free to use any permutation within each block. This describes the full freedom available and what is meant by ‘essential uniqueness’. It should be noted that this result is true for symmetric permutations whether or not the diagonal consists entirely of entries. However, it can only be extended to unsymmetric permutations if the diagonal does consist entirely of entries. Otherwise, we may be able to find a block form with smaller diagonal blocks by first permuting entries onto the diagonal (see Exercise 6.12). We now suppose that B = PAQ is in the block triangular form (6.1.2) and show that any set of entries that may be permuted onto the diagonal must consist of entries from the diagonal blocks Bii , i = 1, 2, ..., N . Such a set of entries must include exactly one from each row of the matrix and exactly one from each column. Those from the n1 rows corresponding to B11 must come from the first n1 columns, that is from B11 itself. The remainder must all come from the last n − n1 rows and columns. The submatrix of the last n − n1 rows and columns has the same triangular form, so we can use the same argument to show that n2 of the entries must come from B22 . Continuing, we find that the entries all come from diagonal blocks. It follows that column permutations within diagonal blocks can be used to place the new set of entries on the diagonal. Such permutations make no change to the block structure. We may deduce that, if A has entries on its diagonal and PAQ = B has the block form (6.1.2), then there is a symmetric permutation P1 APT1 having the same form, because we can find a column permutation BQ1 of B that preserves its form, while placing the diagonal entries of A on the diagonal of BQ1 . Therefore, BQ1 is a symmetric permutation of A, say BQ1 = P1 APT1 . We have therefore established that once entries have been placed on the diagonal there is no loss of generality in considering symmetric permutations only. This establishes the essential uniqueness, as defined at the end of the second paragraph of this section. A simple illustration of the essential uniqueness is provided (see Exercise 6.13) by looking first for column singletons in the matrix of Figure 6.3.1. This gives a trailing block B33 of size five and a leading block B11 of size zero. While

EXPERIENCE WITH BLOCK TRIANGULAR FORMS

127

this is not the same, it is ‘essentially’ the same since all that has happened is that two blocks of size 1 have moved. 6.7

Experience with block triangular forms

Having described an efficient algorithm for finding the reducible form of a matrix, we now discuss experience with it on practical problems. Since this capability has been incorporated as a user option in the HSL general unsymmetric code MA48, we make reference to results using this code in describing timing and performance. The block triangular form is an unsymmetric form and so it would be expected to be less useful as a preordering on a symmetric matrix. Indeed, one might conjecture that the benefit would be greater for matrices that are highly unsymmetric. In the symmetric case, the analogue is a block diagonal form so that the matrix splits into independent blocks. This decomposition is less common in practice, in part because the underlying problem is likely to already have been reduced to independent subproblems. The block triangular form of a symmetric matrix that has zeros on its diagonal will be unsymmetric. However, Duff and U¸car (2010) have shown how symmetry can be recovered. The rows of each block row and the columns of each block column are kept together and the resulting matrix has some zero blocks on its diagonal. Erisman et al. (1987) collected unsymmetric matrices from chemical engineering, linear programming, simulation, partial differential equation grids, and other applications to evaluate the effectiveness of block triangularization on real problems. Their results are summarized in Table 6.7.1. They quantified the amount of reducibility, and average figures for these results are summarized in the table. These results tend to support the conjecture that very unsymmetric matrices are those that are likely to benefit most and indicate that some applications yield more reducible matrices than others. Table 6.7.1 Benefits from block triangularization. Matrix Chemical Linear Simulation PDE grids Miscellaneous origin engineering programming Number of matrices 16 16 11 7 6 Average symmetry index 0.05 0.01 0.46 0.99 0.54 Reducibility None 0 0 7 5 3 Some 2 0 4 0 2 Much 14 16 0 2 1 Average number of diagonal blocks 6.9 21.8 1.4 1.7 2.2

128

REDUCTION TO BLOCK TRIANGULAR FORM

Table 6.7.2 Times in seconds of the ANALYSE and FACTORIZE phases of HSL MA48, with and without block triangularization (BTF) on a Dual-socket 3.10 GHz Xeon E5-2687W, using the Intel compiler with O2 optimization. Matrix shyy161 twotone cage9 neos2 Order (×103 ) Matrix 76.48 120.75 3.53 134.13 Max. block 25.44 105.74 3.53 132.57 Entries (×103 ) Matrix 329.8 1 224.2 41.6 685.1 All blocks 228.0 800.8 41.6 685.1 ANALYSE time (secs) BTF 6.96 11.62 0.77 0.20 No BTF 10.31 51.68 0.64 0.18 FACTORIZE time (secs) BTF 0.98 0.28 0.054 0.030 No BTF 1.17 2.21 0.051 0.028

In Table 6.7.2, we give timings for runs of HSL MA48 with and without block triangularization on four of our test matrices, representative of our experience. We have added the time for block triangularization to the ANALYSE times. We indicate the effectiveness of block triangularization by showing the order of the largest block and the total number of nonzeros in all the non-trivial blocks on the diagonal (order greater than one). Block triangularization often decreases both ANALYSE and FACTORIZE times slightly; shyy161 illustrates this behaviour. Occasionally, it decreases the times greatly as is illustrated by twotone. If the matrix is irreducible, there is an overhead in ANALYSE for no benefit, but this overhead is not great. An example of this is cage9; here, there is a slight difference in FACTORIZE times, because a different pivot sequence is taken. Another case where there is no benefit is when the matrix is block diagonal, as illustrated by neos2. In general, block triangularization is worthwhile and its use is the default in HSL MA48, but a user may find it worthwhile to check that this is the case for the matrices involved in his or her application.

6.8

Maximum transversals

Throughout this chapter, we have been concerned with developing efficient algorithms for block triangularization, so that subsequent Gaussian elimination operations can be confined to the blocks on the diagonal and the original problem can be solved as a sequence of subproblems. The algorithms of this chapter are, however, of interest in other contexts and, in particular, the transversal selection problem has a long and varied history.

MAXIMUM TRANSVERSALS

129

Transversal selection corresponds to the assignment or matching problem, which has been of interest to management scientists for many years. They are concerned with allocating people to tasks, activities to resources, etc. This area has given rise to many algorithms, most being based on variants of breadthor depth-first search. A selection of these algorithms is discussed by Duff et al. (2011). Other disciplines for which transversal selection is important include graph theory, games theory, and nonlinear systems. A different language is used in each area and a brief summary of the terms involved is given by Duff (1981c). In this chapter, our presentation is primarily given in terms of matrices and the nomenclature is also taken from the matrix literature. However, this area is a rich one in combinatorics and is usually described in terms of graphs. Thus, transversal selection is equivalent to choosing a matching in a bipartite graph, which is a set of edges of the graph no two of which are adjacent to the same vertex. Transversal extension in the matrix is equivalent to finding an augmenting path in a bipartite graph. An augmenting path is an alternating path (that is a path whose edges alternate between edges in the current matching and not in it) whose endpoints are both unmatched. Then, the matching (transversal) is extended by flipping the edges in the augmenting path so that matched edges become unmatched and vice versa. An important result governing this was given by Berge (1957). Let M be a matching in a bipartite graph G. M is of maximum cardinality if and only if there is no augmenting path for M in G. In the previous paragraph, we showed that M is not of maximum cardinality if there is an augmenting path for it in G. For the converse, suppose there is a matching M 0 of greater cardinality. Consider the set of edges T that are in only one of M and M 0 (called the symmetric difference of M and M 0 ). A path in T must alternate between edges in M and edges in M 0 . Furthermore, as the cardinality of M 0 is greater than M , the set T must contain a path with more edges in M 0 than in M . This is an augmenting path and so this part of Berge’s result is proved. Note the importance of this result for deciding whether a maximum transversal has been found. Often it is important to obtain a maximum transversal (or maximum assignment) which, in matrix terms, is equivalent to determining a permutation that places the maximum number of entries on the diagonal. We now show that the algorithm we have described achieves this if we continue searching the columns after encountering one from which no assignment is possible. Suppose that eventually l assignments are made. No further assignment can be made by adding a rejected column because if this were possible, part of the resulting transversal would provide a set of transversal entries for the rejected column and those columns that had been accepted at the time of its rejection. Because the rank of A will be l if the transversal entries have the value one and all other entries have the value zero, we have thus shown that l is equal to the maximum rank that the matrix can have over all choices of values for the entries (the symbolic rank).

130

6.9

REDUCTION TO BLOCK TRIANGULAR FORM

Weighted matchings

In all of our previous discussion on maximum transversals, we did not consider the numerical value of the entries selected as transversal entries. That is to say that all the graphs we were using were unweighted. It is possible to extend this work by looking at the case of bipartite weighted matchings whereupon the transversal can have additional properties. In Section 4.15.3, we considered I-matrix scaling, that is, scaling and permuting to a matrix whose diagonal entries are all equal to one in absolute value and whose off-diagonal entries are all less than or equal to one in absolute value. In this section, we will show that this can be achieved by finding a permutation P such that the product n Y

|(PA)ii |

(6.9.1)

i=1

is maximized. We will begin with the assumption that the matrix has full structural rank. Duff and Koster (1999, 2001) give an efficient algorithm for computing such a permutation for a sparse matrix and also give examples of its power as a precursor to both direct and iterative methods of solution. They have implemented their algorithm in the HSL package MC64. Their work is based on the Hungarian algorithm (Kuhn 1955) and the shortest path algorithm of Dijkstra (1959). They explain their algorithm in graphical terms. We will work directly with the matrix and prove the validity of the algorithm. Duff and Koster work with the sparse matrix C that has the same structure as A and entries cij = log |aij |. (6.9.2) Maximizing the product (6.9.1) is equivalent to maximizing the sum n X

(PC)ii .

(6.9.3)

i=1

Pn

It is also equivalent to minimizing

i=1 (PB)ii

for the matrix B with entries

bij = ui − cij + vj ,

(6.9.4)

because X i

(PB)ii =

X i

ui −

X i

(PC)ii +

X

vj

(6.9.5)

j

for all permutations P. It follows that the choice of P is independent of u and v. Duff and Koster always work with vj = − minj (ui − cij ) so that no entries of B are negative and every column has at least one zero. If u and v can be chosen so that B has a transversal of zero entries and all other entries non-negative, an optimal solution will have been found since any

WEIGHTED MATCHINGS

131

other transversal cannot have entries that sum to less than zero. Furthermore, this B corresponds to the scaled matrix diag(eui ) A diag(evj ), which is an Imatrix. The algorithm begins with the following crash procedure. The vector u is chosen as 0 and v is chosen to have components vj = maxj cij so that all the entries of B are non-negative and every column has a zero. If row i is without a zero entry, ui is changed to − mini bij , which provides this row with a zero without making any other entry negative. Transversals of zeros are now sought using the algorithm of Section 6.4, except that Duff and Koster chose to limit the search to paths of length two. When this is completed, we will have found a set of k columns of B, say BQ and a permutation P such that the matrix PBQ has diagonal entries zero and no negative entries. We have an optimal solution for the first k columns of B. Sometimes in practice, k has the value n and no further work is needed. If k < n, an algorithm similar to that of Section 6.4, extends the solution one column at a time, maintaining optimality for the columns involved. A complication is that a best transversal has to be found, rather than any transversal. Given an optimal solution for k columns and corresponding vectors u and v, we append another column of B and seek an optimal solution for this matrix. We do this by performing a structured search of the alternating paths that start in column k + 1 and end in a row beyond k. The length of such a path is defined as the sum of the values of bij along the path, noting that the diagonal entries passed are all zero. Let the shortest such path be in rows i1 , i2 , . . . , i` . For simplicity of explanation, let us symmetrically permute the matrix so that this path is in rows 1, 2, . . . , `. The leading ` × ` submatrix contains the alternating path with nonzero entries in positions (1, `) and (i, i − 1), i = 2, 3, . . . `, say d1 , d2 , . . . , d` . For example, if ` = 4, the entries of the alternating path are 0 d2

d1 0 d3

.

0 d4

If we permute the columns so that column ` becomes column 1 and column m becomes column m + 1 for m = 1, 2, . . . ` − 1, we have diagonal entries di , i = 1, 2, . . . ` and zeros in positions (i, i + 1), i = 1, 2, . . . ` − 1, for example d1

0 d2

0 d3

0 d4

.

P` If we add m+1 di to um for m = 1, 2, . . . ` − 1, the path entries in columns 2, 3, . . . ` become the same, for example

132

REDUCTION TO BLOCK TRIANGULAR FORM

d1 + d2 + d3 + d4

Subtracting for example

P`

m

d2 + d3 + d4 d2 + d3 + d4

d3 + d4 d3 + d4

d4 d4

.

di from vm for m = 1, 2, . . . ` makes all the path entries zero, 0 0 0 0 . 0 0 0

We now show that these changes to u and v cannot make a negative entry in these columns. First, an entry above the superdiagonal, that is, bij < 0 with Pj−1 j > i + 1 has i di added to it. It therefore remains non-negative. Secondly, suppose bij with j = 1 or i > `. If we exchange rows i and j, we get a matrix whose diagonal entries have a negative sum. They correspond to an alternating path in the matrix at the start of the stage that is shorter than the chosen path, which cannot be the case. Finally, if j > 1 and j < i ≤ `, the permutation such that row i becomes row j and for m = j, j + 1, . . . i − 1, row m becomes row m + 1 creates a matrix whose diagonal entries have a negative sum. This cannot happen because the previous steps involved optimization over columns i to j. The practical difficulty with this algorithm lies in identifying the augmenting path that has shortest length. There may be many thousands of augmenting paths. Instead of the depth-first search that was used in Section 6.4.2, a selective search is needed. Once an augmenting path has been found, any path of greater length can be rejected as a possibility for the starting part of a best augmenting path. For each row reached by a path, a record is kept of the shortest such path and its length. The shortest non-augmenting path is extended by scanning the corresponding column. Any new path that is longer than an already known path to the same row or longer than best augmenting path so far found is rejected. The search can be terminated if the shortest non-augmenting path is no shorter than the best augmenting path found. To avoid a lengthy search for the shortest path, the path lengths are held in a heap. We have run mc64 on 47 test matrices, all of order greater than 3 000 and 21 of order greater than 100 000. In three cases, the crash algorithm found the solution and in 10 cases, the further work was trivial. At the other extreme, the further work was very substantial in five cases. Occasionally, almost all the columns that had already been selected were searched in the course of extending the solution to one more column. Because of the danger of this happening, we do not recommend the use of this algorithm on a matrix whose entries are all of a similar size or become so after simple row and column scaling. In Table 6.9.1, we summarize the results on 5 matrices, chosen to illustrate the range of performance. We also tried finding a transversal without regard to optimality with the algorithm of Section 6.4.1 by running mc21 on the structure of the matrix. This was mostly faster than mc64, but was occasionally much slower.

THE DULMAGE–MENDELSOHN DECOMPOSITION

133

Table 6.9.1 The variation in performance of the weighted matching algorithm. All counts are in thousands except in the final column. Matrix

Order

ncvxqp3 shyy161 x104 rajat21 transient

75.0 76.5 108.4 411.7 178.9

Cols Max. Av. Max. Total after queue queue cols scanned crash size size scanned cols×106 16.810 30.650 0.182 74.993 27.441 0.804 0.445 0.098 12.770 2.725 6.995 0.696 0.043 0.664 0.180 0.401 3.492 0.072 2.218 0.012 0.009 0.018 0.007 0.004 0.000

For example, it was more than 25 times slower on the matrix x104. It seems that numerical values can allow unrewarding paths to be avoided. In the rank deficient case, if no transversal is found after a full search from a column, the column can be set aside because it will have been identified as structurally dependent on the submatrix whose optimal transversal has been found. We will end with an optimal transversal on a submatrix. Other criteria can be expressed as a weighted bipartite matching algorithm, such as maximizing the sum of the sizes of the diagonal entries (Exercise 6.14) or maximizing the smallest diagonal entry (Exercise 6.15). There are entries to MC64 to generate these matchings also. 6.10 The Dulmage–Mendelsohn decomposition The concept of a block triangular form can be extended to the case of singular or rectangular systems using the work of Dulmage and Mendelsohn (1959). It is helpful to employ the bipartite graph of the matrix, which was introduced in Section 1.2. It has a node for each row, a node for each column, and an edge between row node i and column node j for each matrix entry aij . Suppose the matrix has a maximal transversal of length l. This corresponds to a one-one matching in the bipartite graph between l row nodes and l column nodes. This can be used to construct what Pothen and Fan (1990) call the coarse decomposition. An example of a matrix that has been permuted to this form is shown in Figure 6.10.1. To get to this form, we start by placing any unmatched rows as leading. These are rows 1–3 in Figure 6.10.1. Next we look for the columns that can be reached from any of these rows via an alternating path. These n1 columns are placed as leading. The first block of the block triangular form contains these columns and the rows that are unmatched or have a transversal entry in one of these columns. The rows of the block can have no entry in a column outside the block because such an entry would have led to an alternating path from an unmatched row and so the column would by definition be in the first block. Similarly, any unmatched columns are placed as trailing and the m1 rows that can be reached from any of these columns via an alternating path are placed as trailing. The final block of the block triangular form contains these

134

REDUCTION TO BLOCK TRIANGULAR FORM n1 l − m1 − n1 m1 + n − l × × × m − l + n1 ∗ ∗ × ∗ × ×× ∗ l − m1 − n1 × ∗ × × × ∗ ∗ × × × ×× × ∗ m1 × ×× ∗ ××

Fig. 6.10.1. Example of the Dulmage and Mendelsohn decomposition. The transversal entries are shown as asterisks. × × × × ×

Fig. 6.10.2. The reordered leading block of Figure 6.10.1. rows and the columns that are unmatched or have a transversal entry in one of these rows. The middle block contains all the rows and columns not in the first or final block. Since it contains no unmatched rows and no unmatched columns, it is square. We have permuted the matrix to a block triangular form, with the first block having at least as many rows as columns and the final block having having at least as many columns as rows. Dulmage and Mendelsohn (1959) established that this form is unique. To see this, consider another maximum transversal; this can have at most n1 entries in the first set of columns and at most m1 entries in the final set of rows, so must have a full set of entries in the middle set of rows and columns. This implies that the transversal must have exactly m1 entries in the first set of columns and exactly n1 entries in the final set of rows. The block structure is therefore unique. The coarse decomposition may be further subdivided by applying block triangularization to the square central block of order l −m1 −n1 . In our example, this is already in block triangular form. Also the leading block may be subdivided if its bipartite graph consists of a collection of parts that are bipartite graphs with no connections between each other. In this case, grouping the rows and columns of each part gives a block diagonal form. For our example, the groups are [column 1 with rows 1 and 4] and [column 2 with rows 2, 3 and 5]; thus, the columns are already ordered, but the rows need to be put in the order [1,4,2,3,5] to yield the form shown in Figure 6.10.2. Similar considerations apply to the final block. For our example, no subdivision is possible here.

THE DULMAGE–MENDELSOHN DECOMPOSITION

135

Duff and U¸car (2010) show that if the matrix is structurally symmetric but symbolically rank deficient, the ‘head’ and ‘tail’ blocks are the transpose of each other. A useful property of this decomposition is that we can always find a set R of r rows and a set C of c columns with r + c = l that cover all the entries (in the sense that each entry lies in a row of R or a column of C or both). We must include the first set of columns of the decomposition in C and the final set of rows in R. We may take the middle set of columns to be in C or the middle set of rows to be in R. It is clear that all the entries are then covered. This is the least number of rows and columns that can do this since there is a transversal of length l. This discussion effectively provides a proof of the theorem of K¨onig (1931). Exercises 6.1 Given a block lower triangular matrix A, find permutations P, Q such that PAQ is block upper triangular. 6.2 Consider creating the block triangular form shown in equation (6.3.2). Show that if B11 is formed by successively looking for row singletons, forming B33 afterwards by looking for column singletons does not lead to additional row singletons. 6.3 Show that the matrix 

×

××



 ××     ××    ×  × ×     ×    × × × × × is symbolically singular by attempting to extend the transversal and identifying columns which must be linearly dependent. 6.4 Show that the transversal algorithm, as described in Section 6.4.2, requires O(nτ ) = O(n3 ) operations when applied without prior column ordering to the matrix 

 F0I I I I  0 I 0 where each block is of size 13 n × 13 n and F is dense. What is the effect of prior column ordering by numbers of entries? 6.5 Construct a matrix whose columns are in order of increasing number of entries and for which the transversal algorithm needs O(n3 ) operations. 6.6 Consider the transversal extension algorithm that uses only cheap assignments and switches a later column with column k whenever a cheap assignment is not possible in column k. Show that this obtains at least 12 n cheap assignments for a nonsingular matrix of order n.

136

REDUCTION TO BLOCK TRIANGULAR FORM

6.7 By considering the example 

 ××× × × × show that when the transversal algorithm of Sections 6.4.1 to 6.4.4 is applied to a nonsingular matrix of order n, it may obtain less than 12 n assignments cheaply. 6.8 Show that permuting the matrix to place singletons on the diagonal, as in Section 6.3, cannot reduce the number of entries on the diagonal. 6.9 Apply the Tarjan algorithm to the digraph in Figure 6.5.5 starting from node 7. 6.10 Apply the Tarjan algorithm to the digraph in Figure 6.5.9 starting from node 7. 6.11 Apply the Tarjan algorithm to the digraph of Figure 6.5.7 starting at node 1 and comment on the work. 6.12 Illustrate that, if the diagonal of A does not consist entirely of entries, a finer block triangular structure may be found by using unsymmetric permutations than is possible by using symmetric permutations. 6.13 What block triangular form is obtained by looking first for column singletons in the matrix of Figure 6.3.1? 6.14 By changing the matrix C of (6.9.2), explain how the ideas in Section 6.9 can be employed to maximize the sum of the sizes of diagonal entries. 6.15 By changing the matrix C of (6.9.2), the objective function (6.9.3), and the matrix B of (6.9.4), explain how the ideas in Section 6.9 can be employed to maximize the smallest diagonal entry.

Research exercise R6.1 We have placed this chapter before those describing general ordering methods because whenever a matrix can be reduced to block triangular form, orderings need to be applied only to the diagonal blocks. Much of the work for transforming a matrix to block triangular form arose from control systems or linear programming applications, where these methods are widely used. In Section 6.7, we conjectured that benefit seems to come primarily from very unsymmetric matrices, common in these applications. Using the Florida collection of test cases, argue for or against the conjecture that very unsymmetric cases are more likely to benefit from finding the block triangular form. Also, identify what application domains most frequently benefit from the block triangular form.

7 LOCAL PIVOTAL STRATEGIES FOR SPARSE MATRICES

We consider local strategies for pivot selection, that is, where decisions are made at each stage of the factorization without regard to how they might affect later stages. They include minimum degree for symmetric matrices, Markowitz’ strategy for unsymmetric matrices, minimum fillin, and simpler strategies. We discuss the inclusion of numerical pivoting. We consider taking advantage of sparsity in the right-hand side and of computing only part of the solution.

7.1

Introduction

In this chapter, we focus on local ordering methods for sparse matrices. We introduced this idea by an example in Chapter 1 (Section 1.3), and discussed it briefly in Chapter 5 (Section 5.3.2), where we introduced the Markowitz algorithm. Local orderings are heuristic. It is difficult to define what we mean by best and, given a definition, it is difficult to find such an ordering. To put this in perspective, suppose we use the objective of finding the ordering that introduces the fewest new entries in the LU factorization. This will be referred to as the minimum fill-in objective. Even if we start with entries on the diagonal and restrict the pivot choice to the diagonal, this problem has been shown by Rose and Tarjan (1978) in the unsymmetric case and by Yannakakis (1981) in the symmetric case to be ‘NP complete’. This means (see Karp 1986) that the problem is very hard (like the travelling salesman problem), and only heuristic algorithms are computationally feasible for large problems. Going back to defining a best ordering, we see that a solution to the minimum fill-in problem need not be best for a number of reasons. It may be very costly to compute, so that the total cost of allowing more fill-in but computing the ordering more economically may be less. It may be numerically unstable. The ordering may require a very sophisticated data structure and complicated programming, the cost of which could override any computational savings. It may permit less exploitation of modern architectures (see Section 1.4) than some ordering allowing more fill-in. Sometimes factorizing the matrix is not the dominant cost of solving the overall problem. One example of this is when the user wants to use the same matrix with many right-hand sides. These are just a few of the factors that demonstrate that ‘best’ cannot be defined absolutely. Direct Methods for Sparse Matrices, second edition. I. S. Duff, A. M. Erisman, and J. K. Reid. c Oxford University Press 2017. Published 2017 by Oxford University Press.

138

7.2

LOCAL PIVOTAL STRATEGIES FOR SPARSE MATRICES

The Markowitz criterion

The ordering strategy of Markowitz (1957) has proved extremely successful for general-purpose use when the order is not huge. We consider the Markowitz ordering in more detail here. Suppose Gaussian elimination applied to an n × n matrix has proceeded through the first k − 1 stages. For each row i in the active (n − k + 1) × (n − k + 1) (k) (k) submatrix, let ri denote the number of entries. Similarly, let cj be the number (k)

of entries in column j. Then, the Markowitz criterion is to select the entry aij that minimizes the expression (k)

(ri

(k)

− 1)(cj

− 1)

(7.2.1)

from the entries of the active (n − k + 1) × (n − k + 1) submatrix that are not too small numerically (see Sections 5.2 and 7.8). (k) (k) An advantage of using (7.2.1) rather than ri cj is that this forces the (k)

(k)

algorithm to select a row singleton (ri = 1) or a column singleton (cj = 1) if either is present. Such a choice produces no fill-in at all. Markowitz interpreted this strategy as finding the pivot that, after the first k − 1 pivots have been chosen, modifies the least coefficients in the remaining submatrix. It may also be regarded as an approximation to the local (k) (k) (k) minimum multiplication count, since using aij as pivot requires ri (cj − 1) multiplications. Finally, we may think of this as an approximation to the choice of pivot that introduces the least fill-in at this stage, since that would be the case if all modifications (7.2.1) caused fill-in. To implement the Markowitz strategy generally requires knowledge of the updated sparsity pattern of the reduced (n − k + 1) × (n − k + 1) submatrix at each stage of elimination. It requires access to the rows and columns, since the positions of nonzero entries in the pivotal column are needed to determine which rows change. It requires the numerical values to judge the acceptability of the pivot size. If the search is not cleverly limited (see Section 10.2, for example), it could require examining all the entries in the active submatrix at every stage. It should be observed that the Markowitz count (7.2.1) may be greater than (k)

min(rt t

(k)

− 1) min(ct t

− 1),

(7.2.2)

for there may be no entries, or no acceptable entries, in the intersection between (k) (k) rows with minimum ri and columns with minimum cj (see Exercise 7.1). 7.3

Minimum degree (Tinney scheme 2)

Before examining some related local algorithms and comparing them with the Markowitz strategy, we look at the case where the pattern of A is symmetric and we can be sure that diagonal pivots produce a stable factorization (the most important example is when A is symmetric and positive definite). Two things

MINIMUM DEGREE (TINNEY SCHEME 2)

139

simplify. We do not have to carry numerical values to check for stability, and the search for the pivot is simplified to finding i such that (k)

ri

(k)

= min rt

(7.3.1)

t

(k)

leading to aii as pivot. This special case was not considered by Markowitz (1957). It was introduced by Tinney and Walker (1967) as their scheme 2. It is called the minimum degree algorithm because of its graph theoretic interpretation: in the graph associated with a symmetric sparse matrix, this strategy corresponds to choosing the node for the next elimination that has the least edges connected to it. Notice that the diagonal entry of minimum degree will always have Markowitz count equal to the minimum for any diagonal or off-diagonal entry. This property is maintained by choosing pivots from the diagonal, since symmetry is preserved.

1

2

4

3

7

6

5

9

10

11

8

Fig. 7.3.1. A graph that is a tree. The minimum degree strategy has been very successful for problems that are not huge. It is easy to verify that for a graph that is a tree (a graph with no cycles, see the Figure 7.3.1 example), it introduces no fill-in (Exercise 7.2). In this case, it is certainly best in the sense of least overall fill-in, but in other cases it need not produce an ordering that minimizes the amount of fill-in introduced. We demonstrate this by the example shown in Figure 7.3.2. The minimum degree

7

2 4

1

3

5

6

8

9

× × × ×

× × × ×

× × × ×

× × × ×× ××× ×× × × ×

× × × ×

Fig. 7.3.2. Minimum degree not optimal.

× × × ×

× × × ×

140

LOCAL PIVOTAL STRATEGIES FOR SPARSE MATRICES

algorithm selects node 5 first, introducing fill-in between nodes 4 and 6. The given order produces no fill-in. Since minimum degree is a Markowitz ordering applied to a symmetric problem, we conclude that the Markowitz ordering need not be optimal in the sense of least fill-in. A rather surprising result about this strategy is that it may be implemented without the explicit updating of the sparsity pattern at each stage, see Section 11.2. The result is that the symmetric version of the Markowitz strategy runs significantly faster than the unsymmetric version. It depends on storing the degrees and updating the few that change with each pivot step. Gilbert, Moler, and Schreiber (1992) suggested computing only an approximate degree and implemented this within their sparse MATLAB code. To avoid the high cost of the degree updates, Amestoy, Davis, and Duff (1996a) have implemented an approximate minimum degree (AMD) algorithm that has a better approximation but still at low cost. Rather surprisingly, this often produces a better ordering than minimum degree. Because these are heuristic algorithms, it does not follow that relaxing a local objective will cause the algorithm to perform more poorly. Code for AMD is available as the HSL code MC47 and Davis has written MATLAB code for this algorithm. We describe it in Section 11.3. Because of their speed, the minimum degree and AMD algorithms are often applied to unsymmetric systems whose pattern is symmetric or nearly so (see Section 13.2).

7.4

A priori column ordering

When numerical pivoting (Section 4.4) is needed, a disadvantage of the Markowitz strategy is that the numerical values of the entries of the reduced matrix are needed for each pivot choice. Many codes fix the order of the columns and then use partial or threshold pivoting to choose the pivot from within the pivot column. This may be unsatisfactory from a sparsity point of view if a row with entries in the pivot column is not sparse. George and Ng (1985) bound the sparsity for Gaussian elimination with row pivoting on a sparse matrix A by considering the patterns when Gaussian elimination without pivoting is applied to the normal matrix N = AT A. First consider a particular pivot order and suppose that rows of the matrix A have been permuted so that A has the pivots on its diagonal. We will show that the pattern of each pivot row during the reduction of A is contained in the pattern of the corresponding pivot row during the reduction of N. Note that N can be Pexpressed as the sum of the outer products of the rows of A, that is, N = k ak: ak: T . Now P consider the first pivot step of the reduction of A. The first row n1: T of N is k ak1 ak: T , so its pattern is the union of the patterns of the active rows (rows of A with an entry in column 1). This pattern must include the pattern of the first pivot row. Furthermore, it includes the pattern that any of the active rows have after the elimination operations have been applied. Therefore,

A PRIORI COLUMN ORDERING

141

the contribution of each active row of the (n−1)×(n−1) reduced matrix1 A(2) to T its normal matrix A(2) A(2) is contained in the contribution from nT1: . It follows T that the pattern of A(2) A(2) is contained in the pattern of the reduced matrix (2) N of N. Therefore, we can apply to the reduced matrices the argument that we applied to the original matrices. By doing this at each successive pivot step, we conclude that Gaussian elimination applied to N = AT A provides bounds for the patterns of each pivot row when Gaussian elimination is applied to A. Applying a row permutation P to A does not affect N since (PA)T PA = AT A. It follows that the result of the previous paragraph applies to any row pivot order used for the reduction of A. Gilbert and Ng (1993) have shown that if the matrix is strong Hall, that is, if every set of k < n columns contains at least k + 1 nonzero rows, the bound is tight in the sense that for any entry of the factorization of N there is a pivot sequence that results in a entry in the corresponding position in the factorization of A. We can also bound the sparsity of the pivot columns. The pattern of the first column of A is contained in the pattern of the first of N and similarly for each reduced matrix. George and Ng (1985) proposed that a good pivot order be constructed for N = AT A and the resulting permutation be applied as a column ordering for A. If sparsity can be well preserved in the reduction of N, it will be well preserved during the reduction of A for any choice of row permutation. The techniques of the previous section may be applied to the pattern of N without even forming it explicitly, so finding this column permutation is not expensive. This observation forms the basis for the COLAMD ordering (Davis, Gilbert, Larimore, and Ng, 2004a, 2004b) in MATLAB. We discuss this further in Section 11.6. While this technique is often very successful in practice, it must be emphasized that it chooses its column sequence from a bound on the pattern, and not the pattern itself. Sometimes the bound may be so pessimistic that the technique is of little benefit. The obvious example is when A has a single full row; in this case, N is full and gives no guidance in the choice of a good column order. The result of the tightness of the bound, mentioned earlier in this section, applies only to a single entry. For any entry of the factorization of N, there is a pivot sequence that results in the corresponding position in the factorization of A holding an entry. It is unlikely that any pivot sequence will lead to this happening for all such entries. Duff and Reid (1996a) have an alternative approach for the FACTORIZE phase; they use the column order of the original analyse. This has proved to be very successful where a sequence of similar matrices are factorized, for example, during successive steps of the solution of a set of ordinary differential equations.

1 Note that this definition of A(2) differs from that in Chapter 3 inasmuch as we look only at the Schur complement after the first elimination step.

142

LOCAL PIVOTAL STRATEGIES FOR SPARSE MATRICES

Fixing the column sequence in advance does not avoid the need to follow the sparsity pattern during the reduction since it will be affected by the choice of pivot within each pivot column. Fortunately, Gilbert and Peierls (1988) showed that it is possible to construct the pattern with a computing effort that is proportional to the arithmetic performed. We explain how this is done in Section 10.3. 7.5

Simpler strategies

There are two basic difficulties with the Markowitz strategy: it does not always produce the best ordering and may be costly to implement. We will show in Chapter 10 that both the pivot search and the updating may be performed efficiently by a careful implementation. Nevertheless, we are led to examine alternatives. In this section, we consider simpler orderings in the hope of reducing the work of finding the ordering without greatly increasing the fill-in. In Section 7.6, we try to improve the performance by using more sophisticated strategies. One alternative involves keeping an updated sparsity pattern but limiting the pivot search: select the column with updated minimum column count and choose the entry in that column in the row with minimum updated row count among entries in that column. This is referred to as the min. row in min. col. strategy (and there is an analogous min. col. in min. row strategy). For the symmetric case when only diagonal pivoting is permitted, it is equivalent to the minimum degree algorithm. Now the savings compared with the Markowitz ordering are only in the pivot search. Tosovic (1973) and Duff (1979) give examples that show the performance to be somewhat poorer (typically the number of entries in the factorized form is up to 20% greater than with the strategy of Markowitz). The danger is that, although the pivot is in a column with few entries, it may be in a row with many entries. An example of this behaviour is shown in Table 7.5.1; although the Markowitz and min. row in min. col. algorithms perform similarly on the matrix and Markowitz performs similarly on its transpose, min. row in min. col. performs significantly worse on the transpose. Table 7.5.1 Markowitz and min. row in min. col. ordering on a matrix of order 363 with 3157 nonzeros and on its transpose. Matrix A AT Fill-in Markowitz 729 747 Min. row in min. col. 809 4 889 Operations in factorization Markowitz 6 440 7 865 Min. row in min. col. 7 811 51 375

A MORE AMBITIOUS STRATEGY: MINIMUM FILL-IN

143

A further suggestion for simplifying the Markowitz search was made by Zlatev (1980) and is now incorporated in the MA48 code. This strategy restricts the search to a predetermined number of rows of lowest row count, choosing entries with best Markowitz count and breaking ties on numerical grounds. This can suffer from some of the difficulty of the min. col. in min. row strategy, but has proved very useful when there are a large number of rows and columns with the same number of entries. The performance is quite dependent on the number of rows to which the search is restricted and that in turn depends on the problem. Zlatev (1980) recommends searching two or three rows. We will see in Section 10.2 that the Zlatev strategy is usually as good as the Markowitz strategy at preserving sparsity and is faster. Early in the development of local sparse ordering algorithms, Tinney and Walker (1967) proposed and analysed an algorithm that was even simpler. For a symmetric matrix, they considered ordering the rows and columns a priori based on the row counts without updating due to fill-in from previous choices. They referred to this as scheme 1. While this represented a natural thing to try, they rejected it in practice because of its poor performance. They found savings in computing the ordering were far less than the increased costs due to additional fill-in. 7.6

A more ambitious strategy: minimum fill-in

Many efforts have been made to improve the Markowitz ordering by trying to reduce the overall fill-in further. One such strategy is to replace the Markowitz criterion with a local minimum fill-in criterion. That is, at the k-th stage of Gaussian elimination select as pivot the nonzero entry (which is not too small numerically) that introduces the least amount of fill-in at this stage. Markowitz (1957) actually suggested this algorithm, but rejected using it on the grounds of cost. For the symmetric case, this is scheme 3 of Tinney and Walker (1967). It is also sometimes called the minimum deficiency algorithm. We readily observe that this minimum fill-in criterion is considerably more expensive than that of Markowitz. We not only require the updated sparsity pattern but must compute the fill-in associated with the pivot candidates. It is not necessary, in a careful implementation, to recompute the fill-in produced by all pivot candidates at each stage, since only a small part of a sparse matrix changes with each elimination step. For Markowitz, we need the row and column counts and these only change for the rows and columns involved in the pivot step. The fill-in, however, has to be recomputed for all the entries in any row with an entry in an active column and in any column with an entry in an active row. Even with careful implementation, we need a significant improvement in performance to make the extra effort worthwhile. We have already remarked that no fill-in results from using the minimum fillin criterion on the case shown in Figure 7.3.2, whereas one fill-in was generated with a Markowitz ordering. However, the graph in Figure 7.6.1 (where the given ordering is globally optimal) illustrates that local minimum fill-in does not always

144

LOCAL PIVOTAL STRATEGIES FOR SPARSE MATRICES 1

2

4

5

×

××× ××× ×××× ×××× ××× × ××× ×× ××× ××

3

×

6 7 10

××× ××× ×××× ×××× ××× × ××× ×

9

×

8

11

12

13

Fig. 7.6.1. Minimum fill-in not optimal. 1

7

4

10

5

2

3

6

8

9

11

12

13

14

15

Fig. 7.6.2. Markowitz optimal, minimum fill-in not optimal.

produce a globally optimal result. Selecting node 7 first produces one immediate fill-in while any other selection leads to at least two. However, this fill-in is in addition to those already resulting from the given order. Compounding the problem further is the example in Figure 7.6.2, where the Markowitz algorithm leads to an optimum order (the order given), while the minimum fill-in algorithm introduces extra fill-in. These examples, and others, are discussed by Ogbuobiri, Tinney, and Walker (1970), Duff (1972), Rose (1972), Dembart and Erisman (1973), and Duff and Reid (1974). They serve to demonstrate that a locally best decision is not necessarily best globally and to illustrate the general difficulty with local algorithms. Rothberg and Eisenstat (1998) experimented with the minimum fill-in heuristic on far larger matrices and found that significant savings can be obtained over using minimum degree. For example, they report that on runs on 40 standard symmetric test matrices of order between approximately 3 000 and 250 000, the minimum fill-in algorithm consistently outperformed minimum degree. The mean reduction over the whole set was about 15% in the number of entries in the factor and 30% in the operation count for factorization. Of course, the costs of running the minimum fill-in algorithm were considered prohibitive. Peyton (2001) and Ng and Peyton (2014) have investigated improving the efficiency of such algorithms although they have not implemented their approach in available software. However, they developed several variants of approximate minimum fill-in (AMF) whose run times were only sightly greater than that of

EFFECT OF TIE-BREAKING ON THE MINIMUM DEGREE ALGORITHM 145

AMD (end of Section 7.3), but which exhibited most of the benefits. Median gains were reduced to 7% in the number of entries in the factor and 20% in the operation count for factorization, but gains of up to 25% and 50%, respectively, were obtained on a few matrices in the set. An AMF option has been included in the MUMPS package and sometimes outperforms AMD, but there is little evidence that it is widely used. 7.7

Effect of tie-breaking on the minimum degree algorithm

Because of the good results obtained with the minimum degree algorithm in practice, many people have attempted to analyse its performance formally, sometimes on limited problem classes. For example, it is optimal in the minimum fill-in sense for matrices whose graphs are trees (see Section 7.3), but this is a very limited problem set. Matrices from 5-point and 9-point operators on a square grid (see Section 1.7) provide a natural class to analyse. Their regular pattern and frequent use in applications motivate such an investigation. In Section 9.3, we introduce the nested dissection algorithm, which has been shown to be optimal in the sense that on these matrices it involves O(n3/2 ) operations and no ordering involves O(nα ) operations2 with α < 3/2. Here, we make some comparisons using this problem class simply to show the effect of tie breaking on the minimum degree algorithm. To see why tie breaking is so important for these matrices, observe that for a 9-point operator on a two-dimensional grid with q subdivisions in each direction (where the matrix order is (q +1)2 ), four nodes have degree 4, 4(q −1) nodes have degree 6, and all the rest have degree 9. Similarly, for a 5-point operator on the same grid, the matrix order is (q + 1)2 , four nodes have degree 3, 4(q − 1) nodes have degree 4, and the rest have degree 5. After a few steps of the algorithm, there are a huge number of variables with minimum degree and the choice is unspecified. This makes analysing the minimum degree algorithm impossible — different tie-breaking strategies are likely produce a wide range of results. To demonstrate the effect of tie breaking, we have used a variant of the HSL code ma27 that accepts a given order and breaks each tie by using the variable that lies first in the given order. A heap (Section 2.11.2) is used to make this choice efficiently. We look at different given orderings experimentally, seeking to find those that could lead to good and poor outcomes. We experimented with what might be considered the most natural (pagewise) and a very unnatural (spiral in) ordering, illustrated in Figure 7.7.1. We also tried a random ordering, spiral out (the opposite of spiral in), and the ordering of nested dissection. This extends our earlier work (Duff, Erisman, and Reid 1976) by using more given orderings and matrices of order up to 16 785 409 (q = 4 096). 2 This is based on the assumption that factorization of a full matrix of order r needs O(r 3 ) operations, which ignores the work of Strassen and others, see the last paragraph of Section 3.11.

146

LOCAL PIVOTAL STRATEGIES FOR SPARSE MATRICES

1 2 3 4 5

16 17 18 19 6

15 24 25 20 7

14 23 22 21 8

13 12 11 10 9

1 2 3 4 6 7 8 9 11 12 13 14 16 17 18 19 21 22 23 24

5 11 15 20 25

Fig. 7.7.1. Spiral and pagewise ordering on a 5 × 5 grid. Table 7.7.1 Matrices from 9-point stencil on a regular square grid of size q+1. Number of entries in L divided by q 2 log2 q, for the minimum degree algorithm with different tie-breaking strategies and for nested dissection.

q 16 32 64 128 256 512 1 024 2 048 4 096

Minimum degree, resolving ties by Nested Page- Spiral Spiral Rand- Nested diss. wise in out om diss. 3.9 4.2 3.6 3.7 3.4 3.5 4.5 5.4 4.1 4.5 4.0 3.8 5.3 7.5 4.8 5.4 4.2 4.2 6.6 8.4 5.3 7.1 4.6 4.5 8.6 10.5 6.0 8.2 4.9 4.9 10.8 11.7 6.4 9.8 5.2 5.1 13.2 13.0 7.0 11.5 5.5 5.4 14.9 14.4 7.6 13.7 5.7 5.6 17.7 16.0 8.0 16.3 5.8 5.8

Results of these experiments for the 9-point operator are summarized in Table 7.7.1, showing normalized data for the number of entries in L for different given orderings. The normalization factor is proportional to the asymptotic behaviour of nested dissection and is included to aid the comparison for varying sizes of grid. In the last column, we include the corresponding data for nested dissection. The results are displayed graphically in Figure 7.7.2. We offer several observations on this data. First, it appears that with the best given order we could find, minimum degree tracks nested dissection (measured by number of entries in L) closely on this set of problems. Secondly, the pagewise order led to significantly more fill-in, giving an indication that it is asymptotically of a different order. That the spiral-out order performs so well and the spiralin order much more poorly seems surprising. We invite the reader to explore possible explanations for these numbers in Exercise 7.6. We conclude that any analysis of minimum degree must specify a strategy for tie breaking, since it can make a huge difference in the performance. We consider some related cases in Section 9.3.

NUMERICAL PIVOTING

147

20 18 16 14 12 10 8 6 4 2 0

Pagewise Spiral in Spiral out Random ND

0

1000

2000

3000

4000

5000

Fig. 7.7.2. The minimum degree tie-breaking results of Table 7.7.1. 7.8

Numerical pivoting

Associated with the Markowitz ordering strategy is the need in the unsymmetric case to establish a suitable threshold parameter, u, for numerical stability. In particular, we restrict the Markowitz selection to those pivot candidates satisfying the inequality (k)

(k)

|akk | ≥ u|aik |, i > k,

(7.8.1)

where u is a preset parameter in the range 0 < u ≤ 1, as discussed in Section 5.2. Recommendations for suitable values for u have varied quite widely. Curtis and Reid (1971) recommended the value 14 on the basis of experiments that now appear to be of very modest order. Tomlin (1972) recommended the rather low figure of 0.01 for linear programming cases, where the number of entries in column j of U is usually very small (for example, 3, 4, or 5). Some authors (for example, Saunders (Gill, Saunders, and Shinnerl 1996) and Gould (Gould and Toint 2002)) have found that they can use extremely small values such as 10−8 or less in optimization codes. They report that there is seldom a problem with the factorization, but they do check the residuals and use iterative refinement or even a refactorization with increased threshold if a problem is detected. Intuitively, it might seem that the smaller the value of u the greater freedom there would be to choose pivots that are satisfactory from a sparsity point of view, so that the limit u is needed only for stability reasons. Duff (1979) found experimentally that this is not necessarily the case. We have run experiments on large unsymmetric matrices from the University of Florida Sparse Matrix Collection and five representative sets of results are

148

LOCAL PIVOTAL STRATEGIES FOR SPARSE MATRICES

Table 7.8.1 Varying the threshold parameter u, running on a Dual-socket 3.10 GHz Xeon E5-2687W, using the Intel compiler with O2 optimization. Matrix n τ u 1.0 10−1 10−2 10−4 10−8 u 1.0 10−1 10−2 10−4 10−8 u 1.0 10−1 10−2 10−4 10−8 u 1.0 10−1 10−2 10−4 10−8

goodwin crashbasis onetone1 shyy161 7 320 160 000 36 057 76 480 324 784 1 750 416 341 088 329 762 ANALYZE time (seconds) 3.39 19.67 5.87 9.64 8.13 18.29 6.05 9.90 2.19 95.99 5.96 7.00 0.82 331.61 6.77 6.82 0.60 254.79 7.88 6.67 FACTORIZE time (seconds) 0.68 3.28 0.41 2.02 0.57 3.43 0.54 1.50 0.28 17.38 0.44 1.17 0.19 41.46 0.48 0.93 0.15 112.45 0.34 0.67 SOLVE time (seconds) 0.0093 0.0388 0.0068 0.0179 0.0060 0.0398 0.0063 0.0135 0.0045 0.0663 0.0059 0.0120 0.0033 0.0923 0.0064 0.0107 0.0030 0.1006 0.0061 0.0100 Scaled residual 5.04e-16 1.19e-15 3.69e-16 7.27e-21 6.12e-15 1.87e-13 9.82e-16 7.27e-21 1.15e-12 1.03e-09 9.62e-14 7.27e-21 4.12e-07 2.27e-03 8.62e-07 7.27e-21 1.00e-06 2.47e-03 2.28e-04 7.27e-21

trans4 116 835 766 396 313.99 9.48 9.43 9.44 9.43 79.48 0.05 0.06 0.06 0.05 0.2051 0.0051 0.0050 0.0050 0.0041 2.29e-14 9.06e-15 1.09e-14 1.72e-14 1.68e-08

shown in Table 7.8.1. Note that the SOLVE time correlates with the number of entries in the factorization because each entry is used once. The matrix crashbasis illustrates that reducing u can be disadvantageous from the sparsity point of view. We believe that the explanation is that with smaller values of u the spread of numerical values within the reduced submatrix becomes very wide and this causes inequality (7.8.1) to be very restrictive. That the choice u = 1 can lead to very poor performance is illustrated by trans4. Usually, choosing 0.1 or 0.01 is satisfactory, but which is better varies. For crashbasis, 0.1 is better and for goodwin, 0.01 is better. Overall in our tests, 0.01 is significantly better than 0.1 more often than vice-versa. Several packages use this value for the default setting, but the best choice is problem dependent. For those solving particular classes of problems, we encourage exploration of this question with sample problems. We outline some more work for this area in Research Exercise R7.1.

SPARSITY IN THE RIGHT-HAND SIDE AND PARTIAL SOLUTION

149

To judge the accuracy, we show scaled residuals kb−Axk kAkkxk in Table 7.8.1. The results for small u are surprisingly accurate, given the opportunity for growth that is provided. 7.9

Sparsity in the right-hand side and partial solution

Often in solving Ax = b, the vector b is also sparse. It is readily seen that if b1 = b2 = ... = bk = 0, bk+1 6= 0, then in the forward substitution step Lc = b

(7.9.1)

c1 = c2 = ... = ck = 0. This suggests that we may be able to save computation by taking advantage of sparsity in b. If the matrix L is being saved for use with other right-hand sides and they also have k leading zeros, then we may also save storage since the first k columns of L (shaded in Figure 7.9.1) are not needed in the forward substitution. However, if equation k + 1 is placed first in the ordering, we may need to store all of the resulting L, and perform all of the forward substitution. This suggests that sparsity in b should influence the ordering of A. Note that if all the right-hand sides are known at the time of the factorization, the forward substitution can be done along with the factorization. Computation is still reduced when there are leading zeros in the right-hand side(s), and none of L needs to be saved. Discarding L is not possible when multiple right-hand sides depend on each other, because then one or more of these vectors is unknown at the time of the factorization. This dependency is present in iterative refinement and parametric studies, for example. Next, we show a similar property for x. The combined effect of forward substitution and back-substitution usually leads to x being a dense vector. This happens if cn is nonzero and U has an entry in each off-diagonal row except the last, see Section 15.6. This will be the case structurally for irreducible matrices A (see Exercise 7.4). However, we may be interested in only a partial solution. In this case, if the components of interest in x are numbered last, we have to carry the back-substitution computation only far enough to compute all xi of interest.

Fig. 7.9.1. Storage saved in L from sparsity in b.

150

LOCAL PIVOTAL STRATEGIES FOR SPARSE MATRICES

Fig. 7.9.2. Savings in U from computing part of x.

=

0 . . . . 0

x b A Fig. 7.9.3. The structurally symmetric case.

This also saves computation and storage in U as shown in Figure 7.9.2, where the shaded portion of U is not required. Again, the ordering of x to achieve this saving influences the ordering of A. Looking at Figures 7.9.1 and 7.9.2, we could make the required triangles (unshaded regions in L and U) as small as possible by constraining the sparse ordering on A as follows. All equations (rows of A) corresponding to bi = 0 must be numbered before equations corresponding to bi 6= 0. All variables (columns of A) corresponding to xi of interest must be numbered after the columns in A corresponding to unwanted variables. Erisman (1972, 1973) and Erisman and Spies (1972) reported a five-fold reduction of storage on circuit analysis problems by reusing those locations that would otherwise hold entries in the shaded part of L\U. To preserve symmetry for the symmetrically-structured case requires partitioning x and b as shown in Figure 7.9.3, where the shaded portion corresponds both to unwanted variables and to zero right-hand side components (other unwanted variables and zero right-hand side components are not exploited).

SPARSITY IN THE RIGHT-HAND SIDE AND PARTIAL SOLUTION

151

This strategy has two shortcomings. First, the objective of making the order of the unshaded triangles as small as possible is not usually best. Recall that the amount of computation associated with forward substitution and back-substitution is equal to the number of entries in L and U, respectively. Constraining the pivotal sequence, and possibly increasing the factorization costs and the overall density of the factors, may more than offset the resulting savings in forward substitution and back-substitution. Secondly, the severe constraint on the pivot sequence may lead to an unstable factorization. The second difficulty may be overcome by continuing the restricted pivot choice only as long as sufficiently large pivots are available. The first difficulty must be addressed further. Tinney, Powell, and Peterson (1973) suggest monitoring the number of entries in the active part of the matrix and applied this in their code for simulation of power grids. Dembart and Erisman (1973) suggested a similar approach for circuit analysis. As the pivot choice becomes more restricted in the course of Gaussian elimination, the number of entries in the remaining submatrix may increase sharply. They recommend a two-pass algorithm, the first of which monitors the number of entries in the submatrix. If this increases in the neighbourhood of the restriction, they return to where the increase began and release the restriction. This effectively increases the order of the stored parts of the factors in exchange for improving their sparsity. This concept is illustrated by the example in Figure 7.9.4. If we keep to the constraints indicated, one step of Gaussian elimination with the (1,1) pivot would be performed, leading to complete fill-in. Releasing the constraints means that d is computed and the leading zero on the right-hand side is ignored, but there is no fill-in and overall computation is much reduced. 

    ××××××××× d 0 × × × ×      × × × ×      × × × ×      × × = × ×      × × × ×      × × × ×      × × × × × × × ×

Fig. 7.9.4. Constrained pivot sequence illustration. We do not want the value of d. Another approach to this problem is to increase the counts in the rows corresponding to nonzeros in b and in the columns corresponding to wanted components of x. This is one interpretation of a suggestion of Hachtel (1976). Duff and Reid (1976), examined the use of the augmented form

152

LOCAL PIVOTAL STRATEGIES FOR SPARSE MATRICES



I A AT 0

    r b = x 0

(7.9.2)

in the solution of the problem min kb − Axk2 , x

(7.9.3)

where the matrix A is of dimension m×n, m ≥ n. The matrix of equation (7.9.2) is symmetric and we would normally expect that a solution scheme that took advantage of this fact would be better than an unsymmetric code. By treating the matrix as a general matrix, however, we allow early pivots to be chosen from rows m + 1 to m + n (thereby exploiting sparsity in the right-hand side) and columns 1 to m (thereby exploiting the fact that equation (7.9.2) need only be solved for x). This approach to solving the augmented problem has the flavour of a null-space method for solving constrained problems but does have a little more flexibility. If r is required, it can be obtained from the computation r = b − Ax.

(7.9.4)

Duff and Reid (1976) biased the pivot choice by adding one to the row counts of the first m rows and to the column counts of the last n columns and found a reduction of about a third in the number of multiplications needed in SOLVE. Working now with larger test cases, we have found very mixed results, as we show in Table 7.9.1. Biasing the pivot choice usually made little difference, but was much better in one case (mod2). Using the unsymmetric factorization was usually better, but was sometimes worse. Table 7.9.1 Millions of multiplications in SOLVE for least-squares problems. Matrix

Rows Columns Entries Symmetric Unsymmetric ×103 ×103 ×103 Unbiased Biased pigs large11 28.3 17.3 75.0 0.529 0.489 0.497 Kemelmacher 28.4 9.7 100.8 31.65 26.81 26.74 mod2 34.8 66.4 199.8 14.18 13.52 8.84 pigs large21 56.5 34.5 225.1 5.18 14.48 15.72 deltaX 68.6 22.0 247.4 9.62 32.73 35.04 2.3 54.8 317.1 1.862 0.549 0.549 lp osa 14 1 Not in the Florida sparse matrix collection. In addition to the early applications of sparse right-hand sides in the simulation of power grids, circuit simulation, and the use of augmented equations for least squares solutions, Hall and McKinnon (2005) address sparse right-hand sides in the context of linear programming. We discuss their work in Section 10.6. Amestoy, Duff, Guermouche, and Slavova (2010) have investigated the exploitation of sparse right-hand sides in the context of factorizations based on an elimination tree. We will discuss this in Section 14.4.

VARIABILITY-TYPE ORDERING

153

7.10 Variability-type ordering Suppose a problem requires the factorization of a sequence of matrices Al , all with the same sparsity structure but with some of the entries changing in numerical value. The program may contain several layers of nested loops so that some entries change less frequently than others. At the outer level, there may be changing design parameters. Inside this, there might be the time steps in the solution of a set of ordinary differential equations and inside this there may be iterations to solve a set of nonlinear equations. In principle, any operation involving quantities unchanged since the last factorization need not be repeated. This led Hachtel, Brayton, and Gustavson (1971) to define what they called the sparse tableau, and to introduce the concept of variability type (see also Hachtel (1972)) to label the way each entry changes in the sequence Al , l = 1, 2,... . They took each matrix entry aij to have a variability type νij that depends on the depth of nesting of the innermost loop in which it changes. It is assumed that the variability type increases with the depth of nesting. They also associate (k) (k) variability type νij with the intermediate quantities aij computed in Gaussian elimination. Each is labelled with the greatest variability type of a quantity from which it is calculated and therefore labels the depth of nesting at which it will need to be recomputed. They replaced the Markowitz count with the aim of seeking pivots that keep the variability type low, in order to avoid computations in inner loops. In spite of the fact that variability-type ordering has been successful for special problems, its implementation is very difficult. It is unlikely that it will ever be incorporated in a general-purpose sparse matrix package, though it may be part of a special-purpose package (for example, in circuit design). 7.11 The symmetric indefinite case For dense symmetric matrices that are not positive definite (or are not known a priori to be so), stable symmetric decompositions are available with the help of symmetric permutations and the use of 2 × 2 pivots as well as 1 × 1 pivots (see Section 4.9). An approach for the sparse case used by Duff and Reid (1982, 1983) is to analyse the structure as if it were a positive-definite matrix with the same off-diagonal structure and incorporate pivoting with 1 × 1 and 2 × 2 pivots during factorization. While this approach is often satisfactory, it can require far more work during factorization than would have been needed in the corresponding positive-definite case. One issue is that a zero entry on the diagonal is regarded as available to be a pivot during the analyse phase, whereas it cannot be chosen as a pivot during actual factorization until it has become nonzero by fill-in. To take account of zeros on the diagonal, Duff, Reid, Munksgaard, and Neilsen (1979) proposed an extension of the Markowitz strategy. For the Markowitz count of a 2 × 2 pivot (to be compared with (7.2.1) for 1 × 1 pivots) they proposed the square of the number of entries in the two pivot rows that are not in the pivot

154

LOCAL PIVOTAL STRATEGIES FOR SPARSE MATRICES

itself. This is a readily evaluated upper bound for the possible fill-in, generalizing one interpretation of the Markowitz cost. Since a 2 × 2 pivot results in two eliminations being performed together, they compared the Markowitz cost of the best 2 × 2 pivot with double that of the best 1 × 1 pivot. Their experimental results showed promise, so Duff and Reid (1983) incorporated these ideas within a multifrontal framework (see Section 12.8), choosing pivots on sparsity grounds alone. Although the method is successful for some classes of problems, experience has unfortunately shown that pivotal operations during actual factorization can lead to far more fill-in than was anticipated during the analysis of the structure. This generalization of the Markowitz strategy to 2 × 2 pivots will often not work well in the common case where the potential pivot is of the form   0x (7.11.1) x0 since, if the union of the pattern of the two rows is used to define the Markowitz count it will usually be a severe overestimate of the fill-in when using pivot (7.11.1), which we call an oxo pivot. In such a case, Duff and Reid (1996b) generalize the Markowitz count by looking at the potential fill-in when using the two off-diagonal entries as successive 1 × 1 pivots, which preserves the symmetry. Exercises 7.1 Construct a sparsity pattern such that the minimum of expression (7.2.1) is greater than expression (7.2.2) for the first step of Gaussian elimination. 7.2 Show that if the graph of a symmetric matrix is a tree, the minimum degree ordering introduces no fill-in. 7.3 Show that leading zeros in b will also be present in the solution of Lc = b, where L is a nonsingular lower triangular matrix. 7.4 If A is irreducible, show that the active submatrix of A(2) is structurally irreducible. Deduce that the active part of every A(k) is structurally irreducible. If A has the triangular factorization LU, show that every row of U, except the last, has an offdiagonal entry. 7.5 Suppose we wish to solve Axi = bi for a sequence of vectors bi each having the same set of numerical values in its leading positions. Show how this property may be exploited as in the case where there are leading zeros in b. 7.6 In Table 7.7.1, we show the normalized number of entries in L based on different tie-breaking orderings for the 9-point operator on a square. Create a plausible argument that explains why: minimum degree with a given ordering based on nested dissection would produce similar fill-in to nested dissection itself; a pagewise given ordering would produce significantly more fill-in; and why a given ordering based on spiral in would produce so much more fill-in than one based on spiral out.

Research exercises R7.1 In Section 7.8, we outlined some considerations for choosing u, the threshold parameter in the Markowitz strategy. The size of u for threshold pivoting is critical both

THE SYMMETRIC INDEFINITE CASE

155

for the accuracy and the speed of the computation. If u is too small (some authors have recommended 10−6 ), it could interfere with the stability of the factorization and, hence, the accuracy. In addition, it can lead to a much wider range of matrix entries because of the large multipliers causing later pivot choices to be even more restricted. If u is too large, it may limit the choice of pivots for sparsity preservation. If rook pivoting is used, how might this change the picture? The best choice for u may be dependent on the size of the matrix, the algorithm, and the types of problems from which the matrix arises. For example, problems coming from linear programming solutions may behave differently from problems associated with chemical process simulations or large scale circuit design. Are there guidelines that can be developed from careful testing over problem classes, matrix sizes, and the size of u that address both the performance and accuracy of computation? Does computer architecture play any role here? R7.2 We have shown a wide variety of variations from the Markowitz/minimum degree algorithms in the chapter. We have also offered preliminary comparisons, when possible, suggesting trade-offs in the efficiency of the algorithm for finding the pivots and the resulting sparsity preservation. Because these are heuristic algorithms, testing results are all we have for making comparisons. Would more definitive testing over a broad class of problems suggest clearer trade-offs? Would problem size, application, or computer architecture be a factor in these trade-offs? R7.3 In Chapter 4, we introduced rook pivoting and in Section 5.2.2 we indicated how a threshold version of this could be adapted for the sparse case. We invite readers to explore the unsymmetric test cases and identify where this algorithm may offer an advantage. R7.4 We introduced 2×2 pivoting for the symmetric indefinite problem in Section 4.9, and extended it to the sparse case in Section 7.11. A domain where this is useful is for the augmented system that arises from least squares problems, see Section 7.9. Explore other domains where 2×2 pivoting can be needed and run test cases to determine the performance.

8 ORDERING SPARSE MATRICES FOR BAND SOLUTION We consider a different approach to reordering to preserve sparsity, with algorithms that take a global view of the problem. We now take a topdown approach to ordering as opposed to the local bottom-up approach considered in Chapter 7. In this chapter, we explore one global approach: ordering matrices for reduced bandwidth or profile (number of entries within the variable band). In this context, we also consider blocking and incorporating numerical pivoting. In Chapter 9, we introduce another approach to global orderings.

8.1 Introduction In this chapter, we discuss an approach to the ordering problem that is different from that of Chapter 7. We replace the objective of minimizing fill-in or minimizing the number of operations at each step by the objective of permuting the matrix to confine the fill-in to a band near the diagonal. Although this can result in more fill-in, it is not necessarily the case since the local strategies do not minimize the fill-in globally. An advantage of this approach is that the ordering and solution algorithms are usually simpler. We introduced this topic in Section 5.3.3. 8.2 Band and variable-band matrices Consider the graph of Figure 8.2.1. Taking the natural order for the graph produces the matrix shown in Figure 8.2.2. If we ignore the zeros within the band and only exploit the zeros outside the band, we make an interesting tradeoff. What we may give up in sparsity we can seek to gain through a simpler data structure for the entries within the band. We can work with dense vectors rather than sparse vectors. We show that we can generalize this idea to include the band and variable-band forms that were illustrated in Figure 5.3.2. 2

1

5

8

11

14

3

6

9

12

15

4

7

10

13

16

17

Fig. 8.2.1. A graph with an obviously good node order. Direct Methods for Sparse Matrices, second edition. I. S. Duff, A. M. Erisman, and J. K. Reid. c Oxford University Press 2017. Published 2017 by Oxford University Press.

BAND AND VARIABLE-BAND MATRICES ×××× ××× × ×××× × × ×× × × ×× × × ××× × × ×× × × ×× × × ××× × × ×× × × ××× × × ××× × × ×× × × ×× × ××× × ×× ×××

157

× × × ×

Fig. 8.2.2. The matrix corresponding to the Figure 8.2.1 graph. We say that a symmetrically structured matrix A has bandwidth 2m + 1 and semibandwidth m if m is the smallest integer such that aij = 0 whenever |i − j| > m. When treating it as a band matrix, all matrix coefficients aij with |i − j| ≤ m (including zeros) are stored explicitly. For the unsymmetric case, we define the lower (upper) semibandwidth as the smallest integer ml (mu ) such that if aij is an entry, i − j ≤ ml (j − i ≤ mu ). The bandwidth is ml + mu + 1. Gaussian elimination without interchanges preserves the band structure. Hence, band matrix methods provide an easy way to exploit zeros in a matrix. When using the variable-band form for a symmetric matrix, we store for each row every coefficient between the first entry in the row and the diagonal. The total number of coefficients stored is called the profile. In the unsymmetric case, we also store for each column every coefficient between the first entry in the column and the diagonal. If Gaussian elimination without interchanges is applied, no fill-in is created before the first entry in any row or before the first entry in any column (Exercise 8.1). This shows that the form of a variable-band matrix is preserved. In applications, it often happens that worthwhile savings are obtained by exploiting bandwidth variability. The graph of a simple example, arising from the triangulation of a square plate, is shown in Figure 8.2.3 and the lower triangular half of the corresponding matrix A is shown in Figure 8.2.4. Here, the maximum semibandwidth of 4 is attained in only 6 of the rows, whereas a pagewise ordering gives a band matrix with semibandwidth 4. Any zeros within the variable-band form are stored explicitly because this part of the matrix usually fills in totally. In fact, George and Liu (1975) have shown that for a symmetric matrix, if all rows after the first have a nonzero ahead of the diagonal, then they each fill in totally between this nonzero and the diagonal. This generalizes to any unsymmetric matrix with a nonzero ahead of the diagonal in every row and every column (except the first). The proof is straightforward and we leave it as an exercise for the reader (Exercise 8.2).

158

ORDERING SPARSE MATRICES FOR BAND SOLUTION 1

3

6

10

2

5

9

13

4

8

12

15

7

11

14

16

Fig. 8.2.3. The triangulation of a square plate. × ×× ××× × × ×××× × ×× × × ×× ×× ×× ×× × ×× ×× × ×× ×× ×× ×× ×× × ×××× ×××

Fig. 8.2.4. The pattern of the lower triangular part of the matrix of Figure 8.2.3.

We assume for the moment that we are working with matrices where the diagonal pivots are stable. We defer to Section 8.12 the case where interchanges for numerical stability may need to be considered. In the next few sections, we describe algorithms for finding orderings for small bandwidth. The data structures for actually carrying out the factorization will be discussed in Section 11.8. 8.3

Small bandwidth and profile: Cuthill–McKee algorithm

It is often natural to use an ordering that gives a small bandwidth. For instance the matrix whose graph is shown in Figure 8.2.1, with the node ordering shown there, has semibandwidth 3. In general, however, an automatic ordering algorithm is clearly desirable. Such algorithms for the symmetric case are the subject of this section. Symmetric permutations of a symmetric matrix correspond to relabellings of the nodes of the associated graph (see Section 1.2) and it is easier to describe algorithms in terms of relabelling graphs. We will follow usual practice and do

SMALL BANDWIDTH AND PROFILE: CUTHILL–MCKEE ALGORITHM ×××× ××× × ×××× × × ×× × × ×× × × ××× × × ×× × × ×× × × ××× × × ×× × × ××× × × ××× × × ×× × × ×× × ××× × ×× ×××

159

× × × ×

Fig. 8.3.1. The matrix corresponding to the Figure 8.2.1 graph. 1

7

3 2

6

4

5

×× ××× ×× × 0 × 0 × 0 × 0

× 0 × 0 0 0

× 0 0 × 0 0

× 0 0 0 × 0

× 0 0 0 0 ×

Fig. 8.3.2. A graph and its associated matrix, ordered by Cuthill–McKee.

this. Many algorithms divide the nodes into level sets, Si , with S1 consisting of a single node. The next, S2 , consists of all the neighbours of this node. The set S3 consists of all neighbours of nodes in S2 that are not in S1 or S2 . The general set Si consists of all the neighbours of the nodes of Si−1 that are not in Si−2 or Si−1 . Once these sets have been constructed, an ordering that takes the nodes of S1 followed by those of S2 , etc., corresponds to a permuted matrix which is block tridiagonal with diagonal blocks corresponding to the sets Si . For example, the ordering of the Figure 8.2.1 graph could have come from level sets (1), (2–4), (5–7), (8–10), (11–13), (14–16), (17) and the corresponding matrix, shown in Figure 8.3.1, is block tridiagonal with diagonal blocks having orders 1, 3, 3, 3, 3, 3, and 1. A widely-used algorithm of this kind was proposed by Cuthill and McKee (1969). They ordered within each block Si by taking first those nodes that are neighbours of the first node in Si−1 , then those that are neighbours of the second node in Si−1 , and so on. Rather surprisingly, George (1971) found that reversing the Cuthill–McKee order often yields a worthwhile improvement, not in the bandwidth, but in the

160

ORDERING SPARSE MATRICES FOR BAND SOLUTION 7

× 1

5 6

2

4

3

× × × × × × ×× ××××××× ×× ×

Fig. 8.3.3. As Figure 8.3.2, but with order reversed. total storage required within the variable-band form (the profile) and in the number of arithmetic operations required for variable-band factorization (see also Liu and Sherman (1976)). The resulting algorithm is called the Reverse Cuthill– McKee (RCM) algorithm. A simple example of this is shown in Figures 8.3.2 and 8.3.3. The Cuthill–McKee order (with level sets (1), (2), (3–7) ) is shown in Figure 8.3.2 and there are 20 zeros within the variable-band form (shown explicitly in the Figure 8.3.2 matrix). On reversing the order (Figure 8.3.3) all these zeros move outside the form and will not need storage. The reason for the improvement on reversing the Cuthill–McKee order was explained by Jennings (1977). Each zero marked explicitly in the lower triangular part of the matrix in Figure 8.3.2 has an entry ahead of it in its row and so fills in when Gaussian elimination is applied. It does not, however, have an entry below it in its column and so, on reversing the order, it has no entries ahead of it in its row and therefore does not fill in. The opposite effect does not occur because the Cuthill–McKee order gives leading entries in the rows which are in columns that form a monotonic increasing sequence. This is straightforward to prove and we leave it as an exercise for the reader (Exercise 8.3). Any column in which such a gain takes place (a row in the reversed order) corresponds to a node in Si that has no neighbours in Si+1 or whose only neighbours in Si+1 are themselves neighbours of earlier nodes in Si . In our simple example, the gains were all in the last level set whose nodes cannot have neighbours in the next level set. Another simple example is shown in Figure 8.3.4. In this case node 3 in S2 has only one neighbour, node 5, in S3 and this is a neighbour of node 2. An important practical case is in finite-element problems involving elements with interior and edge nodes. For example, in Figure 8.3.5 we have a finiteelement problem with four quadratic triangular elements and four biquadratic rectangular elements. The graph of the matrix is as in Figure 8.3.5 except that within each element every node is connected to every other node. The Cuthill– McKee order is shown and has level sets (1), (2–9), (10–19), (20–29), (30–33). On reversing the order there are gains associated with nodes 2, 3, 7, 10, 11, 12, 16, 17,... . Notice that they are all interior or edge nodes. Figure 8.3.3 also illustrates the interesting property that, if a symmetric matrix has a graph that is a tree (has no cycles), no fill-in is produced with

SMALL BANDWIDTH AND PROFILE: CUTHILL–MCKEE ALGORITHM

×× ×× ×× ×× × ×

× × × × ×

× ×× ×× ×× ×× ×× ××

161

× × × × ×

× × × ×

Fig. 8.3.4. Another example showing the advantage of reversing the Cuthill–McKee ordering. 4

2 3 1

7

13

10

23

20

5

11

14

21

24

6

12

15

22

25

8

16

26

28

18

17

9

19

27

30 31

32

33

29

Fig. 8.3.5. A finite-element problem. the Reverse Cuthill–McKee ordering. The proof is straightforward and is left as an exercise for the reader (Exercise 8.4). The minimum degree algorithm also produces no fill-in in this case (see Section 7.3). Sometimes the Reverse Cuthill– McKee ordering applied to a tree graph produces a variable-band matrix with embedded zeros, as illustrated in Figure 8.3.6. There is no actual fill-in when carrying out the elimination, but we must carry the zeros in the variable-band structure, since we are storing each row from its first entry. In this case, we are storing zeros in exchange for a simpler data structure. 7

×

× ×

× 6

×

5

××

× ××

4

3

2

× ×

×

1

× ×× ×××

Fig. 8.3.6. A tree and its associated matrix, ordered by Reverse Cuthill–McKee. There are zeros within the variable-band format.

162

ORDERING SPARSE MATRICES FOR BAND SOLUTION

8.4 Small bandwidth and profile: the starting node So far we have not discussed the choice of starting node. It is clear that for small bandwidth we should aim for the level sets to be small and numerous. We obtained this for the Figure 8.2.1 graph by starting at node 1. If instead we had started at node 10 we would have had level sets (10), (7,9,13), (4,6,8,12,16), (1,3,5,11,15,17), (2,14), that is only 5 sets instead of 7 and the semibandwidth would have been 6 instead of 3. For a problem associated with a rectangular or cigar-shaped region it is obviously desirable to start at an end, rather than in the middle of a side, but a good choice is not always so apparent. Cuthill and McKee (1969) suggested trying several variables with small numbers of neighbours. Collins (1973) tried starting from every variable, but abandoned each attempt immediately he encountered a bandwidth greater than the smallest so far obtained. Gibbs, Poole Jr., and Stockmeyer (1976) suggest that each variable in the final level set Sk should be tried as a starting node. If one of these nodes produces more than k level sets, then this variable replaces the starting node and the new level sets are examined, continuing in this way until no further increase in the number of level sets is obtained. The node in the final starting set and the node in the terminating set from which the restart is made are called pseudoperipheral nodes because each is a node on the ‘edge’ of the graph. A shortest path between these two nodes is called a pseudodiameter. The final ordering used by Gibbs et al. (1976) is based on a combination of the two final sets of level sets. They report that, in all their test examples, the number of level sets was actually maximized. The reader will find it instructive to see how quickly the algorithm recovers from a poor starting node, see Exercise 8.5. A good implementation of the Gibbs–Poole–Stockmeyer algorithm is given by Lewis (1982). One of his performance enhancements is to begin looking in the last level set with a node of lowest degree; this reduces the number of starting points tested. 8.5 Small bandwidth and profile: Sloan algorithm An enhanced method of ordering for small bandwidth and profile was proposed by Sloan (1986). He orders the nodes one by one and refers to any unordered node that is a neighbour of an ordered node as active. He aims to keep the number of active nodes, known as the wavefront, small. This is directly relevant to the frontal method, which is discussed in Sections 11.9–11.12, but is also helpful for keeping the bandwidth and profile small. To see this, suppose the wavefront is l. When the last of the currently active nodes is ordered, it will be connected to the set of nodes that have been ordered, so the semibandwidth must be at least l and the contribution of this node to the profile must be at least l + 1. Sloan considers how the wavefront will be affected by the choice of the next node. He calls the amount the wavefront will increase the current degree of the node. For candidates, he takes the active nodes and all their unordered neighbours. The reason for including the neighbours is that one of them might have smaller current degree than any active node; for example, it might be

SPECTRAL ORDERING FOR SMALL PROFILE

163

connected only to active nodes, so that its current degree is zero, when all the active nodes have nonzero current degree. Sloan uses the ends of a pseudodiameter (see end of Section 8.4) to guide the choice. Without such global information, the algorithm would not take account of the overall structure. He begins with one end of the pseudodiameter and chooses the candidate node i that maximizes the priority function Pi = −W1 ci + W2 d(i, e),

(8.5.1)

where W1 and W2 are positive integer weights, ci is the current degree and d(i, e) is the distance to the other end node. Based on numerical experiments, Sloan recommends (2,1) for the weights. On the set of problems of Everstine (1979), Sloan reports good profiles, almost all better than those of the algorithms of Section 8.3. For large problems, care needs to be taken over finding the candidate node with the best priority function value. Sloan (1986) considered problems that are small by today’s standards and concluded that a simple search of a list was best, though he anticipated that a binary-heap search, see Section 2.11.2, would be better for large problems. Reid and Scott (1999) use a simple search if the number of candidate nodes is less than a threshold and a binary heap thereafter. Their experiments showed that for the largest problems the heap is preferable, but for the smaller ones there is an advantage in using the simple sort. 8.6 Spectral ordering for small profile A completely different method of ordering for small profile has been proposed by Barnard, Pothen, and Simon (1995). Its merit is that it considers the overall structure directly. It is based on computing an approximation to the eigenvector associated with the smallest positive eigenvalue of the matrix L each of whose diagonal entries is the number of off-diagonal entries in the corresponding row of A and each of whose off-diagonal entries is −1 if the corresponding entry of A is nonzero and zero otherwise. Since the numerical values of the entries are ignored, L may be regarded as derived from the graph of A. Each edge between nodes i and j provides off-diagonal entries lij and lji with value −1 and adds 1 to the diagonal entries lii and ljj . L is known as the Laplacian matrix for the graph of A. If A is the matrix that arises from the usual 5-point discretization of the Laplacian on a rectangular grid with Neumann boundary conditions, then L and A coincide within an overall scaling factor. Thus, the Laplacian matrix L can be considered as a generalization to any graph of the discrete Laplacian on a rectangular grid. The eigenvector associated with the smallest positive eigenvalue is known as the Fiedler vector, following the work of Fiedler (1975). For a rectangular grid, this eigenvector has contours that are parallel to the shorter side, see Figure 8.6.1. In general, the contours will be smooth and roughly orthogonal to a pseudodiameter. If we order the nodes according to the corresponding values in the Fiedler vector, we get the left-right order that is obviously desirable in this case.

164

ORDERING SPARSE MATRICES FOR BAND SOLUTION

-22.2 -22.2 -22.2 -22.2 -22.2

-20.4 -20.4 -20.4 -20.4 -20.4

-17.0 -17.0 -17.0 -17.0 -17.0

-12.1 -12.1 -12.1 -12.1 -12.1

-6.3 -6.3 -6.3 -6.3 -6.3

0.0 0.0 0.0 0.0 0.0

6.3 6.3 6.3 6.3 6.3

12.1 12.1 12.1 12.1 12.1

17.0 17.0 17.0 17.0 17.0

20.4 20.4 20.4 20.4 20.4

22.2 22.2 22.2 22.2 22.2

Fig. 8.6.1. Values of the Fiedler vector on a rectangular grid. We assume that the matrix is irreducible, which corresponds to its graph being connected. The Laplacian matrix is positive-semidefinite since it is diagonally dominant. It is easy to see that it has the eigenvalue zero corresponding to the eigenvector e with components all one. On our assumption that the graph of the matrix is connected, this eigenvector is simple. To see this, suppose the vector is x and a component with minimum value is xi . Consider component i of the equation Lx = 0. This tells us that all the components xj for which aij (and lij ) is nonzero must have the same value as xi , that is, the minimum value. Now apply the same reasoning to each of these xj . Continuing, since the graph is connected, we find that all the components have the same value. The smallest positive eigenvalue is the second. By the Rayleigh–Ritz principle, the Fielder vector v is the solution of the problem X xT Lx = min min (xi − xj )2 . (8.6.1) xT x=l,xT e=0

xT x=l,xT e=0

aij 6=0,j k,

(10.9.2)

for each calculation of a multiplier lik = aik /akk that is an entry and mij = 1 + max(mik , mkj )

i, j > k,

(10.9.3)

for each Gaussian elimination operation aij = aij − lik akj for which lik and akj are both entries. The entries of the factorization for which mij = 0 need no computation. The calculations for the entries with mij = 1 are independent and may be performed in parallel. Once these are done, each of the calculations for the entries with mij = 2 may be performed in parallel, and similarly for each successive value. As a simple illustration, Grund considered the matrix with the pattern   ×× × × ×    × × × , (10.9.4)    × × × ××× for which the matrix M is 

0 0 1 2   3 0   1 0 1 1

 0 4 5 6

    

(10.9.5)

Grund reports that over many different matrices from the simulation of chemical plants and circuits, he found that the number of levels of independence was in general small. It is therefore disappointing that he reports speedups by a factor of only about 1.5 when using OpenMP on four alpha EV5.6 processors. 10.10

The use of drop tolerances to preserve sparsity

One of the main problems associated with the use of direct methods for solving sets of sparse linear equations lies in the increase in the number of entries due to fill-in. This increase is particularly marked when solving very regular sparse problems, such as those obtained from discretizing partial differential equations. Indeed, the storage of the matrix factorization often limits the size of problem that can be solved. A simple way to extend this limit is to remove from the sparsity pattern (and from any subsequent calculations) any entry that is less than a chosen absolute or relative tolerance, usually called a drop

226

GAUSSIAN ELIMINATION WITHOUT SYMBOLIC FACTORIZE

tolerance. Thus, if during the process of Gaussian elimination, an intermediate value satisfies the inequality (k) |aij | < tola (10.10.1) or the inequality (k)

(k)

|aij | < tolr × max |alj |,

(10.10.2)

l

(k)

for a specified non-negative value of tola or tolr , then aij is dropped from the sparsity pattern and subsequent consideration. For drop tolerances to be useful, they must be set high enough to reduce significantly the number of entries in the factors, but low enough so that the factorization is sufficiently accurate for the incomplete factorization to be useful as a preconditioner for an iterative method. ¯U ¯ say, will be such When using drop tolerances the factorization obtained, L that ¯ U, ¯ A+E=L (10.10.3) where the size of the entries of E may be significant relative to those of A. Hence, the solution of ¯ U¯ ¯x = b L (10.10.4) may differ substantially from the actual solution to the original system, even if the problem is well-conditioned. ¯U ¯ to solve One way of using the incomplete factorization L Ax = b

(10.10.5)

is to use it as an acceleration or ‘preconditioning’ for some iterative method. One of the simplest iterative schemes is that of iterative refinement (also called defect correction), defined for k = 0, 1, 2,... by the equations r(k) = b − Ax(k) , (k) ¯ U∆x ¯ L = r(k) ,

(10.10.6a) (10.10.6b)

and x(k+1) = x(k) + ∆x(k) ,

(10.10.6c)

where the starting iterate x(0) is usually taken as 0 so that x(1) is the solution of equation (10.10.4). Iterative refinement was discussed in Section 4.13 in connection with allowing for rounding in LU factorization. The Richardson iteration of equations (10.10.6) will converge so long as the spectral radius of the matrix ¯ U) ¯ −1 A I − (L (10.10.7) is less than 1.0 (see Exercise 10.12) so that the error matrix E in equation (10.10.3) can have quite large entries. The rate of convergence is, however,

THE USE OF DROP TOLERANCES TO PRESERVE SPARSITY

227

Table 10.10.1 Iterative refinement with drop tolerances on a five-diagonal system of order 10 000 with semibandwidth 100. Times are in seconds on an Dell Precision 650. Convergence to a scaled residual of less than 10−15 . The case in the last column did not converge. Value of drop tol., tola 0. 10−6 10−4 10−2 0.05 3 Entries in factors (×10 ) 453.2 397.9 283.0 142.2 97.7 Time for factorization 0.97 1.56 1.47 0.53 0.32 Number of iterations 1 4 15 291 > 500 Time for solution 0.067 0.102 0.209 2.109 >3.182

dependent on the value of this spectral radius and can be quite slow. We illustrate the performance of this technique in Table 10.10.1 where we have used the HSL code MA48. One of the first codes to incorporate drop tolerances was SSLEST (Zlatev, Barker, and Thomsen 1978) who later used this with iterative refinement in the Y12M code (Østerby and Zlatev 1983). Drop tolerances were used in a 1983 version of the MA28 code and have been included in the MA48 code that has a built in facility both for iterative refinement and error analysis using the work of Arioli et al. (1989). Although there was some discussion of the use of iterative methods that are more rapidly convergent than iterative refinement, for example, Chebyshev acceleration, it is only relatively recently that direct codes have been used with drop tolerances as preconditioners for more powerful iterative methods, such as GMRES (see Arioli and Duff 2008). We will see, in the next chapter, that the use of iterative refinement or more sophisticated iterative methods are important tools when the accuracy of the numerical factorization might be compromised for implementational efficiencies. For symmetric positive-definite matrices, the use of drop tolerances is much better established, although an added complication is that it is often desired to ensure that the preconditioning matrix is also positive definite. This can be important because the iterative technique normally employed in this case is conjugate gradients. A common way to maintain positive definiteness of the preconditioning matrix is to add something to the corresponding diagonal entries whenever an off-diagonal entry is dropped. For example, Jennings and Malik (1977) preserve positive definiteness by adding appropriate quantities to the (k) (k) (k) diagonal entries aii and ajj whenever aij is dropped (see Exercises 10.13 and 10.14). In this section, we have been concerned with dropping entries less than a given absolute or relative tolerance. It is also possible to drop any fill-in occurring outside a particular sparsity pattern (which can be that of the original matrix). Partial factorizations using this criterion, when used as a preconditioning for the method of conjugate gradients, give rise to the ICCG (Incomplete Cholesky

228

GAUSSIAN ELIMINATION WITHOUT SYMBOLIC FACTORIZE

Conjugate Gradient) methods of Meijerink and van der Vorst (1977). These methods have proved extremely successful in the solution of positive-definite systems from partial differential equation discretizations. In the unsymmetric case, the analogue is called ILU(0) and this preconditioner has had some success with unsymmetric systems such as those coming from discretized fluid flow problems. Further consideration of these techniques lies outside the scope of this book and the reader may consult Dongarra, Duff, Sorensen, and van der Vorst (1998), Saad (1996), or van der Vorst (2003) for a more detailed discussion. 10.11

Exploitation of parallelism

There are several places in this chapter where methods can be adapted for a parallel computing environment. Local algorithms remain more challenging for a parallel environment than other approaches to sparse matrix factorizations, and these other ordering algorithms will be considered in subsequent chapters. 10.11.1

Various parallelization opportunities

We observed in Table 10.7.4 that switching to dense form (once sparsity drops to about 50%) has little impact on memory use. Since there are good ways to exploit parallelism for the dense case, and since this last part of the factorization can have a real impact on the performance, the use of this switch is normally very beneficial. Sometimes the user is solving a sequence of problems where the sparsity pattern is fixed, and the same data structures can be used with each new matrix. In Section 10.3, we showed how to do one step of ANALYSE and multiple steps of FACTORIZE. Unless the new matrix depends on the output from the previous solution, we can do the succeeding FACTORIZE steps in parallel. This arises, for example, in parameter studies. Interestingly, the opportunity has opened the door to new ways to optimize a design by sampling the parameter space, performing the solutions, and then using the responses to understand the impact of design parameter changes. A similar approach can be taken to do multiple SOLVEs in parallel when multiple, independent right-hand sides need to be solved. 10.11.2

Parallelizing the local ordering and sparse factorization steps

Less obvious is the work to do a parallel implementation of a local ordering scheme and the subsequent sparse factorization. We look at the work of Davis and Yew (1990) in two parts. First, we consider the ordering objective and why this could lead to a good parallel sparse factorization. Then we consider the steps to achieve this order, again taking advantage of parallelism. The structure they sought to achieve is a block 2×2 matrix where the (1,1) block is diagonal. In this case, all of the diagonal pivots can proceed independently on different processes, updating the 2×2 block. They then recommend working in the same way on the updated (2,2) block recursively until the (2,2) block is sufficiently dense and the final block is dealt with as a

EXPLOITATION OF PARALLELISM

229

dense matrix. At first glance this may appear to be similar to finding a maximum transversal since the matrix has the same structure (see Figure 6.4.9) and we already discussed finding such a structure in Section 6.8. However, the details of their goals matter and it soon becomes apparent that their goals are quite different because of three constraints they add to their algorithm. First, they wanted to ensure numerical stability so they needed to care about the size of the candidate pivots. Secondly, they want to ensure sparsity in the updated (2,2) block. If a pivot was selected that extended the diagonal block but had a dense row in the (1,2) block and a dense column in the (2,1) block, for example, the resulting (2,2) block would be dense. So they constrain the pivot selection to ‘relatively sparse’ rows and columns. Thirdly, they wanted to work in parallel when determining the pivots in the (1,1) block. So there is nothing from the transversal work already discussed that applies. They start with a sequential step that chooses a numerically acceptable pivot with lowest Markowitz count M in the four shortest rows. Thereafter, they treat any entry with Markowitz count up to f M as acceptable, where f is a user-set parameter (typically 2 ≤ f ≤ 8). They mark the columns with an entry in the row of the chosen pivot and the rows with an entry in the column of the chosen pivot. Rows are divided between the processes which all work independently, seeking an entry in an unmarked row and unmarked column that has an acceptable Markowitz count and a numerically acceptable value. When such an entry is found, a critical section is entered; if the row and column of the entry are still unmarked, the entry is recorded as a pivot, the columns with an entry in its row are marked, and the rows with an entry in its column are marked; the critical section is then left. This continues until all the processes have searched their rows. After synchronization, each process independently performs all the updates to the sparsity patterns of its columns. After another synchronization, each process independently performs all the updates to the sparsity patterns of its rows and to the associated values. While Davis and Yew (1990) report from experiments on an 8-processor shared-memory machine that the loss of sparsity was not severe and that the execution times compared favourably with other codes, no production code has been made available. T. A. Davis (2015, private communication) says that this is mainly because the results are non-deterministic. From run to run on the same matrix, the algorithm may find a different pivot order, depending on which process gets into the critical section first. They are not strictly using a Markowitz algorithm, but one in that spirit with slightly relaxed criteria. Koster and Bisseling (1994) (see also Koster 1997) modified MA48 using a similar approach in their code SPLU and were competitive in sequential mode with MA48. On some matrices, they showed good speedups on a process grid on a MEIKO CS2-HA and on a CRAY T3E, but did not develop this work further and never released any publicly available code.

230

GAUSSIAN ELIMINATION WITHOUT SYMBOLIC FACTORIZE

We will see in the next four chapters that there are other approaches to sparse factorization that offer better possibilities for the exploitation of parallelism. Exercises 10.1 By means of a Fortran program that first links all the rows with the same row counts, demonstrate that the rows of a matrix can be ordered in O(n) time so that they are in order of ascending row counts, given that the row counts are already available. 10.2 Show how the auxiliary storage in Table 10.2.1 can be reduced from 6n to 5n integers by holding only one header array. 10.3 Give Fortran code for removing a row from a chain of rows having tau entries and inserting it into a chain of rows having tau1 entries. 10.4 Give an example where scanning rows and columns in the order given by the algorithm in Section 10.2 does not access entries in order of increasing Markowitz count. 10.5 Give an example where the search strategy of Section 10.2 must search more than half the matrix before termination. 10.6 Construct code for performing the operations of Figure 10.3.2 in the case where the structure of L\U is not known in advance. Assume that the entries in the columns of A are stored in order and that the L\U factorization has proceeded to column k − 1. 10.7 Write Fortran code to solve a unit upper triangular system where the matrix is stored by rows. 10.8 Express the permutation (1,2,3,4,5) → (3,5,1,2,4) as a sequence of interchanges and show that the effect of applying them to a vector is the same. Write Fortran code that uses interchanges to reorder a vector in place. 10.9 Write code to solve Lx = Pb when L is unit lower triangular and PT L is held by columns. 10.10 Write loop-free code for SOLVE for the example in Figure 10.8.1. 10.11 Using the operation codes in Figure 10.9.1, write data for the interpretative code for the SOLVE phase for the example in Figure 10.8.1. 10.12 Show that convergence of the method outlined in (10.10.6) depends on the spectrum of the matrix (10.10.7). (k)

(k)

10.13 Find a small quantity that when added to the diagonals aii and ajj ensures (k)

positive definiteness, regardless of the values of the other entries, when entry aij is dropped during the partial factorization of a symmetric positive-definite matrix. 10.14 It may be verified that the matrix   1 0.71 σ A =  0.71 1 0.71  σ 0.71 1 is positive definite for σ in the range 0.0164 < σ < 1. Thus, the two matrices     1 0.71 0.02 1 0.71 0 A1 =  0.71 1 0.71  and A2 =  0.71 1 0.71  0.02 0.71 1 0 0.71 1

EXPLOITATION OF PARALLELISM

231

are ‘neighbours’, but A2 is indefinite. If we were to use the cautious drop tolerance approach of Exercise 10.13 with threshold 0.025, what resulting sparse matrix would be used instead of A2 ? 10.15 In Section 10.7, we advocated switching to full form in the factorization, essentially replacing zeros by entries to take advantage of dense blocks for vectorization and parallelization. In Section 10.10, we advocated replacing small numbers by zero, to gain more sparsity. How might the two potentially conflicting strategies fit together?

11 IMPLEMENTING GAUSSIAN ELIMINATION WITH SYMBOLIC FACTORIZE We discuss implementation techniques for sparse Gaussian elimination, based on an analysis of the sparsity pattern without regard to numerical values. This includes a discussion of data structures for pivot selection, the use of cliques, and the use of both dynamic and static data structures. We examine the division of the solution into the distinct phases of reordering, symbolic factorization, numerical factorization, and solution, indicating the high efficiency with which these steps can now be performed.

11.1

Introduction

The difference between this chapter and the previous one is that here an initial pivotal sequence is chosen from the sparsity pattern alone and is used unmodified or only slightly modified in the actual factorization. When it is applicable, this approach can exhibit a significant advantage over the methods of Chapter 10. Here, ANALYSE will usually be less expensive than FACTORIZE, rather than very much more expensive. Additionally, ANALYSE can usually be performed in place, which allows it to be implemented in static storage. At the completion of ANALYSE, we can compute the work and storage requirements for FACTORIZE provided no numerical pivoting is performed. We begin (Sections 11.2 and 11.3) with the implementation of pivot selection by the minimum degree and approximate minimum degree algorithms, which were introduced in Section 7.3. We then look at pivot selection by the dissection schemes from Chapter 9 and their combination with minimum degree. Using one of these orderings for the actual factorization of a matrix is considered in Section 11.5. The case where additional pivoting for numerical stability is needed is considered in Section 11.6. We described the data structures used to carry out the ordering of a matrix to band and variable band form in Sections 8.2 and 8.3. In Sections 11.7 and 11.8, we describe the data structures to carry out the numerical factorization efficiently using these orderings. We note that these are very different from the data structures used for minimum degree, dissection, and their variants in Sections 11.5 and 11.6. The frontal method is a variation of the variable-band technique that was developed for finite-element problems though it is not restricted to them. It is most straightforward for symmetric and positive-definite systems. The frontal method is described in Sections 11.9–11.12. Direct Methods for Sparse Matrices, second edition. I. S. Duff, A. M. Erisman, and J. K. Reid. c Oxford University Press 2017. Published 2017 by Oxford University Press.

MINIMUM DEGREE ORDERING

233

Because of the power of these methods, we can apply them in cases where diagonal pivot selection might be numerically unstable, but that requires a twostep process. We select the initial ordering in the ANALYSE phase as discussed, but then allow for modifications of the selected sequence in the FACTORIZE phase. This extension can apply for dissection, banded, and frontal methods, so we have considered this option in the discussion of the various methods. The added complication is that numerical pivoting may create additional fill-in, or widen the band in the case of banded systems, but does not necessarily do so. It does increase the times for FACTORIZE, however, and the work and storage forecasts from ANALYSE need not be sustained in FACTORIZE. We conclude the chapter with some remarks on parallelism in Section 11.13. 11.2

Minimum degree ordering

The minimum degree algorithm (Tinney scheme 2) was introduced in Section 7.3 and we will now consider its implementation. To facilitate the pivot selection, we require access to whole rows (or columns) of the matrix. Taking full advantage of symmetry means that only half of the matrix is stored and access to the whole of a row demands access to the row and column of the stored half of the matrix. The simplest alternative is to store the entire matrix, and we make this assumption here. The storage burden is eased by not having to store the numerical values at this stage. The critical factor in the efficient implementation of the minimum degree ordering is to avoid the explicit storage of the fill-ins. Rather, the fill-in is treated implicitly by using generated elements (cliques) to store updates to the reduced matrix. This clique representation of Gaussian elimination was popularized by Rose (1972), and used in a somewhat different way by George and Liu (1981). We will describe the non-finite-element case first, although some simplifications are possible for finite-element problems. The first pivot is chosen as the diagonal entry in a row with least entries, say (k+1) entries. Following the operations involving the first pivot, the reduced matrix will have a full submatrix in the rows and columns corresponding to the k off-diagonal entries in the pivot row. The graph associated with this full submatrix is called the pivotal clique. The important thing concerning the storage of cliques (see Section 2.18) is that a clique on k nodes requires only k indices, while the corresponding submatrix requires storage for 21 k(k + 1) numerical values. Rather than finding out where fill-ins take place and revising the lists of entries in the rows, as we did when working with actual values (see Section 10.2), we keep the original representations of the rows together with a representation of the full submatrix as a clique. To update the count of number of entries in a row that was involved in the pivotal step, we scan the index list of its entries (excluding the pivotal column) and the index list of the clique, incrementing the count each time an index occurs for the first time. For example, in the case illustrated in Figure 11.2.1, the clique associated with the first elimination has index list (3,5,6) and row 3 has index list (3,5,7)

234

IMPLEMENTING ELIMINATION WITH SYMBOLIC FACTORIZE

1 2 3 4 5 6 7

1 2 3 4 5 6 7 × × ×× × × ×× × × × • × × × ×× × × × • × ×× • × • ×× ××××××

Fig. 11.2.1. The first step of elimination. Fill-ins are shown as •.

so the count is incremented to 3 by scanning the clique and to 4 when 7 is found to occur for the first time when scanning the index list (3,5,7). The index of the pivot can remain as an index for the pivotal clique. Note that the index lists of rows outside the clique do not change. Each list may be held as a sparse vector (Section 2.7) or as a linked list (Section 2.10). At a typical stage of the elimination, the pivot is chosen as the diagonal entry in a row with least entries. The set of entries in the pivot row is formed by merging its index list (excluding eliminated columns) with those of all the cliques that involve the pivot. This set, less the index of the pivot, provides the index set of the pivotal clique. Since all the old cliques that involve the pivot are included in the pivotal clique, there is no need to keep them. The number of entries in each altered row is now calculated. As when working with actual values, it is advisable to store the indices of rows with equal numbers of entries as doubly-linked lists. At each pivotal step, each index of a row whose count changes is removed from its list and is inserted in its new list. The algorithm has a very interesting property. Since each new list for a clique is always formed by merging old lists (and removing the pivot index) and the old lists are then discarded, the total number of list entries does not increase in spite of fill-ins. Hence, provided the original matrix can be stored and provided adequate data structures are in use, there is no possibility of failure through insufficient storage. It is very worthwhile to recognize rows with identical structure, since once identical they will remain so until pivotal. Such a collection of rows is known as a super-row and the corresponding variable is known as a supervariable. Index lists of supervariables will be shorter and can be searched more quickly. Only one is needed for each super-row and only one degree needs to be calculated for all the rows of a super-row. We explained in Section 2.14 how to identify efficiently all the supervariables for a given matrix. For testing whether two rows become identical after eliminations, an efficient approach is to use a hashing function. Amestoy et al. (1996a) place the super-rows modified in a pivot step in n hash buckets by using the hash function 1 + s(i) mod (n − 1), where s(i) is the sum of the indices of

MINIMUM DEGREE ORDERING

235

the supervariables and cliques associated with super-row i. A linked list can be used for those in the same bucket. Whenever a super-row is placed in a bucket that is not empty, it is checked against the others in the bucket. This does not recognize a row that becomes identical to a row that is not modified, but they will be so recognized in the next pivotal step that involves them. The graph whose nodes are the supervariables and whose edges correspond to full submatrices in their rows and columns is known as the quotient graph (George and Liu 1981). Once one variable of a supervariable has been eliminated, the others may be eliminated immediately without any further fill-in. This strategy was called ‘mass node elimination’ when it was first proposed. Following such a mass elimination, the pivotal clique has size equal to the number of variables in the pivot row less the number of variables in the pivot block. This is known as the external degree, since it does not count entries in the pivot block. In some cases, choosing the pivot to minimize the external degree instead of the true degree can have a significant effect on the number of entries in the factorization of the reordered matrix. This was proposed by Liu (1985), who reported that he found a reduction of 3–7% in the size of the factorized matrix. Bigger gains have been found since then. Some examples are shown in Table 11.3.1. One reason for this is that the external degree gives a better approximation to the fill-in than the true degree. We note that most modern codes work with a reduced matrix where blocks are represented by a single supernode so that the gains from using external degrees are readily available. Three devices may be employed to reduce the cost of the calculations (see Duff and Reid 1983, and Exercise 11.1): (a) When recalculating the degree of a super-row that is affected in a pivotal step, start with the index list of the pivotal clique, since this is bound to be involved. (b) When scanning cliques during degree calculation, it is easy to recognize any clique that lies entirely within the pivotal clique. Such a clique may be removed without losing any information about the structure of the reduced system. (c) Once an entry of the original matrix has been ‘overlaid’ by a clique, it may be removed without altering the overall pattern. Of course, there may be many rows with least number of entries. Liu (1985) suggested choosing a set with no interactions (none has an index list that includes any of the others). This is known as multiple minimal degree or MMD. The lack of interaction may be helpful when working in parallel. To illustrate these improvements to the minimum degree ordering that we have considered in this chapter, we show in Table 11.2.1 specimen times for five codes. The problems are three of the graded-L triangulations of George and Liu (1978b). Although these are old results on small problems they illustrate the gains well, and the most recent code in the comparison (MA27A) is essentially

236

IMPLEMENTING ELIMINATION WITH SYMBOLIC FACTORIZE

Table 11.2.1 Times in IBM 370/168 seconds for various implementations of minimum degree ordering. Order 265 1 009 3 466 Entries 1 753 6 865 23 896 MA17A (1970) 1.56 29.9 >250 MA17E (1973) 0.68 6.86 62.4 YSMP (1978) 0.24 1.27 6.05 SPARSPAK (1980) 0.27 1.11 4.04 MA27A (1981) 0.15 0.58 2.05 still state-of-the-art for a minimum degree code. Codes MA17A and MA17E are from the HSL Archive and do not use cliques; MA17A uses the storage scheme described in Table 2.13.2 and so needs to run through links to find the row or column index, whereas MA17E additionally holds row and column indices explicitly. YSMP (Yale Sparse Matrix Package—Eisenstat, Gursky, Schultz, and Sherman 1982), SPARSPAK (University of Waterloo), and MA27A (HSL) all use successive refinements of clique amalgamation, which is why they are faster than the first two. Since there is more than one version of each of these codes, the date of the version used in this comparison is given in parentheses. These dates also serve to indicate the advances in the implementation of the minimum degree algorithm. The finite-element case is simpler in that the problem is given in the form of a set of cliques. The degrees of all the variables should be calculated initially from the clique representation without performing any assemblies. Thereafter, the implementation can be just like the non-element case except that the whole pattern is stored by cliques. We return to the finite-element case in Sections 11.10 and 11.11. 11.3

Approximate minimum degree ordering

Despite all the devices mentioned in the previous section, the recalculation of the degrees (or external degrees) is usually expensive, as may be seen from the pseudocode in Figure 11.3.1. To reduce the cost of this step, Amestoy et al. for each supervariable i of the pivotal clique d(i) := size of pivotal clique for row i and each old clique involving i for each supervariable j of row i or old clique if (j not touched yet) d(i) := d(i) + (size of j) end for end for end for Fig. 11.3.1. Pseudocode for degree calculation.

APPROXIMATE MINIMUM DEGREE ORDERING

237

for each supervariable i of the pivotal clique for each old clique k involving i if (first occurrence of k) w(k) := size of clique k w(k) := w(k) - (size of i) end for end for Fig. 11.3.2. Pseudocode for computing w(k) for AMD. do each supervariable i of the pivotal clique d(i) := size of pivotal clique + size row i outside cliques for each each old clique k involving i d(i) = d(i) + w(k) end for end for Fig. 11.3.3. Pseudocode for AMD algorithm. (1996a) suggested replacing the degree by an approximation that has become known as AMD, for Approximate Minimum Degree. It actually approximates the external degree. It involves the preliminary calculation illustrated in Figure 11.3.2. For each old clique k involved, this computes the number of variables w(k) that are in clique k, but are not in the pivotal clique. The calculation of the approximate degree is illustrated in Figure 11.3.3. It is the sum of the number of variables in the pivotal clique, the number of entries in the row that have not yet been absorbed in a clique, and the values w(k) for all the cliques that row i touches. This may be an overestimate, since variables that occur in more than one old clique will be counted more than once. However, it cannot be an underestimate and is exact unless two or more old cliques are involved. The value is reduced to the number of variables left to eliminate should it exceed this. Also, it is reduced to its previous value plus the number of variables in the pivotal clique should it exceed this, since the number of fill-ins cannot exceed the size of the pivotal clique. By comparing Figure 11.3.1 with Figure 11.3.3, it is apparent that the computation of the approximate minimum degrees is much quicker than that of the minimum degrees. This is confirmed by the results of Amestoy et al. (1996a), who ran experiments on 26 actual problems. They found that the time was always reduced and was reduced by a factor of 1.5 or more in 15 of the 26 cases, of more than 5 in 5 cases, and of more than 30 in three cases. They also compared the number of entries in the factor L for AMD with that for minimum external degree and found them to be very similar (within 1% in 18 cases, worse by more than this in 3 cases, and better by more than this in 5 cases). They also compared the factor size with that obtained by minimum degree and found much bigger gains for using external degree than were reported by Liu (1985). The gain was more than 5% in 23 of their 26 cases, greater than 10% in 11 cases

238

IMPLEMENTING ELIMINATION WITH SYMBOLIC FACTORIZE

Table 11.3.1 Results of Amestoy et al. (1996a), comparing AMD with minimum external degree (MED) and minimum degree (MD) on a SUN SPARCstation 10. Name

Order (×103 ) olafu 16.1 bcsstk33 8.7 finan512 74.8 lhr34 35.2 appu 14.0

Entries Entries in L (×103 ) (×103 ) AMD MED MD 1 015.2 2.86 2.86 3.09 591.9 2.62 2.62 2.79 597.0 4.78 4.04 8.23 747.0 3.62 3.74 4.38 1 853.1 87.65 87.61 87.60

Time AMD MED MD 1.83 2.56 2.33 0.91 1.36 1.62 15.03 34.11 46.49 19.56 109.10 125.41 41.75 2 970.54 3 074.44

and a massive 51% in one case. We show in Table 11.3.1 five cases from the results of behaviour for both speed and fill-in. Our conclusion is that AMD is the preferable variant of minimum degree. It is implemented in HSL by MC47. 11.4 Dissection orderings We discussed the various dissection orderings in detail in Chapter 9. These provide good performance for FACTORIZE for many problems, and are particularly applicable to parallel computing environments because of their property of creating independent subproblems. The methods for carrying out the numerical factorizations described in the next two sections are valid for the various dissection schemes as well as the variants of minimum degree described earlier in this chapter. However, they do not allow for the implicit factorization described in Section 9.2. An effective strategy for large problems is to carry out a few steps of dissection and then apply one of the other ordering methods on the resulting submatrices on the diagonal. We discussed algorithms of this kind in Section 9.5. When a variant of minimum degree is used for the blocks on the diagonal, we face two choices. We can simply create the overall ordering for the matrix (the combined ordering obtained from both algorithms) and use the numerical factorization schemes of the next two sections directly. Alternatively, we can exploit the block form as described algebraically in Section 9.2. In this case, would use the numerical factorization methods of the next two sections only on the submatrices on the diagonal, assemble the implicit factors, and ultimately obtain the solution, through the steps in Section 9.2. There is another possibility when a few steps of dissection are taken. We could use a band or variable band method to factorize the submatrices on the diagonal. As we shall see in Sections 11.7 and 11.8, the data structures for factorizing a banded form are quite different from those used when a variant of minimum degree is used. This means the final ordering could not simply be handed over to a numerical FACTORIZE, and the steps of Section 9.2 would be best. We know of no standard codes that offer these alternatives. However, it would be worthwhile to explore these various alternatives for problems from a particular application domain where one of these may consistently outperform others.

NUMERICAL FACTORIZE USING STATIC DATA STRUCTURES

11.5

239

Numerical FACTORIZE using static data structures

We now discuss the numerical factorization of the matrix, that is the generation of the factors L and U, where the ordering has been chosen. As was done for the symbolic factorization, we will process the matrix by rows. When numerical pivoting is not required, for example, when the matrix is positive definite, the data structures generated in the symbolic phase can be used without any modification. This can lead to a very efficient numerical factorization, particularly on parallel architectures. For the symbolic FACTORIZE, we were able to make use of the fact that the sparsity pattern of row k is determined by the patterns of those previous rows that have their first entry after the diagonal in column k. For the numerical FACTORIZE, the situation is not so simple, since any row with an entry in column k will contribute. Nevertheless, a similar linked-list data structure (connecting rows which are used to update the same later row) can be used. Here, each row of U is labelled by its first ‘active’ entry (Gustavson 1972). An entry of U is called active at stage k if it lies in columns k to n. This concept and its implementation are most easily understood by looking at an example. In Figure 11.5.1, access is needed to rows 1, 3, and 4 of U when calculating row 6. These rows, with their first active entry in column 6, will be linked. The arithmetic is normally performed by loading row k of A (k = 6 in the example) into a full vector (as in Section 2.4), which is then successively modified by multiples of the later parts of the rows with entries in column k (rows 1, 3, and 4 in the example). The full vector is then unloaded into the packed row k of U using its known structure. We leave the coding of this as an exercise for the reader (Exercise 11.4). Before continuing to process the next row (row 7), it is necessary to update the linked lists (see Section 2.9) since the previous column (column 6) is no longer active. This update is easily effected when scanning the rows (1, 3, or 4 in our example), since the next entry in the row will be identified and the row can be added to the appropriate list. In our example, row 1 will be added to 1 2 3 4 5 6 7 8 9 10 1 u u u u u 2 u u u u 3 u u u u u 4 u u u u u 5 u u u u 6 a a 0 0 7 a 0 a 0 8 a 0 0 9 a 0 10 a

Fig. 11.5.1. Symmetric Gaussian elimination just before calculating row 6 of U. Key: u are entries of U, a are entries of A, and 0 are entries that fill-in.

240

IMPLEMENTING ELIMINATION WITH SYMBOLIC FACTORIZE

the list for column 10, row 3 to the list for column 9, and row 4 to the list for column 7. Additionally, the newly-generated row 6 of U must be unloaded from the full vector and added to the list for column 7. Notice that the order in which the rows are held in the linked lists is unimportant (the result is the same apart from the effects of rounding). If only the pivotal sequence is given, it is possible to generate the structure for U while performing the numerical factorization, but it is usually more efficient to construct the pattern of U (that is, to do the symbolic factorization) a priori (see Section 12.5) and use a static data structure. This is specially true when further factorizations of matrices with the same pattern are needed. Significant benefits and simplifications stem from the use of a static data structure during numerical factorization because dynamic structures will involve data movement and complicate parallel execution in addition to the fact that the space needed will often not be predictable. Unfortunately, it is hard to achieve this and at the same time allow for numerical pivoting. This discussion is related to that of Section 10.3. There, we had found an ordering and a static data structure for factorizing an unsymmetric matrix for one set of numerical values and hoped that it would be suitable for changed numerical values. We also considered the suggestion of Stewart (1974) to change very small pivots and correct the solution later with the matrix modification formula (see Section 15.2). It leads to overheads in each subsequent SOLVE and the stability is not guaranteed, so iterative refinement is advisable. We prefer to use a structure that allows row interchanges to be performed for the new matrix, see Section 10.4. It is safer, but is likely to exploit the sparsity less well. 11.6

Numerical pivoting within static data structures

We now look at two possibilities for using a static data structure when performing an LU factorization on an unsymmetric matrix. This can be achieved by setting up a data structure that includes entries in all positions that could conceivably fill-in during a factorization with numerical pivoting. George and Ng (1985) have noted that for any row permutation P and factorization PA = LU, (11.6.1) the patterns of LT and U are contained in the pattern of the Cholesky factor U, where T (11.6.2) AT A = U U. We leave this as an exercise for the reader (see Exercise 11.5). This suggests choosing a column permutation Q for A by an ANALYSE on the structure of AT A. P can then be chosen to maintain stability in the knowledge that the pattern of U will suffice to hold LT and U. Thus, a static data structure can be used for L and U, partial pivoting can be used without any further increase in storage, and the fast analysis available for the structure of a symmetric matrix can be used. The principal defect of the method is that U may be very dense

BAND METHODS

241

compared with the L and U of other methods (see Exercise 11.6) and indeed it is always likely to be worse since its structure must accommodate all row orderings P. George and Ng (1985) suggest some ways of trying to overcome this deficiency, involving update schemes where not all of the matrix is used in (11.6.2). George and Ng (1987) suggest preordering the columns of A and at each stage of symbolic FACTORIZE they set the patterns of all the rows with entries in the pivot column equal to their union. The resulting data structure clearly includes the pattern of the L\U factorization that results from any choice of row interchanges. They show that the resulting data structure is usually smaller and is never larger than that of the George and Ng (1985) approach. A simple example of this is a matrix with entries only on the diagonal and all of the last row. Because the normal equations matrix is dense, the first strategy would use full matrices for holding L and U. However, the second strategy would give a pattern for L that is the same as that for A although the structure of U is still a full triangular matrix. It is still likely that the second strategy will seriously overestimate the storage needed for U.

11.7

Band methods

Fixed bandwidth methods are very straightforward to implement. We merely hold the matrix by rows or columns in a normal rectangular array. In Fortran, because arrays are stored by columns, it is generally best to take the array to have as many rows as the bandwidth and as many columns as the order, rather than vice-versa. In this way, the columns are contiguous in storage and the rows have a constant stride one less than the bandwidth. This is important for efficiency. This storage scheme is illustrated for a tridiagonal matrix in Figure 11.7.1. We may use a similar storage pattern for the computed L\U factorization. If the matrix is symmetric, the superdiagonal (or subdiagonal) part need not be stored. For a symmetric and positive-definite matrix, no interchanges are needed for numerical stability. Without interchanges, the symmetry is preserved so an m+1 by n array suffices for a matrix of bandwidth 2m + 1 and order n. A choice is available (see Section 3.9) between a Cholesky factorization a1 c1 b1 a2 c2 b2 a3 c3 b3 a4 c4 b4 a5 c5 b5 a6 c6 b6 a7



0 c1 c2 c3 c4 c5 c6 a1 a2 a3 a4 a5 a6 a7 b1 b2 b3 b4 b5 b6 0

Fig. 11.7.1. Storing a band matrix by columns.

242

IMPLEMENTING ELIMINATION WITH SYMBOLIC FACTORIZE

¯L ¯T A=L

(11.7.1)

A = LDLT .

(11.7.2)

and the symmetric decomposition

We prefer the decomposition (11.7.2) because there is no need to calculate square roots and because the algorithm does not necessarily break down in the indefinite case although it is potentially unstable. Another choice lies in the order in which the elimination operations are performed. In right-looking Gaussian elimination (see Figure 3.9.1), step k of the factorization involves the following minor steps: (k)

dkk = akk ,

(11.7.3a)

and, for i = k + 1, ..., min(n, k + m), (k)

uki = aik , lik =

(11.7.3b)

(k) aik d−1 kk

(11.7.3c)

and (k+1)

aij

(k)

= aij − lik ukj , j = k + 1, ..., i.

(11.7.3d)

Step (11.7.3b) is included because the pivotal row must be stored temporarily for use in step (11.7.3d), although U = DLT as a whole is not stored. The most computationally intensive step is (11.7.3d), but it makes reasonable use of cache1 , since the entries of A that are accessed are nearby in memory, as are those of L and of U. In the left-looking version (Figure 3.9.2), the entries in the factors are found directly by the operations d11 = a11 and, for k = 2, 3, ..., n, ! j−1 X lkj = akj − lki dii lji d−1 (11.7.4) jj , j = mk , ..., k − 1 i=mk

and dkk = akk −

k−1 X

lki dii lki ,

(11.7.5)

i=mk

where mk = max (1, k − m). This is probably faster than right-looking Gaussian elimination since each matrix entry is altered only once, which is less demanding of the memory system. 1 Better

use of cache is made when blocking is used.

BAND METHODS

243

The SOLVE phase involves solving the triangular sets of equations Lc = b

(11.7.6)

LT x = D−1 c

(11.7.7)

and There are two alternative computational sequences for the forward substitution (11.7.6): k−1 X ck = bk − lkj cj , k = 1, 2, ..., n, (11.7.8) j=mk

where mk = max (1, k − m) and, for k = 1, 2, ..., n, (k)

ck = bk

(11.7.9a)

and (k+1)

bi

(k)

= bi

− lik ck , i = k + 1, ..., min(n, k + m),

(11.7.9b)

with b(1) = b. The first sequence may be preferable if L is stored by rows and the second form may be preferable if L is stored by columns. Similar considerations apply to the back-substitution (11.7.7). In the unsymmetric case, interchanges are not needed if the matrix is diagonally dominant and, in this case, the band form is preserved. If contiguously stored vectors are wanted for the inner loop of FACTORIZE, right-looking Gaussian elimination is suitable for storage by rows and left-looking Gaussian elimination is suitable for storage by columns. At the risk of instability, we may continue to use these methods for matrices that are symmetric but not definite, or are unsymmetric but not diagonally dominant. It is not difficult to check for instability by performing iterative refinement (Section 4.13). Alternatively, we may use row interchanges, thereby destroying symmetry in the symmetric case and increasing the stored bandwidth of U from m + 1 to 2m + 1 in both instances. Since these methods at no time require access to more than m+1 rows at once, they will automatically make good use of cache because the rows (or columns) are stored contiguously. Similarly, they are well-suited to working out of memory. We summarize the storage and computational requirements in Table 11.7.1. Table 11.7.1 Leading terms in storage and operation counts for factorization of band matrices of bandwidth 2m + 1 and order n (valid for m  n).

Storage Active main storage

Symmetric and positive definite (m + 1)n 1 m2 2

Diagonally General dominant (2m + 1)n (3m + 1) m2 2m2

Number of multiplications

1 m2 n 2

m2 n

2m2 n

244

11.8

IMPLEMENTING ELIMINATION WITH SYMBOLIC FACTORIZE

Variable-band (profile) methods

As we explained in Section 8.2, a variable-band form (Jennings 1966), also called skyline (Felippa 1975), is preserved if no interchanges are performed. The important case is when the matrix is symmetric and positive definite. The lower triangular part may be stored by rows almost as conveniently as if the bandwidth were fixed. One possibility is to hold the positions of the diagonals. For example, the matrix whose lower triangular part is shown in Figure 11.8.1 might be stored in the arrays shown in Table 11.8.1 where the nine entries are stored by rows in the array value and the ith component of the array row end, i = 1, 5 points to the position in value of the last entry in row i. This is also the diagonal entry in row i. The computational variants (11.7.3) and (11.7.5) for the (fixed) band form both generalize to this case, but the left-looking variant (11.7.5) is more straightforward and probably faster. It is the one normally used; for example, it is used by SPARSPAK (Chu et al. 1984). If rows j and k have their first entries in columns mj and mk , respectively, then the sum in equation (11.7.4) commences at i = max (mj , mk ). Knowing the positions of the diagonal entries allows us readily to calculate the row starts mk when needed. For instance, in Table 11.8.1 row 4 starts at position row end(3)+1 = 5 and ends at position row end(4) = 7; the diagonal is in column 4 so the first entry is in column 4 − (7 − 5) = 2. Apart from this, the left-looking code is no more complicated than in the (fixed) band case and the inner loop is still an inner product between contiguously stored vectors. The problem with right-looking Gaussian elimination (11.7.3) is that there may be some sparsity within the pivot row and column (see, for example, column 2 in Figure 11.8.1). If the first entry in row i is in a column after column k, we have lik = 0 and it is easy enough to skip steps (11.7.3c) and (11.7.3d), but an explicit zero must be stored in uki and used in later steps (11.7.3d). We will be doing some multiplications involving explicit zeros in step (11.7.3d), something we normally try to avoid in sparse matrix computations.   1.7  0.6 3.2      4.5     2.3 4.7 5.7 3.1 6.9 Fig. 11.8.1. The lower triangular part of a symmetric variable-band matrix. Table 11.8.1 Arrays holding the Figure 11.8.1 matrix. Subscript 1 2 3 4 5 6 7 8 9 value 1.7 0.6 3.2 4.5 2.3 4.7 5.7 3.1 6.9 row end 1 3 4 7 9

FRONTAL METHODS: INTRODUCTION

245

If the maximum semibandwidth is m, at no time will access to more than m rows be needed at once. Thus, as for the (fixed) band case, it is straightforward to arrange for out-of-memory working. This facility is provided, for example, by the HSL Archive code MA36. The algorithms of this and the previous section readily generalize to block form. Each subscript in Section 11.7 now refers to a block row or column instead of a single row or column and each superscript refers to a block step. Each row block has the same size as the corresponding column block, so the diagonal blocks dii are all square. If all the blocks are full, the same number of floatingpoint operations will be performed but they will be faster because of less data movement through the caches. Indeed Level 3 BLAS may be employed. If the blocks are not full, more operations will be performed, but there will still be considerable advantages as long as the blocks are nearly full. The block form is advantageous for parallel working, too. Here, the rightlooking version is to be preferred because within a step the operations (11.7.3d) are independent. 11.9

Frontal methods: introduction

In the methods for factorization discussed earlier in this chapter, we have assumed that the matrix A is given. For some cases, however, the matrix is ‘assembled’ as the sum of submatrices of varying sizes. That is, each entry is computed as the sum of one or more values. In this case, we can simply compute the matrix A and then factorize it as before. We observe, however, that if the pivot row and column have been fully assembled, the factorization may begin before the assembly of the rest of the matrix has been completed. This observation may at first seem to be simply an algebraic oddity. The results would be algebraically the same, although if a particular entry were to have the results of the pivot step added before it was fully assembled, the additions would come in a different order, which would probably lead to different rounding errors. However, on further reflection, this idea provides the opportunity to compute and factorize the pieces, while never having the whole matrix available. This opens the door to solving very large problems. The need to assemble the matrix as a sum of pieces is precisely the issue encountered in finite-element problems. Not surprisingly, these methods were developed in the finite-element world before many of the sophisticated algorithms for solving sparse systems were developed. The first published paper for describing this frontal method was by Irons (1970) for the symmetric, positive-definite case. It was extended to unsymmetric cases by Hood (1976). Today, we find this ‘third’ approach to data structures for symbolic factorization to be of great interest in its own right, even beyond the origins of finite-element problems, and we now develop the basic ideas and outline their implementation. The extension of these ideas to multiple fronts regardless of the origin of the problem will offer great opportunity for parallelism, and that part of the discussion is reserved for Chapters 12 and 13.

246

IMPLEMENTING ELIMINATION WITH SYMBOLIC FACTORIZE

11.10

Frontal methods: SPD finite-element problems

Frontal methods have their origin in the solution of finite-element problems in structural mechanics, but they are not restricted to this application nor need the matrix be symmetric and positive definite (SPD). It is, however, easiest to describe the method in terms of its use in this application, and we do this in the first instance. In a finite-element problem the matrix is a sum X A= A[l] , (11.10.1) l

where each A[l] has entries only in the principal submatrix corresponding to the variables in element l and represents the contributions from this element. It is normal to hold each A[l] in packed form as a small full matrix together with a list of indices of the variables that are associated with element l, which identifies where the entries belong in A (see Section 2.18). The formation of the sum (11.10.1) is called assembly and involves the elementary operation [l]

aij := aij + aij .

(11.10.2)

We have used the Algol symbol ‘:=’ here to avoid confusion with the superscript notation of (11.7.3). We call an entry fully summed when all contributions of the form (11.10.2) have been summed. It is evident that the basic operation of Gaussian elimination (k+1)

aij

(k)

(k) (k) −1 (k) akj

= aij − aik akk

(11.10.3)

may be performed before all the assemblies (11.10.2) are complete, provided only that the terms in the triple product in (11.10.3) are fully summed (otherwise we will be subtracting the wrong quantity). Each variable can be eliminated as soon as its row and column is fully summed, that is after its last occurrence in a matrix A[l] (a preliminary pass through the data may be needed to determine when this happens). If this is done, the elimination operations will be confined to the submatrix of rows and columns corresponding to variables that have not yet been eliminated, but are involved in one or more of the elements that have been assembled (we call these the active variables). This permits all intermediate working to be performed in a full matrix whose size increases when a variable appears for the first time and decreases when one is eliminated. The pivotal order is determined from the order of the assembly. If the elements are ordered systematically from one end of the region to the other, the active variables form a front that moves along it. For this reason, the full matrix in which all arithmetic is performed is called the frontal matrix and the technique is called the frontal method. It is perhaps easier to envisage this by examining the process of frontal elimination on a small example. We show such a problem in Figure 11.10.1, where there are three variables associated with each triangle (one at each vertex). To

FRONTAL METHODS: SPD FINITE-ELEMENT PROBLEMS

1

4

3

2

C

A

G

E

5

B

D

I

6

F

J

H

9

8

247

7

10

Fig. 11.10.1. A simple triangulated region.

4 8 1 5

4 × × × ×

8 1 5 ××× × × ×× ×××

4 8 1 5

4 u l l l

8 u × • ×

1 u • × ×

5 u × × ×

Fig. 11.10.2. The leading part of the matrix after the assembly of the first two triangles and after the elimination of variable 4. Entries of L and U are shown as l and u, and fill-ins are shown as •. keep the example simple, we suppose that the element matrices are not symmetric but any diagonal entry can be used as a pivot without fear of instability. We perform the assemblies from left to right (in the order shown lexicographically in Figure 11.10.1). After the assembly of the first two elements, the matrix has the form shown on the left of Figure 11.10.2 if the variables 4, 8, 1, 5 are ordered first. The remaining triangles will make no further contribution to the row and column corresponding to variable 4, which is why we permuted it to the leading position. We may immediately eliminate variable 4 to yield the matrix shown on the right of Figure 11.10.2. Next, we add the contribution from triangle C to give the pattern shown on the left of Figure 11.10.3. Now the row and column of variable 1 are fully summed, so we perform a symmetric permutation to bring it to the pivotal position, here (2,2), and perform the elimination step to give the pattern shown on the right of Figure 11.10.3. The next element is D with variables 5, 8, and 9, and now variable 8 becomes fully assembled and may be eliminated. The procedure continues similarly. The set of active variables (the front) at successive stages is (4,8,1,5), (8,1,5), (8,1,5,2), (8,5,2), (8,5,2,9), (5,2,9), (5,2,9,6), ... . For ordering the elements, the methods of Section 8.5 may be adapted to provide an ordering of the elements. A good element order may be found by first ordering the nodes for small wavefront and choosing an element order such that the elimination order is followed as closely as possible. Sloan (1986) suggests processing the elements in ascending sequence of their lowest numbered nodes. Note that the wavefront corresponds to the front size. In general, the partially processed matrix has the form illustrated in

248

IMPLEMENTING ELIMINATION WITH SYMBOLIC FACTORIZE

4 8 1 5 2

4 u l l l

8 u × × ×

1 u × × × ×

5 u × × × ×

2

× × ×

4 1 8 5 2

4 u l l l

1 u u l l l

8 u u × × •

5 u u × × ×

2 u • × ×

Fig. 11.10.3. The leading part of the matrix after the assembly of triangle C and after the elimination of variable 1. Entries of L and U are shown as l and u, and fill-ins are shown as •. Eliminated Frontal

Eliminated

L11 \U11

U12

Frontal

L21

F(k)

Remainder

0

Remainder

0

Fig. 11.10.4. Matrix of partially processed finite-element problem, just before another assembly. Figure 11.10.4. We show the situation after a set of eliminations and just prior to another assembly. The fully-summed rows and columns in blocks (1,1), (1,2) and (2,1) contain the corresponding rows and columns of L and U. They will not be needed until the SOLVE stage, so may be stored on auxiliary storage as packed vectors. Blocks (3,1) and (1,3) are zero because the eliminated variables are fully summed. Block (2,2) is the frontal matrix, normally held in memory. Blocks (2,3), (3,2), and (3,3) as yet have no contributions, so require no storage. Thus, only the frontal matrix needs to be in memory and full matrix storage is suitable, although the frontal matrix need not be full. We illustrate a non-full frontal matrix with the example shown in Figure 11.10.5. A reasonable order for the elements is pagewise, that is, top to bottom and left to right. After the top two elements have been processed, we will have a front containing the two variables that join the top arm to the rest. We now

FRONTAL METHODS: SPD FINITE-ELEMENT PROBLEMS

249

Fig. 11.10.5. A cross-shaped region. assemble the leftmost element and we have a front, which is block diagonal with blocks of size four and two. As further assemblies are performed, permutations are needed to maintain the form shown in Figure 11.10.4. Variables involved for the first time move into the front and frontal variables involved for the last time (identified by a preliminary pass of the data) are eliminated and move out. The frontal matrix thus varies in size and variables rarely leave it in the same sequence as they enter it. It is obviously necessary to be able to accommodate the largest frontal matrix that occurs. Returning to the finite-element matrix (11.10.1), suppose there is a variable associated with element 1, but with no others. Then, the frontal method will begin with frontal matrix A[1] and will immediately eliminate this variable. The amount of arithmetic needed will depend solely on the size of A[1] itself and not on the size of the overall problem. A variable associated with a later element and no others will enter the front with the assembly of its element and will be immediately eliminated, but now the amount of arithmetic will depend on the current front size. However, since the variable is fully summed within its element, there is no reason why the operations (11.10.3) associated with its elimination should not be done in temporary storage before its assembly into the frontal matrix; then assembly consists of placing the pivot row and column with the eliminated rows and columns, and adding the remaining reduced submatrix (including the modifications caused by the elimination operations) into the appropriate positions in the frontal matrix. This elimination within the element is known as static condensation and means that the true cost of adding internal variables to finite elements is very slight indeed. For instance, the algebraic cost of working with 9-node elements as shown on the left of Figure 11.10.6 is virtually identical with that of working with the usual 8-node element (right of Figure 11.10.6). Adding this simple technique to a frontal code can substantially improve its performance on appropriate problems. We have remarked that variables usually leave the front in a different order from that in which they enter it. For example, a variable associated with a pair of elements enters the front with the first element and leaves it with the second.

250

IMPLEMENTING ELIMINATION WITH SYMBOLIC FACTORIZE

Fig. 11.10.6. A 9-node element and an 8-node element. This property is important to the efficiency of the method, but does lead to slight programming complications. We have already noted that a preliminary pass through the index lists associated with the matrices A[l] is needed to determine when each variable appears for the last time (that is when it can be eliminated). There is also some complication in organizing the actual assemblies and eliminations within the frontal array. When the variables enter the front the corresponding rows and columns are normally placed at the end of the array and the eliminations are performed from the middle with the revised matrix closed up to fill the gaps. There is, of course, a need to maintain an index array to indicate the correspondence with the associated variables. Note that for frontal methods it is only the frontal matrix that need be in memory which is why the technique can be applied to problems where the whole matrix cannot fit in memory. Indeed, one major strength of frontal methods is that it is easy to use them in an out-of-core way where neither the entire input matrix nor the complete factors are held in memory. 11.11 Frontal methods: general finite-element problems The frontal technique can also be applied to systems which need numerical pivoting. To help visualize this, we show in Figure 11.11.1 the situation after an assembly, but prior to another set of eliminations. Blocks (1,1), (1,2), (2,1), and (2,2) correspond to the frontal matrix at this stage. Any entry whose row and column is fully summed (that is, any entry in block (1,1) of Figure 11.11.1) may be used as a pivot. For unsymmetric matrices, we may use the threshold criterion (k) (k) |aij | ≥ u max |alj | (11.11.1) l

where u is a given threshold, see Section 5.2. The pivot must be chosen from the (1,1) block of Figure 11.11.1, but the maximum is taken over blocks (1,1) and (2,1). Note that if unsymmetric interchanges are included, the index list for the rows in the front will differ from the index list of the columns and will require separate storage. In the symmetric case, we may maintain symmetry by using either a diagonal pivot that satisfies inequality (11.11.1) or a 2×2 pivot # " (k) (k) aii aij (11.11.2) Ek = (k) (k) , aji ajj

FRONTAL METHODS FOR NON-ELEMENT PROBLEMS

Front Front

Remainder

251

Remainder

fs

fs

fs

ps

0

0

Fig. 11.11.1. Reduced submatrix after an assembly and prior to another set of eliminations. The block of the front that is partially summed is labelled ps and the parts that are fully summed are labelled fs. all of whose coefficients are fully summed and that satisfies the inequality   (k)  −1  |ail | −1 max u E  l6=i,j  , (11.11.3) ≤ (k) k u−1 max |ajl | l6=i,j

see inequality (5.2.9). It is possible that we cannot choose pivots for the whole of the frontal matrix because of small entries in the front or large off-diagonal entries that lie outside the block of the fully-summed rows and columns (that is, because they lie in block (1,2) or block (2,1) of Figure 11.11.1). In this case, we simply leave the variables in the front, continue with the next assembly, and then try again. It is always possible to complete the factorization using this pivotal strategy because each entry eventually lies in a fully-summed row and in a fully-summed column, at which stage it may be chosen as a diagonal pivot or its row and column can define a 2×2 pivot. Naturally, if many eliminations are delayed, the order of the frontal matrix might be noticeably larger than it would have been without numerical pivoting. We will discuss this further in Section 12.8. Note that this is an example of a posteriori ordering for stability. We choose an ordering for the variables that keeps the front small without worrying about the possibility of instability, and we ensure the stability during the numerical factorization. This is an approach that we shall use extensively in the next chapter. 11.12

Frontal methods for non-element problems

Our discussion in the previous section was based on the application of the frontal method to finite-element problems, but the technique is not restricted to this

252

IMPLEMENTING ELIMINATION WITH SYMBOLIC FACTORIZE 2

4

6

8

−4

1

3

5

1

1

1 −4

1

1

0 −4

1

1

7

1

Fig. 11.12.1. A 2 × 4 grid and the first three rows of the corresponding matrix (all entries to the right of those shown are zero). Eliminated Remaining columns Eliminated

L11 \U11

U12

0

Frontal

L21

F(k)

0

To come

0

Fig. 11.12.2. Matrix of partially processed non-element problem, just prior to entering a row. application. Indeed, the frontal method applied to a general (assembled) problem can be viewed as a generalization of the variable-band solution (Section 11.8). For non-element problems, the rows may be taken in turn as if they were unsymmetric element matrices. The assembled rows are then always fully summed and a column becomes fully summed whenever the row containing its last nonzero is reached. This situation is illustrated in Figure 11.12.1, where we show the frontal matrix after the assembly of equation 3 from the five-point discretization of the Laplacian operator on a 2×4 grid. After this stage no more entries will appear in column 1, so this column is fully summed and can be eliminated. The counterpart to Figure 11.10.4 is Figure 11.12.2. We have ordered the columns so that the fully summed columns come first, followed by the partially summed columns, followed by those still without entries. After another row is entered, the reduced submatrix has the form shown in Figure 11.12.3, which is the counterpart of Figure 11.11.1. Eliminations may now take place in every fully-summed column and pivots may be chosen from anywhere in the column. Therefore, partial pivoting is possible, satisfying inequality (11.11.1) with u = 1. Without numerical pivoting, this form of frontal solution is very similar to

FRONTAL METHODS FOR NON-ELEMENT PROBLEMS

Front Front

fs

To come

0

fs

253

Remainder 0

Fig. 11.12.3. Reduced submatrix of non-element problem, just after entering a row. The blocks of the front that are fully summed are labelled f s. the variable-band or profile method discussed in Section 11.8. While the ordering of the elements constrains the pivotal sequence, it does not prescribe it. In particular, numerical pivoting is straightforward to incorporate, which gives the method a substantial advantage. As is the case for band techniques, frontal methods may be applied to any matrix. Of course, their success depends on the front size remaining small. In particular, they tend to produce more fill-in and require more arithmetic operations than the local methods of Chapter 7. For example, any zeros in the frontal matrix F(k) of Figure 11.10.4 are stored explicitly and there may be many of them. We illustrated how zeros can occur in frontal matrices while discussing Figure 11.10.5 in Section 11.10. Two of the main reasons favouring band methods also support frontal techniques. The first is that only the active part of the band or the frontal matrix need be held in memory. The second is that the arithmetic is performed with full-matrix code without any indirect addressing or complicated adjustment of data structures to accommodate fill-ins (although it is necessary to keep track of the correspondence between the rows and columns of the frontal matrix and those of the problem as a whole). The avoidance of indirect addressing facilitates vectorization or parallelization on machines capable of it. Both methods can take advantage of blocking. It is important to keep the front small since the work involved in an elimination is quadratic in the size of the front. It is particularly easy to keep the front small in a grid-based problem if the underlying geometry is that of a long thin region. When the region is closer to square, the frontal method will still work, but there is likely to be much more computation. We show how to use these frontal methods while exploiting sparsity in a more general way in Chapter 12. Automatic methods for bandwidth reduction can also be applied for frontal solution (see Section 8.5). For a finite-element problem, manual ordering of the elements for a small front size is much more natural than ordering the variables

254

IMPLEMENTING ELIMINATION WITH SYMBOLIC FACTORIZE

for small bandwidth and is likely to be easier simply because there are usually less elements than variables; in general, however, people prefer this to be done automatically for them. It is interesting that the advantages associated with reversing the Cuthill– McKee ordering (see Section 8.3) for variable-band matrices are automatically obtained in the frontal method since each variable is eliminated immediately after it is fully summed. The HSL code MC43 uses Sloan’s algorithm but works on the elemental unassembled form of the matrix (Scott 1999). Irons (1970) was, to our knowledge, the first to publish a paper on the frontal method. His code was designed for symmetric and positive-definite matrices and was generalized to unsymmetric cases by Hood (1976). Harwell subroutine MA32 (Duff 1981b, 1983, 1984a) incorporates many improvements to Hood’s code and we believe it was the first to accept the data by equations as well as elements, which makes it very suitable for finite-difference discretizations of partial differential equations. Although the MA32 code combined two pivot steps to essentially perform a BLAS operation, it did not use any Level 3 BLAS because they were not developed at that time. The later HSL frontal codes MA42 (Duff and Scott 1996) for unsymmetric matrices and MA62 (Duff and Scott 1999) for symmetric positivedefinite matrices both make extensive use of the Level 3 BLAS. We illustrate the importance of using a good ordering for a frontal scheme in Table 11.12.1, where we show the huge reduction in maximum front size from using the ordering MC43 before running MA62. In most cases, there was not enough memory for MA62 to solve the problem if the preordering was not used. Table 11.12.1 Results of Duff and Scott (1999) showing effect of using MC43 ordering on MA62. Name

Order Number of Maximum front size elements Nat order With MC43 fcondp2 201 822 35 836 6 714 4 374 fullb 199 187 59 738 132 321 2 826 shipsec1 140 874 41 037 93 036 3 654 trdheim 22 098 813 276 324 In later chapters of this book, we will discuss multifrontal techniques, which can be considered as a generalization of frontal schemes. They, too, benefit from working with dense matrices and BLAS kernels, but avoid the large fronts that can occur with frontal methods. Frontal schemes are competitive only if the matrix can be ordered to keep the front size small.

EXPLOITATION OF PARALLELISM

11.13

255

Exploitation of parallelism

Standard implementations of band, variable-band, or frontal methods offer little scope for the exploitation of parallelism other than what is available at a finegrain level from computations on dense blocks. However, an obvious two-fold parallelism was observed in the factorization of tridiagonal matrices in Chapters 7.4 and 7.5 of the LINPACK User’s Guide (Dongarra, Bunch, Moler, and Stewart 1979) by choosing pivots simultaneously from the beginning and end of the matrix. The resulting algorithm is usually referred to as the the burn-at-both-ends or BABE algorithm. This does not entail any more work but, if more is allowed, more parallelism can be extracted by working independently on different partitions of the tridiagonal matrix and several of these attempts are chronicled in the book by Dongarra et al. (1998). We show how the nested dissection ordering trades fill-in for parallelism on a tridiagonal matrix in Section 9.10.4. More recent algorithms based on similar ideas have been developed to exploit parallelism in solvers for general band matrices, for example the SPIKE algorithm of Polizzi and Sameh (2007). We will see, in the next chapter, how the extension of frontal methods to multifrontal methods gives much greater scope for exploitation of parallelism. Exercises 11.1 Given the matrix whose pattern is 

×××  ××××  × × × ×  × × × ×  × × × × ×   ×     × ×



×

× × ××× ××× ××× ××

      ×    ×  × ×

show how the implementation of the minimum degree algorithm in Section 11.2 makes use of element absorption and recognition of identical rows. Assume that the matrix is in pivotal order. 11.2 Apply AMD (Section 11.3) to the matrix of Figure 11.2.1. 11.3 Table 11.3.1 shows minimum external degree and AMD producing comparable numbers of entries in L for the test matrices. However, AMD has dramatically smaller times. What are the factors that might explain this difference? 11.4 Write Fortran code to convert a row of A to a row of U using the strategy outlined in Section 11.5. Include code to readjust the linked lists of rows whose first entries lie in the same column. 11.5 Show that for any row permutation P and factorization PA = LU, the patterns

256

IMPLEMENTING ELIMINATION WITH SYMBOLIC FACTORIZE

T

of LT and U are contained in the pattern of U, where AT A = U U. Hint: Compare with the orthogonal factorization A = QU. 11.6 When using interchanges for numerical stability in the banded case, we state at the end of Section 11.7 that this will destroy ‘symmetry in the symmetric case and increase the stored bandwidth of U from m + 1 to 2m + 1 in both instances.’ Show that interchanges cannot cause the bandwidth to exceed 2m + 1. ¯ of (11.6.2) is significantly denser than the 11.7 Construct an example where the U L\U factors. 11.8 Create the row end array and the first 15 entries for the value array (see Table 11.8.1) for the matrix given in Figure 8.2.4. Use aij for the entry in the (i, j) position of Figure 8.2.4.

Fig. E11.1. A triangulated region. 11.9 Consider the triangulated region of Figure E11.1, where there are single variables associated with each vertex, edge, and interior. If the elements are assembled in the order shown and no numerical pivoting is required, what is the order of the frontal matrix after assembly of element 6 and after assembly of element 19? What is the maximum size of the frontal matrix and when is it achieved? 11.10 Indicate the amount of work and storage needed for the factorization of a symmetric and positive-definite matrix by the frontal method when the front is of uniform width d. 11.11 Use the grid from Exercise 11.9, and assume the elements are assembled in the order shown. When you are beginning to assemble element 7, what would the four entries of Figure 11.10.4 be?

Research exercise R11.1 At the end of Section 11.4, we described possible hybrid orderings which implement a few steps of dissection followed by one of three choices: (i) The direct use of a sparse code on this ordering.

EXPLOITATION OF PARALLELISM

257

(ii) Carrying out a variable-band solution on resulting diagonal blocks and using the implicit factorization methods from Section 9.2. (iii) Using approximate minimum degree on the resulting diagonal blocks. using the implicit factorization methods from Section 9.2. We expect the success of these methods to be domain specific. Pick a domain of interest, and make use of test matrices from that domain to compare and evaluate these alternatives.

12 GAUSSIAN ELIMINATION USING TREES We first show how the frontal technique described in the previous chapter can be generalized. We then discuss a framework for implementing sparse Gaussian elimination following a symbolic choice of pivot sequence. Both the frontal technique and the framework involve trees and we discuss the relationships between them, how they can be constructed, their properties, and changes that can be made to improve the performance. We introduce the additional complexity of accommodating numerical pivoting which we will discuss more in Chapter 13.

12.1 Introduction In this chapter, we continue the theme of matrix factorization using a pivotal sequence that has been chosen from the sparsity pattern alone. Here, we pay little attention to how it was chosen—we just assume that it is known. The band and frontal methods that were discussed in Chapter 11 are useful for matrices that have a small bandwidth or front size, usually because they originate from a long and thin region. Here, we discard this assumption. We begin (Section 12.2) by considering a finite-element problem that is presented as a collection of elements. Static condensation (Section 11.10) is performed on each element and they are then merged into structures that are essentially bigger elements. Static condensation is performed on these and then they are merged into even bigger structures. This continues until a single structure is formed. This merging process has a natural representation as a tree, the element assembly tree. There is considerable freedom for reordering the operations, as long as those at a node are performed before those at its parent. As for the frontal method, the operations at a node are performed on a full matrix and can make use of Level 3 BLAS (which may be executed in parallel). Another tree, the elimination tree, depends on the structure of the filled-in matrix. It is closely related to the element assembly tree, as we show in Section 12.3. As in the element case, there is much freedom to reorder the operations—the only constraint is that operations at a node must be performed before operations at its parent. Efficient ways to construct the elimination tree are considered in Section 12.4 and for the sparsity pattern of the filled-in matrix in Section 12.5. The patterns of data movement during factorization are considered in Section 12.6. For trees with more than two children at the nodes, memory for temporary storage may be saved by reordering their processing, see Section 12.7.1. Changes Direct Methods for Sparse Matrices, second edition. I. S. Duff, A. M. Erisman, and J. K. Reid. c Oxford University Press 2017. Published 2017 by Oxford University Press.

MULTIFRONTAL METHODS FOR FINITE-ELEMENT PROBLEMS

259

to trees known as tree rotations are considered in Section 12.7.2. These change the shape of the tree and lead to different operations being performed, but do not affect the pattern of the filled-in matrix. Amalgamating a node with its parent can lead to faster execution at the expense of more fill-ins, see Section 12.7.3. While most of this chapter is concerned with the symmetric positive-definite case, indefinite problems are considered in Section 12.8. 12.2

Multifrontal methods for finite-element problems

In this section, we assume that the assembled matrix is symmetric positive definite so that numerical pivoting is not needed for stability. We discuss a generalization of the frontal technique that is able to accommodate any ordering scheme based on the structure of the matrix, including minimum degree and nested dissection. Because more than one front is involved, it is called the multifrontal method. As for one front, it is easier to describe in the finiteelement case, which is the topic of this section. The general case is discussed in Section 12.3. The method is also known as substructuring, for reasons that we will explain. The concept of static condensation, discussed in Section 11.10, extends naturally to substructures consisting of groups of elements. Any variables that are internal to a substructure may be eliminated by operations involving only those matrices A[k] that are associated with the substructure. The rows and columns of the resulting reduced matrix are associated with variables that are also involved in elements outside the substructure. They usually correspond geometrically to variables on the boundary of the substructure and are called ‘boundary variables’ even if the geometric analogy is not applicable. The reduced matrix has a form just like that of a large element identified by these boundary variables. Indeed the term generated element is often used (Speelpenning 1978 used the term ‘generalized element’). The grouping of elements into substructures and of substructures into bigger substructures until the whole problem is obtained can be expressed as a tree that is known as an element assembly tree. Associated with each node is a list of indices of the variables of the substructure. This list includes the indices of all boundary variables of the children of the node, but not of their internal variables. The list distinguishes the indices of internal variables from those of boundary variables. Even the frontal method can be expressed this way by regarding each assembly of an element as constructing a new substructure. For the finite-element problem of Figure 12.2.1, the frontal method can be represented by the tree shown on the left of Figure 12.2.2 and the substructuring defined by nested dissection can be represented by the tree shown on the right of Figure 12.2.2. These nested groupings of elements into substructures can also be represented by different bracketings of the sum A=

X l

A[l] .

(12.2.1)

260

GAUSSIAN ELIMINATION USING TREES

1

2

5

6

3

4

7

8

Fig. 12.2.1. A finite-element problem. 15

15

14

8

13

7

12

6

14

13 11

5

10

4

9 1

9

3 2

1

11

10 2

3

4

5

12 6

7

8

Fig. 12.2.2. Element assembly trees for frontal elimination (summation (12.2.2)) and nested dissection (summation (12.2.3)) on the Figure 12.2.1 example. (The index lists associated with the nodes are not shown.)

For our example, the frontal method can be represented by left-to-right bracketing, ((((((A[1] + A[2] ) + A[3] ) + A[4] ) + A[5] ) + A[6] ) + A[7] ) + A[8] .

(12.2.2)

and nested dissection by the bracketing ((A[1] + A[2] ) + (A[3] + A[4] )) + ((A[5] + A[6] ) + (A[7] + A[8] )).

(12.2.3)

In an element assembly tree, each node corresponds to an assembly of a frontal matrix (nothing to do at a leaf node) and subsequent eliminations to form the generated element matrix (also known as the ‘contribution block’). For the element assembly tree shown on the right of Figure 12.2.2, we may proceed as follows: (i) Eliminate all variables that belong in only one element, storing the resulting pivotal rows and columns, and setting the resulting generated element matrices aside temporarily. (ii) Assemble the resulting matrices in pairs and eliminate any variable that is internal to a single pair; store the resulting generated element matrices A[9] , A[10] , A[11] , A[12] . (iii) Treat the pairs (A[9] , A[10] ) and (A[11] , A[12] ) similarly to produce A[13] and A[14] . (iv) Assemble A[13] and A[14] and complete the elimination.

MULTIFRONTAL METHODS FOR FINITE-ELEMENT PROBLEMS

261

Each stage corresponds to one depth of brackets in expression (12.2.3) or one level in the tree. We do the assemblies as indicated by the brackets or the tree, and do each set of eliminations as soon as the corresponding frontal matrix has been assembled. The result is an ordinary LDU factorization, where L = UT and the rows of U are generated as usual at each pivotal step. For simplicity, we begin by considering a very regular example. In general, we can expect the number of nodes in a tree level to increase as we move away from the root, but a tree node may have any number of children and there need be no particular relationship between the number of nodes at successive tree levels, as the examples in Figure 12.2.2 illustrate. At each node, the associated operations consist of assembling the generated element matrices of the children, eliminating any variables that do not appear elsewhere, and storing the resulting generated element. The assembly and elimination operations may be performed in a temporary full array labelled to indicate which variables are associated with its rows and columns. In fact, it is exactly like the frontal matrix of the frontal method. Essentially, we have a number of frontal matrices whose operations are independent. This is why the method is called the multifrontal method. Given an element assembly tree, there is still considerable choice for the order in which operations are performed. For the example on the right of Figure 12.2.2, we could visit the nodes in the natural order, 1, 2, 3, . . .; alternatively, the order 1, 2, 9, 3, 4, 10, 13, 5, 6, 11, 7, 8, 12, 14, 15 would result in the same arithmetic operations being performed. The only requirement that a reordering must satisfy is that every node is ordered ahead of its parent. This is termed a topological ordering of the tree. A topological ordering that leads to a simplification in the data structure for numerical factorization on a uniprocessor is to order the nodes of the tree by postordering (Aho, Hopcroft, and Ullman 1974), using a depthfirst search. In a postorder, the root of any subtree is ordered immediately after the nodes in the subtree. With a depth-first search, each node is ordered when a backtrack from it is performed in the depth-first search. For example, a depthfirst search of the tree on the right of Figure 12.2.2 with priority to the left leads to the second node order given at the start of this paragraph. We illustrate this in Figure 12.2.3, where each node of the tree in Figure 12.2.2 has been labelled 15

14

7

3 1

6 2

4

5

10 8

13 9

11 12

Fig. 12.2.3. The tree of Figure 12.2.2 with the nodes labelled in postorder.

262

GAUSSIAN ELIMINATION USING TREES

with its position in the new order. The advantage of such an ordering is that the generated elements required at each stage are those most recently generated and so far unused. Indeed, in the nodes in the subtree rooted at node k form a sequence i, i + 1, . . . , k − 1, as the example in Figure 12.2.3 illustrates. This means that a stack may be used on a uniprocessor for temporary storage (see Exercise 12.3). Indeed, the use of postordering gives a nested block structure for the U array, as can be seen by comparing Figure B.15 with Figures B.12–B.14 in Appendix B. If a postordering is used, it is sensible to calculate the node sequence and implied pivot sequence during ANALYSE. Given the tree and lists of variables associated with the original finite elements (leaf nodes), we proceed as follows: (i) Perform a depth-first search to establish the node order; for each variable, record which of the original elements with which it is associated is last in this ordering. (ii) Perform a second depth-first search in exactly the same order to establish the pivotal sequence. From the first search, we will know when a variable appears in this search in an original element for the last time. When a leaf node is reached, any variable that appears for the first and last time in the node can be added immediately to the pivotal sequence and labelled in the list as internal. This corresponds to performing a static condensation. Before backtracking from a node, the index lists of boundary variables of the children of the node are merged to make the list for the node. Any variable that occurs in this new list and has appeared for the last time in a visited original element is added to the pivotal sequence and marked as internal. Notice that this process is much quicker than the actual factorization because: (i) Merging a list of length l with a given list needs only O(l) operations (see Section 2.4) whereas assembling an element of order l requires O(l2 ) operations. (ii) Labelling an index in a list of length l needs O(1) operations, whereas one step of Gaussian elimination with a front size l requires O(l2 ) operations. The trees that we have considered so far are binary—every non-leaf node has two children. If the tree has nodes with more than two children, the order of assembly of their generated elements is likely to affect the rounding errors. It can also affect the amount of temporary memory needed. We will defer discussing this in detail until Section 12.7.1. On a parallel computer, different considerations apply. The operations at any two nodes are independent provided neither is an ancestor of the other in the tree. For our simple example, the operations at the leaves might be performed in parallel first, then the operations at nodes 9, 10, 11, 12, then those at 13 and 14, and finally those at the root. We have assumed so far that the element assembly tree is known. Sometimes, this is indeed the case. It might have come from nested dissection. It might have

ELIMINATION AND ASSEMBLY TREES

263

come from the natural substructuring of the underlying problem. However, an assembly tree can also be constructed from an ordering that has been obtained by some other means, as we will explain in Section 12.3. It is also possible to generate trees in a hybrid fashion where, for example, the user provides a coarse dissection as an element assembly tree and the software refines the tree by ordering within the nodes using some heuristic which generates subtrees. This approach was adopted by Reid (1984), who designed his finite-element package to allow the user to provide any number (including none) of substructures; the package then completes the element assembly tree automatically using the minimum degree algorithm. For a fuller discussion of the use of substructures, we refer the reader to Noor, Kamel, and Fulton (1977) and Dodds and Lopez (1980). 12.3 12.3.1

Elimination and assembly trees The elimination tree

In the previous section, we assumed that we started with an unassembled finiteelement problem and a known element assembly tree. We showed that many pivotal orderings were associated with the tree and that they all involved the same arithmetic operations, except for the effects of rounding errors caused by changing the order of additions and subtractions. In this section, we take as our starting point a given symmetric positivedefinite matrix of order n (which might be an assembled finite-element matrix) and a given pivotal order. Again, we consider what other orderings involve the same arithmetic operations, apart from the effects of rounding errors. We will find that this, too, can be represented by a tree. It is called the elimination tree and is used extensively in connection with solving symmetric positive-definite systems. It has a node for each row1 of the matrix. It does not normally have index lists associated with the nodes. To construct this tree, we permute the matrix so that the pivots are in order down the diagonal and we name each node with the position of its row in this permuted matrix. Suppose row k has one or more off-diagonal entries and the first is in the column l > k. In this case, row k must appear ahead of row l in a new pivotal order since otherwise the numerical values in row l when it becomes pivotal will be different. We represent this by an edge (k, l) and the graph consists of all edges corresponding to these first off-diagonal entries. For example, in the matrix of Figure 12.3.1, rows 1 and 2 may be interchanged without having any effect on the operations performed, but row 1 must precede row 3. The graph representing all such precedences for this matrix is shown on the right of this figure. Because each node k has a single node l with k < l to which is it connected by an edge, we can treat node l as its parent. From any node k1 , we can follow a sequence of edges to nodes k1 < k2 < . . . < kr , ending at a node kr corresponding to a row that has no entries to the right of the diagonal. In our example, only the final 1 We

work with rows here, but could have worked with columns, as the matrix is symmetric.

264

GAUSSIAN ELIMINATION USING TREES

×

×

×× × × × × × • ×× • ×

5

(5)

×

3

4

(3,5)

(4,5)

1

2

(1,3)

(2,4,5)

Fig. 12.3.1. A 5 × 5 matrix, with each fill-in as •, its elimination tree when the pivots are in natural order, and corresponding assembly tree.

row is such a row. The subgraph containing all such paths terminating at kr is a tree with root kr and the graph as a whole is a forest. One such root is node n, since there is no l > n. Often this is the only root and we have a tree rather than a forest. We note that for a given matrix and a given pivotal sequence, the elimination tree is uniquely determined. This is a direct consequence of the way we constructed it. It may be stored in a single vector of size n, with parent(k) holding the parent of node k or a special value, such as 0 if k is a root. We now show that, for any node, each entry of its pivotal row corresponds to an ancestor in the elimination tree. Assume that this is true for the parent of node k. The first entry, l, of row k corresponds to its parent. For any other entry m > l, the elimination operations at step k will fill entry (l, m), if not already filled (‘accidental’ zeros that occur when a nonzero entry becomes zero are retained as entries). From our assumption, it follows that m is an ancestor of l, so is an ancestor of k. The hypothesis is therefore true for node k. Since the hypothesis is true trivially for the children of a root, it is true throughout the graph. We next show that any topological ordering of the nodes (an ordering such that every child node is ordered ahead of its parent) corresponds to a pivotal order that gives the same intermediate results (apart from rounding errors). When row k is pivotal, suppose the other entries of the row are in columns j1 , j2 , . . . jl . The elimination operations will make additions to rows j1 , j2 , . . . jl , all ancestors of k. For example, in Figure 12.3.1, the first elimination made an addition to row 3 and the second elimination made additions to rows 4 and 5. In the general case, any one of these rows j1 , j2 , . . . jl may also receive additions from other descendants before it becomes pivotal. Different topological orders may result in the additions being made in a different order, but the result will be the same, apart from rounding errors. The result stated in the first sentence of this paragraph is true trivially at the leaf nodes. It follows by induction that it is true for all nodes. There is a simple relationship between the elimination tree of this section and the element assembly tree of Section 12.2, assuming that all the element matrices there are full. An elimination tree may be converted into an element assembly tree by adding to each node leaf children corresponding to the entries of the upper triangular part of the original row. This corresponds to treating each

ELIMINATION AND ASSEMBLY TREES

265

diagonal entry and each pair of off-diagonal entries as an element. In practice, this is not done. Instead, each node is associated with the corresponding row of the upper triangular part of the original matrix, which may be regarded as an implicit representation of the leaf nodes. Since this is used more than the element assembly tree, we call it the assembly tree. A simple example is shown in Figure 12.3.1. The index lists that are associated with the nodes of an assembly tree may be constructed recursively. For each node, we merge the original list for the row of the matrix with the lists of the generated elements at the children. At each node of the resulting assembly tree, a single variable will be eliminated. However, if the variables of a node of an assembly tree are those of a child’s generated element, the node and that child may be amalgamated without any loss of sparsity2 . Note that such amalgamations will not happen for a tree that has been constructed as in Section 12.2. For the reverse conversion, we discard any leaf nodes with no eliminations and insert k − 1 additional nodes as parent, grandparent, . . . of each node with k > 1 eliminations. Only the first variable eliminated at the node is now eliminated there; the second is eliminated at the new parent, the third at the new grandparent, . . . . The node-amalgamation process of the previous-but-one paragraph may be applied recursively to get back the structure of the assembly tree (Exercise 12.7). We cannot get the index lists of the assembly tree back without knowing the index lists of the rows of the original matrix. The significant differences between elimination trees and assembly trees are: (i) Only one elimination is associated with a node of an elimination tree. (ii) Normally, there is an index list associated with each node of an assembly tree. The index lists associated with an assembly tree mean that it implicitly defines the sparsity pattern of the filled-in matrix, but the lists need to be stored. An elimination tree is very economical in storage — only one integer, indexing the parent, is needed for each node. However, it does not define the sparsity pattern of the filled-in matrix. For example, the two sparsity patterns of Figure 12.3.2 have the same elimination tree. ××× ××× ×××

×× ××× ××

Fig. 12.3.2. Two sparsity patterns with the same elimination tree.

2 If two children each have this property, they cannot both be amalgamated with the parent without loss of sparsity.

266

GAUSSIAN ELIMINATION USING TREES

12.3.2 Using the assembly tree for factorization Having constructed the assembly tree, we may use it to factorize the matrix, or another matrix with the same sparsity pattern but different numerical values. We visit the nodes in some topological order. At each node, we add together the original ‘elements’ (rows of the upper triangular part of the matrix when the diagonal entries are in pivotal order) and the generated elements from the children (if any), perform the elimination operations with the packed matrix, store the calculated rows of U, and hold the generated element for use later. Note that changing the order of the assemblies can lead to slightly different results due to rounding. There are cases where it is required to produce precisely the same results on a later run using the same data. An example where this is useful is in the code development phase, when we want to ensure that changes to the code do not change the answer. Another is that an external regulatory requirement may make it important to produce reproducible results. Special care is needed in the implementation to ensure that the result is repeated, particularly when running in parallel. Hogg and Scott (2013a) achieve it by always assembling the original rows and the generated elements at a node of the tree in the same order. Just as for the frontal method, the original matrix need not be held in memory for the multifrontal approach. We can read a part of the original matrix from auxiliary storage only when the node at which that part is first needed becomes active. If many variables are eliminated at a node of the assembly tree, we may use Level 3 BLAS, which helps the execution speed. In Section 12.7.3, we will discuss the use of node amalgamations that lead to some loss of sparsity. This is tolerated for the sake of faster execution. It also provides some savings in integer storage for the factors. 12.4 The efficient generation of elimination trees The elimination tree can be generated from the collection of index sets of the rows of the upper triangular part of the matrix in pivotal order. At the beginning of stage k, the collection consists of the original index sets of rows k to n and index sets generated at previous stages. In stage k, we merge all the index lists in the collection that include index k and remove k itself to create a new index list. The lists that were used are discarded and the new list is added. For each old index list used, an edge from the corresponding node to node k is added to the graph. Note that since the merged list cannot be longer than the total length of all the lists merged, the amount of memory needed for the lists never exceeds that needed for the upper triangular part of the original matrix. Liu (1986a) showed that the tree can be generated more efficiently, essentially in time proportional to the number of entries in the original matrix. Code for this is shown in Figure 12.4.1, but we need to describe the algorithm to make this code understandable. We first consider what happens if we try to generate the elimination tree for a symmetric matrix that is reducible, that is, a permutation

THE EFFICIENT GENERATION OF ELIMINATION TREES

267

! Given the pattern of the upper triangular part of A in arrays ! col_start and row_index, construct the tree in array parent. do k = 1,n parent(k) = n + 1 ! Flag node k as a root root(k) = n + 1 do kk = col_start(k), col_start(k+1)-1 i = row_index(kk) do while (root(i)i such that aij is an entry of A row count(i) = row count(i) + 1 if last(j)/= then find root k of tree containing the node last(j) row count(k) = row count(k) - 1 end if last(j) = i end for add parent pi of i to tree containing i row count(pi) = row count(pi) + row count(i) - 1 end for

Fig. 12.5.1. Counting the number of entries in each row of U.

270

GAUSSIAN ELIMINATION USING TREES

To prove that this code calculates the counts correctly, we first note that when a leaf node is encountered, its row count will be found directly in the outer for loop. Now assume inductively that the row count for any node i will have been found correctly by the end of its iteration of the main loop. It will not be changed thereafter since all subsequent changes are made to nodes that appear later in the postorder. At this point, a partial count will have been made for each node that has not yet been reached in the postorder, but has a child that has been reached. At the end of the main loop for the first child of such a node, the count of the node will be given that of the generated element coming from this child. If an index of this list also appears in the generated element from the second child, that index must appear in an index list of a row of A that is reached from that child and the count will be decremented just once, the first time it occurs. Therefore, once the main loop for the second child is complete, the parent will have the correct count for the merged list of the two generated elements. A similar argument can be applied for each subsequent child so that after the last has been processed, the parent will have the count of the merging of the lists of all its children. Now the list of the parent node is processed, at which point it is the root of its own tree. It therefore adds one to its count for each j in its list, but subtracts one if that index has already appeared. Therefore, once the processing of the parent is complete it has the correct row count. This completes our inductive proof that the code finds all the counts correctly.

12.6

The patterns of data movement

It is interesting to consider the different patterns of data movement that result from the different orderings of the operations associated with a single tree. For simplicity, we first consider the case where one variable is eliminated at each node of the assembly tree, so that the elimination tree is the same. In the multifrontal method, illustrated in Figure 12.6.1, after processing at a node the generated element of that node is passed to its parent. In the right-looking method, illustrated in Figure 12.6.2, after each row of U has been found, all the rows that its pivotal operations affect are modified. These are its ancestors in the tree but not all ancestors need be involved. We could perhaps have called this ‘down looking’ (within U), but it is usually known as

Fig. 12.6.1. Data transfer in a multifrontal symmetric factorization.

MANIPULATIONS ON ASSEMBLY TREES

271

Fig. 12.6.2. Data transfer in a right-looking symmetric factorization.

Fig. 12.6.3. Data transfer in a left-looking symmetric factorization.

‘right-looking’ because this is how it looks when finding the columns of L, which is the same computation. In the left-looking method, illustrated in Figure 12.6.3, the updating of each row of U is delayed until just before it is pivotal. There are contributions from its descendants in the tree, but not all descendants need be involved. It is known as ‘left-looking’ because the equivalent updating of a column of L requires accessing data to its left. We looked at these operations from a matrix point of view in Figures 3.8.1 and 3.8.2. We have seen that, in general, the assembly tree has more than one elimination at each node and each of its nodes corresponds to a chain in the elimination tree. It is computationally more efficient to eliminate a block at a time. Figure 12.6.1 might represent such an assembly tree and we note that now each data movement follows many eliminations at the node. Each node of the graphs of Figures 12.6.2 and 12.6.3 might represent a chain of nodes in the elimination tree.

12.7 12.7.1

Manipulations on assembly trees Ordering of children

In Section 12.2, we explained that a stack may be used for temporary storage during multifrontal factorization on a uniprocessor. We now consider how to reduce the amount of stack space needed. This does not affect the number of floating-point operations performed, but it may affect the speed because of the overheads of moving data between caches and main memory, and may even allow the solution of a problem that would otherwise not be possible without going out of main memory.

272

GAUSSIAN ELIMINATION USING TREES

A simple strategy is to wait until all the children of a node have been processed, and then allocate the frontal matrix and assemble all the generated element matrices from its children into it, releasing their storage. Different orderings of the children can substantially affect the stack size needed. If node i has children cj , j = 1, 2, . . . ni , the size of the generated element matrix at node cj is gcj , and the stack storage needed to form that generated element matrix is scj , the stack storage needed to form the generated element matrices at c1 , . . . cni is j−1 X tcj = gck + scj , j = 1, 2, . . . , ni . (12.7.1) k=1 i We seek an order for the children that minimizes maxnj=1 tcj . Liu (1986b) showed that this is obtained if the children are ordered so that scj − gcj decreases monotonically. To show this, suppose an optimal ordering has a pair such that scm+1 − gcm+1 > scm − gcm and we swap children m + 1 and m. This does not Pm−1 affect tcj for j < m or j > m + 1. tcm+1 changes from k=1 gck + gcm + scm+1 Pm−1 to k=1 gck + gcm+1 + scm , which is a reduction in view of our supposition. The Pm−1 new value of tcm is k=1 gck + scm+1 , which is less than the old value of tcm+1 . It follows that the swap cannot increase the maximum value of tcj , so the new sequence is also optimal. The argument can be repeated until scj − gcj decreases monotonically. Pni For very wide trees, a node may have many children so that k=1 gck is large and reordering the children becomes less effective. Liu (1986b) therefore considered choosing the first child to maximize sc1 and allocating the frontal matrix after it has been processed. The generated element from each child can then be assembled directly into the frontal matrix for the parent, which avoids the potentially large number of stacked generated elements and takes advantage of overlaps in the index lists of the children. However, numerical experiments have shown that this can sometimes perform poorly because a chain of frontal matrices (at the active tree levels) must be stored. This led Guermouche and L’Excellent (2006) to propose computing, for each node, the optimal point at which to allocate the frontal matrix and start the assembly. Suppose the frontal matrix for node i has size fi . If it is allocated and the assembly started after pi children have been processed, Guermouche and L’Excellent show that the total storage ti needed to process node i satisfies the equation ! ! j−1 pi X X ti = max max gck + tcj , fi + gck , fi + max tcj . (12.7.2)

j=1,pi

k=1

k=1

j>pi

Their algorithm for finding the split point, that is, the pi that gives the smallest ti , then proceeds as follows: for each pi (1 ≤ pi ≤ ni ), order the children in decreasing order of tcj , then reorder the first pi children in decreasing order of tcj − gcj . Finally, compute the resulting ti and take the split point to be the pi

MANIPULATIONS ON ASSEMBLY TREES

273

Table 12.7.1 Comparison of the stack storage required for different strategies for summing and allocating the children. Matrix

Stack storage required In-place expansion Equation (12.7.2) Allocate at end Allocate at start audikw 1 1.12 1.43 1.13 gupta3 28.00 1.02 1.17 grid481 1.00 1.77 1.30 saylr1 1.15 1.23 1.06 thermal 1.00 1.63 1.30 vibrobox 1.14 1.17 1.16 1 Not in the Florida sparse matrix collection.

that gives the smallest ti . Guermouche and L’Excellent prove they obtain the optimal ti . Duff and Reid (1983) suggested that the generated element of the final child be expanded in-place to the storage of its parent, following which the generated elements of the other children are merged in. Guermouche and L’Excellent (2006) suggested that this be done at the split point, which reduces the total storage needed to process node i to ui satisfying the equation ! ! j−1 pX i −1 X ui = max max gck + ucj , fi + gck , fi + max ucj . (12.7.3) j=1,pi

k=1

k=1

j>pi

We illustrate the savings from using the assembly and allocation strategies of equations (12.7.2) and (12.7.3) in Table 12.7.1. We show the ratio of the stack memory required to the optimal strategy given in equation (12.7.3). We are grateful to Abdou Guermouche for giving us the data used to generate the results in his paper (Guermouche and L’Excellent 2006). We selected the results on six matrices out of his runs on 44 matrices, including matrices exhibiting the highest and lowest ratios in the three columns. The trees were generated from an AMD ordering of the matrices. We see that these strategies can reduce the storage required and that the optimal strategy in equation (12.7.3) is indeed the best, sometimes by a significant margin. Indeed, the matrix gupta3 has a tree where some nodes have children that overlap significantly with the parent leading to high storage requirements if they are all stacked before the parent node is allocated. 12.7.2

Tree rotations

Liu (1988) considered whether beneficial changes can be made to the ordering without altering the pattern of the filled-in matrix. While he worked with the elimination tree, it is much easier to understand the ideas in terms of the assembly tree. The assembly tree, with each node accompanied by its list of variables, defines the pattern of the filled-in matrix. It is just the sum of the patterns of the square

274

GAUSSIAN ELIMINATION USING TREES

1 3 2 19 10 12 11

7 8 9 20 16 17 18

4 43 22 28 25 6 44 24 29 27 5 45 23 30 26 21 46 40 41 42 13 47 31 37 34 15 48 33 38 36 14 49 32 39 35

Fig. 12.7.1. Nested dissection ordering of a 7 × 7 grid. 43

40

19 7

16

3 1

12

6 2

4

5

10

24

15 11

37

28

13 14

22

27 23

25

26

33 31 32

36 34 35

Fig. 12.7.2. Nested dissection assembly tree. Each node is labelled by the smallest index in the list of indices of variables at the node. full submatrices defined by the lists of variables. Some of the variables at a node are eliminated there. Those not eliminated there appear in the list of the parent node. No variable is eliminated at more than one node. Conversely, these two properties of lists at tree nodes are sufficient to identify the tree as an assembly tree. At any node, there is room to accommodate the generated element matrices of its children and the elimination can be performed with no fill-in. We illustrate this with the example of Liu (1988), where he used nested dissection to order a matrix whose graph is a 7×7 grid with every node connected to all its neighbours in 8 directions. The ordering is shown in Figure 12.7.1 and the assembly tree is shown in Figure 12.7.2. We use the positions in the ordering of Figure 12.7.1 to identify the variables. Part of this tree, with the index lists at the nodes, is shown in Figure 12.7.3. Not shown are the descendants of the nodes at which variables 19–21, 28–30, and 33 are eliminated. Of course, any topological ordering of the assembly tree leads to no further fill-ins. Liu goes further, looking for changes to the assembly tree that still lead to no further fill-ins and considers rotations where the order of the nodes on the path to the root from a given node is reversed, while all other nodes retain their existing parents. The nodes retain their lists of variables, so the same matrix pattern is represented, but different variables are eliminated at the nodes of the rotation path so that variables not eliminated at a node always appear in the list of its parent.

MANIPULATIONS ON ASSEMBLY TREES

275

( 43−49 )

( 19−21,43−49)

( 40−42,43−49)

( 28−30,40−46)

( 37−39,40−42, 46−49)

( 33, 37−41, 46−49)

( 36, 37−39, 41−42)

( 34, 36−38, 41−42)

( 35, 36, 38−39)

Fig. 12.7.3. Part of the nested dissection assembly tree. Indices of eliminated variables are in italic. 35 36 37

34

40

33 32

31

43

28 27

24

22

23

25

19

7

26

16

3 1

6 2

4

5

12 10

15 11

13 14

Fig. 12.7.4. Nested dissection assembly tree after rotation. Each node is labelled by the smallest index in the list of indices of variables at the node. Liu (1988) illustrates this by rotating the tree of Figures 12.7.2 and 12.7.3 about the node at variable 35. The result is shown in Figures 12.7.4 and 12.7.5. It may readily be verified that the two properties mentioned in the second paragraph of this section hold, so this is an assembly tree. We now show that, in general, a tree rotation gives a new tree that is an assembly tree. A variable that appears in the lists of the nodes on the path from the given node to the root appears in a sequence of nodes from its first appearance at node i until the node j at which it is eliminated and in no other nodes of the path. In the reversed path, it appears in nodes j to i and no others. We treat it as eliminated at node i instead of node j. It is still eliminated at only one node. Now consider a node p on the path and its original parent q. In the new tree, p is the parent of q and any variable of q that is not a variable of p will

276

GAUSSIAN ELIMINATION USING TREES (35−36, 38−39)

(36, 37, 38−39, 41−42)

( 34, 36−38, 41−42)

(37−39, 40, 41−42, 46−49)

( 33, 37−41, 46−49)

( 28−30,40−46)

(40−42, 43−45, 46−49)

(43−49)

( 19−21,43−49)

Fig. 12.7.5. The part of the assembly tree after rotation that corresponds to the part in Figure 12.7.3. be eliminated at q. Therefore, all the uneliminated variables of q are variables of its parent p. Any node not on the path still has the same parent, any variable eliminated there is still eliminated there only and all its uneliminated variables are still variables of its parent. We have therefore shown that the new tree is an assembly tree. Tree rotations can be used to change the shape of a tree. For example, the tree in Figure 12.7.2 is balanced and short, but after rotation it is unbalanced and long, see Figure 12.7.4. A short balanced tree is desirable for parallel programming, whereas a tall unbalanced tree may save temporary storage for sequential programming. Liu (1988) found memory gains of between 3% and 21% for the multifrontal method on k-by-k grid problems with different values of k. He rotated about a leaf node chosen in a search starting at the root and successively taking the child that required the most memory. Another possible application for tree rotations is to increase the size of the root node of the assembly tree in the case where the root is not the node with the most eliminations. For example, MUMPS (Amestoy, Duff, L’Excellent, and Koster 2001), which is a distributed-memory multifrontal code for symmetric or unsymmetric matrices, performs parallel computation by subdividing the frontal matrices into block rows except at the root where it uses a 2-D distribution by blocks. 12.7.3

Node amalgamation

We remarked near the end of Section 12.3.1 that, if the variables of the generated element at a node of an assembly tree are those of its parent, the node and its parent may be amalgamated without any loss of sparsity. There are two examples of this in Figure 12.7.5: nodes 36 and 19. Indeed, this condition will always hold for a node that is an only child. Also, a node with no eliminations

MULTIFRONTAL METHODS: SYMMETRIC INDEFINITE PROBLEMS

277

Table 12.7.2 Comparison of the number of nodes with eliminations and the number of entries in L for MA57 for different values of the node amalgamation parameter nemin. Thousands of nodes Millions of entries in L nemin 1 4 8 16 32 1 4 8 16 32 pwtk 20.8 19.7 13.7 11.0 8.8 49 49 49 50 53 ship 003 12.0 12.0 7.6 6.1 4.7 60 60 61 61 63 bmwcra 1 16.6 10.6 8.6 6.5 5.5 70 70 71 72 73 nd12k 4.1 1.8 1.2 0.8 0.6 117 117 117 117 118 Table 12.7.3 Comparison of the factorize and solve phase times (single right-hand side) for different values of the node amalgamation parameter nemin. Runs of MA57 on a Dual-socket 3.10 GHz Xeon E5-2687W, using the Intel compiler with O2 optimization. Factorization times nemin 1 4 8 16 32 pwtk 2.72 2.68 2.59 2.50 2.47 ship 003 8.88 9.02 8.16 7.32 6.97 bmwcra 1 5.73 5.63 5.59 5.41 5.39 nd12k 74.59 69.55 59.85 48.74 44.84

1 0.083 0.080 0.095 0.130

Solve times 4 8 16 0.083 0.082 0.082 0.080 0.078 0.078 0.095 0.095 0.095 0.140 0.135 0.124

32 0.084 0.079 0.096 0.122

can be amalgamated with its parent. This can occur through tree rotation and is illustrated by node 43 in Figure 12.7.5. There are inefficiences associated with eliminating a small number of variables at a node. Level 3 BLAS will be less effective in this case because the whole of the matrix at the node will need to be copied in and out of cache and registers for the sake of only a few arithmetic operations on each entry. Furthermore, the generated element matrices from the children must be added to the matrix at the parent before the actual elimination operations are commenced. Duff and Reid (1983) therefore suggested amalgamating a node with its parent if both involve less than a given number of eliminations, nemin. To retain the node order given by a depth-first search, they amalgamate a child with its parent only if it is the last of its siblings to be visited. The HSL code MA57 (Duff 2004) uses essentially the same strategy. Reid and Scott (2009) point out that better results may be achieved by looking for all possible amalgamations before performing the depth-first search. We show in Tables 12.7.2 and 12.7.3, results on the effect of varying nemin when running the HSL code MA57. We note that it does far less amalgamations because of its restricted choice. Note also that larger values of nemin are likely to be of benefit on a parallel computer. 12.8 Multifrontal methods: symmetric indefinite problems Just as the frontal approach was shown to be applicable to indefinite systems of equations, the same is true of the multifrontal technique. We already discussed in Section 12.3 how the assembly tree relates to the elimination tree and how it can be used to represent the factorization of a matrix

278

GAUSSIAN ELIMINATION USING TREES

in assembled form. In the multifrontal context, at any node of the assembly tree the contributions from the children are summed (assembled) with original rows of the matrix to produce a frontal matrix. We illustrate this frontal matrix by the form fs

fs

fs

ps

where blocks labelled fs are fully summed so that pivots can be chosen from anywhere within the leading block. The numerical factorization is particularly straightforward if numerical pivoting is not required. The fully summed variables at each stage are eliminated (that is used as pivots) and the resulting Schur complement (equal to the updated ps block) is passed to the stage represented by the parent node. Since these elimination operations are performed on a dense matrix, blocking and higherlevel BLAS can be used for efficiency. However, the situation in the indefinite case is more complicated and we may pay a high cost for numerical pivoting in terms of both storage and execution time. As for the frontal method, the multifrontal approach can be extended to incorporate numerical pivoting. 1×1 pivots from the diagonal are used only if they satisfy inequality (11.11.1), and are supplemented by 2×2 pivots, (11.11.2), satisfying inequality (11.11.3). We assume that after pivoting using these criteria the frontal matrix will have been permuted to the form  F11 F12 F13  FT12 F22 F23  FT13 FT23 F33 

(12.8.1)

where we have chosen pivots for all of F11 but there are no entries in F22 that satisfy our threshold pivoting test because of the presence of relatively large entries in F23 . At this point, no further pivots can be chosen and we update the trailing submatrix   F22 F23 , (12.8.2) FT23 F33 which is then passed to the parent node. As a result, the frontal matrices stored during factorization may be larger than in the corresponding positive-definite case. It is usual to talk about delayed pivots because pivoting operations were not performed when they were anticipated. When counting the number of delayed pivots, it is usual to sum the differences between the expected and actual numbers of pivots over all the nodes because each delay affects the storage and operation counts. We show this count in Tables 13.4.1 and 13.5.1.

MULTIFRONTAL METHODS: SYMMETRIC INDEFINITE PROBLEMS

279

Table 12.8.1 Effect of numerical pivoting on five-diagonal matrix of order 400. Diagonal entries 4.0 1.0 0.01 0.001 Nonzeros in factors 8 048 8 344 13 640 14 660 Total storage for factors 13 134 13 440 18 118 19 226 Multiplications 52 838 56 264 140 460 170 380

We illustrate delayed pivoting with the results of Duff (1984b) shown in Table 12.8.1. The example used in this table is a five-diagonal matrix of order 400 whose off-diagonals have value −1 and whose diagonal entries are as shown in the first row of the table. The matrix in the first column (diagonals equal to 4.0) is positive definite. Matrices in succeeding columns are not positive definite and have smaller and smaller diagonal entries, causing an increasing number of 2×2 pivots to be selected because of the threshold condition (11.11.1). The effect of this change in the pivotal order is quite clear. Exercises 12.1 While the concept of an elimination tree in Section 12.2 appears to be new in this chapter, it is a representation of the doubly-bordered block diagonal form that we introduced in Section 5.3.4 and generalized to its nested form in Section 9.3. Create an assembly tree for the finite-element problem corresponding to Figure 5.3.4. 12.2 Figure 12.2.2 shows two element assembly trees corresponding to the finite-element problem of Figure 12.2.1. For large scale computation, what are the advantages and disadvantages of the two different trees? 12.3 Verify that a depth-first search postorder of the tree on the right-hand side of Figure 12.2.2 will produce the tree in Figure 12.2.3 with the labelling shown in the latter figure. If a stack is used to hold the intermediate results, what are its contents just before each time the top two elements are assembled? Show that the postordering of any tree will enable stack storage to be used for the generated elements. 12.4 If the number of off-diagonal entries in row i of U is ni , i = 1, 2, ..., n, compute a formula for the number of floating-point operations involved in the numerical factorization. 12.5 Using the sparse matrix pattern and pivotal order of the matrices in Figures 1.3.3 and 1.3.4, construct the elimination trees following the methods described in Section 12.3.1. 12.6 Assembly trees and the elimination trees were discussed in Section 12.3. Consider their advantages and disadvantages, and discuss how they might be used in a practical code. 12.7 Show that if an elimination tree is converted to an assembly tree, as explained in Section 12.3.1, it may be converted back to the elimination tree by recursively amalgamating nodes. 12.8 Use a matrix argument to prove the validity of the observation, given in Section 12.5, that only previous rows of U whose first off-diagonal entry is in column k are required when calculating fill-in for row k of U.

280

GAUSSIAN ELIMINATION USING TREES

12.9 In the assembly tree in Figure 12.7.2, corresponding to the grid in Figure 12.7.1, how are the nodes whose numbers are not present (for example, 8, 9, 46) represented in the assembly tree? 12.10 Explain the remark at the end of Section 12.7.3, ‘the larger values of nemin are likely to be of benefit on a parallel computer.’ 12.11 Look carefully at Table 12.8.1. How would you expect the numbers to change for the case where the diagonal entries are 0.0001? Why?

13 GRAPHS FOR SYMMETRIC AND UNSYMMETRIC MATRICES Following on from the work of the previous chapter, we make extensive use of graphs for the efficient factorization of both symmetric and unsymmetric sparse matrices. It is in this area that most of the important recent developments concerning the factorization of large sparse matrices have occurred. We study the effect of numerical pivoting for stability, see how we can avoid it through static pivoting, and show how the pivoting is affected by scaling. We examine in some detail the factorization of matrices on parallel computers.

13.1

Introduction

In this chapter, we continue to consider the use of graphs for the efficient factorization of sparse matrices. We no longer restrict ourselves to the symmetric case. An unsymmetric matrix A may be handled by applying a symmetric ANALYSE (for example, the minimum degree ordering) to the pattern that is the union of the pattern of A with the pattern of AT . We discuss this in Section 13.2. In Section 13.3, we consider the effect of numerical pivoting, which often involves some loss of sparsity. To avoid this, a perturbed matrix may be factorized and the solution corrected by iteration. This is known as static pivoting and is discussed in Section 13.4. The scaling of the matrix can have a big effect on the pivot choice and is discussed in Section 13.5. In Section 13.6, we introduce an alternative to the multifrontal method in which the tree is used as a framework for holding the factorized matrix as a set of blocks. This avoids the temporary storage needed by the multifrontal method. Directed acyclic graphs, discussed in Section 13.7, can be used to represent data dependencies in more detail than trees, which is useful for parallel programming. Parallel implementation is discussed in detail in Sections 13.8 and 13.9. In Section 13.10, we explain how the approximation of the blocks of a sparse matrix factorization by products of low-rank matrices can give worthwhile savings in both computer time and storage. All these methods are based on working within a symmetric structure. While this is often satisfactory, it is inefficient for seriously unsymmetric structures, which we consider in Sections 13.11–13.13. Direct Methods for Sparse Matrices, second edition. I. S. Duff, A. M. Erisman, and J. K. Reid. c Oxford University Press 2017. Published 2017 by Oxford University Press.

282

13.2

GRAPHS FOR SYMMETRIC AND UNSYMMETRIC MATRICES

Symbolic analysis on unsymmetric systems

It is possible to use all the apparatus that we discussed in the previous chapter in the solution of systems where the matrix is unsymmetric. If the matrix has a symmetric pattern, the structures generated by the symbolic analysis can be used for factorization, but the whole of each front must be stored because it is not symmetric and numerical pivoting will probably be needed. We will discuss numerical pivoting in Section 13.3. It leads to the need for separate row and column indices. If the pattern of the matrix A is not symmetric, the ordering schemes of Chapter 11 and the symbolic tree-based analysis discussed in Chapter 12 can be used on the pattern that is the union of the pattern of A with the pattern of AT . This was discussed in Section 9.10.2 and is the approach used by the HSL code MA41. If the pattern of A is nearly symmetric, this can be very successful and the front sizes are often only a little greater than in the symmetric case. However, this approach will perform less well when A has a pattern that is far from symmetric. Amestoy and Puglisi (2002) have noted that when the pattern of the matrix is not symmetric, the frontal matrices may have rows or columns that are entirely zero (see Exercise 13.1) and can be removed from the front. This reduces the work needed in assembly and elimination and also reduces the storage required for the matrix factors and the stack. In extensive testing on a large range of matrices from different applications, using a modification of MA41 known as MA41 UNSYM, Amestoy and Puglisi (2002) found that significant savings can be made in this way. The benefit will of course depend on how unsymmetric the pattern is. For matrices in the test set that had a symmetry index (Section 1.7) less than 0.5, the median storage required by the modified algorithm compared to the original MA41 code was about 80% for LU factors and 35% for the stack. On average, the operations for elimination and assembly were reduced to 66% and 40%, respectively and the time by 60%. We show the effect of asymmetry in Table 13.2.1 where we compare the use of a symbolic analysis on the symmetrized pattern (code MA41) with Table 13.2.1 Effect of asymmetry on diagonal ordering. Matrix av41092 Order (×103 ) 41.1 Nonzeros (×106 ) 1.684 Symmetry index 0.08 Entries in factors (×106 ) MA41 14.0 10.6 MA41 UNSYM MA48 28.1 Entries in stack (×106 ) MA41 38.8 MA41 UNSYM 18.8

lhr71c twotone onetone1 wang4 70.3 120.7 36.1 26.1 1.528 1.224 0.341 0.177 0.21 0.43 0.43 1.00 11.7 9.4 7.7

22.1 17.0 8.2

4.7 3.9 6.1

11.6 11.6 23.0

0.7 0.3

15.9 5.3

3.3 1.4

5.1 5.1

NUMERICAL PIVOTING USING DYNAMIC DATA STRUCTURES

283

the variant of Amestoy and Puglisi (2002) (code MA41 UNSYM), and MA48 (a Markowitz/threshold approach code) on a range of matrices that are increasingly symmetric. In Table 13.2.1, the results using MA41 and MA41 UNSYM were obtained from the paper of Amestoy and Puglisi (2002). We see here that the modification of MA41 generally outperforms the original code, but that MA48 is sometimes better. Rectangular frontal matrices are used in WSMP (Gupta 2002a, 2002b), which is a distributed-memory multifrontal code for symmetric or unsymmetric matrices. Using rectangular frontal matrices usually leads to better sparsity preservation, but at the expense of a more complicated data structure—a DAG has to be used instead of a tree. We discuss this work in Section 13.12. We discussed in Section 7.4 using a symbolic analysis of the structure of the symmetric matrix AT A to provide a column ordering for A and a fill-in pattern that can accommodate any row ordering. The main problem with using this is that the matrix AT A can be very much denser than A, for example, if A has some rows that are quite dense. This is the approach adopted by SuperLU and SuperLU MT (Section 13.6). 13.3

Numerical pivoting using dynamic data structures

Numerical pivoting for symmetric indefinite systems was discussed in Section 12.8. In just the same way as numerical pivoting there led to delayed pivots and frontal matrices that are larger than forecast during the analysis, so it does in the unsymmetric case wherever there are no sufficiently large entries in the block of the frontal matrix that was expected to be pivotal. This can be ameliorated by scaling and reordering. Unfortunately, reordering an unsymmetric matrix with a symmetric pattern destroys the pattern symmetry. However, if the symmetric case is reordered with care (see Section 13.5.2), most of the benefits of unsymmetric scaling and reordering can be obtained while preserving the symmetry. Delayed pivots and frontal matrices that are larger than was forecast during the analysis necessitates the use of dynamic data structures, that is structures that change as the numerical factorization proceeds. Note that even in a nonpivoting case the extended-add required when assembling the frontal matrices of the children into the parent node requires some form of indirect addressing. The extra burden of determining the new structure at the parent is not great because the work is linear in the front size (see Exercise 13.3), whereas the actual assembly is quadratic in the front size. The serial codes in HSL mostly use dynamic data structures as do MUMPS and WSMP, which are distributed-memory multifrontal codes for symmetric or unsymmetric matrices. They are the only distributed memory codes of which we are aware that use dynamic data structures. One reason is that most codes for indefinite symmetric or unsymmetric systems have been developed from codes for positive-definite systems and there is a big incentive to keep changes to a minimum. The other is the added complexity of having to perform dynamic

284

GRAPHS FOR SYMMETRIC AND UNSYMMETRIC MATRICES

scheduling that redistributes the data among the processes, but this is avoided by WSMP, see Section 13.12. To maintain some form of stability while avoiding the need for dynamic data structures, most parallel codes either use static pivoting, which we discuss in Section 13.4, or generate a larger static structure that accommodates all choices of pivot. The codes SuperLU DIST, PasTiX, and PARDISO use the former strategy while SuperLU MT uses the latter. 13.4

Static pivoting

We illustrated in Section 12.8 the potential cost of numerical pivoting when using a multifrontal scheme. One way of avoiding this is to use static pivoting. As the name implies, static pivoting is a scheme that enables us to use the data structures generated in the symbolic factorization. We describe this strategy by using the frontal matrix   F11 F12 F13  FT12 F22 F23  (13.4.1) FT13 FT23 F33 previously shown in (12.8.1). As in Section 12.8, the rows and columns corresponding to F11 and F22 are fully summed. We assume that pivots have been chosen and used from all of F11 , but there are no entries in F22 that satisfy our threshold pivoting test because of the presence of relatively large entries in F23 . At this point, we have two choices. We either relax the threshold criterion to allow pivots to be chosen from F22 or change the diagonal of F22 . In any case, the result is that pivots are chosen from all of F22 , the matrix F33 is updated and passed to the next stage. This means that there are no delayed pivots and the data structures set up in the symbolic phase are suitable. However, the factorization is that of a perturbed matrix so that it is necessary to use some refinement algorithm to obtain the solution to the original system. Another possibility would be to use the matrix modification formula, see Section 15.2, although we are unaware of any code that takes this approach. We note also that, if the matrix is singular, either technique will be unreliable and will normally fail. Such failure is likely to be shown by the failure of iterative refinement to make the residuals small. Static pivoting is described in detail in the papers of Duff and Pralet (2004, 2005) and Schenk and G¨ artner (2006). Effectively, in static pivoting, we replace potentially small pivots pk by pk + τ sign(pk ), so that we will have factorized A + E = LDLT where |E| ≤ τ I. The main issue in this approach is the choice of τ . As τ increases, the stability of the factorization improves but kEk increases, which is likely to

STATIC PIVOTING

285

make iterative refinement less successful. As τ decreases, kEk decreases giving a better factorization of the original matrix, but this may be very unstable. There is thus a trade-off where a value of τ close to maxi,j |aij | can give a huge error E, whereas a value close to  maxi,j |aij |, where  is the relative precision, can result in large growth of entries in the factorized matrix. Conventional wisdom is to choose √ (13.4.2) τ = O( ) maxi,j |aij | and this is supported theoretically in the paper by Arioli, Duff, Gratton, and Pralet (2007). The default in the HSL code MA57 is to scale the matrix so that maxi,j |aij | is equal to 1 so that equation (13.4.2) simplifies greatly. A static pivoting strategy will result in there being no delayed pivots and will respect the number of entries forecast by the analysis phase. We show, in Table 13.4.1, the result of running MA57 on a selection of indefinite matrices with the static pivoting option switched off and switched on. As expected, the use of static pivoting invariably results in sparser factors and a lower execution time for the factorization sometimes by a significant amount. Indeed, as we can see from the results for the dtoc matrix, highlighted in Table 13.4.1, the savings in time and storage from using static pivoting can be considerable. The downside concerns the stability of the factorization and we give some indication of that by showing the number of perturbed pivots in Table 13.4.1 and the residuals in Table 13.4.2. The problem is that iterative refinement sometimes fails to converge as we see in the CONT examples in Table 13.4.2. The codes SuperLU, PARDISO, MA57, and MUMPS use static pivoting (or have an option for it), as do some of the codes in HSL. Hogg and Scott (2013b) have done extensive testing on a range of options for pivoting in cases where the matrices are highly indefinite and show that it is dangerous to rely solely on iterative refinement to obtain a solution when using static pivoting. Arioli et al. (2007) have shown that using the more powerful iterative method FGMRES (Saad 1993), rather than iterative refinement results in a backward stable method that converges for really quite poor factorizations of Table 13.4.1 Effect of static pivoting on factorization storage and time. Runs using HSL code MA57.

Order Entries Matrix (×103 ) (×103 ) brainpc2 27.6 179.4 cont-300 180.9 988.2 dtoc 25.0 70.0 ncvxqp5 62.5 425.0 sit100 10.3 61.0 stokes128 49.7 558.6

Number Time Size of of pivots seconds the factors Delayed Pertb millions Num Static Num Static Num Static 14 267 12 932 0.18 0.11 0.657 0.323 183 306 67 864 21.1 6.08 23.839 10.714 29 478 9 790 29.1 0.41 4.714 0.188 16 703 8 402 25.7 23.0 13.366 11.205 2 710 1 388 0.13 0.11 0.483 0.417 18 056 12 738 1.14 1.06 3.437 2.754

286

GRAPHS FOR SYMMETRIC AND UNSYMMETRIC MATRICES

Table 13.4.2 Residual norms when using iterative refinement after numeric and static pivoting. Runs using HSL code MA57. Matrix

Num pivoting strategy Iteration 0 Iteration 1 brainpc2 1.6e-15 1.0e-15 cont-300 7.6e-11 1.9e-16 dtoc 2.1e-16 2.7e-20 ncvxqp5 2.0e-11 2.0e-16 sit100 4.4e-15 1.4e-16 stokes128 1.1e-14 5.5e-16

Static It. 0 2.1e-08 2.1e-05 8.3e-07 2.0e-08 2.0e-08 4.2e-14

pivoting strategy It. 1 It. 2 5.7e-15 9.8e-16 2.7e-09 2.5e-09 2.1e-13 1.9e-15 6.7e-11 2.7e-14 5.8e-15 1.5e-16 2.0e-15 1.7e-15

−6

Norm of the residual scaled by ||A||∞ ||x||2 + ||b||2

10

−8

10

GMRES −10

10

IR

−12

10

−14

10

FGMRES −16

10

−18

10

0

2

4

6

8

10

12

Number of iterations

14

16

18

20

Fig. 13.4.1. Restarted GMRES, iterative refinement (IR) and FGMRES on CONT-201 test example with τ = 10−8 . We thank Mario Arioli for generating this figure for us.

A. FGMRES was developed from GMRES (Saad and Schultz 1986) to allow a different preconditioning matrix to be used on each iteration and Arioli et al. (2007) proved that the extra vector recurrence ensures greater stability. We show results of Arioli et al. (2007) in Figure 13.4.1, where we see that neither iterative refinement nor GMRES converge and the benefits of using FGMRES are clearly seen. However, we do not pursue this discussion further as the use of these more complicated iterative methods is not within the scope of this book.

SCALING AND REORDERING

13.5 13.5.1

287

Scaling and reordering The aims of scaling

We have discussed scaling before, notably in Sections 4.14 and 4.15. One reason for scaling is to avoid difficulties caused by the use of widely varying units and the potential loss of information when performing arithmetic when values range over several orders of magnitude. In this section, we are not primarily concerned by the method used for scaling, but rather that the resulting matrix has numerical values that are not too disparate and that changes to the planned pivot order are limited. We identify a potential caution for scaling in Section 13.5.4. If the matrix is scaled and reordered to put large entries on the diagonal prior to analysis and factorization, then it might be expected that it will retain this property during the factorization so that diagonal entries have a reasonable chance of satisfying a threshold criterion. In particular, the I-matrix scaling and permutation (see Sections 4.15.3 and 6.9), where the modulus of the diagonal entries are 1.0 and all off-diagonals are less than or equal to 1.0 in magnitude, might be expected to do particularly well, although it does not guarantee that diagonal pivoting will be satisfactory, as illustrated by the example 

1  1   0 0 13.5.2

1 1 1 0

1 1 1 1

 0 1  . 1  1

Scaling and reordering a symmetric matrix

Symmetric scaling and reordering leaves the same entries on the diagonal. A large off-diagonal entry cannot be moved to the diagonal. One possibility for a symmetric matrix is to abandon symmetry and treat the matrix as unsymmetric. An alternative was constructed by Duff and Pralet (2005) and implemented in the code HSL MC64. They choose a transversal by applying I-matrix scaling and permutation to the matrix, ignoring its symmetry. Any diagonal entry that is in the transversal is regarded as defining a 1×1 pivot and any off-diagonal entry aij that is in the transversal is regarded as defining a candidate 2×2 pivot (see Section 5.2.3) in rows and columns i and j. If all such entries are in pairs aij and aji , they have a complete set of pivots. Otherwise, the permutation must have component cycles of length 2k or 2k + 1 with k ≥ 1, from which k 2×2 pivots in non-overlapping rows and columns may be chosen. Duff and Pralet (2005) take care to make this choice well, but Hogg and Scott (2013b) report that cycles of length 3 or more do not occur often and that the simple choice of k adjacent pairs from a cycle of length 2k or 2k + 1 is adequate. A compressed matrix pattern is constructed with a single row and column for each pair (i, j) that has the pattern of the union of the patterns of rows i and j. This compressed matrix is analysed to provide a pivot sequence of 1×1 and 2×2 pivots for the original matrix.

288

13.5.3

GRAPHS FOR SYMMETRIC AND UNSYMMETRIC MATRICES

The effect of scaling

In the techniques of this chapter, we perform orderings and symbolic factorizations without consideration for the numerical values. If numerical pivoting is required, this is performed later and can change both the ordering and the data structures. Our concern is that the changes that we have to make to the initial ordering may be substantial. In symbolic analysis, it is common to choose pivots from the diagonal. If they are large relative to off-diagonal entries of the original matrix, then it may be hoped that they will be good candidates to choose later on numerical grounds. In the context of our tree-based factorization, a measure of whether the original ordering is acceptable is given by the number of delayed pivots, as defined in Section 12.8. When a dynamic pivoting strategy is in use, as in MUMPS or MA57 for example, a reduction in the number of delayed pivots should also reduce the amount of fill-in. We show the effect of this in Table 13.5.1 where MA57 is run on a selection of symmetric indefinite matrices, both unscaled and using MC64 scaling that results in an I-matrix. It is clear from this table that there are sometimes far fewer delayed pivots after scaling and consequently fewer entries in the factors. Table 13.5.1 Effect of scaling on factorization and number of delayed pivots. Runs using HSL code MA57. Hyphens indicate insufficient memory allocated. Matrix

Order Entries (×103 )

brainpc2 cont-300 dtoc ncvxqp5 sit100 stokes128

27.6 180.9 25.0 62.5 10.3 49.7

179.4 988.2 70.0 425.0 61.0 558.6

Entries in factors Delayed pivots (×106 ) (×103 ) Unscaled With MC64 Unscaled With MC64 0.619 0.663 13.8 15.1 22.536 22.540 160.5 160.5 1.033 1.021 44.1 44.1 13.314 17.4 0.488 0.487 2.7. 2.7 4.814 3.214 46.0 20.0

Another place where scaling has a significant effect is when a static pivoting strategy is being used, as described in Section 13.4. As we remarked in that section, static pivoting is particularly useful when we wish to retain static data structures for efficiency in parallel computation. We show in Table 13.5.2 the effect of using scaling on the performance of static pivoting in the code SuperLU-DIST.In most cases, there is a slight increase in the number of entries in the factors after using MC64. The reason for that is that the matrices are nearly symmetric and the permutation associated with MC64 will make them less so at detriment to fill-in. The main point is that there are some matrices where iterative refinement fails to converge if the MC64 scaling is not used. It should be added that all matrices are already scaled using a simple row-norm equilibration and, if that were not done, the failure rate would be even higher.

SCALING AND REORDERING

289

Table 13.5.2 Effect of using MC64 in SuperLU-DIST with static pivoting. Matrix

Order Entries (×103 ) (×106 )

Entries (L+U) (×106 ) MC64 No-MC64 bbmat 38.7 1.77 33.1 32.5 inv-extrusion-11 30.4 1.79 17.1 13.9 torso3 259.2 4.43 233.3 226.3 matrix2111 801.4 129.41 1 253.8 1 253.8 tdr190k1 1 100.2 43.32 713.5 711.9 1 Not in the Florida sparse matrix collection.

Rel BERR (its) MC64 No-MC64 4.5e-16 (2) 4.0e-16 (4) 8.9e-16 (2) 9.4e-16 (3) 3.5e-16 (1) 3.9e-16 (1) 8.2e-16 (1) 1.3e-03 (-) 4.9e-16 (1) 4.3e-01 (-)

We are grateful to Sherry Li for doing the runs in this table.

13.5.4

Discussion of scaling strategies

In Section 4.14, we described three quite different strategies for scaling a sparse matrix: scaling the entries to make their absolute values close to one (Section 4.15.1), scaling the row and column norms (Section 4.15.2) and I-matrix scaling (Sections 4.15.3 and 13.5.2). They all seem entirely reasonable. So the question is: ‘which should be used and is one better than another?’. The answer depends on context and criteria. As to context, Duff (2007) has compared these scaling algorithms when used with MA57 for symmetric indefinite matrices. In Figure 13.5.1, we show a performance profile (Dolan and Mor´e 2002) from the thesis of Pralet (2004), who ran the scaling algorithms on a set of augmented system matrices. For a given value x on the x-axis, the plots show the proportion of matrices for which the factorization time after scaling with each algorithm was within the factor x of the best of the five. We see that all do better than no scaling, that Imatrix scaling is best, and that scaling the rows and columns (in either norm) is between this and scaling the entries (Section 4.15.1). We believe that the success of I-matrix scaling is because it permutes entries onto the diagonal to maximize the size of their product and most of these entries are retained when symmetry is restored (Section 13.5.2). This is not true of the other scalings. Similar conclusions were made by Hogg and Scott (2008). Clearly, scaling is very helpful from this perspective. Note that we have not paid attention to the right-hand sides in this section. It is possible that scaling the matrix with right-hand sides appended may be desirable. We discussed this for full matrices at the start of Section 4.15, and the same considerations can apply in the large sparse case. We note that the criteria being used here is the speed of the factorization time. More difficult to assess but equally as important is whether the resulting factorization yields a solution to the original problem with high accuracy. The problem here is that different automatic scaling methods assign meaning to different small numbers. In making this assignment, they lead to different scaled matrices. Each resulting matrix has a stable factorization that can be achieved rapidly as shown. However, the solutions using these factors may look different

290

GRAPHS FOR SYMMETRIC AND UNSYMMETRIC MATRICES 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 No Scaling MC30 MC77inf MC77one MC64SYM

0.2 0.1 0

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3

Fig. 13.5.1. Figure from Duff and Pralet (2005). Performance profiles for factorization times on a set of augmented system matrices. MC30 scales the entries, MC77 scales the rows and columns in the infinity or one norm, and MC64 uses I-matrix scaling.

from each other. Which one was the better solution to the problem intended by the modeller? It is easy to demonstrate this dilemma with a 2×2 matrix, which we invited the reader to explore in Research Exercise R.4.1. We invite the reader to explore the effect of different scalings on solution accuracy for larger, relevant problems in Research Exercise R13.2. 13.6 Supernodal techniques using assembly trees For most of this chapter, we have focused on the use of trees in a multifrontal factorization. However, as we showed in Section 12.6, the assembly tree can be used for other approaches and we examine, in this section, its use for supernodal schemes. To illustrate our discussion, we use the code SuperLU (Li 2005) as an exemplar. SuperLU solves systems with a square unsymmetric coefficient matrix. It can use the same assembly tree as the multifrontal approach but, instead of storing a frontal matrix at each node, it stores only the fully summed columns. The elimination operations are applied directly to these fully summed columns. The algorithms and tree used by SuperLU are a little more complicated than this and depend on the version of SuperLU. Those for sequential

SUPERNODAL TECHNIQUES USING ASSEMBLY TREES

291

computation (Demmel, Eisenstat, Gilbert, Li, and Liu 1999) and shared memory (SuperLU MT) (Li 2005) use one approach, while the code for distributed memory (SuperLU DIST uses quite another. The sequential and shared memory versions compute a structure for L and U for the factorization PAQ = LU, (13.6.1) where P and Q are permutation matrices. As discussed in Sections 7.4 and 9.10.2, Q is obtained from an ordering for the factorization of the symmetric matrix AT A and a tree is constructed for this. For any row permutation P, the structure of L and U will be contained in the structure of RT and R, respectively, where RT R is the Cholesky factorization of QT AT AQ. The default is to use the AMD ordering, discussed in Section 11.3, on AT A which is effected through the COLAMD ordering on A (Davis et al. 2004a). This ordering gives a column ordering Q that preserves sparsity, while the row ordering P can be used for numerical pivoting knowing that the structure of R will accommodate any row permutation. Partial pivoting is used. The assembly tree provides a grouping of the variables into the sets of variables eliminated at the nodes, known as supervariables. The corresponding block on the diagonal of L\U is unlikely to have many zeros, so it is held as a full matrix and any sparsity within it is ignored. Sparsity is exploited in the rest of the block column. If the block column is large, it is subdivided by columns into ‘panels’. Each update operation of Gaussian elimination takes the form Aij = Aij − Lik Ukj

(13.6.2)

where i, j, and k label sets of rows or columns that all lie within a block. The operations for a node are performed when the node is active, after all the operations at its descendants are complete. This means that the access pattern is that of a left-looking factorization, illustrated in Figure 12.6.3. This has an advantage for parallel computing because data is written only while the node is active. That data can be accessed without synchronization by parallel threads working at an ancestor node. Once all the operations coming from previous pivotal operations (from descendants in the tree) have been applied, row pivoting can be performed. For SuperLU DIST, which is designed for distributed-memory machines, the matrix is first permuted and scaled by MC64 to B = Pr Dr ADc Pc

(13.6.3)

where the permutations Pr and Pc and the diagonal scaling matrices Dr and Dc are chosen so that the entries on the diagonal are one and the off-diagonal entries have absolute values no greater than one. A symmetric ordering is used based on running a fill-reducing algorithm such as AMD or nested dissection on the symmetric matrix (B + BT ) and the matrix PBPT is factorized as

292

GRAPHS FOR SYMMETRIC AND UNSYMMETRIC MATRICES

PBPT = LU

(13.6.4)

without any pivoting. Because the sparsity pattern of B is contained in the sparsity pattern of (B + BT ), the sparsity patterns of L and U are contained in the sparsity patterns anticipated for the Cholesky factorization of (B + BT ). SuperLU DIST can therefore use the assembly tree constructed for (B + BT ). It, too, stores only the fully summed columns at the nodes, but it is right looking. The update operations (13.6.2) are performed when panel k is active. This rightlooking algorithm has the data transfer pattern shown in Figure 12.6.2. It has been chosen mainly because of the inherently greater parallelism it provides. No further pivoting is performed for numerical stability but static pivoting (see Section 13.4) and iterative refinement (Section 4.13) are used to get an accurate solution (see Section 13.5.3, where we present a table on this from runs of SuperLU DIST). The Level 3 BLAS subroutine gemm is an obvious candidate for performing the matrix multiplication in the update (13.6.2), but the blocks Ukj will usually not be full. Each is held by columns omitting the zeros ahead of the first entry (and omitting zero columns altogether), but may be unpacked into a temporary dense rectangular matrix that excludes all leading zero rows and all zero columns. Following the call to gemm, each column of the result is subtracted from the appropriate part of the stored submatrix Aij . The code PARDISO (Schenk, G¨ artner, and Fichtner 2000) also uses a supernodal scheme and, similarly to SuperLU DIST, uses right- and leftlooking strategies. This code handles symmetric definite, symmetric indefinite and unsymmetric matrices. By default, it uses static pivoting with iterative refinement. 13.7

Directed acyclic graphs

The assembly tree may be used to define task dependencies and thereby schedule tasks for parallel computing — operations at a node wait for those at its children to be complete. A more detailed structure for doing this is a DAG (directed acyclic graph). We define tasks, each of which acts on a block and is associated with a node, and use directed edges to define dependencies. An easy example of this is to consider the block Cholesky factorization of a dense matrix. We define three types of task (k)

(k)

(i) Factorize a block on the diagonal, Akk = Lkk LTkk . This must wait for Akk to be calculated. (ii) Compute an off-diagonal block Lik (i > k) by solving a set of equations (k)T (k) Lkk LTik = Aik . This must wait for Lkk and Aik to be calculated. (k+1)

(iii) Update a block of the remaining matrix Aij (k) Aij ,

(k)

= Aij − Lik LTjk , with

i ≥ j > k. This must wait for Lik , and Ljk to be calculated. We illustrate this for the case having three block rows and columns in Figure 13.7.1.

DIRECTED ACYCLIC GRAPHS (2)

L13

A 33 (3)

(2)

L11

293

A 23

A33

L33

L23 L12

(2)

A 22

L22

Fig. 13.7.1. DAG for Cholesky factorization of a dense matrix having three block rows and columns.

There are ten tasks: three factorizations of blocks on the diagonal, three computations of off-diagonal blocks, and four updates of blocks. For each node, we show the matrix that the task creates. The first task has to be the factorization of A11 . The calculation of L12 must wait for this to be done, as must the calculation of L13 . Updating A22 must wait for L12 to be known. The first update of A33 must wait for L13 to be known. Each of these relationships and the later ones are represented by a directed edge in the graph. To ensure that the two updates to A33 do not occur at the same time, we have shown a requirement for the second update to wait for the first. An alternative is to make the calculation (2) of L33 wait for both updates, that is, to make the arrow from A33 terminate (3) at L33 instead of A33 . This provides a little more freedom (at the expense of bitwise reproducibility), but would need to be accompanied by the use of some other mechanism to prevent the two updates occurring at the same time. This could be a lock, so that the whole of one update precedes the other, or could involve the use of atomics for each entry, so that the order of updates may vary from entry to entry. Having thus set up such a DAG, the work at any node can be performed as soon as the work at all the nodes with edges ending at this node has completed. All that remains is to develop a scheduling algorithm for the available tasks during the factorization. Hogg (2008) tried several strategies for shared-memory machines and found that what is important is to keep all the threads busy. He targeted the fastest completion of the longest path (this critical path defines the minimum elapsed time for the computation on a parallel machine), but obtained little gain over earlier work on utilizing the DAG Both papers use this model to implement algorithms for multicore architectures where finer granularity and the use of OpenMP (as opposed to MPI) enable excellent performance on eight-core desktop machines. Buttari, Langou, Kurzak, and Dongarra (2009) get over 40 Gflops for the Cholesky factorization of matrices of order 5 000 or more on a machine with peak performance for dgemm of 57.5 Gflops. Hogg (2008) gets over 60 Gflops for the Cholesky factorization of matrices of order 10 000 or more on a machine with peak performance for dgemm of 72.8 Gflops. This DAG-based approach can be extended to the solution of sparse systems. For a shared-memory machine, Hogg, Reid, and Scott (2010) used a supernodal

294

GRAPHS FOR SYMMETRIC AND UNSYMMETRIC MATRICES

approach (see Section 13.6) to save storage (each update is applied directly to the memory that eventually holds the factors) and subdivided the computation into blocks of size nb × nb or less, where nb is a parameter. To allow the use of Level 3 BLAS subroutines, each block column is stored by rows in pivotal order. With this storage, it makes sense to treat the Cholesky factorization of a diagonal block and the subsequent calculation of the corresponding off-diagonal blocks as a single task to be performed by a single thread. However, the updates are treated as two tasks: (i) update internal, which involves the update of a block from a block column of the same node; (ii) update between, which involves the update of a block from a block column of a descendant node. An update internal can be performed directly using Level 3 BLAS subroutines. An update between has the form Lik LTjk , where both Lik and Ljk are contiguous rows of a block column, which allows it to be calculated by gemm. However, gemm cannot be used to modify the destination block because this is likely to contain additional rows and columns (from other descendants). The update must therefore be placed in a work array (one for each thread) before its entries are added in a ‘scattered’ way. Of course, the scheduling of tasks is important and we refer the reader to the paper of Hogg et al. (2010) for details. They do not store the DAG explicitly and use locks to prevent two updates occurring at the same time rather than specifying an order for them. This, of course, gives more freedom to exploit parallelism. They have shown that their code (HSL MA87) can attain about 5/8th of the the peak dgemm speed (speed when large independent dgemm calls are made on each of the cores) on a 28-core machine. 13.8

Parallel issues

The ANALYSE phase has proved to be the most difficult to parallelize. To generate an ordering in parallel, most approaches are based on nested dissection as that has a natural parallelism associated with it since after a dissection the two subgraphs split by the separator are independent. However, in a distributed memory environment, it may not be possible to hold even the structural information for the matrix on a single process and the matrix is usually condensed to a representation that can be. The initial ordering is then performed on this reduced matrix so that its quality may be inferior to that obtained by working on the original matrix. Both METIS and SCOTCH have versions that will generate orderings and data structures in this case, called ParMETIS (Karypis and Kumar 1998d; Sui, Nguyen, Burtscher, and Pingali 2011) and PT-SCOTCH (Chevalier and Pellegrini 2008; Pellegrini 2012), respectively, but the quality can be poor, particularly in the case of ParMETIS. In shared memory, where we assume we have access to the structure of the complete matrix, the situation is better, and LaSalle and Karypis (2015) have

PARALLEL FACTORIZATION

295

developed a shared memory code called mt-ND-Metis that achieves speedups of up to 10 on 16 cores and provides an ordering with only about 1.0% more fill-in and 0.7% more operations for the subsequent factorization than the serial ND-Metis. For the other steps in ANALYSE even less has been done to exploit parallelism although the algorithms for generating and manipulating trees, that we discussed in Chapter 12 are amazingly efficient. In the next section, we consider parallel algorithms for numerical factorization. Historically, numerical factorization has been the most time consuming part of a direct solver, so this phase has attracted the most attention for developing parallel algorithms and software. Indeed this work has been so successful that the other phases are often the bottleneck to efficient solution. We discuss the parallelization of the SOLVE phase in Section 14.7. 13.9 13.9.1

Parallel factorization Parallelization levels

Algorithms for the factorization of sparse matrices exhibit potential parallelism at three levels. The macro-level is at the top. For example, using the desirable forms of Chapter 9, the factorization of the blocks on the diagonal of the block triangular form are independent and so can be performed in parallel. Duff and Scott (2004) show that the singly bordered block diagonal form can be used to good effect in parallelizing the MA48 code for the solution of highly unsymmetric systems. The bottom level applies to computations on blocks that are dense matrices. We need to perform factorizations of blocks, solution of block equations, and multiplications of blocks. For these tasks, Level 3 BLAS that are multi-threaded to exploit fine-grain parallelism have been developed. At a node of the assembly tree, the frontal matrix has the form   F11 F12 (13.9.1) F21 F22 where pivots are chosen from F11 and the Schur complement matrix F22 − F21 F−1 11 F21 is passed to the parent node. If the frontal matrix is large, it can be distributed in a one- or two-dimensional way, that is, as block rows (or block columns) or as subblocks. A one-dimensional distribution will be sufficient up to some size, but a two-dimensional distribution will be needed thereafter, particularly for fronts at the root Schreiber (1993) or near the root. This level of parallelism is known as node parallelism. The middle level arises from using the assembly tree. We illustrate this level of parallelism, which we call tree parallelism, using the tree shown in Figure 13.9.1. During the factorization, we process the nodes of the tree from the leaf nodes to the root (we assume the matrix is irreducible so there is only one root) and we

296

GRAPHS FOR SYMMETRIC AND UNSYMMETRIC MATRICES

Fig. 13.9.1. Assembly tree.

Fig. 13.9.2. Available nodes at start of factorization. Work corresponding to leaf nodes can proceed immediately and independently. use the property that the computations at any node can proceed as soon as all descendant nodes of that node have been processed (strictly, as soon as all the data from all the children are available). Thus, at the start of the factorization, illustrated in Figure 13.9.2, the parallelism from the tree is equal to the number of leaf nodes and this is illustrated by the encircled nodes in Figure 13.9.2. As the computation progresses as shown in Figure 13.9.3, computations at some nodes are completed (indicated by a cross) and other nodes become available for computation (now encircled). We observe in Figure 13.9.3 that the number of active nodes (that is, those for which computations can be concurrently performed) has reduced from 13 to 7. When DAGs are in use (see Section 13.7), the DAG replaces the tree as the middle level of parallelization. It provides additional scope for parallelization, because updates to an ancestor do not have to wait for processing at the node to be complete. Of course, the shape of the tree can have a profound influence on the amount of parallelism that can be obtained, and in turn this will depend on the matrix and the ordering used. In general, minimum degree or approximate minimum degree orderings, that were so successful for serial computation, tend to produce

PARALLEL FACTORIZATION

297

Fig. 13.9.3. Situation part way through the elimination. When all children of a node complete then work can commence at parent node. trees that are tall and skinny, and so not so good for parallel exploitation. A nested dissection based ordering will tend to give broader trees that yield more parallel processes. 13.9.2

The balance between tree and node parallelism

As the elimination progresses from the leaves to the root there are generally less and less nodes available for parallel tasks until, at the root, there is a single node and, hence, no parallelism from the tree. The saving grace is that the frontal matrices generally become larger the closer we are to the root, so that the node parallelism becomes more relevant and useful just as the tree parallelism decreases. We see this in Table 13.9.1, where the balance between tree parallelism and node parallelism is clearly seen. On a symmetric positive-definite matrix, WSMP (Gupta, Karypis, and Kumar 1997) achieves a good balance between node and tree parallelism by working with a binary tree on a number of processes p that is a power of 2. This means that there is a level `0 at which there are exactly p nodes. Each process is associated with a node at this level and acts on the subtree rooted at this node on its Table 13.9.1 Statistics on front sizes in assembly tree. Tree Leaf nodes Top 3 levels Matrix Order nodes No. Av. size No. Av. size % bratu3d 27 792 12 663 11 132 8 296 37 cont-300 180 895 90 429 74 673 6 10 846 cvxqp3 17 500 8 336 6 967 4 48 194 mario001 38 434 15 480 8 520 4 10 131 ncvxqp7 87 500 41 714 34 847 4 91 323 bmw3 2 227 362 14 095 5 758 50 11 1 919

ops 56 41 70 25 61 44

298

GRAPHS FOR SYMMETRIC AND UNSYMMETRIC MATRICES

own1 . No synchronization is needed at this level. At level `i , i > 0, each frontal matrix is partitioned into 2i parts and each part is held on a single process. The 2i processes that hold a frontal matrix collaborate to perform the elimination operations of the frontal matrix. Synchronization is needed within each set of 2i processes, but not between the sets. If the parts are of similar size, there will be good node parallelism for the computations within the frontal matrices and good tree parallelism across the nodes. To partition the frontal matrices and associate the partitions with processes, WSMP uses the positions in the pivot sequence expressed as a binary number in the range 0, 1, . . . , n − 1 and indexes the processes with binary numbers in the range 0, 1, . . . , p − 1. Each frontal matrix at level `1 is partitioned by columns according to the final bit of the position of the column in the pivot sequence, that is, according to whether the position is odd or even. This may be expected to divide the columns into two parts of roughly equal size because all the variables of the front are eliminated at the node or an ancestor of the node and all the variables eliminated at a node form a contiguous set in the pivot sequence. The pair of processes that handle the frontal matrix have indices that differ only in the last bit. The process whose last bit is 0 holds the entries that lie in the columns with last bit 0 and the process whose last bit is 1 holds the entries that lie in the columns with last bit 1. Each assembles its part of the frontal matrix then they work together to perform the elimination operations within the frontal matrix, leaving a generated element matrix that is partitioned in the same way. Each frontal matrix at level `2 is partitioned by rows, as well as columns. The part to which an entry belongs is determined by the final bits of the positions of its row and column in the pivot sequence. The four processes that handle the frontal matrix have indices that differ only in the last two bits. The two processes with final bit 0 continue to hold the columns with final bit 0; the process with final two bits 00 holds the rows with final bit 0 and the process with final two bits 10 holds the rows with final bit 1. They collaborate to assemble their parts of the frontal matrix without needing any data from other processes. The same applies to the two processes with final bit 1. Each frontal matrix is now stored in four processes, each holding about the same number of entries. The four then work together to perform the elimination operations within the frontal matrix. We illustrate this in Figure 13.9.4 for a case with exactly four processes. At level `k , k > 2, pairs of processes again collaborate to assemble generated element matrices. They have indices that differ in the bit that is k-th from last. They hold entries from the same set of rows and columns at level `k−1 and now hold about half the rows (k even) or columns (k odd). Once the assembly is complete the 2k processes that hold the frontal matrix collaborate to perform eliminations there. Their indices are identical except in the last k bits. 1 If the tree is not binary, the algorithm of Geist and Ng (1989) (Section 13.9.5) may be used to find a suitable set of p nodes and a binary subtree with these as leaves can be constructed from the given tree by adding extra nodes.

PARALLEL FACTORIZATION Cols ......0

Cols .....1

Rows ......0

Proc. 00

Proc. 01

Rows .....1

Proc. 10

Proc. 11

299

Proc. 00

Proc. 01

Proc. 10

Proc. 11

Cols ......0

Cols .....1

Cols ......0

Cols .....1

Proc. 00

Proc. 01

Proc. 10

Proc. 11

Fig. 13.9.4. The WSMP division of frontal matrices among processes. A disadvantage of this scheme is that adjacent pivot entries in a front are placed on different processes (except at level `0 ), which means that the block size for Level 3 BLAS subroutines will be small. Gupta et al. (1997) suggest starting from a bit other than the last in the index of the position in the pivot sequence. If the k-th from last is used, sequences of up to 2k pivots will be placed together in a process, thereby increasing the block size for Level 3 BLAS subroutines, but at the expense of load balance. Which value of k is best will depend on the hardware. A. Gupta (2015, private communication) has extended this approach to include pivoting when factorizing a symmetric indefinite matrix. Because of delayed pivots, the frontal matrices may be bigger than anticipated during analysis and the index lists need to be dynamic. This has been incorporated in WSMP and the load balance usually remains good. The approach could also be used for factorizing an unsymmetric matrix A using the pattern of A + AT but Gupta reports getting better sparsity preservation by using rectangular frontal matrices, see Section 13.12. 13.9.3

Use of memory

In a parallel computing context, memory use is much more difficult to control and limit than in the serial case. This is true for two related reasons. First, it is not possible to use a postorder traversal of the tree while exploiting the tree

300

GRAPHS FOR SYMMETRIC AND UNSYMMETRIC MATRICES

parallelism, since processes work on different branches of the tree at the same time. Thus, the use of a simple stack to hold the generated element matrices as in the sequential case is not possible. Secondly, because many processes are active during the factorization, the global memory requirement will increase compared to a sequential execution; the memory per process should decrease, but the overall memory requirement will increase. 13.9.4

Static and dynamic mapping

The assembly tree gives a good indication of how the factorization can exploit parallelism. For good efficiency, how we distribute the work at the nodes of the tree to the different processes is very important and we will now discuss methods for doing this. These methods can be divided into two classes. The first is a mapping of nodes of the tree to processes done at the analysis and tree generation phase. We call this static mapping. This mapping is used to distribute the work and storage over the processes in a balanced manner and to establish the communications required for factorization. The phase is sometimes known as static scheduling since it determines how the work is scheduled. This phase is done without regard to the numerical values of the matrix entries. If we know exactly the amount of work at each node, for example when the matrix is positive definite and numerical pivoting is not required, then we can exploit node parallelism by distributing the computation within each large node over more than one process. If numerical pivoting is needed, we will not know a priori the sizes of the matrices at the nodes (see Section 13.3) or the work involved. We thus require some mapping and scheduling to be performed in the numerical factorization phase. This is known as dynamic scheduling. In MUMPS, static mapping defines a master process for each node and, before the factorization at a large node, a second mapping subdivides the work and distributes it to different processes, called workers. In addition to accommodating numerical pivoting, dynamic scheduling can be useful in a multiuser environment because it can take account of all loads on the processes. We discuss strategies for static and dynamic mapping in the next two subsections. 13.9.5

Static mapping and scheduling

For tree parallelism, one can assign nodes to processes as soon as the assembly tree is constructed. We described how WSMP does this in Section 13.9.2, but there are several other algorithms for doing this. Most are based on the concept of levels or layers in the tree. If we start at the root, the children of the root are at the first level and their children are at the second level. As we continue to descend the tree from the root, there is less and less work associated with the subtree rooted at a node. It is likely that there are more and more nodes at each successive level until at some point there are likely to be far more nodes in the level than needed for obtaining tree parallelism. Although the exact point at which one should stop descending the tree is both tree and machine dependent, a good algorithm for deciding when to do so

PARALLEL FACTORIZATION

a

b

301

c

Fig. 13.9.5. Geist–Ng algorithm for the construction of layer L0 . L3 L2 L1 L0

Subtree roots

We thank our colleagues from Toulouse-IRIT and LIP-ENS Lyon for this figure.

Fig. 13.9.6. Layers in the assembly tree. is that of Geist and Ng (1989), illustrated in Figure 13.9.52 . It determines a set of independent subtrees, known as a ‘layer’, each of which can be assigned to a separate process or set of processes. Starting at the root, successive sets of nodes are constructed. In each set, a node representing a subtree with most work is identified (figure b). The set is then updated by removing the node and adding its children, as shown in figure c. The algorithm terminates either when there are as many subtrees as processes available in the parallel computer or there is no node in the set with sufficient workload to justify further splitting. This then determines layer L0 . Having established layer L0 , we then assign to layers the nodes above this layer in the tree. For i = 1, 2, ... any node all of whose children are in previous layers is placed in layer i, continuing until the root is reached. We illustrate this algorithm in Figure 13.9.6. It is then quite common to assign the processes to the nodes by layers in a round-robin fashion. We now consider more sophisticated algorithms for mapping the tree nodes to processes. A subtree-to-subcube mapping was used to reduce the communication overhead in parallel sparse Cholesky factorization on hypercubes (George, Liu, and Ng 1987) and is used sometimes in a more general context. This mapping was originally developed to address the parallelization of the factorization of 2 Reprinted from Computer Methods in Applied Mechanics and Engineering, Vol 184, P. R. Amestoy, I. S. Duff, and J.-Y. L’Excellent, Multifrontal parallel distributed symmetric and unsymmetric solvers, Page 184, Copyright 2000, with permission from Elsevier.

302

GRAPHS FOR SYMMETRIC AND UNSYMMETRIC MATRICES

matrices arising from regular meshes when using nested dissection. The basic idea of the algorithm is to build a set of p subtrees (where p is the number of processes) on the same level and then assign a subtree to each process. The nodes of the upper parts of the tree are then mapped using a simple rule: when two subtrees merge together into their parent, their set of processes are merged and assigned to the parent. The method reduces the communication cost, but can lead to load imbalance particularly on irregular assembly trees. A bin-packing based approach was then introduced to improve the load balance. In this scheme, starting from the root node, the assembly tree is explored until a set of subtrees on the same level, whose corresponding mapping on the processes produce a ‘good’ balance, is obtained (the balance criterion is computed using a first-fit decreasing bin-packing heuristic). Once the mapping of the set of subtrees at this level is done, all the processes are assigned to each remaining node (that is, the nodes in the upper parts of the tree). This method takes into account the workload to distribute processes to subtrees. However, it does not minimize communication in the sense that all processes are assigned to all the nodes of the upper parts of the tree. The proportional mapping algorithm improves upon these two algorithms. It uses the same ‘local’ mapping of the processes as that produced by the subtreeto-subcube algorithm with the difference that it takes into account workload information to build the layer of subtrees. To be more precise, starting from the root node to which all the processes are assigned, the algorithm splits the set of processes recursively among the branches of the tree according to their relative workload until all the processes are each assigned to a single subtree. This algorithm is characterized by both a good workload balance and a good reduction in communication. This technique is used by used by several codes and is examined in detail by Sid-Lakhdar (2014). It is used by WSMP for unsymmetic matrices and is described in Section 13.12. 13.9.6

Dynamic scheduling

Dynamic scheduling may be used to balance work and storage at execution time when the initial mapping is not appropriate for numerical or loading considerations (Amestoy et al. 2001; Amestoy, Guermouche, L’Excellent, and Pralet 2006). In order for each process to take appropriate dynamic scheduling decisions, an up-to-date view of the load and current memory usage of all the other processes must be maintained. This is done using asynchronous messages, as described by Guermouche and L’Excellent (2005). The nodes of the tree can be processed on a single process or can be split between several processes. In MUMPS, a one-dimensional splitting and mapping can be used within intermediate nodes and a two-dimensional splitting can be used at the root node. The root node is by default factorized using ScaLAPACK (Blackford, Choi, Cleary, D’Azevedo, Demmel, Dhillon, Dongarra, Hammarling, Henry, Petitet, Stanley, Walker, and Whaley 1997).

PARALLEL FACTORIZATION P0 P1

P0

303

P0

P2 P3

P2

P0 P1

P0

Type 3

Type 1 P0 Type 2

Type 2 P0 P1 P2

P0

Type 2

P0

P2 P3 P0

P2

P1

P3 P0 P1 P2

P3

P2

P3

P3

SUBTREES

Fig. 13.9.7. Splitting of nodes and assignment to processes. We show in Figure 13.9.73 how the nodes are split between processes in the MUMPS code. • Type 1: Parallelism of the tree. • Type 2: 1D partitioning of frontal matrices with distributed assembly process. • Type 3: 2D partitioning of root node (ScaLAPACK). Dynamic scheduling is very complicated and is difficult to implement efficiently, so even when numerical pivoting might be required in the factorization of indefinite or unsymmetric systems, there is an option in many codes to use a static pivoting approach (see Section 13.4). 13.9.7

Codes for shared and distributed memory computers

In the shared-memory case, a major issue is the management of the storage of the generated element matrices that can no longer be stored on a stack. If an allocatable or pointer array is used for each of them, the performance is dependent on the efficacy of the memory management system. An alternative is to use space within large arrays with clever re-use of freed space and occasional garbage collection. Duff (1986) discusses these issues and Duff (1989) examines the implementation of a multifrontal scheme on the Alliant FX/80. Towards the middle of the 1990s shared memory parallelism became less popular largely because of the problems of bus contention at even moderate numbers of processes, but there has been a recent revival of this mode with the advent of multi-core CPUs and GPUs. On a distributed memory architecture, generated element matrices may reside on different processes and the data management is much more complicated. A message passing system will be required with MPI being almost invariably the system of choice. 3 Reprinted from Computer Methods in Applied Mechanics and Engineering, Vol 184, P. R. Amestoy, I. S. Duff, and J.-Y. L’Excellent, Multifrontal parallel distributed symmetric and unsymmetric solvers, Page 184, Copyright 2000, with permission from Elsevier.

304

13.10

GRAPHS FOR SYMMETRIC AND UNSYMMETRIC MATRICES

The use of low-rank matrices in the factorization

If a dense matrix A of order m × n has numerical rank k that is less than min {m, n}, it is possible to represent the matrix A as the product of two matrices X and Y of dimensions m × k and n × k as A = XYT .

(13.10.1)

This representation is called the low-rank form for A. We first focus on the use of this form before discussing methods for obtaining X and Y. We note that the storage for the low-rank form is k(m + n) which is often significantly less than the storage of mn as a full matrix. Additionally, matrices in low-rank form can exhibit reduced operation count for some basic operations. For example, given a lower triangular matrix L of dimension m, L−1 A can be expressed as (L−1 X)YT , which changes the number of operations from mn(m + 1) to mk(m+2n+1) and, thus, there are fewer operations if k < min(m, n)/3. The multiplication of two matrices A1 and A2 of dimensions m × r and r × n requires 2mnr operations, but if they are of rank k1 and k2 with low-rank forms X1 Y1T and X2 Y2T , respectively, the multiplication can be performed as (X1 (Y1T X2 ))Y2 or X1 ((Y1T X2 )Y2 ), and the number of operations is 2k1 k2 (r + m) + 2k2 mn or 2k1 k2 (r + n) + 2k1 mn. We may choose whichever needs fewer operations. This idea of a low-rank approximation can be used within the dense frontal matrices of a multifrontal scheme or, indeed, within any supernodal scheme. It has been observed (and can to some extent be shown theoretically) that discretizations of elliptic partial differential equations will result in frontal matrices that have many off-diagonal blocks that can be approximated by lowrank matrices. The low-rank form can be obtained by using a singular value decomposition USV. For a given threshold , we can obtain an approximation A such that kA− Ak2 ≤  by replacing all singular values (diagonal entries of the diagonal matrix S) that are less than  by zero. At much reduced computational cost, a rankrevealing QR factorization allows the same effect to be obtained approximately. Several schemes have been proposed to exploit this. Two of the more established approaches are using H-matrices (the H stands for Hierarchical) and Hierarchically SemiSeparable (HSS) matrices. In the H-matrix approach, each generated element matrix is analysed to decide whether to keep it as a full matrix, replace it with a low-rank approximation, or partition it. If it is partitioned, each of the resulting blocks is treated similarly, continuing until no further partitioning occurs. A detailed description of H-matrices can be found in the book of Bebendorf (2008). The HSS scheme (Xia 2013), uses similar partitions, but with a fixed number of recursions and a different algebraic scheme involving approximations for a block being formed after approximations for all its subblocks have been formed. Xia has shown that the work and storage for solving a three-dimensional Poisson or Helmholtz equation on a n × n × n grid, using a multifrontal method, can be reduced from O(n6 ) and O(n4 ) to

THE USE OF LOW-RANK MATRICES IN THE FACTORIZATION

305

Fig. 13.10.1. Illustration of the BLR blocks for the final frontal matrix of the discretized Laplacian on a 128 × 128 × 128 grid with blocks of size 512 × 512 and  = 10−14 . The darkness of each block shows its rank. Figure is from Amestoy, Ashcraft, Boiteau, Buttari, L’Excellent, and Weisbecker (2015a).

O(n4 log n) and O(n3 log n), respectively. However, Weisbecker (2013) reports that the multiplying factors are large. A much simpler approach to low-rank approximation has been developed by Amestoy et al. (2015a) and is discussed in detail in the thesis by Weisbecker (2013). The approach is called the Block Low-Rank or BLR format and is a non-hierarchical block matrix format. Essentially, the frontal matrix is just partitioned into blocks and a test is done on each off-diagonal block to see whether to represent it in low-rank form or to hold it in full form. Blocks on the diagonal are always held in full form. We show the illustration of this by Amestoy et al. (2015a) in Figure 13.10.1. Weisbecker (2013) performed extensive experiments on the choice of block size for the partitioning and the threshold  for replacing singular values by zero, and shows that his simple approach can produce compressions that are comparable with those of the H and HSS schemes (on regular problems, they are asymptotically slower but faster for modest mesh sizes). On some problems, good compression was obtained even with  = 10−14 . He found that small block sizes, such as 64 and 128, were undesirable because of poor full Level 3 BLAS efficiency. He also showed that pivoting can be incorporated, which would be very difficult for the H-matrix or HSS schemes. In Table 13.10.1, we show some of results from Amestoy et al. (2015a) using development versions of MUMPS running sequentially on three realistic test problems supplied by Electricit´e de France. The results show that with  = 10−14 the BLR method can get accurate answers in about half the storage for the factors and about 10% of the floating-point operations in about 1/3 of the time of the normal multifrontal code (shown as ‘full’). The storage, operations, and time can be reduced further by increasing , but then iterative

306

GRAPHS FOR SYMMETRIC AND UNSYMMETRIC MATRICES

Table 13.10.1 Weisbecker’s results for three matrices using development versions of MUMPS. CSR stands for the |b−Aˆ x|i componentwise scaled residual maxi (|b|+|A| |ˆ x|)i and IR stands for iterative refinement.  full 1e-14 1e-06

|L|

Ops

2.6GB 4.7e11 86.3% 67.3% 65.8% 33.3%

full 200GB 1.5e14 1e-14 55.2% 8.9% 1e-08 48.7% 5.6% full 128GB 2.7e14 1e-14 48.2% 17.4% 1e-06 24.8% 3.9%

Secs pompe 89 81 67 amer12 16 747 6 260 5 792 dthr7 25 815 10 006 5 987

CSR CSR IR No.IR 2e-15 2e-10 7e-03

8e-16 7e-16 4e-13

1 1 5

8e-14 3e-14 4e-08

6e-16 6e-16 6e-16

1 1 4

5e-14 6e-13 1e-04

6e-16 5e-16 6e-16

1 1 8

refinement is necessary to recover a satisfactory solution. Greater values of  give an approximate factorization that is not suitable for simple iterative refinement, but can be used as a preconditioner for more powerful iterative methods, such as conjugate gradients. Note, however, that the time does not decrease as rapidly with increasing  values as the operation count; we believe that this is because the block sizes for Level 3 BLAS calls are smaller. 13.11 Using rectangular frontal matrices with local pivoting To address the disadvantages of working with the union of the patterns of A and AT , Davis and Duff (1997) suggested working with generated element matrices that are rectangular instead of square. Each reduced matrix is held as the sum of a set of rectangular element matrices and a matrix of original entries. Before the operations associated with a pivot can be performed, its row i and column j must be assembled into a frontal matrix. A complication is that an element matrix may involve row i without involving column j or vice-versa. A simple example is shown in Figure 13.11.1 where the pivot is an original entry a, entries only of the first element are shown as b, entries only of the second element are shown as c, and entries of both elements are shown as d. Here, the frontal matrix ab c0 cb b b

b c d b b

b ccc dcc b b

Fig. 13.11.1. A simple assembly.

USING RECTANGULAR FRONTAL MATRICES WITH LOCAL PIVOTING 307

consists of rows 1–3 and columns 1–4. The first element has rows 1 and 3–5, of which only rows 1 and 3 are assembled into the front. The second element has columns 1 and 3–6 of which only columns 1 and 3–4 are assembled into the front. The elements are therefore not fully assembled into the front, but remain active with reduced size. The pivotal strategy of Davis and Duff (1997) is based on the Markowitz count (ri − 1)(cj − 1) (13.11.1) (see Section 10.2), where ri and cj are the row and column counts, subject to the inequality (k) (k) |aij | ≥ u max |alj | (13.11.2) l

(see Section 11.11), but is relaxed by using approximations ri and cj for the counts when the exact values are not readily available. These approximations are upper bounds and are found by calculations that are analogous to the AMD bounds (Section 11.3). The pivots are chosen in groups, using a single frontal matrix and producing a single generated element matrix. The first pivot of a group is found by searching a small number of columns of smallest approximate column count (by default, four columns). Each candidate column has to be assembled because its numerical values are needed for the test (13.11.2). This also provides the exact column count. The pivot is chosen from the entries of the candidate columns to minimize (ri − 1)(cj − 1)

(13.11.3)

subject to the inequality (13.11.2). Once the pivot has been chosen, the assembled columns that were not chosen are discarded and the pivot row is assembled. We now have the pivot row and the pivot column. These are the first row and column of the frontal matrix, which is rectangular. For each element k whose rows overlap those of the front, we find the number of rows w(k) it has outside the front by using the algorithm shown in Figure 13.11.2. Note how similar this is to Figure 11.3.2. If w(k) is zero, all the rows of element k lie within the front. In this case, all the columns of the element that overlap the front can be assembled into the frontal matrix to leave a smaller do i = each row of the front do k = each element involving i if (first occurrence of k) w(k) = no. of rows in k w(k) = w(k) - 1 end do end do

Fig. 13.11.2. Pseudocode for counting the number of rows outside the front for each element that overlaps it.

308

GRAPHS FOR SYMMETRIC AND UNSYMMETRIC MATRICES

Front

Front Element

Front

Front

Element

Element

Element

Fig. 13.11.3. The ways that an element and the front may overlap. rectangular element matrix or sometimes to totally absorb the element matrix (see the first two diagrams of Figure 13.11.3). For a non-pivotal column j of the front, the number of original entries outside the front is found and a bound for the total number of entries outside the front is computed by adding to it the sum of the values of w(k) for all the elements that contain column j. If this bound is zero, column j is fully summed. Otherwise, the new estimated column count cj is taken as the least of this bound plus the number of non-pivotal rows in the front, the number of pivotal steps remaining, and the previous estimate plus the number of non-pivotal rows in the front. Computing an exact column count would involve merging the index lists of all the elements that involve it, whereas computing the approximate count requires only adding the w(k) values. The calculation of Figure 13.11.2 involves a similar total amount of work. Therefore, we can expect the calculation of approximate column counts to be much more economical. A similar calculation is performed for each element k whose columns overlap those of the front to find the number of columns w(k) it has outside the front. If w(k) is zero, all the rows of element k that overlap the front can be assembled into the frontal matrix (see the third diagram of Figure 13.11.3). Note that an element that overlaps the front and has both rows and columns outside the front (see the fourth diagram of Figure 13.11.3) is not partially absorbed because of the complexity of handling non-rectangular elements. For each non-pivotal row i of the front, the values of w(k) are used to bound its number of entries outside the front. If this bound is zero, we can label row i as fully summed. Otherwise, the new estimate ri is taken as the least of this bound plus the number of nonpivotal columns in the front, the number of pivotal steps remaining, and the previous estimate plus the number of non-pivotal columns in the front. The reduced matrix then has the form shown in Figure 13.11.4 (after reordering) and any entry in a summed row and in a summed column has the same Markowitz cost as the first chosen pivot and is available as pivot, too. As many of these as possible are used successively as pivots, all subject to the threshold test (13.11.2) and with updating only of the fully-summed rows and columns. Further rows and columns may be added to the front to allow more pivots to be chosen and hence better use made of Level 3 BLAS when the generated element is eventually constructed. This is controlled by a parameter g with

USING RECTANGULAR FRONTAL MATRICES WITH LOCAL PIVOTING 309

Frontal Remainder sum’d other Frontal summed

0

Frontal other

Remainder

0

Fig. 13.11.4. Reduced matrix after assembling the front. default value 2.0. The number of rows is permitted to increase to g times its initial value and similarly for the number of columns. The non-pivotal columns of the front are treated in order of increasing cj . Each is assembled, the previous operations of this frontal step are applied, and a pivot is chosen, again minimizing the approximate Markowitz count (13.11.3) and subject to the stability condition (13.11.2). The pivot row is then assembled and the front is expanded to accommodate any extra rows in the pivot column and any extra columns in the pivot row. The frontal matrix is now processed in the same way as for the first pivot. The calculation of Figure 13.11.2 may be continued so that work is needed only for the new rows (a similar remark applies to the corresponding column calculation). The rows and columns are checked to see if they can be assembled into the front, as many pivots as possible are chosen from the submatrix of summed rows and columns, the pivotal operations are applied and the bounds ri and cj for the non-pivotal rows and columns of the front are revised. This continues until a pivot column or pivot row is encountered that has too many entries for the allocated size of the front. Finally, Level 3 BLAS are used to calculate the generated element matrix. Note that, unlike in the symmetric case, the rows and columns of the generated elements are not necessarily absorbed into the front at the same time. It is convenient to flag each row and column whenever it is absorbed and discard the element once it has been completely absorbed. The ideas explained in this section have been implemented in the HSL code MA38 and in UMFPACK. Some comparisons of UMFPACK with MA48 are shown in Table 13.11.1, where the results from the codes other than MA48 are from Amestoy and Puglisi (2002). UMFPACK is usually better if there is significant fill-in and MA48 is usually better if there is little fill-in. UMFPACK is the default code for unsymmetric sparse systems in MATLAB.

310

GRAPHS FOR SYMMETRIC AND UNSYMMETRIC MATRICES

Table 13.11.1 Comparisons of UMFPACK with other codes. Matrix av41092 Order (×103 ) 41.1 Nonzeros (×106 ) 1.684 Symmetry index 0.08 Entries in factors (×106 ) MA41 14.0 10.6 MA41 UNSYM SuperLU 43.7 UMFPACK 33.0 MA48 28.1

13.12

lhr71c twotone onetone1 wang4 70.3 120.7 36.1 26.1 1.528 1.224 0.341 0.177 0.21 0.43 0.43 1.00 11.7 9.4 7.3 6.5 7.7

22.1 17.0 21.2 10.0 8.2

4.7 3.9 4.9 4.7 6.1

11.6 11.6 26.2 43.5 23.0

Rectangular frontal matrices with structural pivoting

Rectangular frontal matrices have also been used by Gupta (2002a) in WSMP for handling unsymmetric matrices. In common with several other codes, he uses a static data structure based on applying a fill-reducing algorithm to the pattern of A + AT to find a symmetric permutation of A that defines an initial pivot order. The matrix A is sorted so that the lower triangular part of the permuted matrix is held by columns and the upper triangular part of the permuted matrix is held by rows. This allows row and column i (in the permuted order) to be assembled into the front for pivot i. For the rest of this section, we will refer to the permuted matrix as A. Note that these fronts are usually smaller than those of codes that continue during factorization to work with the pattern of A + AT . There, the frontal matrix is square with a single index list that consists of the indices of all the rows in the pivotal column and all the columns in the pivotal row. Gilbert and Liu (1993) introduced the concept of a pair of DAGs to characterize the structure of the L and U factors of A. The DAGs are the transitive reductions4 of the graphs of L and U. We will call them the ‘L-DAG’ and the ‘U-DAG’. If front k contains row ` > k (that is, if L has an entry l`k ), row ` of its generated element must be calculated and added into front ` before the task of performing pivotal operations can take place there. This requirement is met by any ordering of the tasks which is such that for every edge i → j in the L-DAG or the U-DAG, task i precedes task j. Gupta therefore merges the L-DAG and the U-DAG into a single task-DAG. Edges from the L-DAG, the U-DAG, or both are called L-edges, U-edges, or LU-edges, respectively. For each edge i → j, j is called a ‘parent’ of node i. Note that a node may have more than one parent. Unfortunately, it can happen that some entries in a generated element are absent from the fronts of all its parents, which led Gupta (2002a) to add additional links to ancestors. He calls this new DAG the ‘data-DAG’. 4 The transitive reduction of a graph G is the subgraph that has an edge for every path in G and has no proper subgraph with this property.

RECTANGULAR FRONTAL MATRICES WITH STRUCTURAL PIVOTING 311

With pivoting in the symmetric case, if a pivot cannot be chosen in a front, it may be delayed to its parent and reconsidered there. Gupta shows that there is a similar property in the unsymmetric case. There is an ancestor to which the pivot may be delayed. The simplest case is where there is an LU-parent in the task DAG. In this case, if the pivot had been acceptable, all the rows and columns of the generated element would be present in the LU-parent. When it is not acceptable, all we need to do is add the rejected row and column to the parent and reconsider the rejected pivot there. Gupta shows that there is always an ancestor in the task-DAG that can play this role; all the rows and columns of the generated element that would have been generated are either pivotal at descendants of the ancestor or are present in the front of the ancestor, so the rejected pivot can be reconsidered there. The column of the rejected pivot must be added not only to this ancestor, but also at all nodes that are on a path to the ancestor in the data-DAG and lie beyond an L-edge; at these nodes, the rejected pivot is not available for reconsideration because the whole of its row cannot be added there. Similarly, the row of the rejected pivot must be added to all nodes that are on the path to the ancestor and lie beyond an U-edge; and the rejected pivot is not available for reconsideration because the whole of its column cannot be added. Unfortunately, this is not always sufficient. Sometimes, extra edges needed to be added to the data-DAG to hold some of the entries of the row or column of the rejected pivot. The full details of the extra edges that are needed in the data-DAG, both for the case without pivoting and for the case with pivoting, are quite complicated and are contained in the paper of Gupta (2002a). As in the symmetric case, pivots delayed from one front may be further delayed from the front in which they are reconsidered. For simplicity, we have not mentioned supervariables so far in this section, but the generalization is straightforward. In each front, any entry in the pivot block is a pivot candidate and as many as possible are chosen. Even without supervariables initially, this situation occurs once pivots are delayed. Gupta determines the structure of the LU factorization of A, including the determination of its supervariables. Two adjacent variables i and i+1 are treated as belonging to a supervariable if the patterns of columns i and i + 1 of L from row i + 1 onwards are identical or nearly so and the patterns of rows i and i + 1 of U from column i + 1 onwards are identical or nearly so. He uses the supervariables to condense the representation of the pattern of the factorized matrix and accelerate the actual factorization by working with blocks. Gupta (2002a) reports on runs on 25 real-life problems and finds that on average the data-DAG contained about 4% more edges than the task-DAG and that most of these were added to support the non-pivoting case. The extra time needed averaged about 12% without pivoting and an extra 1 21 % to support pivoting. He compared a version of MA41 that took account of zero rows and columns in the front with a version of WSMP that also used AMD ordering. In one case (pre2), WSMP succeeded when MA41 failed, presumably through lack of

312

GRAPHS FOR SYMMETRIC AND UNSYMMETRIC MATRICES

memory. The largest ratio of operation counts was 21.3 and the median was 1.58. However, in five cases the ratio was less than 1.0, with minimum 0.37. The parallel version of WSMP is described by Gupta (2007). He uses a static mapping based on the work at supernodes when no interchanges are needed. A tree is constructed from the data-DAG by discarding all parents of a node except one with longest weighted path to the root. This choice is made in the hope of minimizing the time before processing the parent front can start. All the processes are first assigned to the root and then recursively the set of processes at a node is subdivided among its children roughly in proportion to the total work in the subtrees rooted at the children until nodes with a single process are reached. This mimics the load-balancing effect in the symmetric case of using a binary tree and always dividing the processes at a node equally between its two children. The number and distribution of processes at a node is limited for the sake of efficient execution of the operations within the front. Fronts are block-cyclically assigned to processes using an m × n process grid. m is 1 if the number of processes is less than 6, which means that no communication is needed when choosing a pivot from a column. Otherwise, m ≤ n, m ≈ n, and m is a power of 2. This choice of m is designed to make choosing a pivot efficient. The effect of all these requirements is that not all numbers of processes are possible at a node. Example of numbers not permitted are 7, 9, 13, 14, and 15. For the operations by a process within a front, shared-memory threading may be employed. Gupta reports on runs on 25 real-life problems, many those used in his sequential tests, and with up to 32 CPUs. On the whole, he found that the code works best with 2 threads for each process. In his experiments, WSMP compares favourably with MUMPS and SuperLU, although all codes have been upgraded since that time. 13.13

Trees for unsymmetric matrices

Eisenstat and Liu (2005) have generalized the definition of the elimination tree to accommodate the structure of unsymmetric matrices. They make use of the L- and U-DAGs of Gilbert and Liu (1993), see Section 13.12, using paths in these graphs. The elimination tree for an unsymmetric matrix is defined, as in the symmetric case, by identifying the parent node for any node k of the tree. In this case using the expression L

U

min{i|i ⇒ k ⇒ i},

(13.13.1)

L

where the expression i ⇒ k means that there exists a path between node i and node k in the L-DAG. If the matrix has symmetric structure, the tree is just the elimination tree, defined by the parent of node k being given by min{i|lik 6= 0}.

(13.13.2)

TREES FOR UNSYMMETRIC MATRICES

313

This is easy to show, and we leave this as an exercise for the reader (Exercise 13.10). Eisenstat and Liu (2005) show that using a particular postordering of this tree to reorder the unsymmetric matrix will give a reordered matrix in bordered block triangular form. They show how numerical pivoting can be later applied in a similar way to what we have already discussed in Section 13.3. Although it is some years since this work, we are unaware of any significant effort in implementing and comparing approaches based on this generalization and know of no available codes for using these structures. It would seem that the main problem is being able to exploit sparsity fully, while being able to maintain numerical stability through dynamic pivoting at the factorization stage. Exercises 13.1 Using the unsymmetric matrix 

×

×



 × ×      × × ×     ×   × × × × ×     × × × × from Amestoy and Puglisi (2002), show the assembly trees that would be obtained both when asymmetry is ignored and when it is exploited. 13.2 Create a plausible explanation for the disparity of results displayed in Table 13.2.1. 13.3 Justify the statement in Section 13.3 that the overhead of the extended-add operation is linear in the front size. 13.4 In Section 13.4, we discussed the effect of modifying a small pivot and correcting for the modification using iterative refinement. We stated that this will tend to fail when the matrix is singular. Demonstrate this for the singular matrix  A=

 1.0 1.0 , 0.0 0.0



 1.0 . 1.0 13.5 In Table 13.4.1, we see a dramatic reduction in time and size for the matrix dtoc using static data structures with perturbed small pivots. What might be going on here?  0 B [Hint: This symmetric matrix has the form , with B having 14 997 rows, 5 001 BT 0 singleton rows, and no singleton columns.] using right-hand side

13.6 At the end of Section 13.9.1, we stated ‘A nested dissection based ordering will tend to give broader trees that yield more parallel processes.’ Why would you expect this to be the case?

314

GRAPHS FOR SYMMETRIC AND UNSYMMETRIC MATRICES

13.7 Referring to Figure 13.9.4, how does the parallelism at the leaves (bottom of the figure) differ from the parallelism at the top of the figure? 13.8 Again, referring to Figure 13.9.4, for a very large problem, consider the possibility of a fourth hierarchy in the different parallel structures. What might this look like? 13.9 In Table 13.11.1, what factors might lead to the inconsistent results over the different algorithms across the different test problems? 13.10 Show that, if the matrix is symmetric, the expression (13.13.1) for trees for unsymmetric matrices will just give the elimination tree defined by expression (13.13.2).

Research exercises R13.1 When factorizing unsymmetric matrices, we showed a wide range of outcomes across the different test matrices and different algorithms (see Table 13.2.1). If you are working in a particular domain involving unsymmetric matrices, run tests on your class of problems to see if there is a clear best algorithm for this class of matrices. R13.2 To explore the issues of solution accuracy for different types of scaling for realistic-sized problems from your field, start with real cases or select matrices from the test collection that are well scaled. Then ‘unscale’ the matrix by replacing random rows and columns by small multiples of the original values. Then apply the automatic scaling methods to these poorly-scaled matrices and compare the results to solving the original problem, using a variety of right-hand sides. What do you conclude?

14 THE SOLVE PHASE We examine the SOLVE phase in the direct solution of sparse systems. Here we assume that the factors have been computed and we study the efficient use of these to determine the solution through forward and backsubstitution. We use the trees described in the previous chapters to study the efficient solution of sparse right-hand sides including the computation of null-space bases. We consider the parallelization of the SOLVE phase.

14.1

Introduction

When using direct methods to solve a dense linear system, the SOLVE phase, that is the use of the matrix factors to obtain the solution using forward and back-substitution, is of far lower complexity than the actual factorization, being of order O(n2 ) rather than O(n3 ). Thus, the SOLVE phase has not attracted as much attention as the factorization. In the case of sparse matrices, the SOLVE phase also normally has lower complexity than the factorization, with floating-point arithmetic for a single right-hand side proportional to the number of entries in the factors. For this reason, there has been relatively little effort to optimize this part of the solution process. There are three main reasons why we should be concerned with this phase. The first is that it is not always true that the complexity is much lower for SOLVE as the complexity of the factorization phase can be quite low, of order O(n2 ) or even O(n) (as, for example, for tridiagonal systems). The second is that so much effort has been spent in accelerating the factorization phase on both sequential and parallel computers that the time for factorization can sometimes be comparable or even less than the time for SOLVE. This problem is greater on parallel machines because there is less scope for exploiting parallelism than in the FACTORIZE phase. A third reason for the importance of the SOLVE phase, which we have already discussed in Section 10.5, is that there are many applications where many hundreds of calls are made to this phase for each factorization so that the efficiency of this phase can be crucial to the overall cost of the computation. A major issue when considering the efficient implementation of SOLVE is that the ratio of communication to computation is high since each entry of the factors is used only once for each right-hand side or set of right-hand sides. This means that there is the same amount of data movement as arithmetic and many of the techniques for accelerating the factorization are not available to us. Direct Methods for Sparse Matrices, second edition. I. S. Duff, A. M. Erisman, and J. K. Reid. c Oxford University Press 2017. Published 2017 by Oxford University Press.

316

THE SOLVE PHASE

We discuss the levels within SOLVE where we can obtain improvements in performance in Sections 14.2 and 14.3. We consider the situation for sparse righthand sides in Section 14.4 and for multiple right-hand sides in Section 14.5. We give an illustration of an application, which involves sparse right-hand sides in Section 14.6. Finally, we discuss the parallelization of SOLVE in Section 14.7. In this chapter, we focus on the case where the factors have already been computed. We noted in Section 7.9 that sparse right-hand-side vectors may be added as extra columns of the matrix and included in the forward elimination process, influencing the pivot choices. There will be efficiency gains from doing this if a solution is required for only a single right-hand side or set of right-hand sides. Also, it means that the advantages of parallelization can be obtained for forward substitution. 14.2

SOLVE at the node level

When we discussed the implementation of the SOLVE phase in Section 10.5, we were concerned with the detailed loop structure. However, in our later discussion of sparse factorization in Chapters 11–13 we have stressed the importance of using dense blocks as the kernel of our sparse factorization. Here, we study the SOLVE phase with the same orientation. We assume that the factors are stored in blocks corresponding to variables eliminated at the same node of the assembly tree. We first consider the solution of Ux = b. A block pivot row at a tree node for U is illustrated in Figure 14.2.1. We can view this as part of a block row of the block triangular form shown in equation (3.13.4), that we will denote as Ukk Uk∗

(14.2.1)

because, in this sparse case, many of the blocks Ukj may be zero or contain zero columns. One step of the block back-substitution (see (3.13.5)) consists of solving Ukk xk = bk − Uk∗ xk∗ , (14.2.2) where xk∗ is the part of x that corresponds to Uk∗ . Indexing vectors are held to identify the rows and columns of the block. We note that, when the matrix is NCOL

NPIV

Fig. 14.2.1. Block pivot row of U with columns from index set NCOL and rows from index set NPIV.

SOLVE AT THE NODE LEVEL

317

symmetric, the indices in NPIV are just the first size(NPIV) indices of NCOL and so they are not held in a separate array. We recall from Section 10.5 that we use the term active vector to mean the vector of length n that starts by being the vector for the right-hand side, b, and then is modified during the substitution process, finishing as the solution vector, x. There are two ways of performing the block back-substitution (14.2.2). In the first, each row is treated separately as a packed vector and indirect addressing is used in the innermost loop. The other possibility involves loading all the components of the active vector identified by the index set NCOL into a vector of length size(NCOL), that we call the node vector. We perform the operations using direct addressing on the node vector and then unload the size(NPIV) components of the node vector corresponding to xk into the active vector. The latter approach will be better if the savings in using direct over indirect addressing are greater than the cost of the loading and unloading. This will be true on most computers unless size(NPIV) is very small, and we would expect the use of direct addressing to be particularly efficacious on modern computers. We illustrate this by the runs in Table 14.2.1 where we see that using direct addressing is around 60% faster. The forward substitution using block columns of L proceeds in a similar fashion, but now we use the alternative scheme shown in equation (3.13.3), which here takes the form c(1) = b and for k = 1, 2, . . . , m: (k)

Lkk xk = ck (k+1)

cj

(k)

= cj

(14.2.3a) − L∗k x∗k .

(14.2.3b)

A block column of L is structurally the transpose of Figure 14.2.1 and is exactly the transpose if the matrix is symmetric. Now all the components of the active vector corresponding to those in the temporary node vector of length abs(NCOL) are altered and need to be copied back into the active vector. Using the temporary node vector aids efficiency by enabling the use of Level 2 BLAS for both forward and back-substitution, but the bad ratio of memory access to floating-point computation remains. Table 14.2.1 Difference between using direct and indirect addressing in SOLVE. Runs of MA57 on a Dual-socket 3.10GHz Xeon E5-2687W, using the Intel compiler with O2 optimization. Times in seconds. Matrix

Order Entries Solve times (103 ) (106 ) Direct Indirect pwtk 217.9 5.926 0.096 0.143 4.104 0.099 0.165 ship 003 121.7 bmwcra 1 148.8 5.396 0.122 0.201 bmw3 2 227.4 5.758 0.093 0.141 cfd2 123.4 1.606 0.088 0.145

318

14.3

THE SOLVE PHASE

Use of the tree by the SOLVE phase

In Section 14.2, we considered the operations at a single node without discussing the interaction between operations at different nodes. Indeed, if the node structure portrayed in Figure 14.2.1 is used without taking into account the structure of the tree, data must be loaded from the active vector to the working vector and back at each stage. By using the tree structure we can avoid some of this data movement. The pivotal order is based on visiting the nodes of the tree in a topological order. This means that during back-substitution the nodes are visited from the root with each node preceding its children and during forward substitution each node is visited before its parent. Note that the variables in the set NCOL-NPIV (that is the indices in NCOL corresponding to variables that are not eliminated) will also appear in the parent node. During back-substitution, once a node has been processed, its node vector will consist of components of x. If this vector is retained while processing a child, the child’s vector x∗k can be extracted from it. This is likely to be more efficient than accessing the active vector because the data will be less scattered. It will still, of course, be necessary to load bk and store xk . Forward substitution can be processed in a similar way, but a component of the active vector may be updated by more than one child of a node. Therefore, the updates at a node must be made to the vector at the node’s parent. For unassembled matrices, the right-hand side vector may be treated similarly to the matrix itself in a frontal fashion. No active vector is used during forward substitution. Instead, the contribution of an element is added into the node vector of the first node in which one of its variables is in NPIV. 14.4

Sparse right-hand sides

There are many cases when the right-hand-side vector is sparse or when only a subset of components of the solution vector is required. We have already shown a simple case of this in Section 7.9. It is easy to see that during forward substitution, the only nodes that need be accessed are those on the paths in the tree between the root node and nodes with variables in NPIV corresponding to nonzero entries of the right-hand side. This is because the vector is zero on other nodes, so does not need to be computed. For back-substitution, the relevant nodes are those on the path from the root to nodes corresponding to the components required. This is because beyond this all variables are unwanted and need not be computed. These properties are exploited in the papers by Amestoy et al. (2010) and Amestoy, Duff, L’Excellent, Robert, Rouet, and U¸car (2012). The identification of the nodes that are required to solve a lower triangular system with a sparse right-hand side is the same as using the Gilbert–Peierls algorithm to identify which previous columns of the matrix are required in the factorization phase, see Chapter 10.

MULTIPLE RIGHT-HAND SIDES

319

For sparse right-hand sides, it is important to use the technique explained in Section 14.3 to avoid copying data unnecessarily to and from the active vector. We already met sparse right-hand sides in Section 7.9 and also in Section 10.6 in the context of linear programming using the simplex method. Here, the current basis matrix is sparse as is the incoming column on which it operates. An added gain is found because the matrices involved are often reducible so operations are restricted to particular parts of the incoming vectors. Hall and McKinnon (2005) exploit sparsity in vectors in all parts of their simplex implementation and derive considerable benefits from so doing. 14.5

Multiple right-hand sides

There are many instances where we wish to solve simultaneously for a block of right-hand side vectors. This can happen in parametric studies or in methods for obtaining sets of eigenvalues. It can also happen when different right-hand sides correspond to computing the responses to different events. In this case, there can be more reuse of the factors and the ratio of data access to computation improves to the level of the number of right-hand sides. Level 3 BLAS can be used instead of Level 2 BLAS. We discussed this already in Section 5.6 with the special case of computing multiple entries of the inverse considered in Section 5.7. When using the methods of Section 14.4 for multiple sparse right-hand sides, we need the union of the subset of nodes corresponding to the constituent righthand sides so that often most or all of the tree nodes will need to be accessed. Also, if we have many right-hand sides then storage requirements might prevent us from solving for all of them simultaneously. It is then necessary to solve for blocks of right-hand sides at a time and the choice of blocking can have a significant bearing on efficiency, for example if the matrix factors are held out-of-core (Amestoy et al. 2012) or in a parallel environment (Amestoy, Duff, L’Excellent, and Rouet 2015b). We discuss this further in Section 14.7. One case with multiple right-hand sides is in the computation of entries of the inverse. This computation is needed in a range of applications, particularly in statistics. We discuss the parallelization of this application in Section 14.7.3. We first discuss another application that has many sparse right-hand sides. 14.6

Computation of null-space basis

For the computation of a null-space basis, we assume that the matrix, necessarily singular if the null-space is not null, has the factorization A = LU, where without loss of generality permutations are omitted for the sake of clarity. We further assume that det(L) = 1 and the upper triangular matrix U is of the form   U11 U12 , 0 0

320

THE SOLVE PHASE

with the upper triangular matrix U11 nonsingular of rank r, say. This means that, to obtain a basis for the null space, we need to solve  L

U11 U12 0 0



X1 X2

 = 0,

 X1 has n − r columns. where X2 As L is nonsingular, we solve 



U11

   X1 U12 = 0, X2

that is, U11 X1 = −U12 X2 . X2 should be of full rank but otherwise is arbitrary and can be taken as the identity matrix to give the equation U11 X1 = −U12 .

(14.6.1)

This set of equations has a sparse right-hand side and is also solved for multiple right-hand sides. 14.7

Parallelization of SOLVE

When discussing the parallelization of the SOLVE phase, we assume that the matrix has already been factorized in parallel and that the same target architecture is in use, be it shared or distributed memory. In the latter case, the entries of the factors will be distributed across the processes of the parallel machine. We have noted that the parallelization of ANALYSE has not received much attention; a comparable level of inattention has been paid to the parallelization of SOLVE. In addition to the motivations for improving SOLVE given in Section 14.1, there is interest in parallelizing this phase to avoid avoid a costly redistribution of vectors or matrix after the numerical factorization on distributed memory machines and the added complexity for an out-of-core parallel implementation, and the fact that some applications require a huge number of SOLVEs relative to the number of factorizations. Although the effect on efficiency will differ, there is little difference in implementation whether the target parallelism is for shared or distributed memory machines, where we assume in the latter case that the factors are distributed as computed during the parallel factorization. As in the case of sparse factorization, SOLVE can be parallelized at more than one level.

PARALLELIZATION OF SOLVE

14.7.1

321

Parallelization of dense solve

At first glance the SOLVE phase for dense systems seems horribly sequential as in both the forward and back-substitution we require the solution of one equation before proceeding to the next. However, some parallelism can be extracted by blocking the system, as we did in Section 3.13. With multiple right-hand sides, the equivalent of (3.13.2) is Lkk Xk = Ck −

k−1 X

Lkj Xj , k = 1, 2, . . . , m.

(14.7.1)

j=1

We have thus replaced a large (sequential) triangular solve by m smaller ones and some calls to gemm that we know can be parallelized. Indeed, this is just the technique used in most implementations of the triangular solve BLAS, trsm. Note that the gemm operations corresponding to Lkj Xj can be computed in parallel or it could be more efficient to combine these matrices and let the gemm subroutine partition the larger matrix according to the fine details of its implementation. Although this provides some parallelism, the smaller triangular solves are still sequential. One can remove this by computing an explicit inverse for these blocks so that only matrix-matrix multiplication is required. Raghavan (1998) has investigated the use of selective inversion within a sparse multifrontal code and has shown speedups of between six and sixteen times over not doing the inversion when running on the Intel Paragon and IBM-SP2. More recently, Hogg (2013) has implemented this idea in a CUDA code for solving dense triangular systems on GPUs and has obtained a speedup of up to 30% over the simple partitioned approach. 14.7.2

Order of access to the tree nodes

The SOLVE phase can also be executed in parallel by following the same computational pattern as the factorization phase, that is, the pattern induced by the elimination tree. The efficiency of the SOLVE phase is very dependent on the order in which the tree nodes are accessed during the solution. For the forward substitution, we can access nodes from leaves to the root in the same order as during the factorization. However, the only requirement is that the children should be processed before the parent. This leaves a lot of flexibility for the ordering so we need not follow that used to generate the tree in the factorization phase. For back-substitution, the access is from the root to the leaves and as soon as a node is processed all of its children can be. Thus, the potential for tree parallelism increases as the computation proceeds. Perhaps the most important aspect when considering the parallelization of SOLVE is to process the factors held in the assembly tree simultaneously on quite different branches of the tree. For the forward substitution, accessing the nodes in a postorder is preferable in the sequential case, but for exploiting parallelism

322

THE SOLVE PHASE

we need to start by working on leaf nodes that are far apart in the tree, that is, having all common ancestors close to the root. Thus, we would commence by accessing several leaf nodes simultaneously. For back-substitution, we should access the nodes by a breadth-first search approach rather than by the depth-first search associated with a postordering. 14.7.3

Experimental results

We show in Table 14.7.1 some results from Amestoy et al. (2010) of runs of a prototype parallel solve in a version of MUMPS with dense right-hand sides. We see that there is reasonable scalability given the amount of data movement to arithmetic. Table 14.7.1 Elapsed times for a parallel solve. Results from the paper by Amestoy et al. (2010). Times in seconds on a Cray XD1. Matrix

Order Entries Number of processes (×106 ) (×106 ) 2 4 8 16 24 CAS4R LR151 2.423 19.6 Fwd 334 220 118 64 52 Bwd 270 191 100 84 75 audikw 1 0.944 39.3 Fwd - 218 148 118 63 Bwd - 234 166 121 76 GRID5M1 5.000 53.8 Fwd - 348 187 84 77 Bwd - 321 223 133 117 COR5HZ1 2.233 90.2 Fwd - 298 220 195 Bwd - 351 303 256 AMANDE1 6.995 584.8 Fwd - 475 Bwd - 629 NICE9HZ1 5.141 215.5 Fwd - 572 Bwd - 685 1 Not in the Florida sparse matrix collection. - indicates not enough memory to run the factorization phase.

32 45 70 73 80 45 77 186 218 351 565 472 605

We discussed an application with multiple sparse right-hand sides in Section 14.4. We show the performance of a parallel SOLVE code for this application in Table 14.7.2 extracted from table 4 in the paper by Amestoy et al. (2015b). In this table, ‘Baseline’ uses a postordering of the tree and does not exploit sparsity within the block of right-hand sides. ‘Improved’ is when we order for parallelism and take advantage of the sparsity within the block of righthand sides rather than working with the union of the pattern of all the sparse right-hand sides. We see that while there is some parallelism exhibited by the baseline code, speedups are only around 2–3 on 32 processes, whereas with code that uses the ideas described in this chapter the speedups increase to over 20.

PARALLELIZATION OF SOLVE

323

Table 14.7.2 Time in seconds for the computation of a random 1% of the diagonal entries of the inverse of six large matrices. Results comparing the sequential performance with the parallel performance on 32 cores (4 nodes of the Hyperion system), using the baseline algorithm and an improved algorithm, are from Amestoy et al. (2015b). Matrix

Order Entries (×106 ) (×106 )

Processes 32 Baseline Improved Time(s) Time(s) audikw 1 0.944 39.3 1143 380 75 NICE20MC1 0.716 28.1 945 245 43 bone010 0.987 36.3 922 327 70 CONESHL1 1.262 43.0 803 293 48 Hook 1498 1.498 31.2 1860 720 173 2.423 19.6 882 482 116 CAS4R LR151 1 Not in the Florida sparse matrix collection. 1

Exercises 14.1 We showed in Table 14.2.1 that the use of direct addressing allows faster SOLVE times compared with using indirect addressing. Would the same result be expected if A were tridiagonal? Why? What other cases might lead to a different result? 14.2 Show that the only nodes of the assembly tree that are involved in the forward elimination of a system whose right-hand side has a single nonzero entry in position i are those between the root and the node in which variable i is pivotal. Also show that for back-substitution when we only require component i of the solution the nodes involved are only these between the corresponding node and the root. 14.3 In Section 14.7, we showed that we could improve parallelization in the SOLVE phase by computing the explicit inverses of the smaller triangular blocks and using these, rather than the LU factors for part of the SOLVE steps. What is the additional cost of this computation? Sketch the argument for why we would expect this to improve parallelization. 14.4 In light of the effectiveness (on parallel computers) of the use of the explicit inverse, why is it NOT a good idea to simply create the explicit inverse for the entire matrix A?

Fig. E14.1. A tree with 5 nodes.

324

THE SOLVE PHASE

14.5 In Section 14.7.2, we state that ‘the only requirement is that the children should be processed before the parent. This leaves a lot of flexibility . . .’. Sketch the structure of the lower triangular matrix L associated with the tree of Figure E14.1 to illustrate why this statement is true.

Research exercise R14.1 For the case of multiple right-hand sides with differing sparsity patterns, how might the use of Level 3 BLAS be accommodated?

15 OTHER SPARSITY-ORIENTED ISSUES We consider solving a problem that is closely related to one that we have previously solved or whose parts have previously been solved. We study backward error analysis for sparse problems, obtaining entries of the inverse of a sparse matrix, sparsity in nonlinear problems, solution methods based on orthogonal transformations, and hybrid methods that combine iterative and direct methods. 15.1

Introduction

The focus of this book has been on the effective and efficient solution of the linear system of equations Ax = b (15.1.1) using Gaussian elimination, or a variant, where A is a given n×n sparse matrix. In this chapter, we introduce numerous closely related topics that play an important role in this kind of computation. In the next section, we introduce an old concept called the matrix modification formula. It was initially derived in the middle of the 20th century and shows the relationship between the solution of (15.1.1) and the closely related problem where A has been changed by a matrix of low rank. In Section 15.3, we develop a few applications of this formula: • if we needed to modify the pivot size in factorizing A in order to preserve sparsity, how might we correct the solution to account for this change? • if we have a parameter that represents an entry in A and is hard to measure, how sensitive is the solution to variations in this parameter? In Section 15.4, we explore two other applications of this formula. We introduce several approaches to creating a matrix A made up of the subproblems plus a low-rank matrix that corresponds to those new entries tying the subproblems together. We also consider solving a sequence of problems like (15.1.1) where A changes over the sequence, but only in a small number of places. We see that this problem is closely related to the first in that the isolated changes in A over the sequence would also be a low-rank change for A. A parameter study is one example of this, but there are many others. We consider several approaches to solving such problems. We can make use of the matrix modification formula. We can constrain the ordering of the original matrix A to confine the parts that change (either the interconnections of subproblems or the changing parts of the matrix over the sequence) to be ordered last. This approach leads to a Direct Methods for Sparse Matrices, second edition. I. S. Duff, A. M. Erisman, and J. K. Reid. c Oxford University Press 2017. Published 2017 by Oxford University Press.

326

OTHER SPARSITY-ORIENTED ISSUES

partitioned form where only the lower corner would need to be refactorized as the entries change. As a side note here, in the earlier days it was standard to solve large problems in terms of their subproblems, but high speed computing and sparse matrix algorithms made this no longer necessary. However, the emergence of even larger problems and the success of dissection methods suggests that subproblem oriented dissection may offer a significant opportunity for a class of applications. In part, this is because other ordering algorithms are themselves heuristic and are not optimal. We showed in Chapter 4 that when Gaussian elimination is done in a stable way, we cannot necessarily say the answer is accurate. However, we can say the answer is accurate for the solution of (A + H)x = b,

(15.1.2)

where the norm of H is small. While this is true, in Section 15.5 we ask the extended question: is there a matrix H with the sparsity pattern of A that has a small norm, and what might that matrix be? In a related algebraic vein, we show in Section 15.6 that if the matrix A is irreducible, then the matrix A−1 is dense. It is not that we would want to compute A−1 in the solution of (15.1.1), but sometimes the entries of A−1 are useful for other things. This might make it seem that if we wanted the diagonal entries of A−1 , for example, we would need to compute all of the lower triangle of A−1 . We show in Section 15.7 that it is possible to compute only selected entries of A−1 without computing all of the entries below them as might be expected. In Sections 15.8 to 15.11, we consider the solution of nonlinear systems, where in the innermost loop we need to compute the solution of (15.1.1). Rather than simply factorizing each matrix without regard to this special context, we show how to take advantage of the nonlinear solution algorithm in updating and solving the various linear systems. In Sections 15.12 and 15.13, we look at a different type of variation on the solution of (15.1.1). In this case, we consider techniques other than Gaussian elimination for the solution. In Section 15.12, we consider using orthogonalization methods rather than Gaussian elimination, but still with the objective of preserving sparsity. In Section 15.13, we consider hybrid methods that combine direct algorithms with iterative methods. Of course, entire books are written on these subjects and it is not our place to try to repeat that work. However, we consider them briefly to sketch their relationship with what we have done in this book, and point the reader to other sources for more information. 15.2 15.2.1

The matrix modification formula The basic formula

Suppose we have already solved Ax = b and wish to solve a perturbed system (A + ∆A)x = b

(15.2.1)

THE MATRIX MODIFICATION FORMULA

327

where ∆A has low rank. The matrix modification formula provides a tool for solving such systems. It was defined by Sherman and Morrison (1949) for rankone changes and was generalized by Woodbury (1950). Expressed in terms of matrix inverses it has the form −1

(A + VWT )

= A−1 − A−1 V(I + WT A−1 V)

−1

WT A−1 .

(15.2.2)

Here, ∆A = VWT is a rank-k matrix, where V and W are n × k matrices. We are interested only in the case where k is much less than n. This formula may be readily verified and its verification is left to Exercise 15.1. To preserve sparsity, we will use the LU factors rather than computing inverses. Moreover, we may multiply both sides of equation (15.2.2) by the righthand side vector b to yield the relation (A + VWT )

−1

−1

b = A−1 b − A−1 V(I + WT A−1 V)

WT A−1 b.

(15.2.3)

We see that the solution to the modified problem (A + VWT )−1 b is given as the solution to the original problem A−1 b and a correction term. While (15.2.3) may look intimidating, the computation is quite straightforward and cheap when k is much less than n. The steps are (i) Solve AX = V using the LU factors of A. (ii) Compute T = I + WT X and its LU factors. (iii) Solve Ay = b using the LU factors of A. (iv) Solve Ts = WT y using the LU factors of T. (v) Compute x = y − Xs. The first step is likely to be the most expensive step. It involves O(kτ 0 ) operations where τ 0 is the number of entries in the LU factors of A. Forming WT X requires at most O(k 2 n) operations but, if W is sparse, it will require far less, perhaps as few as k 2 . The rest of step (ii) requires O(k 3 ) operations, so will be inexpensive. Step (iii) requires O(τ 0 ) operations, step (iv) requires at most O(kn) operations, and step (v) requires O(kn) operations. Note that steps (i) and (ii) need to be performed only once if we have many vectors b. A common case is where ∆A has rank k and has nonzeros in only k rows or columns. If ∆A has nonzeros in only k columns, it can be expressed in the form ∆A = VWT , where V consists of those nonzero columns and WT is the submatrix of the n×n identity matrix that has 1s in the nonzero columns of ∆A. If ∆A has nonzeros in only k rows, we may take V to be the submatrix of I with 1s the nonzero rows of ∆A and WT to consist of the nonzero rows. 15.2.2

The stability of the matrix modification formula

The stability of the matrix modification formula has been studied by Yip (1986). She did not perform a full error analysis, but looked at the stability of each of the steps set out in Section 15.2.1. Steps 1 and 3 depend on the conditioning of A. Steps 2 and 5 depend on there being no severe cancellation in the additions.

328

OTHER SPARSITY-ORIENTED ISSUES

Step 4 depends on the conditioning of T. Yip shows that in two important cases κ(T) ≤ κ(A)κ(A + ∆A) (15.2.4) where κ denotes the 2-norm condition number, as defined in (4.11.5). The first case is when ∆A is nonzero in only k rows or columns and V and W are chosen as explained in the last paragraph of Section 15.2.1. The second case is when an SVD decomposition is performed on ∆A as PSQT , where P and Q have orthonormal columns and S is diagonal. In this case, V is set to P and WT is set to SQT . Equation (15.2.4) makes it apparent that the conditioning of A and A + ∆A is important. While this result gives us some insight into where things might go wrong, making checks is impractical. We therefore recommend that iterative refinement be used to check and correct the result. The situation is more hopeful if k∆Ak is much smaller than kAk, which is the case with static pivoting, see Section 13.4, provided A is well conditioned. T will not be much different from I so there will be no severe cancellation during its formation and it will be well conditioned. Indeed, all the steps will be stable. Note however, the vital importance of A being well conditioned. The important special case where a single diagonal entry is altered has been fully analysed by Stewart (1974). 15.3

Applications of the matrix modification formula

There are many applications for using the matrix modification formula in practical problems, some more promising than others. We will briefly describe a few. 15.3.1

Application to stability corrections

We noted in Section 10.3 that Stewart (1974) recommended the modification formula when some of the pivots selected for sparsity and stability for a first set of numerical values were small for a subsequent matrix. If the kth pivot is too small, simply add a suitably large a to akk and proceed with the factorization using that pivot. This would then require correcting for the modification. That is, we have successfully factorized the matrix A + aek eTk

(15.3.1)

and now must make a simple rank-one correction to solve the original problem. Since the stability is not guaranteed, iterative refinement is still needed. 15.3.2

Building a large problem from subproblems

When very large mathematical models are built, they are often assembled from a number of smaller models, which have been tested and validated individually. A large-scale circuit is generally made up of many smaller components. A large, complex structural analysis is created from its many substructures. Today’s large

APPLICATIONS OF THE MATRIX MODIFICATION FORMULA

329

power grids represent the interconnection of many regional power systems. This can readily be seen in Figure B.10. The diagonal blocks correspond to major independent power systems in the Western Region of USA, while the relatively few entries outside the blocks on the diagonal correspond to the interconnections between these power systems. Figure 9.7.1 can be seen in this way as well. B and W represent two independent models, which are brought together through the added edges and vertices shown in the diagram. Without the connections, the matrix has the form   A11 0 A= . (15.3.2) 0 A22 There are two blocks that may be solved in two parts, with separate factors for A11 and A22 . Including the connections corresponds to adding the correction   ∆A11 ∆A12 ∆A = (15.3.3) ∆A21 ∆A22 where ∆Aij has nonzeros in only ki rows and kj columns, i = 1, 2, j = 1, 2, and k = k1 + k2 is small. The modification formula may be applied directly, which allows the known factors of A11 and A22 to be used again. Alternatively, we might treat the unmodified problem as   A11 + ∆A11 (15.3.4) A= ∆A21 A22 + ∆A22 and solve by block forward substitution, which requires a new factorization of A11 + ∆A11 and A22 + ∆A22 , but the code that was written for the original problem can be reused. Now the modification has the rank min(k1 , k2 ). The astute reader will recognize another way of solving the larger connected problem taking advantage of the solved subproblems by using partitioning. We make a comparison with this approach in the next section. 15.3.3

Comparison with partitioning

If we go back to equation (15.2.2), there is a different approach. Suppose the modification can be permuted to have the form   0 0 ∆A = , (15.3.5) 0 ∆A22 that is, the given problem can be permuted so that all the changed entries are confined to the lower right-hand corner of the matrix. If A is partitioned as in (15.3.5), the partitioned approach may work very well for this perturbed problem. All we have to do is to factorize the Schur complement, which is now A22 − A21 A−1 11 A12 + ∆A22 . If the changes cannot be constrained to the lower corner, but can be confined to one or both borders, for example

330

OTHER SPARSITY-ORIENTED ISSUES

 0 ∆A12 , ∆A = 0 ∆A22 

(15.3.6)

this can also be exploited to simplify the cost of solving (A + ∆A)x = b, see Exercise 15.4. Note, however, that frequently a change in an off-diagonal entry is associated with changes in the diagonal entries in its row and column, and if this is the case for all changes, we have form (15.3.5) rather than (15.3.6). There are some hidden costs with this way of solving the perturbed problem. The perturbation ∆A constrains the choice of partition and this may well conflict with other objectives such as out-of-core treatment of large problems or improving efficiency in the solution of Ax = b. There is another distinction between the partitioned approach and using the matrix formula to solve the modified problem. In the partitioned approach, we assume the bordered form can be achieved through constraining the pivot order. Using the modification formula, a simpler problem may be achieved using a lowrank update, which is not achieved through reordering. The extreme example of this is the matrix A that has every entry equal to 1 except the diagonal, where every entry has value 2. This matrix is dense, but can be represented as A = I + eeT

(15.3.7)

where e is a vector of all 1s. 15.3.4

Application to sensitivity analysis

We consider one more application of the matrix modification formula. Suppose the entry aij corresponds to a parameter that is difficult to measure. A natural question for the modeller is, how sensitive is the solution to changes in aij ? This question is readily answered using ∆A = ei aij eTj , allowing us to explore how the solution changes with respect to changes in aij . More generally, a single or small number of parameters in the model may affect more than one entry in the matrix. When the affected entries together form a low rank matrix, we can use (15.2.3) to explore the impact of changes in these parameters. We illustrate these ideas with the network shown in Figure 15.3.1. This could correspond to an electric power system network where we want to a contingency analysis of modifying b1 (changing matrix entries a66 , a77 , a67 , and a76 ), b2 (changing matrix entries a11 , a22 , a21 , and a21 ), or b3 (changing matrix entries a99 , a10,10 , a9,10 , and a10,9 ). In this case, the matrix has a symmetric pattern whose graph is shown in Figure 15.3.1. Perturbing b1 corresponds to changes in rows and columns 6 and 7, so a suitable choice is given by V being (e6 e7 ) and WT being the submatrix of rows 6 and 7 of A. This is also the case in the admittance formulation of an electrical network problem. Similarly, perturbing b2 requires the submatrix of rows 1 and 2 of A and perturbing b3 requires the submatrix of rows 9 and 10. In view of symmetry we

THE MODEL AND THE MATRIX

331

Fig. 15.3.1. Example network. therefore need inverse entries only in the (1,1), (1,2), (2,2), (6,6), (6,7), (7,7), (9,9), (9,10), and (10,10) positions, provided we never need to consider modifying more than one of b1 , b2 , and b3 at once. 15.4

The model and the matrix

Throughout this book we have assumed we are given a large, sparse matrix A and want to solve the system of equations Ax = b using a direct method, taking advantage of sparsity. We noted in Section 1.6 that the construction of A is crucial to the solution process and can affect both the sparsity of A and the computational performance of the solution algorithm. Thus attention to the formulation of A, rather than simply taking it as given, can have a significant impact on the overall cost of the solution. An example is that of forming the matrix A using the normal equations A = BT B where B is a sparse matrix. This is likely to lead to a matrix A that is much less sparse and less well conditioned. Another example is the idea of what Hachtel et al. (1971) called the sparse tableau, or variability typing, see Section 7.10, where we can take advantage of solving a sequence of sparse matrix problems in which some of the matrix entries changed from problem to problem and some did not. In this section, we offer two more examples of the impact of problem formulation on the effective solution of Ax = b. These examples are by no means exhaustive, but we offer them for two reasons: they are useful in their own right in many modelling problems, and they invite a way to think about the connection between the construction of A and its subsequent factorization. 15.4.1

Model reduction

Consider the case where model is made up of the interconnection of two submodels B and W as illustrated on the right of Figure 9.7.1. They are connected by adding the nodes of the vertex separator S and edges from it to nodes of B and W . We showed in Section 9.2 that we can number the nodes of S last to produce a bordered block diagonal matrix. Alternatively, we could

332

OTHER SPARSITY-ORIENTED ISSUES

eliminate those edges that connect B and W to S, and account for them through the use of the matrix modification formula. A third, closely related approach, opens yet another way of thinking about this problem. We tentatively make two more assumptions: the part of the right-hand side that corresponds to nodes in B is zero and there is no interest in the solution of the final system of equations at any of the nodes in B. That is, we want to solve  Ax = 

A11 A22 A31 A32

    0 x1 A13 A23   x2  =  b2  b3 x3 A33

(15.4.1)

and we are not interested in x1 . It is a simple matter to show that the solution to this problem is the solution to the problem 

(1)

A22 A23 (1) A32 A33



x2 x3



 =

b2 b3

 (15.4.2)

where A33 = A33 − A31 A−1 11 A13 . That is, we factor A11 and form the Schur complement to get a reduced-sized problem without the variables at the nodes of B. The assumptions regarding this problem often are valid in a large-scale model (say a structural analysis or an electronic simulation) when B corresponds to a relatively large submodel, but there are no inputs to the internal nodes of B (the first block of b is zero), and the solution is of interest for the effects of the whole model, but not at internal nodes in B (x1 is not of interest). If the first (1) block of b is not zero, a forward substitution b3 = b3 − A31 A−1 11 b1 can be performed and if x1 is of interest, it can be found by back-substitution, that is solving A11 x1 + A13 x3 = b1 . To see why this may be a good way to solve the overall problem, recall that our ordering algorithms are heuristic. What we have done is to take advantage of the submodel structure, rather than trying to find a cut using a graph theoretic algorithm (see Section 9.8). We make two further observations about this approach. First, it can be applied to multiple submodels in the overall problem to give a bordered block diagonal matrix with more than two blocks in the block diagonal part. In the spirit of the sparse tableau of the previous section, if some of the submodels are fixed while others change over a sequence of solutions, only those that change would need to be refactorized. Secondly, it addresses a dilemma we often face in large scale modelling. A submodel may need to be very large to be adequately accurate while it may need to be small for good solution times. This approach allows a very large submodel, with an accurate reflection of all of the model details in the input/output model created by the Schur complement. The internal reductions are independent of each other, and may be done in advance of solving the assembled problem, or may be done in parallel on a parallel architecture.

THE MODEL AND THE MATRIX

15.4.2

333

Model reduction with a regular submodel

An important, and practical special case of the problem described in the previous subsection arises when the submodel has a regular structure. This happens, for example, when the submodel is a long transmission line in a power system or a free standing beam in a structural model. If this submodel is only connected to the rest of the model at its ends, then we have the same structure as in the previous section with an added bonus. If the matrix of the subproblem has the special form   A E  F  B E     F B E    , . . . (15.4.3)     . . .    F B E  F D where each block is of order m and the number of blocks is odd, and we pivot on the even blocks, the reduced matrix is  A(1) E(1)  F(1) B(1) E(1)    (1) (1) (1)   F B E    , . . .     . . .    F(1) B(1) E(1)  F(1) D(1) 

(15.4.4)

where A(1) = A − EB−1 F, B(1) = B − FB−1 E − EB−1 F, E(1) = −EB−1 E, F(1) = −FB−1 F, and D(1) = D − FB−1 E. Note that B needs to be factorized only once and each of B(1) , E(1) , and F(1) needs to be formed only once so the work is always O(m3 ) regardless of the number of blocks. If the number of blocks is again odd, we can repeat the process. Indeed if the number of blocks is 2k + 1, we can repeat the process until we have a Schur complement consisting of the 2 × 2 block matrix corresponding to the end blocks. The total work is O(km3 ). If one more block is added, we can exclude this from the cyclic reduction so that it produces a 3 × 3 block from which the middle block may be eliminated. It is interesting that this is as much work as increasing the number of blocks from 2k + 1 to 2k+1 + 1. What we have essentially accomplished using the regular structure of the matrix is to reduce avery large model inlogarithmic time and create an input  b2 x2 , accounting for all of the interior is found from output model, where b3 x3 details. We retain the ability, when required, to handle nonzero b1 or wanted x1 .

334

OTHER SPARSITY-ORIENTED ISSUES

15.5 Sparsity constrained backward error analysis In Chapter 4, we discussed the backward error analysis of Gaussian elimination. The computation is considered stable if the computed factors L and U of A satisfy A + H = LU (15.5.1) and kHk is small compared with kAk. For dense matrices, a bound for kHk was given in inequality (4.3.14). In Section 5.4, we showed that for sparse n × n matrices the corresponding bound depends on p rather than n, where p is the maximum number of entries in any column of U. A justification for backward error analysis is that H can be regarded as a perturbation of the original problem for which L and U are the exact factors. In the sparse matrix problem, this interpretation need not be valid. The reason is that small perturbations to zeros in A may not make sense physically. For example, if A is the matrix in a model of a water-distribution network, a zero may mean no direct connection between two points in the network; ‘small’ connections in this context make no sense. Since H is defined by equation (15.5.1), its nonzeros are not restricted to the sparsity pattern of A. We might consider determining a matrix F with the sparsity pattern of A, kFk small, and satisfying the equation (A + F)ˆ x = b,

(15.5.2)

ˆ is the computed solution. Gear (1975) provided a counter example to where x this objective and we present it here. Gear’s example is the problem     1 1 −1 −1 0 0 δ 0 0 1   A= b= (15.5.3) 0 0 δ 0, 1 10 0 1 2 with solution (1 1/δ 1/δ 1)T . Consider the candidate solution   (δ − σ)/δ  1/δ   ˆ= x  1/δ  , (δ − σ)/δ

(15.5.4)

which satisfies the equation 

   1 1 −1 −1 0 0 δ 0 0 1   ˆ =  . 0 0 δ 0x 1 1 2σ 0 1 2

(15.5.5)

For small σ, this represents a small perturbation of the original problem. It can be shown that the minimum perturbation within the sparsity pattern of A

WHY THE INVERSE OF A SPARSE IRREDUCIBLE MATRIX IS DENSE

335

requires adding σ/(δ − σ) to both the (4,1) and (4,4) entries of A If δ is also small, then σ/(δ − σ) can be large, so that it is not possible to satisfy (15.5.2) with kFk small. Note, however, that if δ is small then A is badly scaled. The problem is tackled by Arioli et al. (1989) using the notion of componentwise relative error (Oettli and Prager 1964) and allowing b to be ˆ is the exact perturbed as well as A. They show that the computed solution x solution of the perturbed problem (A + F)ˆ x = b + δb,

(15.5.6)

with perturbations satisfying the inequalities |fij | ≤ ω|aij |, i, j, = 1, 2, ..., n

(15.5.7)

|δbi | ≤ ωfi , i = 1, 2, ..., n,

(15.5.8)

and where ω = max i

|Aˆ x − b|i . (|A||ˆ x|)i + fi

(15.5.9)

Normally fi is equal to |bi |, but when P the denominator of expression (15.5.8) is xk∞ . Thus ω represents a relative near zero, they choose fi equal to j |aij |kˆ error for the entries of A and many of the entries of b, but an absolute error for the other entries of b (different for each component). It may be necessary to perform one or more iterations of iterative refinement to obtain an approximate ˆ for which ω, calculated by expression (15.5.8), is at the level of solution x rounding. The experiments of Arioli et al. (1989) suggest that more than one iteration is rarely needed. For Gear’s example, working with δ = 10−8 and σ = 10−15 on a machine having precision 16−13 ≈ 2 × 10−16 , they found one step ˆ for which ω ≤ 10−16 . In fact, this is of iterative refinement gave an improved x −1 a case where the condition number kA k kAk (see Section 4.11) is large only because of poor scaling. A better condition number in such a case is that of Skeel (1979): k |A−1 | |A| k. Here, the classical condition number is 1 + δ −1 whereas the Skeel condition number is 4. Unfortunately, discussion of Skeel’s work is beyond the scope of this book. 15.6

Why the inverse of a sparse irreducible matrix is dense

The structural inverse of a matrix is defined to be the pattern that is the union of all patterns for A−1 generated by choosing different values for entries of A. A structural LU and solution vector x are defined similarly. In this section, all references to L, U, A−1 , and x are to their structures in this sense. We show that, if A is irreducible, L has at least one entry beneath the diagonal in each column (except the last), U has at least one entry to the right of the diagonal in each row (except the last), x is dense, and A−1 is dense.

336

OTHER SPARSITY-ORIENTED ISSUES

We first show that if nodes i and j in the digraph (Section 1.2) of the matrix are joined by a path (i, k1 ), (k1 , k2 ),..., (ks , j), where k1 , k2 ,..., ks are all less than i and j (called a legal path by Richard Karp), then (i, j) will be an entry in L\U. To show this, suppose that km = mini ki . Just before elimination step km , there are entries in positions (km−1 , km ) and (km , km+1 ). Therefore, following it, there is an entry in position (km−1 , km+1 ) so the digraph of the reduced submatrix contains the (legal) path (i, k1 ), (k1 , k2 ), ..., (km−1 , km+1 ),..., (ks,j ). Continuing the argument similarly, we eventually find that (i, j) is an entry in L\U. Next we show that there is at least one entry to the right of the diagonal in every row of U (except the last) and at least one entry beneath the diagonal in every column of L (except the last). We prove this by showing that if there is a row k of U, k < n, with no entries to the right of the diagonal, then A is reducible. The proof for L is similar. Suppose there is a path in the digraph of A from node k to a node after k and let l be the first such node on the path. The path from k to l is a legal path, so (k, l) must be an entry in L\U, contrary to our assumption. Therefore there can be no path in the digraph of A from k to a later node. Let S be the set consisting of node k and all nodes to which there is a path from k. The complement of S is not empty because it contains all the nodes after k. If we reorder the matrix so that the nodes of S precede those of its complement, we find a 2 × 2 block lower triangular form, that is A is reducible. This proves the statement in the first sentence of this paragraph. Notice that some permutations of a reducible matrix may give factors L and U possessing the property of the last paragraph. For example, interchanging the columns of the matrix   ×× (15.6.1) 0 × leads to dense matrices L and U. On the other hand, any permutation of an irreducible matrix is irreducible, so the property of the last paragraph holds for all permutations of irreducible matrices. We next show that if L is a lower triangular factor of an irreducible matrix and b 6= 0, the solution of the equation Ly = b

(15.6.2)

has an entry in its last position. To prove this, suppose bk is the last nonzero of b. Since the corresponding component of the solution is found from the equation yk = bk −

k−1 X

lks ys ,

(15.6.3)

s=1

it is an entry in y. If k = n, this is our desired result. Otherwise, since L is a factor of an irreducible matrix, it has an entry ljk , j > k; by replacing k by j in equation (15.6.3), we conclude that the solution has an entry in position j.

COMPUTING ENTRIES OF THE INVERSE OF A SPARSE MATRIX

337

If j = n, this is our desired result. If not we continue the argument, finding successive entries in the solution until position n is reached. If U is an upper triangular factor of an irreducible matrix and yn is an entry, the solution of the equation Ux = y (15.6.4) is dense. To prove this, we remark that the equation for xk is ukk xk = yk −

n X

ukj xj .

(15.6.5)

j=k+1

Clearly, xn is an entry in x. For k < n, suppose xk+1 , ..., xn are all entries; since U is a factor of an irreducible matrix, it has an entry ukj , j > k, so we deduce from equation (15.6.5) that the solution has an entry in position k, too. Hence, by induction, x is dense. Together, the last two paragraphs tell us that for any set of equations Ax = b

(15.6.6)

with an irreducible matrix A the solution vector x is dense if it is computed with Gaussian elimination and any cancellations are ignored. This is also true for A−1 if it is computed in the usual way by solving the equation AX = I

(15.6.7)

by Gaussian elimination. It might be thought that there could be some sparsity in the inverse that is not exposed by this treatment, but this is not the case. Duff, Erisman, Gear, and Reid (1988) show that for every position (i, j) there is a matrix with the given irreducible structure that has a nonzero in position (i, j) of its inverse. 15.7

Computing entries of the inverse of a sparse matrix

We have just shown that if A is irreducible, its inverse must be dense. Thus it would seem that if we wanted to compute the diagonal entries of A−1 , for example, we would need to solve the equation AX = I.

(15.7.1)

That is, in order to compute the diagonal entries of A−1 we must compute all of the entries of the lower triangular part of the inverse. However, we show here that some of this computation is not necessary, and we can compute just the entries in the sparsity pattern of (L\U)T . This algorithm is discussed further by Erisman and Tinney (1975). There are many applications where entries of the inverse are useful. In the least-squares data-fitting problem, the diagonal entries of the inverse of the normal matrix AT A have particular significance since they yield estimates of

338

OTHER SPARSITY-ORIENTED ISSUES

the variances of the fitted parameters. Entries of the inverse can also be used to compute sensitivity information. Thus we are motivated to compute a portion of A−1 , not to use it as an operator, but for the entries themselves. Let Z = A−1 and suppose we have computed a sparse factorization of A, A = LDU,

(15.7.2)

where L is unit lower triangular, D is diagonal and U is unit upper triangular. The formulae Z = D−1 L−1 + (I − U)Z

(15.7.3a)

Z = U−1 D−1 + Z(I − L)

(15.7.3b)

and

of Takahashi, Fagan, and Chin (1973) may be readily verified (Exercise 15.7). Since (I − U) is strictly upper triangular and (I − L) is strictly lower triangular, we may conclude that the following relations are true: zij = [(I − U)Z]ij , i < j,

(15.7.4a)

zij = [Z(I − L)]ij , i > j,

(15.7.4b)

d−1 ii

+ [(I − U)Z]ii ,

(15.7.4c)

zii = d−1 ii + [Z(I − L)]ii .

(15.7.4d)

zii = and

Using the sparsity of U and L these formulae provide a means of computing particular entries of Z from previously computed ones. In particular, all entries of Z in the sparsity pattern of (L\U)T can be computed without calculating any entries outside this pattern. To see this, let zij be one such entry. Suppose i < j. Then from (15.7.4a) zij = −

n X

uik zkj .

(15.7.5)

k=i+1

The entries zkj that are needed to evaluate this formula are those corresponding to entries uik . However, if lji and uik are entries, so is (L\U)jk , which implies that zkj is an entry that has been computed (see also Exercise 15.8). This suggests that entries of the inverse may be computed starting with znn and working up through the sparsity pattern of (L\U)T . It shows, for example, that the diagonal entries of the inverse of a symmetric tridiagonal matrix may be computed with just the entries zii , i = 1, 2,... n, and zi,i+1 , i = 1, 2,..., n − 1.

SPARSITY IN NONLINEAR COMPUTATIONS

339

In general, a computational sequence can be developed by calculating entries of Z lying in the pattern of (L\U)T in reverse Crout order, starting with znn . Thus, when calculating entry zij , all entries zst (s > i, t > j) have already been calculated and equation (15.7.5) and its counterpart for i > j (Exercise 15.8) allow zij to be computed immediately. While this computational sequence will always work, the following example shows that it may not be the most efficient. Let A have the pattern   × 0 0 0 × 0 ×× 0 0    0 × × 0 ×. (15.7.6)    0 × × × × × 0 0 ×× In this case, L\U has the same pattern. The sequence z55 = d−1 55 , z51 = z55 l51 ,

using(15.7.4c) using(15.7.4b) and (15.7.4a)

(15.7.7a) (15.7.7b)

z11 = d−1 11 − u15 z51 ,

using(15.7.4c) and (15.7.4b)

(15.7.7c)

can be used to compute z11 without previously calculating all other zij in the pattern of (L\U)T . The minimum number of computations required to compute any entry of A−1 in the sparsity pattern of (L\U)T is unknown. Another unknown question is the best way of computing other entries of A−1 , though the formulae (15.7.4) still apply. Campbell and Davis (1995) construct additional tree structures to implement this algorithm and have written code although it is only available by contacting the authors. Of course, we can solve (15.7.1) for entries of the inverse exploiting the fact that the right-hand side is sparse and only part of the solution is wanted, see Section 14.4. Furthermore, we can solve for multiple sparse right-hand sides when entries in more than one column of the inverse are required with further efficiences, see Section 14.5. Amestoy et al. (2012) and Amestoy et al. (2015b) have looked at out-of-core and parallel implementations using the MUMPS package and we showed some results from their work in Section 14.7.3. 15.8

Sparsity in nonlinear computations

Sparsity plays a critical role in the computational methods of large-scale nonlinear optimization. A review of computational methods for nonlinear optimization is outside the scope of this book. We refer the reader to the books of Fletcher (1980, 1981), Gill, Murray, and Wright (1981); Dennis Jr. and Schnabel (1983); and (for the sparse case) Coleman (1984). It is clear that the sparse matrix solution techniques developed earlier in this volume are applicable to solving the sparse linear equations that arise in nonlinear optimization. Therefore, in this

340

OTHER SPARSITY-ORIENTED ISSUES

and the next two sections, we confine our attention to the generation of the sparse matrices involved. In the solution of a nonlinear system of equations F(x) = 0,

(15.8.1)

where F is a vector-valued function of the vector variable x, an estimate of the Jacobian matrix   ∂Fi (x) J = ∇F(x) = (15.8.2) ∂xj is generally used. If x and F(x) are n-component vectors and n is large, J is usually sparse. This sparsity is useful not only in solving the equations efficiently, but also in reducing the number of function evaluations necessary to estimate J through finite differences. This can be accomplished by an algorithm due to Curtis, Powell, and Reid (1974), which is discussed in Section 15.9. For the case where the matrix is known to be symmetric, Powell and Toint (1979) have extended the algorithm and we describe this work, too, in Section 15.9. In the solution of a nonlinear optimization problem, we seek a point x∗ satisfying f (x∗ ) = min ∗ f (x), (15.8.3) x∈N (x )



where N (x ) is some neighbourhood of x∗ and f is a twice-differentiable function. The point x∗ is a local minimum of f . The algorithm of Powell (1970) for computing x∗ makes use of an approximation Bk to the Hessian   ∂f (xk ) 2 , (15.8.4) ∇ f (xk ) = ∂xi ∂xj and, in the sparse case, the algorithm of Powell and Toint (1979) may be used to compute it by differencing the gradients. Rather than approximating ∇2 f (xk ) afresh at every iteration, Powell updates the approximation at most iterations. The updated approximation Bk+1 is required to be symmetric and satisfy the equation Bk+1 (xk+1 − xk ) = g(xk+1 ) − g(xk ),

(15.8.5)

where g(x) is the gradient of f (x) at x. Subject to these conditions, Bk+1 is chosen to minimize the quantity kBk − Bk+1 kF ,

(15.8.6)

where kkF is the Frobenius norm (possibly weighted). The condition (15.8.5) is known as the quasi-Newton or secant equation. Unfortunately this update destroys the sparsity of the approximate Hessian. Toint (1977) proposes adding the constraint [Bk+1 ]ij = 0 if [∇2 f (x)]ij = 0.

(15.8.7)

The algorithm for preserving the sparsity in the update Bk+1 through the constraint (15.8.7) will be reviewed in Section 15.10.

ESTIMATING A SPARSE JACOBIAN MATRIX

341

15.9 Estimating a sparse Jacobian matrix In this section we describe the CPR algorithm (Curtis et al. 1974) for approximating the n × n Jacobian (15.8.2) of the system (15.8.1) by finite differences and the extension of this algorithm by Powell and Toint (1979) to the symmetric case. The simplest way to approximate J is to use the finite difference ∂Fi Fi (x + hj ej ) − Fi (x) ' (15.9.1) ∂xj hj where ej is column j of the identity matrix and hj is a suitable step-length. A straightforward implementation of (15.9.1) requires computing F(x + hj ej ) for each j; that is n evaluations of F at displacements from x. Because of sparsity, each Fi depends on only a few xj , and we use this fact to reduce the number of function evaluations. We begin by considering the case where the relations ∂Fi (x) = 0, ∂xj

|i − j| ≥ 2

(15.9.2)

hold, so that the matrix J is tridiagonal. In this case, we may compute the approximations to columns 1, 4, 7,... simultaneously by using the displacement h1 e1 + h4 e4 +.... Similarly, the columns 2, 5, 8,... and 3, 6, 9,... can be found together, so that only three function evaluations at displacements from x are needed. What this example shows for the more general sparse case is that we need to find a set of columns of J which are such that their patterns do not overlap. A set of such columns of J can be computed simultaneously with a single function evaluation. Finding an optimal set is an NP-complete problem (McCormick 1983), but the heuristic proposed by Curtis et al. (1974) for grouping the columns of J for any given sparsity pattern seems to perform well. The first group is formed by starting with the first column and combining with it the next column whose pattern does not overlap with its pattern. The pattern of this next column is merged with the first, and other columns are examined in turn seeking additional columns whose pattern does not overlap with the merged pattern. The first group is terminated when the merged pattern is dense or all columns have been examined. The second and successive groups are formed in the same way from remaining columns. Such an algorithm is called a greedy algorithm. There is an interesting connection with graph colouring. Consider the graph that has a node for each country and an edge for each pair of countries that border each other. If we have sets of disconnected nodes, we can use a different colour for the countries of each set, which allow us to construct a map that distinguishes between the countries. In our case, we seek to colour the column incidence graph, that is, the symmetric matrix that has nodes for the columns of our given matrix has an edge for every overlapping pair of columns. The CPR algorithm that we have just described is a greedy graph colouring algorithm.

342

OTHER SPARSITY-ORIENTED ISSUES ×

× × × × × × × × × × ×× ×××××××× ×

Fig. 15.9.1. Nonzero pattern where the CPR algorithm is ineffective.

The experience of Curtis et al. with this heuristic indicates that it can greatly reduce the number of function evaluations, though it need not be optimal. An a priori permutation of columns (to alter the order in which they are examined) or an application of the algorithm to JT rather than J sometimes changes the number of groups slightly, but the significance of the change is minor. Coleman and Mor´e (1983) have considered graph-colouring algorithms and were able to improve on the CPR algorithm, but their further reduction in the number of function evaluations is often small. Software for their algorithms is given by Coleman, Garbow, and Mor´e (1984b). These schemes partition the columns of the Jacobian matrix and thus have great potential for exploiting parallel architectures, since each finite difference and each set of columns of J can be evaluated independently on separate processes. In general, the CPR algorithm is very effective, but one problem where the above algorithms are totally ineffective is shown in Figure 15.9.1 (see Exercise 15.9). Now consider the case where the matrix J is known to be symmetric. Once estimates for the coefficients in column j are available, they may be used for row j too. Not only does this ensure that the estimated J is symmetric, but also having some rows already known may allow more columns to be estimated at once. For example, if the matrix of Figure 15.9.1 is symmetric, we may first estimate the last column (and row) by changing xn alone and then estimate the rest of the diagonal by changing x1 , x2 ,..., xn−1 all together. This is the basis for the ‘direct’ method of Powell and Toint (1979). In general, they group the columns as in the CPR algorithm except that they allow columns in a group to have entries in the same row if that row corresponds to a column that has already been treated. To enhance this effect they order the columns by decreasing numbers of entries when choosing which to place first in a group. Powell and Toint’s second suggestion is to substitute an estimate for Jij whenever Jji is needed to calculate another coefficient. For example, this allows a symmetric tridiagonal matrix to be estimated with two instead of three groups of columns. The two groups consist of the odd-numbered and the even-numbered columns, respectively. The first group provides estimates for J11 , J21 + J23 , J33 , J43 +J45 ,... and the second for J12 , J22 , J32 +J34 , J44 , J54 +J56 ,... . The diagonal entries are thus all available directly, as is J12 . Using the relation J21 = J12 allows

UPDATING A SPARSE HESSIAN MATRIX

343

us to calculate J23 from J21 + J23 , and we then calculate J34 from J32 + J34 , and so on. In general, Powell and Toint recommend applying the CPR algorithm to the lower triangular part of a permutation of the structure of J. Their algorithm again allows a Hessian matrix of the structure of Figure 15.9.1 to be estimated with 3 function evaluations. The substitution method of the last paragraph is usually more efficient than the direct method of the last-but-one paragraph. For example, it needs only r + 1 groups for a symmetric band matrix of bandwidth 2r + 1, whereas the direct method needs 2r + 1 groups. Admittedly it is less accurate, though this is unlikely to be worrying unless the quantities |Jij hj | vary widely. Both methods have been improved slightly by Coleman and Mor´e (1984), with corresponding software given by Coleman, Garbow, and Mor´e (1984a). So far, we have assumed that the sparsity pattern is known. With finite differences, determining the pattern requires n + 1 function evaluations. Automatic differentiation, see for example Griewank and Walther (2008), provides an alternative. In the forward method, the values of each variable and its derivatives are stored. The calculation is broken down into a sequence of unary and binary operations. For a unary operation, we have c = g(a);

dg ∂a ∂c = ∂xi da ∂xi

and for a binary operation we have c = g(a, b);

∂c ∂g ∂a ∂g ∂b = + . ∂xi da ∂xi db ∂xi

For a unary operation, the sparsity pattern of the derivatives of c is that of a. For a binary operation, it is the union of the patterns of a and b. Therefore, the sparsity pattern of the rows of J can be formed during the calculation. Where the sparsity pattern of J is known, automatic differentiation can be used to calculate non-overlapping columns using a full vectors of derivatives, so the CPR algorithm and other graph colouring algorithms are useful here too. Working this way is likely to be faster than working with a dynamic sparsity pattern for every binary operation. 15.10

Updating a sparse Hessian matrix

The sparsity constraint (15.8.7) on the updating of the Hessian subject to minimum-norm change (15.8.6) while maintaining symmetry and satisfying the quasi-Newton equation (15.8.5) represents a non-trivial modification to the standard update. Simply updating Bk to Bk+1 using a standard update and then imposing the sparsity pattern of the Hessian on Bk+1 (sometimes called the ‘gangster operator’ because it fills Bk+1 with holes) is not adequate because Bk+1 would no longer satisfy (15.8.5). It is interesting to note that the work associated with updating Bk+1 using the algorithm of this section is often increased rather

344

OTHER SPARSITY-ORIENTED ISSUES

than reduced by sparsity. The saving comes from being able to operate with a sparse Bk+1 not in computing it. The algorithm of Toint (1977) for solving this problem is now described. Let X be the matrix with entries  (xk+1 − xk )j if bij is an entry (15.10.1) xij = 0 otherwise. Let Q be the matrix with entries  xij xjiP i 6= j qij = n x2ii + k=1 x2ki i = j.

(15.10.2)

If no column of X is identically zero, then Q is symmetric and positive definite. Using Q, we solve the system of equations Qλ = g(xk+1 ) − g(xk ) − Bk (xk+1 − xk )

(15.10.3)

for λ. Then it can be shown that Bk+1 defined by Bk+1 = Bk + E, where

 [E]ij =

λi xj + λj xi if bij is an entry 0 otherwise,

(15.10.4)

(15.10.5)

satisfies all of the constraints, but unfortunately Bk is not positive definite and methods based on this approach (for example, Toint (1981)) have been found often to be less satisfactory than fresh calculation of the approximation (see Thapa (1983)). Sorensen (1981) has found cases where imposing the additional constraint of positive definiteness leads to arbitrarily large entries in E and a poor approximate Hessian. The critical part of the computation lies in building and factorizing Q. However, since Q is a sparse positive-definite matrix of the same structure as Bk , the ANALYSE used for Bk can be used to solve this system. The proof that Bk+1 satisfies all of the constraints is given by Toint (1977). 15.11

Approximating a sparse matrix by a positive-definite one

In this section, we consider another sparse matrix approximation problem: for a given sparse symmetric matrix, find the closest symmetric positive-definite matrix with the same sparsity pattern. There are at least three cases where a solution to this problem is important. The first case arises in the application that we were discussing in the previous section, namely sparse optimization. Here, it is normally desirable to keep Bk positive definite and one way of doing this would be to replace the matrix generated by equation (15.10.4) by a nearby positive-definite matrix of the same

APPROXIMATING A SPARSE MATRIX BY A POSITIVE-DEFINITE ONE 345

sparsity pattern. We are unaware of work using this approach, although we believe the challenge to be a legitimate one. The second case is encountered in building a large circuit analysis model from measured data. The positive definiteness of the matrix is necessary to preserve the physical passiveness of the model: a negative eigenvalue would imply positive exponentials in the solution of the differential equations. Yet, since the measurements are subject to uncorrelated error, it frequently happens that the resulting matrix is indefinite. The problem is to find a nearby matrix with the same sparsity that gives a good representation of the measured data. As an aside, note that if there is a negative eigenvalue that is not comparable with the data errors for the largest entries of A this probably indicates a blunder (a wrong entry, not a rounded one) and this process is useful for identifying them. The problem can arise in a quite different way when ‘drop tolerances’ are used (see Section 10.10). One difficulty is that the tolerance is often based on the magnitude of uncertainty in the data. Thus a matrix can change from positive definite and reasonably well-conditioned to an indefinite or ill-conditioned one. Ignoring the sparsity constraint, the closest positive-semidefinite matrix (in the Frobenius norm) to a given symmetric one is known to be the one with the same eigenvectors, but with the negative eigenvalues changed to zero. Similarly the closest positive-definite matrix with minimum eigenvalue greater than or equal to σ is obtained in the same way except all eigenvalues less than σ are replaced by σ. Unfortunately this approximation usually destroys sparsity. An alternative is to select a value δ such that λmin + δ = σ, where λmin is the minimum eigenvalue. Then the matrix A + δI would have the same off-diagonal sparsity as the original matrix, would have its minimum eigenvalue σ, and would be close to A if A is ‘almost’ positive definite. Of course, we seldom know the leftmost eigenvalue of our matrix and the cost of computing it will be more than that for the matrix factorization. Thus, heuristics are used so that the factorization is that of a positive-definite matrix close to the original. Gill et al. (1981) propose, on page 111, an algorithm that modifies pivot entries to make them sufficiently positive together with bounding the size of the off-diagonal entries. One problem of modifying the matrix in this way is that the modification can be significantly larger than needed for obtaining a positive-definite matrix. Eskow and Schnabel (1991) have proposed an algorithm that looks ahead in the factorization to try and detect at an earlier stage that the matrix is not positive definite. They monitor the diagonal entries of the whole reduced matrix and act as soon as a pivot would cause one of these later entries to become negative. At this point an appropriate change is made to the pivot and the factorization proceeds. Eskow and Schnabel (1991) show that this strategy both yields a factorization of a positive-definite matrix and that the difference between this matrix and the original one is less than if potential negative pivots are just augmented. Their algorithm is implemented in the HSL code MA57.

346

OTHER SPARSITY-ORIENTED ISSUES

Computing eigensystems is impractical if we were working with very large matrices. However, where the matrix is a sum X A= A[l] (15.11.1) of matrices with entries only in dense submatrices, this approach is practical for each A[l] and does not destroy the sparsity of the overall matrix. 15.12

Solution methods based on orthogonalization

Most of this book is concerned with solving linear sets of equations using variants of Gaussian elimination. In this section, we briefly consider approaches based on the orthogonal factorization A = QU (15.12.1) where Q is orthogonal and U is upper triangular. A very desirable property of multiplying a matrix by an orthogonal matrix is that the 2-norm of each column is not altered. Therefore, provided the entries of the columns do not vary widely in size, we need not worry about the instability associated with growth in the size of the matrix entries. This is a decided advantage over Gaussian elimination and means that we are free to choose any row or column permutation. There are two main options for constructing Q. The first is to use Givens transformations of the form   cos θ sin θ (15.12.2) − sin θ cos θ with θ = arctan(aik /akk ). If the matrix (15.12.2) is expanded into an n×n matrix by placing its four entries in positions (k, k), (k, i), (i, k), and (i, i) of I, the updated entry ai,k is zero. These transformations are analogous to the elementary operations of Gaussian elimination inasmuch as a sequence of these may be applied in turn to each of the subdiagonal entries of column 1, then to those of column 2, etc. and each eliminated entry remains zero during the subsequent operations. We eventually obtain the upper triangular matrix U. As for Gaussian elimination, these elementary rotations can be omitted when the target entry is already zero. Unlike Gaussian elimination, fill-in can occur in both the pivot and the non-pivot row. Because of the extra fill-in to the pivot row, the factors Q and U are usually much denser than the LU factors produced by Gaussian elimination to the extent that a straightforward application of the factorization (15.12.1) to the solution of the equation Ax = b is generally not competitive. Premultiplication of both sides of equation (15.12.1) by their transposes shows that the matrix U is just the Cholesky factor of the normal matrix AT A (apart from possible sign changes in the rows of U). This allows a sparsity preserving ordering on AT A to provide both a good column order for the

SOLUTION METHODS BASED ON ORTHOGONALIZATION

347

orthogonal factorization and the sparsity pattern of U. The pattern of U is independent of the ordering of the transformations, but the amount of work required to construct it is not. If the column order is chosen beforehand, the orthogonal factorization may proceed row by row. In each row, the subdiagonal entries are processed from the left. It is easily verified that each eliminated entry remains zero during subsequent operations. Working by rows in this way is preferred since a dynamic data structure is required for only one row at a time. An alternative is to reduce each column k in turn by using an orthogonal Householder reflection. For the first column, this has the form I − 2vvT , where v is obtained by normalizing the vector that is equal to the first column a1 except that its first entry is a11 − ka1 k2 . In the sparse case, the fill-in when using Householder reflections will always be greater than or equal to that of using Givens transformations (Duff and Reid 1975) since the sparsity pattern of the rows involved becomes that of the union of the sparsity patterns of all those rows. Recent implementations of sparse QR (Matstoms 1994; Amestoy, Duff, and Puglisi 1996b; Davis 2011) are multifrontal in nature. Agullo, Buttari, Guermouche, and Lopez (2013) have developed a parallel code qr mumps for multicore architectures. An ordering of AT A is used to generate an assembly tree for AT A. The rows of A are grouped according to the column in which they have an entry that is earliest in the pivot sequence. At a leaf node of the tree, the submatrix containing the groups of the l columns eliminated there is packed into a full rectangular matrix and the l columns ordered to its front. A QR factorization is applied that is either full or that leaves the first l columns in upper triangular form. The leading l rows are part of U. The submatrix without the leading l rows and columns is set aside for processing at the parent node. At a non-leaf node, the submatrices from the children are packed into a full matrix together with any rows of A in the groups of the l columns eliminated at the node. The rows do not overlap, so there is no addition involved. However, the columns do overlap so each row must be expanded to the sparsity pattern of all the columns involved. There follows a QR factorization and storage of the results, just as for a leaf node. Zeros created at a node will remain as later nodes are processed. The storage needed for the sequence of Q matrices is far less than for the matrix Q when stored as a sparse matrix and this makes a sparse QR factorization more competitive for square systems and a good method for solving sparse overdetermined systems in the `2 norm. The code of Davis (2011) is used by the MATLAB QR command. It is possible to solve the system Ax = b without storing Q by performing the operations on the right-hand side at the same time as A is reduced to upper triangular form. It is also possible to solve for subsequent right-hand sides

348

OTHER SPARSITY-ORIENTED ISSUES

without having Q available. This requires multiplication of the right-hand side by AT and is not as stable as using Q. This approach is called the method of seminormal equations (Bj¨ orck 1987). Additional discussion of orthogonalization methods is outside the scope of this book. 15.13

Hybrid methods

In spite of all our best efforts as described in this book, it is possible that the set of sparse equations are too large for us to solve, usually because of memory constraints. Furthermore, as we have seen earlier, although we can exploit parallel architectures, there are limits to how effectively we can do this. In this section, we discuss a class of methods that attempts to overcome these limitations by the judicious use of iterative solution techniques in combination with our direct methods. We have not discussed iterative methods at all so far and will not do so here. If the reader wants to learn more about such approaches, there are several good books available, for example those of Greenbaum (1997), Saad (2003), and van der Vorst (2003). The essence of hybrid methods is that we will continue to use our sparse direct approach, in fact we will continue to use a sparse direct code, but we will not use it on the overall problem. Examples of publicly available hybrid codes are given in Table 15.13.1. For further information on these codes, we suggest the reader ‘googles’ the names of these packages. Table 15.13.1 Some publicly available hybrid codes. ShyLU MaPHYS PDSLin pARMS HIPS ABCD

Erik Boman, SANDIA Labs Luc Giraud, Bordeaux Sherry Li, LBNL Yousef Saad, Minnesota Pascal Henon, Bordeaux/Pau Mohamed Zenadi, Toulouse

We examine two common ways in which this can be done. In both cases, we will use the direct method on subproblems. 15.13.1

Domain decomposition

We first consider domain decomposition, which was historically used in the solution of elliptic partial differential equations. There are several texts that discuss this approach, for example books by Chan and Mathew (1994) and Smith, Bj¨orstad, and Gropp (1996). We illustrate this by the simple case in which a vertex separator Γ divides the region into four subdomains. The discretized version of a PDE on this domain can be represented by the matrix

HYBRID METHODS

     

A11 A22 A33 A44 AΓ1 AΓ2 AΓ3 AΓ4

349

 A1Γ A2Γ   A3Γ  , A4Γ  AΓΓ

where we have partitioned the variables according to those inside one of the P4 (i) regions or on the border Γ, the final block can be written as AΓΓ = i=1 AΓΓ , (i) and AΓΓ involves variables on Γ adjacent to region i. The matrices   Aii AiΓ (i) , i = 1, 2, 3, 4 AΓi AΓΓ can be factorized independently without pivoting in the second block, to yield the local Schur complements (i)

S(i) = AΓΓ − AΓi A−1 ii AiΓ ,

(15.13.1)

P4 whence the Schur complement of the overall system is i=1 S(i) . Note that this is exactly the same as we did in Section 15.4.1. In a hybrid method, the reduced system corresponding to the Schur complement is solved by an iterative method. One of the main issues is deciding whether to form the matrices S(i) explicitly or rely on holding the factorizations Aii = Li Ui and performing forward and back-substitutions for each iteration of the iterative method. Giraud, Marrocco, and Rioual (2005) have experimented with these two approaches. We show some of their results comparing the two strategies when run on 16 subdomains on a regular grid of size 400×400 in Table 15.13.2. This means Table 15.13.2 Comparison of implicit and explicit substructuring. Times in seconds on an Origin 2000. Matrix-vector is the time for computing the product of a vector with the Schur complement. Implicit Explicit

Preparation Matrix-vector 10.2 1.60 18.4 0.07

that the explicit formulation will be faster if there are more than six iterations. Additionally, in the explicit case, preconditioning and scaling are easier. In their hybrid approach they use MUMPS as the direct method to factorize the matrices Aii and the GMRES iterative method on the Schur complement with an additive Schwarz preconditioning, following the work of Carvalho, Giraud, and Meurant (2001). We show some numerical results on an industrial problem from Giraud et al. (2005) in Table 15.13.3. The problem was of size 1 214 758 and was run

350

OTHER SPARSITY-ORIENTED ISSUES

Table 15.13.3 Runs on an industrial problem on an Origin 2000.

MUMPS MUMPS MUMPS MUMPS

on on on on

Newton steps Simulation time (seconds) whole system, using AMD 166 3 995 whole system, using ND 166 3 250 local and on interface 166 2 527 local, GMRES on interface 175 1 654

on 32 subdomains. This problem was from the modelling of a semiconductor (heterojunction) device and was nonlinear. It was solved using Newton’s method with the linear system being solved at each step using an explicit domain decomposition approach. In Table 15.13.3, constructed from the results in Giraud et al. (2005), we give the number of Newton steps required and the overall time required for the simulation. The first two lines are for running a direct method on the whole problem using the direct solver, MUMPS, showing the advantage of the nested dissection ordering over AMD. In the third line, we use MUMPS on the subdomains but form the Schur complement and then solve it again using MUMPS and we can see further gains are obtained. Finally, we use GMRES on the Schur complement (interface) and show a further time reduction even though slightly more iterations of the Newton method are required. 15.13.2

Block iterative methods

The second approach that we discuss is a block iterative method. A very simple example of a block iterative method involves the matrix   A11 A12   A21 A22 A32 . A=  A32 A33 A34  A41 A43 A44 where the off-diagonal blocks are extremely sparse so that it is nearly block diagonal. It can be solved efficiently by using an iterative method (for example, GMRES) on the system DAx = Db where   D= 

A−1 11

 A−1 22

A−1 33

A−1 44

 . 

Because there are very few entries in the off-diagonal blocks of A, DA is a low-rank modification of I. It follows that an iterative method such as GMRES will be very efficient. Iterative people would call this process block Jacobi preconditioning, but we think of it as a way of extending the feasibility of direct

HYBRID METHODS

351

methods. Note that each iteration will involve forward and back-substitution through Lii Uii , i = 1, 2, 3, 4, perhaps with sparse right-hand sides. We now discuss a somewhat more powerful block iterative approach. We can partition the system Ax = b as  1   1  A b  A2   b2      ·     x = ·  ·  ·      ·  ·  AP bP and then the block Cimmino algorithm computes a solution iteratively from an initial estimate x(0) according to: Ai ui = ri = bi − Ai x(k) x(k+1) = x(k) + ω

P X

i = 1, ....P

ui

(15.13.2) (15.13.3)

i=1

where we note the independence of the set of P equations, which allows easy parallelization. The shape of these subproblems is as shown in Figure 15.13.1, that is, they are underdetermined systems. We choose to solve these underdetermined systems (15.13.2) using the augmented system 

I Ai Ai 0

T



ui vi



 =

 0 , ri

and we will solve these using a direct method. There are many aspects to the block Cimmino approach: • The partitioning of the original system (Drummond, Duff, Guivarch, Ruiz, and Zenadi 2015).

Fig. 15.13.1. Shape of subproblems.

352

OTHER SPARSITY-ORIENTED ISSUES

• The method of solving the rectangular subsystems. • Preconditioning and accelerating the iterative method. We show a comparison of a block Cimmino approach with a direct method (MUMPS) in Table 15.13.4 where some of the results are from Duff, Guivarch, Ruiz, and Zenadi (2015). The iterative method that they used is independent of the value of ω in (15.13.3), so they used the value 1. It should be added that in nearly all the test cases from the Florida collection, MUMPS does better so, as advertised at the beginning of this section, hybrid methods should be considered as a possible extension to direct methods only when they encounter problems. Table 15.13.4 Comparison of MUMPS and block Cimmino (BC). 64 mpi-processes and 16 thread per mpi-process. Times in seconds. Problem cage13 cage14 Hamrle3

Factorization BC iter MUMPS BC 76 2.8 4.1 F 5.8 14.5 208.4 4.7 300.5

F: insufficient memory for MUMPS.

Exercises 15.1 Verify the matrix modification formula (15.2.2).   A11 A12 15.2 For the symmetric matrix A = with A22 of order 1, determine rankAT12 A22 two matrices V and W such that   0  .    .   T A11 . A=   + VW   .   0 0.. . .01 15.3 Generalize the result of Exercise 15.2 for A22 of order k. 15.4 Illustrate how matrix partitioning can be used to advantage when solving a perturbed problem with changes confined to a border as in (15.3.6). 15.5 Let the n × n matrix A be given as aij = 1, i 6= j aii = 1 + p, p > 0. For any b, n, and p, show that the solution to

HYBRID METHODS

353

Ax = b is given by P

xi = b i −

i bi . p(1 + np)

[Hint: use the matrix modification formula (15.2.3) and a generalization of (15.3.7)] 15.6 Show that the minimum perturbation, F, to the matrix A of (15.5.3), where F lies within the sparsity pattern of A and is such that (15.5.4) is a solution to (A + F)x = b, is given by   0 00 0 0 0 0 0  F=  0 0 0 0 , α00α where α = σ/(δ − σ). 15.7 Verify the formulae (15.7.3). 15.8 Using the notation of Section 15.7, show that any entry zij of Z with i > j and uji an entry can be computed without calculating an inverse entry outside the sparsity pattern of (L\U)T . 15.9 Why is the CPR algorithm ineffective on the matrix in Figure 15.9.1? 15.10 Show that Q of (15.10.2) is symmetric and positive definite if no column of X defined by (15.10.1) is identically zero. 15.11 Suppose A is a 10 × 10 matrix given by the identity plus an entry 1 +  in both the (5,4) position and the (4,5) position. This is clearly an ill-conditioned matrix (singular for  = 0). Proceeding with Gaussian elimination would yield a very small pivot in the (5,5) location. By making a rank one-change (adding to the (5,5) position), the factorization would proceed normally. Would the matrix modification formula be successful in this case? If so, what about subsequent iterative refinement? 15.12 Give an example where no fill-in occurs when performing Gaussian elimination but fill-in is significant when Givens’ rotations are used to effect the QU factorization (15.12.1). 15.13 Give an example where the pattern of U generated by George and Heath’s technique (Section 15.12) severely overestimates the actual pattern for the factorization (15.12.1).

Research exercises R15.1 Pick a domain of interest to you (e.g., structural analysis, power systems simulation, . . .) where you have access to large models that are created from connecting two or more smaller models. Compare the performance of the following solution algorithms: (i) Solve the general larger system without regard for the subsystems, using a general solution method to factorize the matrix. (ii) Ignoring the subsystem structure, use one of the dissection methods to create a substructured problem and factorize the matrix taking advantage of the resulting partitioned form.

354

OTHER SPARSITY-ORIENTED ISSUES

(iii) Constrain the ordering to number the nodes connecting the smaller models to the end to create the partitioned form of the matrix, and factorize the matrix taking advantage of the resulting partitioned form. (iv) Use the matrix modification formula to account for the interconnections between the submodels. Solve the larger system using the known factors of the submodels, and correct for the modifications due to the added connections. Are there any clear trends in solution performance for your class of problems? R15.2 It would appear, in examining (15.2.2), that numerical stability in the use of the matrix modification formula fundamentally depends on the conditioning of S. If the motivation is based on taking advantage of the submodel structure to solve the large irreducible problem in terms of its subproblems, there are a variety of ways (and even of ranks) to select S, V, and W. Consider this example where the matrix A made up of two submodels as   B 0 A= . 0 C Suppose the connected problem is A + δA where δA has a single nonzero entry in both the 1,2 block and the 2,1 block. Then the modification formula can be applied in a number of ways. We can make a rank-2 modification by eliminating both of the new entries and work with A. We can make a rank-1 modification removing only the 1,2 entry (or the 2,1 entry) and then solve the block triangular system using the original blocks. If the addition of the connecting entries change the values of B and C (for example, on the diagonals), we can modify these entries as well to go back to the original A. Stability in the use of this model depends on the choices made, not simply on the condition of S. Explore this stability question, accounting for the freedom and flexibility that is represented in the modification formula.

APPENDIX A MATRIX AND VECTOR NORMS A norm provides a means of measuring the size of a vector or matrix. Formally, a norm ||x|| of the vector x is a non-negative real value having the properties ||x|| = 0 if and only if x = 0, ||αx|| = |α|||x|| for any scalar α, and ||x + y|| ≤ ||x|| + ||y|| for any two vectors x and y.

(A.1a) (A.1b) (A.1c)

Norms of interest in this book are ||.||p , for p = 1, 2 and ∞, defined for real or complex vectors of length n by the equations ||x||1 =

n X

|xi |

(A.2a)

i=1

v u n uX ||x||2 = t |xi |2

(A.2b)

i=1

||x||∞ = max |xi |

(A.2c)

i

It is straightforward to deduce from these definitions that the inequalities 1

1

0 ≤ n− 2 ||x||2 ≤ ||x||∞ ≤ ||x||2 ≤ ||x||1 ≤ n 2 ||x||2

(A.3)

are true for any vector x. These inequalities allow us to choose p on the basis of convenience in most instances. Since matrices are used as operators, a natural measure for the size of a matrix is the maximum ratio of the size of Ax to the size of x. This leads to the definition ||A|| = max ||Ax||. (A.4) ||x||=1

For the infinity norm and any vector x such that ||x||∞ = 1, we find the relations X X ||Ax||∞ = max | aij xj | ≤ max |aij |. (A.5) i

i

j

j

If the latter maximum is attained for i = k and v is the vector with components vj = sign(akj ), j=1, 2,..., n, then the following relations hold X ||v||∞ = 1, (Av)k = |akj |. (A.6) j

356

MATRIX AND VECTOR NORMS

In view of the definition (A.4), it follows that A has norm X ||A||∞ = max |aij | i

(A.7)

j

and that v is a vector attaining the maximum (A.4). In view of relation (A.7) the infinity norm is also known as the row norm. For the one norm and any vector x such that ||x||1 = 1, we find the relations X X aij xj | | ||Ax||1 = j

i



XX j



X

{max

j

= max j

|aij ||xj |

i j

X

X

|aij |}|xj |

(A.8)

i

|aij |.

i

If the latter maximum is attained for j = k and v is column k of I, the following equalities hold: X ||v||1 = 1 and ||Av||1 = |aik |. (A.9) i

It follows that A has norm X

||A||1 = max j

|aij |

(A.10)

i

and that v is the vector attaining the maximum (A.4). In view of relation (A.10), the one norm is also known as the column norm. For the two norm, maximizing ||Ax||2 /||x||2 corresponds to maximizing its square, whose value is xT AT Ax/xT x and, by the Rayleigh–Ritz principle, this has value equal to the largest eigenvalue of AT A. The maximum is attained for x equal to the corresponding eigenvector. If the eigenvalues of AT A are σi2 , i=1, 2,..., n, the norm is therefore ||A||2 = max σi . i

(A.11)

The numbers σi are known as the singular values of A. In connection with conditioning, the minimization of ||Ax||/||x|| is of interest. Writing Ax as v, we seek to minimize | v||/||A−1 v||. This minimum occurs for the vector v that maximizes ||A−1 v||/||v||. It follows that min

1 ||Ax|| = ||x|| ||A−1 ||

(A.12)

and that the minimum is attained for the vector x = A−1 v where v is the vector that attains max ||A

−1

v||/||v||.

(A.13)

MATRIX AND VECTOR NORMS

357

P In the case of the one norm, if column k attains maxj i |(A−1 )ij |, then v is column k of I and x is therefore the column of A−1 whose P norm is greatest. In the case of the infinity norm, if row k attains maxi j |(A−1 )ij | then v has components vj = sign{(A−1 )kj } and x is therefore the linear combination of columns of A−1 with multipliers ±1 that maximizes ||x||∞ . A more direct approach to minimizing ||Ax||/||x|| is available for the two norm since the Rayleigh–Ritz principle immediately yields the result min

||Ax||2 = min σi i ||x||2

(A.14)

by the same argument as that used to show the truth of (A.11). The minimum is attained for the eigenvector of AT A corresponding to its least eigenvalue. Corresponding to the properties (A.1) satisfied for vector norms, the matrix norms (A.4) all satisfy the equations ||A|| = 0 if and only if A = 0 ||αA|| = |α|||A|| for any scalar α, and ||A + B|| ≤ ||A|| + ||B|| for any two m × n matrices A and B.

(A.15a) (A.15b) (A.15c)

Furthermore, they satisfy the inequalities ||Ax|| ≤ ||A|| ||x||,

(A.16)

for any m × n matrix A and any vector x of length n, and ||AB|| ≤ ||A|| ||B||,

(A.17)

for any m × n matrix A and any n × p matrix B. Corresponding to inequalities (A.3) the inequalities 1

1

0 ≤ n−1 ||A||2 ≤ n− 2 ||A||∞ ≤ ||A||2 ≤ n 2 ||A||1 ≤ n||A||2

(A.18)

hold. They are a simple deduction from (A.3) and (A.4). Because of the difficulty of evaluating the two norm (see (A.11)), the alternative v uX n um X 2 |aij | , (A.19) ||A||F = t i=1 j=1

called the Frobenius norm, is often used. Note that, trivially, ||x||F = ||x||2 for vectors. It can be shown that ||.||F satisfies all the properties (A.15) - (A.17), but it is not so useful because the bounds (A.16) and (A.17) are not so tight.

358

MATRIX AND VECTOR NORMS

For proofs of these results and a more detailed treatment, we refer the reader to Chapter 4 of Stewart (1973). Note that scaling is important when norms are used. For example, for the matrices       0.00 0.00 1.00 2420 0.001 2.42 , (A.20) , and H = , B= A= 1.00 1.58

1.00 1.58

0.00 1.58

||H|| is small compared with ||B||, but is not small compared with ||A||. The scaling that produces B from A leaves H unchanged, so care must be exercised in the use of norms.

APPENDIX B PICTURES OF SPARSE MATRICES Throughout the book, we have illustrated important points by using matrices from a wide range of application areas, mostly from the University of Florida collection (Davis and Hu 2011). Here, we present the structure of some smaller matrices from this collection. The advantage of showing smaller matrices is that their structure is easier to see. We have intentionally chosen only one matrix from each of the major application areas in order to stress the quite different characteristics of sparse systems in each area. We present these patterns in Figures B.1 to B.11. In each case, the application area is given in the caption. Further details on each matrix are available on the web by searching for the matrix name. In Figures B.12 to B.15, we show the pattern of the L\U form obtained after using different minimum degree codes on a five-diagonal matrix arising from the five-point discretization of the Laplacian operator on a 20 × 20 grid. We do this to illustrate the effect of tie-breaking on the minimum degree ordering (Section 7.7) and to show the effect of postordering the assembly tree in the multifrontal approach (Section 12.2).

Fig. B.1. HB/str 0. Linear programming basis. Order of matrix is 363 and number of nonzeros is 2454.

360

PICTURES OF SPARSE MATRICES

Fig. B.2. HB/bcsstk13. Symmetric pattern of stiffness matrix from dynamic analysis. Order of matrix is 2003 and number of nonzeros is 83883.

Fig. B.3. HB/can 256. Symmetric pattern of structures matrix from aerospace. Order of matrix is 256 and number of nonzeros is 2916.

PICTURES OF SPARSE MATRICES

361

Fig. B.4. HB/impcole c. Unsymmetric pattern from hydrocarbon separation problem in chemical engineering. Order of matrix is 225 and number of nonzeros is 1308.

Fig. B.5. HB/west0156. Unsymmetric pattern from simple chemical plant model. Order of matrix is 156 and number of nonzeros is 371.

362

PICTURES OF SPARSE MATRICES

Fig. B.6. HB/fs 183 3. Unsymmetric pattern from solution of ordinary differential equations in photochemical smog. Order of matrix is 183 and number of nonzeros is 1069.

Fig. B.7. HB/fs 760 1. Unsymmetric pattern from solution of ordinary differential equations in ozone depletion studies. Order of matrix is 760 and number of nonzeros is 5976.

PICTURES OF SPARSE MATRICES

363

Fig. B.8. HB/orani678. Unsymmetric pattern from economic modelling. Order of matrix is 2529 and number of nonzeros is 90158.

Fig. B.9. HB/gre 1107. Unsymmetric pattern from simulation of computing systems. Order of matrix is 1107 and number of nonzeros is 5664.

364

PICTURES OF SPARSE MATRICES

Fig. B.10. HB/bcspwr08. Symmetric pattern from representation of Western USA power system. Order of matrix is 1624 and number of nonzeros is 6050.

Fig. B.11. HB/eris1176. Symmetric pattern from electric circuit modelling. Order of matrix is 1176 and number of nonzeros is 18552.

PICTURES OF SPARSE MATRICES

365

Fig. B.12. Pattern of L\LT factors with minimum degree ordering from SPARSPAK. Five-point discretization of the Laplacian operator on a 20 × 20 grid.

Fig. B.13. Pattern of L\LT factors with minimum degree ordering from YSMP. Five-point discretization of the Laplacian operator on a 20 × 20 grid.

366

PICTURES OF SPARSE MATRICES

Fig. B.14. Pattern of L\LT factors with minimum degree ordering from MA27 (prior to postorder of tree). Five-point discretization of the Laplacian operator on a 20 × 20 grid.

Fig. B.15. Pattern of L\LT factors from MA27 after postordering of assembly tree. Five-point discretization of the Laplacian operator on a 20 × 20 grid.

APPENDIX C SOLUTIONS TO SELECTED EXERCISES 1.2 1 2 3 4 5 6 7 8 1×× 2×××× × 3 ×× × 4 × ×× 5 ×××× 6 × ×× 7 × ×× 8 × ×× 1.5 No; any choice for the first elimination introduces a new entry. 1.6 The new matrix and its digraph are 

×

  

××   × × × ××

×



The (4,4) diagonal entry has been moved off the diagonal so now is represented in the digraph by an edge. Rows 2 and 3 are unchanged so the edges (2,3), (3,2), and (3,4) remain. 1.7 When reordering the rows (and corresponding columns to preserve symmetry) of a symmetric matrix, this corresponds to renumbering the nodes of the graph. The structure of the graph remains the same. However, when reordering rows of an unsymmetric matrix, the corresponding digraph is different from the original digraph. In Exercise 1.6, it even has one additional branch, because we have one additional off-diagonal term in the resulting matrix.

368

SOLUTIONS TO SELECTED EXERCISES

2.2

do k = 1,tauy i = index_y(k) iw(k) = itemp(i) itemp(i) = -k end do do k = 1,taux i = index_x(k) if(itemp(i)0) j = col_index(k) col_index(k) = col_start(j) col_start(j) = k if (link(k) j

p=1

uij = (aij −

i−1 X p=1

lip dpp upj )/dii , i < j

SOLUTIONS TO SELECTED EXERCISES

371

3.7 The steps are forward substitution c 1 = b1 , P ck = bk − k−1 j=1 lkj cj , k = 2, 3, . . . , n, scaling ek = ck /dkk , k = 1, 2, ..., n, and back-substitution x n = en P x k = ek − n j=k+1 ukj xj , k = n − 1, n − 2, . . . , 1. 3.8 Suppose on the contrary that the inequality djj ≤ 0 is true for some j. Let x be the vector such that LT x = ej , where ej is zero except for component j, which is one. In this case the relations xT Ax = xT LDLT x = eTj Dej = djj ≤ 0 are true, which contradicts the positive-definiteness of A. 3.9 This is  1  −2 3.10   −3 −4

(k)

an immediate deduction from Exercise 3.8 since akk = dkk .  0 0 0 1 0 0  0 1 0 0 0 1

3.11 Only one operation, division by lnn , is needed. If L is unit lower triangular, no operations are required. 3.12 The forward substitution phase for column n − j + 1 requires 2(j − 1) + 2(j − 2) + . . . + 2 = j(j − 1) operations. Therefore the whole forward substitution phase needs n X j=1

j(j − 1) =

1 1 (n + 1)n(n − 1) = n3 + O(n2 ) 3 3

operations in all. Both phases therefore need 1 3 4 n + n3 + O(n2 ) = n3 + O(n2 ) 3 3 operations. 3.13 2n3 . ¯U ¯ and solve Ax = b by the solution of 3.14 If we write A = L ¯ =b Ly

(C.2)

372

SOLUTIONS TO SELECTED EXERCISES

and then ¯ =y Ux ¯ U, ¯ we get then with partitioned form (3.12.3) for A, L, L11 y1 L22 y2 U22 x2 U11 x1

= = = =

(C.3)

b1 b2 − L21 y1 y2 y1 − U12 x2 ,

which is simply effected through the solution of four triangular systems and the substitution operations shown above. 4.2 This is an immediate consequence of the fact that the eigenvalues of AT A are the squares of the singular values of A. 4.3 Choosing the (1,1) pivot involves the multipliers 491 and –149 and produces the 2 × 2 system 688x2 − 464x3 = 737 −212x2 + 141x3 = −224 and choosing the (2,2) pivot involves the multiplier 0.308 and produces the 1×1 system −2x3 = 3.0. This gives the computed solution x1 = 0.0 x2 = 0.0596 x3 = −1.5. The instability was manifest in the large growth that took place when –0.0101 was used as a pivot. 4.4 (i) Using the given formulae for H and r gives the relations (A + H)ˆ x= = = =

ˆ /kˆ Aˆ x + rˆ xT x xk22 Aˆ x+r Aˆ x + b − Aˆ x b.

(ii) For the second part we find the relations kHk2 ≤

krk2 krk2 kˆ xk2 = = αkAk2 . kˆ xk22 kˆ xk2

(iii) If krk2 is small compared with kAk2 kˆ xk2 , then α is small so kHk2 is small compared with kAk2 , that is we have solved a nearby problem. Conversely, assume that the computed solution satisfies the equation (A + H)ˆ x=b

SOLUTIONS TO SELECTED EXERCISES

373

with kHk2 = αkAk2 , α 0, i = 1, 2, . . . , n then xT Ax > 0 unless αi = 0, i = 1, 2, . . . , n, that is unless x = 0. Conversely if λk < 0, we may choose x = vk and find xT Ax = λk < 0.

374

SOLUTIONS TO SELECTED EXERCISES

4.8 If A is diagonally dominant by rows, the following relations are true. (2)

|aii | −

Pn

j=2,j6=i

P (2) |aij | = |aii − ai1 a1i /a11 | − n j=2,j6=i |aij − ai1 a1j /a11 | Pn P |a | − ≥ |aii | − n ij j=2 |ai1 ||a1j |/|a11 | j=2,j6=i Pn ≥ |aii | − j=2.j6=i |aij | − |ai1 | P = |aii | − n j=1,j6=i |aij | ≥0

Thus, A(2) is diagonally dominant. Further the relations Pn

(2)

j=2

|aij | =

Pn



Pn

j=2

|aij − ai1 a1j /a11 | Pn j=2 |ai1 ||a1j |/|a11 |

j=2 |aij | + Pn ≤ j=1 |aij |

(k) all hold, which is the next result required. It follows by induction are Pn that all A Pn (k) diagonally dominant and that the inequalities j=1 |aij | are all true. j=k |aij ≤ The inequalities n n X X (k) (k) |aij | ≤ |aij | ≤ 2|aii | |aij | ≤ j=1

j=k

all hold, which is the required result. (k) For diagonal dominance by columns we remark that the numbers aij (i ≥ k, j ≥ k) generated when Gaussian elimination is applied to AT are exactly the same as the numbers generated when Gaussian elimination is applied to A (it is just that each number is stored in the transposed position). Therefore, the result is also true for matrices that are diagonally dominant by columns. 4.10 There is a nonzero vector x such that the equation (k)

(A − ak eTk )x = 0 holds. This may be rearranged to the equation x = A−1 ak eTk x. (k)

Taking norms, we find the inequality kxkp ≤ kA−1 kp kak kp kxkp , (k)

from which the result follows. 4.11 κ∞ (A) = kAk∞ kA−1 k∞ ≈ 2.58 × 1.65 ≈ 4.26, and " B = D1 A =

#

103

" A=

1

1.00 2420 1.00 1.58

#

SOLUTIONS TO SELECTED EXERCISES

375

has inverse " B−1 ≈

−0.0007

1.0007

#

0.0004 −0.0004

so the condition number is κ∞ (B) ≈ 2421 × 1.0011 ≈ 2424. 4.15 For the forward substitution " 0.287

#"

0.512 −0.001

z1

#

" =

z2

b1

#

" =

b2

±1

#

±1

we take b1 = 1 without loss of generality to give z1 ≈ 3.48 and z2 either z2+ = (1 − 0.512 × 3.48)/(−0.001) ≈ 782 or z2− = (−1 − 0.512 × 3.48)/(−0.001) ≈ 2780 and therefore choose z2− . Next, by solving the equation "

1.0 0.631 1.0

" we find x ≈

−1750

#

" x=

3.48

#

2780

# . Finally solving Ay = x yields

2780 " y≈

6.92 × 106 −3.88 × 106

# .

Thus, kA−1 k1 ≈ kyk1 /kxk1 ≈ 2380 and κ(A) = kA−1 k1 ×kAk1 ≈ 2380×0.834 ≈ 1980. 4.16 For the first problem we find the relations # # " #" # " " −0.0107 −0.863 0.001 2.42 3.57 = − r= −0.0054 1.48 1.00 1.58 1.47 and krk∞ = 0.0107. For the second problem we find the relations # # " #" # " " 0.00098 3.59 0.001 2.42 0.00215 = − r= 0.00158 −0.001 1.00 1.58 3.59 and krk∞ = 0.00158. Since in both cases krk∞ is small compared with kbk∞ , both can be considered to be solutions to nearby problems. # # " " 3.59 −0.863 . and 4.18 The rescaled solutions are −1.00 1480 For the first problem we find the relations # # " #" # " " −10.7 −0.863 1.00 2.42 3570 = − r= −0.0054 1480 1.00 0.00158 1.47

376

SOLUTIONS TO SELECTED EXERCISES

and krk∞ = 10.7. For the second problem we find the relations " r=

2.15

#

" −

3.59

1.00

2.42

#"

1.00 0.00158

3.59 −1.00

#

" =

−1.02

#

0.00158

and krk∞ = 1.02. The first krk∞ is small compared with the corresponding kbk∞ , but the second krk∞ is of comparable size with kAk∞ kˆ xk∞ and kbk∞ . Hence the first can be considered to be a solution to a nearby problem while the second cannot. 5.1 Making the substitution yields the equation 

 I

 L1 U1 A12

   A21 U−1 L−1 I 1 1   −1  A32 U−1 I  2 L2 A=  .     

. .

. . I

             

L2 U2 A23 L3 U3 A34 .

. .

. L N UN

       .      

On postmultiplying the first matrix by diag(Li ) and pre-multiplying the second by diag (L−1 i ), we find the required factorization     L1   A21 U−1 L2 1    A32 U−1 L3  2 A=  . .    .  

. . LN

−1   U1 L1 A12   U2 L−1 2 A23    U3 L−1  3 A34   . .    .  

       .     .  

UN

5.2 The form (3.2.3) is suitable if access to U is by rows. With access by columns, we use the steps, for k = n, n − 1, . . . , 1: xk = ck /ukk and ci := ci − uik xk , i = 1, 2, . . . , k − 1. For forward substitution, the form (3.2.4) is suitable if L is stored by rows. With L stored by columns, we use the steps, for k = 1, 2, . . . , n: ck = bk /lkk

SOLUTIONS TO SELECTED EXERCISES

377

and bi := bi − lik ck , i = k + 1, k + 2, . . . , n. 5.4 We need the assumption that there is limited fill-in, say that the number of entries in the rows of U averages less than c (that is, that the number of fill-ins does not exceed the number of entries eliminated). The number of operations at a single step of the elimination then averages less than 2c2 , and the total number of operations is less than 2c2 n. 5.6 When a model is first put together, the goal is to make sure it represents reality. A trigger event of multiple sizes and in multiple locations in the model would correspond to different right-hand sides with different entries. The modeller can design and experiment and then use the model to understand its various responses. Doing these all at once, rather than one at a time, both creates faster solutions and aids in the understanding of the behaviour of the model. 6.1 Reversing the order of the rows and columns has the desired effect. This corresponds to   1   1     .   . . P=Q=     .     1 1 6.3 

×  |   |  ×    

− × × −

−− × × −× | × ×

−××



    ×    ×  − × × ×

The paths taken by the algorithm are shown in the figure above and indicate that columns 1, 4, 6, and 7 have all their entries in only 3 rows and are therefore linearly dependent. 6.4 Let m = n/3. The first 2m steps are trivial because the first 2m diagonal entries are already nonzero. If each column is always searched in the natural order and no explicit permutations are performed, then for each of the last m steps all the nonzeros in the block F must be accessed before a new assignment can be made. Since there are m2 nonzeros in F, the transversal algorithm will require m × m2 = O(n3 ) operations. The reordered system requires O(n2 ) operations only.

378

SOLUTIONS TO SELECTED EXERCISES

6.5 Using the notation of Exercise 6.4, the matrix 

 F 0 F  0 I I  0 F 0 has columns ordered, but requires O(n3 ) operations. 6.6 Suppose the algorithm makes k assignments. Following this, none of the columns k + 1 to n can have an entry in rows k + 1 to n and the matrix has the block form 

AB C 0



where A is square and of order k. Such a matrix has rank at most 2k, so is singular if k < n/2. It follows that for a nonsingular matrix, the inequality k ≥ n/2 holds. 6.11

Stack

                      

Step

3 2 2 1 1 1 1 2 3

4 3 2 1 4

5 4 3 2 1 5

6 5 4 3 2 1 6

7 6 5 4 3 2 1 7

81 7 6 5 4 3 2 1 8

8 8 8 8 8 8 8 71 7 7 7 7 7 7 6 61 6 6 6 6 6 5 5 51 5 5 5 5 4 4 4 41 4 4 4 3 3 3 3 31 3 3 2 2 2 2 2 21 2 1 1 1 1 1 1 1 9 10 11 12 13 14 15

The excessive relabelling is completely avoided.   01 6.12 The matrix cannot be reduced by a symmetric permutation, but reordering 11   10 the columns gives the trivially reducible matrix . 11 6.14 If C is changed to the matrix with entries P Pn |ai,j |, the maximization of n |(PA) | is equivalent to the maximization of ii i=1 i=1 |(PC)ii | or the minimization P of n i=1 |(PB)ii | and the same algorithm is applicable. 6.15 A suitable choice for C is the matrix with entries |ai,j |. The objective function needs to be changed to mini (PC)ii . We can retain maximization by making B the same as C. A similar structured search can be used, discarding any partial path whose objective function is less than that of the best full path found so far. 7.1 A simple example is the following pattern: ×× ××× ××× 7.2 Any leaf node has minimum degree (unity). Eliminating the corresponding variable gives no fill-in and the graph of the remaining submatrix is again a tree; it is the original

SOLUTIONS TO SELECTED EXERCISES

379

tree less the chosen terminal node and its connecting edge. The root may also have degree unity and in this case it may be eliminated without fill-in and the new graph will again be a tree. Interior nodes must have degree at least two and so are not chosen. Thus minimum degree yields a subtree at every stage and there is no fill-in. 7.3 Since ck is calculated from the formula lkk ck = bk −

k−1 X

lik ck , k = 1, 2, . . . ,

i=1

we find successively c1 = 0, c2 = 0, . . . until bk 6= 0 is encountered. 7.4 Since A is structurally irreducible, its digraph contains a path from any node to any other. If such a path passes through node 1, say i → 1 → j, there are nonzeros in (2) the positions ai1 and a1j and so after the first elimination there will be a nonzero aij , which means that the corresponding new path will have a direct link i → j. Thus, the active part of A(2) is irreducible. Similarly, the active path of each successive A(k) is irreducible. If row k of U has no off-diagonal nonzeros, the active part of A(k) must have its first row zero except for the first entry, that is A(k) must be reducible. This contradicts what we have just proved. 7.6 Nested dissection essentially orders interior, separated nodes first. Once ties are encountered in minimum degree with nested dissection as its given order, the algorithm is biased toward these same entries. Hence it is not surprising to find that minimum degree has similar results. A pagewise given ordering biases the algorithm to sweep through the grid row by row. Pagewise ordering of the variables is the equivalent of a banded method with bandwidth N , yielding a significant addition of fill-ins compared with nested dissection. In a similar way, the spiral-in given ordering biases toward the edges similarly to pagewise, whereas the spiral-out biases to pivot selection on the interior, closer to nested dissection. 8.2 By hypothesis, a21 and a12 are nonzero. Therefore, entries l21 and u12 of the LU factors of A are nonzero. Assume inductively that li,i−1 and ui−1,i are nonzero for i = 2, 3, ..., k − 1. If akj is the first nonzero in row k, elimination of this will produce a fill-in in position (k, j + 1) if this is zero since uj,j+1 is nonzero. Elimination of the resulting entry will fill in position (k, j + 2) if it is zero because uj+1,j+2 is nonzero. Continuing, we find that the whole of row k from its first nonzero fills in. A similar argument shows that column k fills in from its first nonzero. Thus, our hypothesis is true also for k and the result is proved. 8.3 Consider a node i in level set Sk . It is in Sk and in its position within it because of being a neighbour of a node of Sk−1 , say node j, and not of any node in an earlier level set or earlier in Sk−1 . The first nonzero of row i is thus aij . A node ` after node i will be in its position because of being a neighbour of node j or a node after node j. Therefore the first nonzero in row ` will be in a column after j. Thus the leading nonzeros of the rows are in columns that form a monotone increasing sequence. 8.4 The last node numbered is always a leaf node, hence will cause no fill-in when eliminated first. Remove this node and edge. The second last numbered node is a leaf node of the resulting tree. We continue similarly to establish the result.

380

SOLUTIONS TO SELECTED EXERCISES

8.5 The level sets on successive passes are: (i) (ii) (iii) (iv) (v)

(10), (7,8,9,11,12,13), (4,5,6,14,15,16), (1,2,3,17,18,19) (1), (2,3,4), (5,6,7), (8,9,10), (11,12,13), (14,15,16), (17,18,19) (19), (18,17,16), (15,14,13), (12,11,10), (9,8,7), (6,5,4), (3,2,1) (18), (19,16,15), (17,14,13,12), (11,10,9), (8,7,6), (5,4,3), (2,1) (17), (19,16,14), (18,15,13,11), (12,10,8),(9,7,5), (6,4,2), (3,1)

8.7 If the first step involves exchanging rows 1 and k then after the exchange row 1 will extend to column ku + k. After the first set of eliminations rows 1 to m + 1 may extend to column m + k too. Thus, there are no fill-ins outside the band form in the lower triangle and none outside the extended band in the upper triangle. Similarly, at later stages the interchange and elimination cannot introduce fill-ins outside the new form. 9.1 Figure 9.2.2 is a matrix with the structure  A11 0 0 0  0 A22 0 0   0 0 A33 0   0 0 0 A44 A51 A52 A53 A54

 A15 A25   A35   A45  A55

where all the blocks on the diagonal are square. When applying Gaussian elimination, none of the blocks A22 , A33 , and A44 is altered by prior operations. Therefore, the factorizations Aii = Lii Uii , i = 1, . . . 4 can proceed in parallel. 9.2 The asymptotic performance of nested dissection was defined in terms of its order of magnitude. No claim was made that other orderings could not have the same order of magnitude asymptotic performance. The spiral-out ordering appears to have the same asymptotic performance for this particular problem, but with a smaller multiplier. 9.4 We can take advantage of the structure of the problem for the case of the righthand side of Figure 9.7.1 (vertex separators) by structuring the matrix for B in block form as   B11 B12 B= B21 B22 where B22 represents those variables in B that are connected to S. If we create a similar block structure for W , the overall matrix for the problem looks like this   B11 B12  B21 B22 A25     . W11 W12 A=   W21 W22 A45  A52 A54 S If we had factorized the matrices B and W previously, these factors can be reused in the factorization of A. For the case on the left side of the Figure 9.7.1 (edge separators), we have the matrix

SOLUTIONS TO SELECTED EXERCISES

381



 B11 B12  B21 B22 A24   A=  W11 W12  A42 W21 W22

and the leading parts of the factorizations can be reused. We explore another option in Chapter 15 where we can remove the edges connecting the two independent subproblems, solve the independent problems, and then bring them back using the matrix modification formula. 9.5 The separator set in Figure E9.2 has only one element. In block form, the matrix represented by the connector variables is only 1×1 compared with 4×4 using the separator set in Figure E9.1. While the block corresponding to the variables on the left side will be slightly larger, we would expect less overall computation based on the reduction of the size of the separator block. This trade-off becomes more problematic the further you move from the centre to find a small separator set, however, due to the increase in the size of the separated block. 9.6 Using the same algorithm we described to find the centre separator, test the adjacent level sets on either side of the centre to see if the level set is smaller in size than the centre set. 9.11 Let B = AAT . We see that bi,j is the inner product of rows i and j of A. By definition, bi,j 6= 0 if and only if there is at least one column (say k) with entries in both row i and row j. This corresponds to the definition of the row graph of A. 9.13 First, dissection methods create the opportunity for parallelism. At the extreme of one-dimensional problems, dissection methods require log n steps while other methods require n steps, since the problem is fundamental sequential for banded or minimum degree. Secondly, it may also be advantageous to use a dissection scheme in the case of parametric studies, if the ‘centre’ part of the problem changes from case to case, but the rest of the problem remains fixed. Thirdly, if the ‘one-dimensional’ model becomes a part of a much larger problem, and if the connections to other parts of the problem are only at the ‘ends’, then we may want to reduce the one-dimensional problem to a Schur complement corresponding to the variables at either end. This would require ordering the two ends last, and introduce fill-in to the factorization. 10.1 ! ! ! !

Row counts are in array lenrow. Link together rows with the same number of nonzeros. head(k) is the first row with k nonzeros. link(i) is the row after row i, or zero if i is last. head(1:n) = 0 do i = 1,n lenr = lenrow(i) j = head(lenr) head(lenr) = i link(i) = j end do

382

SOLUTIONS TO SELECTED EXERCISES

! Loop through the rows in order of increasing row count, ! storing the position of row i in the ordered sequence. k = 1 do lenr = 1,n i = head(lenr) do while (i>0) next = link(i) position(i) = k k = k + 1 i = next end do end do

10.2 The rows and columns with i nonzeros are treated as a single linked list with a single header. To indicate the switch from rows to columns or vice-versa, a link is negated whenever it points from a column to a row or vice-versa. Similarly, the header is negated if it points to a column (this happens only if there are no rows with a particular number of nonzeros). 10.3 Let i be the index of the row being moved and let head, linkfd, and linkbk be the arrays holding the headers and forward and backward links, respectively. ! Remove from chain next = linkfd(i) last = linkbk(i) if (next/=0) linkbk(next) = last if (last==0) head(tau) = next if (last/=0) linkfd(last) = next ! Add to new chain next = head(tau1) head(tau1) = i linkbk(i) = 0 linkfd(i) = next if(next/=0)linkbk(next) = i 10.4 The matrix with the following pattern ××× ××× ×××××× ×× ××××× ××××× has lowest Markowitz count 4, which is attained by the entry (1,1). Entries in row 4 have row count 2 but Markowitz count 5.

SOLUTIONS TO SELECTED EXERCISES

383

10.5 The 5 by 5 matrix



1 1

    

1 9

9

9

1 9

9 9 9 9 1

     

has diagonal elements, which are unacceptable on stability grounds if u > 19 , where u is the threshold in inequality (10.2.2). Rows 1 to 4 and columns 1 to 4 will all be searched before one of the off-diagonal entries is accepted as the first pivot. 10.6

! Place column k of A in array w and make a ! linked list of its row indices head = n+1 do kk = a_col_start(k+1)-1,a_col_start(k),-1 i = row_index(kk) w(i) = value(kk) link(i) = head head = i end do ! Scan linked list adding multiples of appropriate ! columns of U to column k of A. The list is ordered. do while (ij (xji yi

P

+

i>j

xij xji yi yj

x2ij yj2

+ xij yj )

+ 2xij xji yi yj ) 2

2 2 i=1 xii yi

≥ 0. It follows that if no column of X is identically zero, no xii is zero and yT Qy 6= 0 unless y = 0. Therefore Q is positive definite. 15.13 ××××××× × × × × × ×

REFERENCES Agullo, E., Buttari, A., Guermouche, A., and Lopez, F. (2013), ‘Multifrontal QR factorization for multicore architectures over runtime systems’. In: Proceedings of Euro-Par 2013 Parallel Processing. Springer-Verlag, Berlin, pp. 521–532. Aho, A. V., Hopcroft, J. E., and Ullman, J. D. (1974), The Design and Analysis of Computer Algorithms. Addison-Wesley Publishing Company, Boston. Amestoy, P., Ashcraft, C., Boiteau, O., Buttari, A., L’Excellent, J.-Y., and Weisbecker, C. (2015a), ‘Improving multifrontal methods by means of block low-rank representations’, SIAM Journal on Scientific Computing 37(3), A1451–A1474. Amestoy, P., Duff, I. S., Guermouche, A., and Slavova, Tz. (2010), ‘Analysis of the solution phase of a parallel multifrontal approach’, Parallel Computing 36(1), 3–15. Amestoy, P. R. and Puglisi, C. (2002), ‘An unsymmetrized multifrontal LU factorization’, SIAM Journal on Matrix Analysis and Applications 24, 553– 569. Amestoy, P. R., Davis, T. A., and Duff, I. S. (1996a), ‘An approximate minimum degree ordering algorithm’, SIAM Journal on Matrix Analysis and Applications 17(4), 886–905. Amestoy, P. R., Duff, I. S., and Puglisi, C. (1996b), ‘Multifrontal QR factorization in a multiprocessor environment’, Numerical Linear Algebra with Applications 3(4), 275–300. Amestoy, P. R., Duff, I. S., Giraud, L., L’Excellent, J.-Y., and Puglisi, C. (2004), ‘GRID-TLSE: A Web Site for Experimenting with Sparse Direct Solvers on a Computational Grid’. In: SIAM Conference on Parallel Processing for Scientific Computing 2004 (PP04), San Francisco, California. Amestoy, P. R., Duff, I. S., L’Excellent, J.-Y., and Koster, J. (2001), ‘A fully asynchronous multifrontal solver using distributed dynamic scheduling’, SIAM Journal on Matrix Analysis and Applications 23(1), 15–41. Amestoy, P. R., Duff, I. S., L’Excellent, J.-Y., and Rouet, F.-H. (2015b), ‘Parallel computation of entries of A−1 ’, SIAM Journal on Scientific Computing 37(2), C268 – C284. Amestoy, P. R., Duff, I. S., L’Excellent, J.-Y., Robert, Y., Rouet, F.-H., and U¸car, B. (2012), ‘On computing inverse entries of a sparse matrix in an outof-core environment’, SIAM Journal on Scientific Computing 34(4), A1975 – A1999. Amestoy, P. R., Guermouche, A., L’Excellent, J.-Y., and Pralet, S. (2006), ‘Hybrid scheduling for the parallel solution of linear systems’, Parallel

REFERENCES

391

Computing 32(2), 136–156. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., and Sorensen, D. (1999), LAPACK Users’ Guide, 3rd edn. SIAM Press, Philadelphia. Arioli, M. and Duff, I. S. (2008), ‘Using FGMRES to obtain backward stability in mixed precision’, Electronic Transactions on Numerical Analysis 33, 31–44. Special issue commemorating 60th anniversary of Gerard Meurant. Arioli, M., Demmel, J. W., and Duff, I. S. (1989), ‘Solving sparse linear systems with sparse backward error’, SIAM Journal on Matrix Analysis and Applications 10, 165–190. Arioli, M., Duff, I. S., Gratton, S., and Pralet, S. (2007), ‘A note on GMRES preconditioned by a perturbed LDLT decomposition with static pivoting’, SIAM Journal on Scientific Computing 29(5), 2024–2044. Ashcraft, C. and Grimes, R. (1999), ‘SPOOLES: an object-oriented sparse matrix library’. In: Proceedings of the Ninth SIAM Conference on Parallel Processing 1999. SIAM Press, Philadelphia. Ashcraft, C. and Liu, J. W. H. (1998), ‘Robust ordering of sparse matrices using multisection’, SIAM Journal on Matrix Analysis and Applications 19, 816– 832. Ashcraft, C., Eisenstat, S. C., and Liu, J. W. H. (1999), ‘A fan-in algorithm for distributed sparse numerical factorization’, SIAM Journal on Scientific and Statistical Computing 11, 593–599. Aspen Technology Inc. (1995), SPEEDUP, User Manual, Library Manual, Cambridge, Massachusetts, USA. ¨ V. (2004), ‘Permuting sparse Aykanat, C., Pinar, A., and C ¸ ataly¨ urek, U. rectangular matrices into block-diagonal form’, SIAM Journal on Scientific Computing 25(6), 1860–1879. Barnard, S. T. and Simon, H. D. (1994), ‘A fast multilevel implementation of recursive spectral bisection for partitioning unstructured problems’, Concurrency: Practice and Experience 6(2), 101–117. Barnard, S. T., Pothen, A., and Simon, H. (1995), ‘A spectral algorithm for envelope reduction of sparse matrices’, Numerical Linear Algebra with Applications 2(4), 317–334. Barwell, V. and George, A. (1976), ‘A comparison of algorithms for solving symmetric indefinite systems of linear equations’, ACM Transations on Mathematical Software 2, 242–251. Bauer, F. L. (1963), ‘Optimally scaled matrices’, Numerische Mathematik 5, 73– 87. Bebendorf, M. (2008), Hierarchical Matrices: A Means to Efficiently Solve Elliptic Boundary Value Problems, Lecture Notes in Computational Science and Engineering. Springer-Verlag, Berlin. Berge, C. (1957), ‘Two theorems in graph theory’, Proceedings of the National Academy of Sciences of the United States of America 43(9), 842–844. Bhat, M. V., Habashi, W. G., Liu, J. W. H., Nguyen, V. N., and Peeters, M. F.

392

REFERENCES

(1993), ‘A note on nested dissection for rectangular grids’, SIAM Journal on Matrix Analysis and Applications 14, 253–258. Bj¨orck, ˚ A. (1987), ‘Stability analysis of the method of seminormal equations for least squares problems’, Linear Algebra and its Applications 88/89, 31–48. Blackford, L. S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., and Whaley, R. C. (1997), ScaLAPACK Users’ Guide. SIAM Press, Philadelphia. Blackford, L. S., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Kaufman, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K., and Whaley, R. C. (2002), ‘An updated set of basic linear algebra subprograms (BLAS)’, ACM Transations on Mathematical Software 28(2), 135–151. Bondy, A. and Murty, U. (2008), Graph Theory. Springer-Verlag, Berlin. Bunch, J. R. (1974), ‘Partial pivoting strategies for symmetric matrices’, SIAM Journal on Numerical Analysis 11, 521–528. Bunch, J. R. and Parlett, B. N. (1971), ‘Direct methods for solving symmetric indefinite systems of linear equations’, SIAM Journal on Numerical Analysis 8, 639–655. Bunch, J. R., Kaufman, L., and Parlett, B. N. (1976), ‘Decomposition of a symmetric matrix’, Numerische Mathematik 27, 95–110. Buttari, A., Langou, J., Kurzak, J., and Dongarra, J. (2009), ‘A class of parallel tiled linear algebra algorithms for multicore architectures’, Parallel Computing 35(6–8), 38–53. Calahan, D. A. (1982), ‘Vectorized direct solvers for 2-D grids’. In: Proceedings of the 6th Symposium on Reservoir Simulation, New Orleans, Feb. 1–2, 1982. Paper SPE 10522. Campbell, Y. E. and Davis, T. A. (1995), Computing the sparse inverse subset: an inverse multifrontal approach, Technical Report TR-95-021. Computer and Information Sciences Department, University of Florida. Carvalho, L. M., Giraud, L., and Meurant, G. (2001), ‘Local preconditioners for two-level non-overlapping domain decomposition methods’, Numerical Linear Algebra with Applications 8, 201–227. Cataly¨ urek, U. V. and Aykanat, C. (1999), ‘Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication’, IEEE Transactions on Parallel and Distributed Systems 10(7), 673–693. Chan, T. F. and Mathew, T. P. (1994), Domain Decomposition Algorithms, Vol. 3 of Acta Numerica. Cambridge University Press, Cambridge, pp. 61–143. Chang, A. (1969), ‘Application of sparse matrix methods in electric power system analysis’. In: R. A. Willoughby (ed), Sparse Matrix Proceedings, Vol. RA1(#11707). IBM T. J. Watson Research Center, Yorktown Heights, NY, pp. 113–121. Chevalier, C. and Pellegrini, F. (2008), ‘PT-Scotch: a tool for efficient parallel graph ordering’, Parallel Computing 34(6-8), 318–331.

REFERENCES

393

Chu, E., George, A., Liu, J. W. H., and Ng, E. G. (1984), Waterloo sparse matrix package user’s guide for SPARSPAK-A, Technical Report CS-84-36. University of Waterloo, Canada. Cline, A. K. and Rew, R. K. (1983), ‘A set of counter examples to three condition number estimators’, SIAM Journal on Scientific and Statistical Computing 4, 602–611. Cline, A. K., Conn, A. R., and Van Loan, C. F. (1982), ‘Generating the LINPACK condition estimator’. In: J. P. Hennart (ed), Numerical Analysis. Springer-Verlag, Berlin, pp. 73–83. Cline, A. K., Moler, C. B., Stewart, G. W., and Wilkinson, J. H. (1979), ‘An estimate for the condition number of a matrix’, SIAM Journal on Numerical Analysis 16, 368–375. Coleman, T. F. (1984), Large Sparse Numerical Optimization. Springer-Verlag, Berlin. Coleman, T. F. and Mor´e, J. J. (1983), ‘Estimation of sparse Jacobian matrices and graph coloring problems’, SIAM Journal on Numerical Analysis 20, 187–209. Coleman, T. F. and Mor´e, J. J. (1984), ‘Estimation of sparse Hessian matrices and graph coloring problems’, Mathematical Programming 28, 243–270. Coleman, T. F., Edenbrandt, A., and Gilbert, J. R. (1986), ‘Predicting fill for sparse orthogonal factorization’, Journal of the ACM 33, 517–532. Coleman, T. F., Garbow, B. S., and Mor´e, J. J. (1984a), ‘Software for estimating sparse Hessian matrices’, ACM Transations on Mathematical Software 11, 363–378. Coleman, T. F., Garbow, B. S., and Mor´e, J. J. (1984b), ‘Software for estimating sparse Jacobian matrices’, ACM Transations on Mathematical Software 10, 329–347. Collins, R. J. (1973), ‘Bandwidth reduction by automatic renumbering’, International Journal of Numerical Methods in Engineering 6, 345–356. Curtis, A. R. and Reid, J. K. (1971), ‘The solution of large sparse unsymmetric systems of linear equations’, Journal of the Institute of Mathematics and its Applications 8, 344–353. Curtis, A. R. and Reid, J. K. (1972), ‘On the automatic scaling of matrices for Gaussian elimination’, Journal of the Institute of Mathematics and its Applications 10, 118–124. Curtis, A. R., Powell, M. J. D., and Reid, J. K. (1974), ‘On the estimation of sparse Jacobian matrices’, Journal of the Institute of Mathematics and its Applications 13, 117–120. Cuthill, E. and McKee, J. (1969), ‘Reducing the bandwidth of sparse symmetric matrices’. In: Proceedings 24th National Conference of the Association for Computing Machinery. ACM, New York, pp. 157–172. Davis, T. A. (2004), ‘A column pre-ordering strategy for the unsymmetricpattern multifrontal method’, ACM Transations on Mathematical Software 30(2), 165–195.

394

REFERENCES

Davis, T. A. (2011), ‘Algorithm 915, SuiteSparseQR: multifrontal multithreaded rank-revealing sparse QR factorization’, ACM Transations on Mathematical Software 38(1), 8:1–8:22. Davis, T. A. and Duff, I. S. (1997), ‘An unsymmetric-pattern multifrontal method for sparse LU factorization’, SIAM Journal on Matrix Analysis and Applications 18(1), 140–158. Davis, T. A. and Hu, Y. (2011), ‘The University of Florida sparse matrix collection’, ACM Transations on Mathematical Software 38(1), 1–25. Davis, T. A. and Yew, P. C. (1990), ‘A nondeterministic parallel algorithm for general unsymmetric sparse LU factorization’, SIAM Journal on Matrix Analysis and Applications 11, 383–402. Davis, T. A., Gilbert, J. R., Larimore, S. I., and Ng, E. G. (2004a), ‘Algorithm 836: COLAMD, a column approximate minimum degree ordering algorithm’, ACM Transations on Mathematical Software 30(3), 377–380. Davis, T. A., Gilbert, J. R., Larimore, S. I., and Ng, E. G. (2004b), ‘A column approximate minimum degree ordering algorithm’, ACM Transations on Mathematical Software 30(3), 353–376. Dembart, B. and Erisman, A. M. (1973), ‘Hybrid sparse matrix methods’, IEEE Transactions on Circuit Theory 20, 641–649. Demmel, J. W., Eisenstat, S. C., Gilbert, J. R., Li, X. S., and Liu, J. W. H. (1999), ‘A supernodal approach to sparse partial pivoting’, SIAM Journal on Matrix Analysis and Applications 20, 720–755. Dennis Jr., J. E. and Schnabel, R. B. (1983), Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Prentice-Hall, Upper Saddle River. Devine, K., Boman, E., Heaphy, R., Hendrickson, B., and Vaughan, C. (2002), ‘Zoltan data management services for parallel dynamic applications’, Computing in Science and Engineering 4(2), 90–97. Dijkstra, E. W. (1959), ‘A note on two problems connected with graphs’, Numerische Mathematik 1, 269–271. Dodds, R. H. and Lopez, L. A. (1980), ‘Substructuring in linear and nonlinear analysis’, International Journal of Numerical Methods in Engineering 15, 583–597. Dolan, E. D. and Mor´e, J. J. (2002), ‘Benchmarking optimization software with performance profiles’, Mathematical Programming 91(2), 201–213. Dongarra, J. J., Bunch, J. R., Moler, C. B., and Stewart, G. W. (1979), LINPACK User’s Guide. SIAM Press, Philadelphia. Dongarra, J. J., Du Croz, J., Duff, I. S., and Hammarling, S. (1990), ‘A set of Level 3 Basic Linear Algebra Subprograms.’, ACM Transations on Mathematical Software 16, 1–17. Dongarra, J. J., Du Croz, J. J., Hammarling, S., and Hanson, R. J. (1988), ‘An extented set of Fortran Basic Linear Algebra Subprograms’, ACM Transations on Mathematical Software 14, 1–17. Dongarra, J. J., Duff, I. S., Sorensen, D. C., and van der Vorst, H. A. (1998),

REFERENCES

395

Numerical Linear Algebra for High-Performance Computers. SIAM Press, Philadelphia. Drummond, L. A., Duff, I. S., Guivarch, R., Ruiz, D., and Zenadi, M. (2015), ‘Partitioning strategies for the block Cimmino algorithm’, Journal of Engineering Mathematics 91(2), 21–39. Duff, I. S. (1972), ‘Analysis of Sparse Systems’, D.Phil Thesis, Oxford University, UK. Duff, I. S. (1977), ‘On permutations to block triangular form’, Journal of the Institute of Mathematics and its Applications 19, 339–342. Duff, I. S. (1979), ‘Practical comparisons of codes for the solution of sparse linear systems’. In: I. S. Duff and G. W. Stewart (eds), Sparse Matrix Proceedings, 1978. SIAM Press, Philadelphia, pp. 107–134. Duff, I. S. (1981a), ‘Algorithm 575: Permutations for a zero-free diagonal [F1]’, ACM Transations on Mathematical Software 7(3), 387–390. Duff, I. S. (1981b), MA32 – A package for solving sparse unsymmetric systems using the frontal method, Technical Report AERE R10079. Her Majesty’s Stationery Office, London. Duff, I. S. (1981c), ‘On algorithms for obtaining a maximum transversal’, ACM Transations on Mathematical Software 7(3), 315–330. Duff, I. S. (1983), Enhancements to the MA32 package for solving sparse unsymmetric equations, Technical Report AERE R11009. Her Majesty’s Stationery Office, London. Duff, I. S. (1984a), ‘Design features of a frontal code for solving sparse unsymmetric linear systems out-of-core’, SIAM Journal on Scientific and Statistical Computing 5, 270–280. Duff, I. S. (1984b), ‘The solution of nearly symmetric sparse linear systems’. In: R. Glowinski and J.-L. Lions (eds), Computing Methods in Applied Sciences and Engineering, VI. North Holland, Amsterdam, pp. 57–74. Duff, I. S. (1984c), ‘The solution of sparse linear equations on the CRAY−1’. In: J. S. Kowalik (ed), Proceedings of the NATO Advanced Research Workshop on High-Speed Computation, held at J¨ ulich, Germany, June 20–22, 1983, NATO ASI series. Series F, Computer and Systems Sciences, Vol. 7. Springer-Verlag, Berlin, pp. 293–309. Duff, I. S. (1986), ‘Parallel implementation of multifrontal schemes’, Parallel Computing 3, 193–204. Duff, I. S. (1989), ‘Multiprocessing a sparse matrix code on the Alliant FX/8’, Journal of Computational and Applied Mathematics 27, 229–239. Duff, I. S. (2004), ‘MA57—a new code for the solution of sparse symmetric definite and indefinite systems’, ACM Transations on Mathematical Software 30(2), 117–144. Duff, I. S. (2007), ‘Developments in matching and scaling algorithms’, Proceedings in Applied Mathematics and Mechanics 7(1), 1010801–1010802. Proceedings of Sixth International Congress on Industrial and Applied Mathematics (ICIAM07) and GAMM Annual Meeting, Zurich 2007.

396

REFERENCES

Duff, I. S. and Koster, J. (1999), ‘The design and use of algorithms for permuting large entries to the diagonal of sparse matrices’, SIAM Journal on Matrix Analysis and Applications 20(4), 889–901. Duff, I. S. and Koster, J. (2001), ‘On algorithms for permuting large entries to the diagonal of a sparse matrix’, SIAM Journal on Matrix Analysis and Applications 22(4), 973–996. Duff, I. S. and Nowak, U. (1987), ‘On sparse solvers in a stiff integrator of extrapolation type’, IMA Journal of Numerical Analysis 7, 391–405. Duff, I. S. and Pralet, S. (2004), Experiments in preprocessing and scaling symmetric problems for multifrontal solutions, Technical Report WN/PA/04/17. CERFACS, Toulouse, France. Duff, I. S. and Pralet, S. (2005), ‘Strategies for scaling and pivoting for sparse symmetric indefinite problems’, SIAM Journal on Matrix Analysis and Applications 27(2), 313–340. Duff, I. S. and Reid, J. K. (1974), ‘A comparison of sparsity orderings for obtaining a pivotal sequence in Gaussian elimination’, Journal of the Institute of Mathematics and its Applications 14, 281–291. Duff, I. S. and Reid, J. K. (1975), ‘On the reduction of sparse matrices to condensed forms by similarity transformations’, Journal of the Institute of Mathematics and its Applications 15, 217–224. Duff, I. S. and Reid, J. K. (1976), ‘A comparison of some methods for the solution of sparse overdetermined systems of linear equations’, Journal of the Institute of Mathematics and its Applications 17, 267–280. Duff, I. S. and Reid, J. K. (1978a), ‘Algorithm 529: permutations to block triangular form [F1]’, ACM Transations on Mathematical Software 4(2), 189–192. Duff, I. S. and Reid, J. K. (1978b), ‘An implementation of Tarjan’s algorithm for the block triangularization of a matrix’, ACM Transations on Mathematical Software 4(2), 137–147. Duff, I. S. and Reid, J. K. (1979), ‘Performance evaluation of codes for sparse matrix problems’. In: L. D. Fosdick (ed), Performance Evaluation of Numerical Software: Proceedings of the IFIP TC 2.5 Working Conference on Performance Evaluation of Numerical Software. North Holland, Amsterdam, pp. 121–135. Duff, I. S. and Reid, J. K. (1982), MA27 – A set of Fortran subroutines for solving sparse symmetric sets of linear equations, Technical Report AERE R10533. Her Majesty’s Stationery Office, London. Duff, I. S. and Reid, J. K. (1983), ‘The multifrontal solution of indefinite sparse symmetric linear systems’, ACM Transations on Mathematical Software 9, 302–325. Duff, I. S. and Reid, J. K. (1996a), ‘The design of MA48, a code for the direct solution of sparse unsymmetric linear systems of equations’, ACM Transations on Mathematical Software 22(2), 187–226. Duff, I. S. and Reid, J. K. (1996b), ‘Exploiting zeros on the diagonal in the direct

REFERENCES

397

solution of indefinite sparse symmetric linear systems’, ACM Transations on Mathematical Software 22(2), 227–257. Duff, I. S. and Scott, J. A. (1996), ‘The design of a new frontal code for solving sparse unsymmetric systems’, ACM Transations on Mathematical Software 22(1), 30–45. Duff, I. S. and Scott, J. A. (1999), ‘A frontal code for the solution of sparse positive-definite symmetric systems arising from finite-element applications’, ACM Transations on Mathematical Software 25(4), 404–424. Duff, I. S. and Scott, J. A. (2004), ‘A parallel direct solver for large highly unsymmetric linear systems’, ACM Transations on Mathematical Software 30(2), 95–117. Duff, I. S. and U¸car, B. (2010), ‘On the block triangular form of symmetric matrices’, SIAM Review 52(3), 455–470. Duff, I. S., Erisman, A. M., and Reid, J. K. (1976), ‘On George’s nested dissection method’, SIAM Journal on Numerical Analysis 13, 686–695. Duff, I. S., Erisman, A. M., and Reid, J. K. (1986), Direct Methods for Sparse Matrices. Oxford University Press, Oxford. Duff, I. S., Erisman, A. M., and Reid, J. K. (2016), The Hellerman-Rarick algorithm., Technical report. Rutherford Appleton Laboratory, Oxfordshire, UK (in press). Duff, I. S., Erisman, A. M., Gear, C. W., and Reid, J. K. (1988), ‘Sparsity structure and Gaussian elimination’, SIGNUM Newsletter 23(2), 2–8. Duff, I. S., Grimes, R. G., and Lewis, J. G. (1989), ‘Sparse matrix test problems’, ACM Transations on Mathematical Software 15(1), 1–14. Duff, I. S., Grimes, R. G., and Lewis, J. G. (1997), The Rutherford-Boeing Sparse Matrix Collection, Technical Report RAL-TR-97-031. Rutherford Appleton Laboratory, Oxfordshire, UK. Also Technical Report ISSTECH-97-017 from Boeing Information & Support Services, Seattle and Report TR/PA/97/36 from CERFACS, Toulouse. Duff, I. S., Guivarch, R., Ruiz, D., and Zenadi, M. (2015), ‘The Augmented Block Cimmino Distributed method’, SIAM Journal on Scientific Computing 37(3), A1248–A1269. Duff, I. S., Heroux, M. A., and Pozo, R. (2002), ‘An overview of the Sparse Basic Linear Algebra Subprograms: the new standard from the BLAS Technical Forum’, ACM Transations on Mathematical Software 28(2), 239–267. Duff, I. S., Kaya, K., and U¸car, B. (2011), ‘Design, implementation, and analysis of maximum transversal algorithms’, ACM Transations on Mathematical Software 38(2), 13, 1–31. Duff, I. S., Reid, J. K., Munksgaard, N., and Neilsen, H. B. (1979), ‘Direct solution of sets of linear equations whose matrix is sparse, symmetric and indefinite’, Journal of the Institute of Mathematics and its Applications 23, 235–250. Dulmage, A. L. and Mendelsohn, N. S. (1959), ‘A structure theory of bipartite graphs of finite exterior dimension’, Transations of the Royal Society of

398

REFERENCES

Canada Section III 53, 1–13. Eisenstat, S. C. and Liu, J. W. H. (1993), ‘Exploiting structural symmetry in a sparse partial pivoting code’, SIAM Journal on Scientific and Statistical Computing 14, 253–257. Eisenstat, S. C. and Liu, J. W. H. (2005), ‘The theory of elimination trees for sparse unsymmetric matrices’, SIAM Journal on Matrix Analysis and Applications 26, 686–705. Eisenstat, S. C., Gursky, M. C., Schultz, M. H., and Sherman, A. H. (1982), ‘Yale sparse matrix package, I: The symmetric codes’, International Journal of Numerical Methods in Engineering 18, 1145–1151. Erisman, A. M. (1972), ‘Sparse matrix approach to the frequency domain analysis of linear passive electrical networks’. In: D. J. Rose and R. A. Willoughby (eds), Sparse Matrices and their Applications. Plenum Press, New York, pp. 31–40. Erisman, A. M. (1973), ‘Decomposition methods using sparse matrix techniques with application to certain electrical network problems’. In: D. M. Himmelblau (ed), Decomposition of Large-scale Problems. North Holland, Amsterdam, pp. 69–80. Erisman, A. M. and Reid, J. K. (1974), ‘Monitoring the stability of the triangular factorization of a sparse matrix’, Numerische Mathematik 22, 183–186. Erisman, A. M. and Spies, G. E. (1972), ‘Exploiting problem characteristics in the sparse matrix approach to frequency domain analysis’, IEEE Transactions on Circuit Theory 19, 260–269. Erisman, A. M. and Tinney, W. F. (1975), ‘On computing certain elements of the inverse of a sparse matrix’, Communications of the ACM 18, 177–179. Erisman, A. M., Grimes, R. G., Lewis, J. G., and Poole Jr., W. G. (1985), ‘A structurally stable modification of Hellerman-Rarick’s P4 algorithm for reordering unsymmetric sparse matrices’, SIAM Journal on Numerical Analysis 22, 369–385. Erisman, A. M., Grimes, R. G., Lewis, J. G., Poole Jr., W. G., and Simon, H. D. (1987), ‘Evaluation of orderings for unsymmetric sparse matrices’, SIAM Journal on Scientific and Statistical Computing 7, 600–624. Eskow, E. and Schnabel, R. B. (1991), ‘Algorithm 695: Software for a new modified Cholesky factorization’, ACM Transations on Mathematical Software 17, 306–312. Everstine, G. C. (1979), ‘A comparison of three resequencing algorithms for the reduction of matrix profile and wavefront’, International Journal of Numerical Methods in Engineering 14, 837–853. Felippa, C. A. (1975), ‘Solution of linear equations with skyline-stored symmetric matrix’, Computers and Structures 5, 13–29. Feurzeig, W. (1960), ‘Algorithm 23, MATH SORT’, Communications of the ACM 5, 13–29. Fiduccia, C. M. and Mattheyses, R. M. (1982), ‘A linear-time heuristic for improving network partitions’. In: Proceedings 19th ACM IEEE Design

REFERENCES

399

Automation Conference. IEEE Press, New York, pp. 175–181. Fiedler, M. (1975), ‘Algebraic connectivity of graphs’, Czechoslovak Mathematical Journal 25, 619–633. Fletcher, R. (1980), Practical Methods of Optimization. Volume 1. Unconstrained Optimization. J. Wiley and Sons, Chichester. Fletcher, R. (1981), Practical Methods of Optimization. Volume 2. Constrained Optimization. J. Wiley and Sons, Chichester. Ford, L. and Fulkerson, D. (1962), Flows in Networks. Princeton University Press, Princeton. Forsythe, G. and Moler, C. B. (1967), Computer Solution of Linear Algebraic Equations. Prentice-Hall, Upper Saddle River. Forth, S. A., Tadjouddine, M., Pryce, J. D., and Reid, J. K. (2004), ‘Jacobian code generated by source transformation and vertex elimination is as efficient as hand coding’, ACM Transations on Mathematical Software 30(3), 266–299. Foster, L. V. (1997), ‘The growth factor and efficiency of Gaussian elimination with rook pivoting’, Journal of Computational and Applied Mathematics 86, 177–194. Gallivan, K., Hansen, P. C., Ostromsky, T., and Zlatev, Z. (1995), ‘A locally optimized reordering algorithm and its application to a parallel sparse linear system solver’, Computing 54, 39–67. Gear, C. W. (1975), Numerical errors in sparse linear equations, Technical Report UIUCDCS-F-75-885. Department of Computer Science, University of Illinois, Urbana. Geist, G. A. and Ng, E. G. (1989), ‘Task scheduling for parallel sparse Cholesky factorization’, International Journal of Parallel Programming 18(4), 291– 314. George, A. (1971), ‘Computer implementation of the finite-element method’, PhD Thesis, Stanford University, California, USA. George, A. (1973), ‘Nested dissection of a regular finite-element mesh’, SIAM Journal on Numerical Analysis 10, 345–363. George, A. (1977), ‘Solution of linear systems of equations: direct methods for finite-element problems’. In: V. A. Barker (ed), Sparse Matrix Techniques, Lecture notes in mathematics, Vol. 572. Springer-Verlag, Berlin, pp. 52–101. George, A. (1980), ‘An automatic one-way dissection algorithm for irregular finite-element problems’, SIAM Journal on Numerical Analysis 17, 740– 751. George, A. and Liu, J. W. H. (1975), ‘Some results on fill for sparse matrices’, SIAM Journal on Numerical Analysis 12, 452–455. George, A. and Liu, J. W. H. (1978a), ‘Algorithms for matrix partitioning and the numerical solution of finite-element systems’, SIAM Journal on Numerical Analysis 15, 297–327. George, A. and Liu, J. W. H. (1978b), ‘An automated nested dissection algorithm for irregular finite element problems’, SIAM Journal on Numerical Analysis

400

REFERENCES

15, 1053–1069. George, A. and Liu, J. W. H. (1979), ‘An implementation of a pseudoperipheral node finder’, ACM Transations on Mathematical Software 5, 284–295. George, A. and Liu, J. W. H. (1981), Computer Solution of Large Sparse Positive Definite Systems. Prentice-Hall, Englewood Cliffs. George, A. and Ng, E. (1985), ‘An implementation of Gaussian elimination with partial pivoting for sparse systems’, SIAM Journal on Scientific and Statistical Computing 6, 390–409. George, A. and Ng, E. (1987), ‘Symbolic factorization for sparse Gaussian elimination with partial pivoting’, SIAM Journal on Scientific and Statistical Computing 8, 877–898. George, A., Heath, M. T., Liu, J. W. H., and Ng, E. (1988), ‘Sparse Cholesky factorization on a local-memory multiprocessor’, SIAM Journal on Scientific and Statistical Computing 9, 327–340. George, A., Liu, J., and Ng, E. (1987), ‘Communication reduction in parallel sparse Cholesky factorization on a hypercube’. In: M. T. Heath (ed), Hypercube Multiprocessors. SIAM Press, Philadelphia, pp. 576–586. George, A., Poole Jr., W. G., and Voigt, R. (1978), ‘Incomplete nested dissection for solving n×n grid problems’, SIAM Journal on Numerical Analysis 15, 663–673. Gibbs, N. E., Poole Jr., W. G., and Stockmeyer, P. K. (1976), ‘An algorithm for reducing the bandwidth and profile of a sparse matrix’, SIAM Journal on Numerical Analysis 13, 236–250. Gilbert, J. R. and Liu, J. W. H. (1993), ‘Elimination structures for unsymmetric sparse LU factors’, SIAM Journal on Matrix Analysis and Applications 14, 334–354. Gilbert, J. R. and Ng, E. (1993), ‘Predicting structure in nonsymmetric sparse matrix factorizations’. In: A. George, J. R. Gilbert, and J. W. H. Liu (eds), Graph Theory and Sparse Matrix Computation, The IMA Volumes in Mathematics and its Applications, Volume 56. Springer-Verlag, Berlin, pp. 107–139. Gilbert, J. R. and Peierls, T. (1988), ‘Sparse partial pivoting in time proportional to arithmetic operations’, SIAM Journal on Scientific and Statistical Computing 9, 862–874. Gilbert, J. R. and Schreiber, R. (1982), ‘Nested dissection with partial pivoting’. In: Sparse Matrix Symposium 1982: program and abstracts, Fairfield Glade, TN. Gilbert, J. R., Moler, C., and Schreiber, R. (1992), ‘Sparse matrices in MATLAB: design and implementation’, SIAM Journal on Matrix Analysis and Applications 13(1), 333–356. Gilbert, J. R., Ng, E. G., and Peyton, B. W. (1994), ‘An efficient algorithm to compute row and column counts for sparse Cholesky factorization’, SIAM Journal on Matrix Analysis and Applications 15, 1075–1091. Gill, P. E., Murray, W., and Wright, M. H. (1981), Practical Optimization.

REFERENCES

401

Academic Press, London. Gill, P. E., Murray, W., Saunders, M. A., and Wright, M. H. (1990), ‘A Schurcomplement method for sparse quadratic programming’. In: M. G. Cox and S. J. Hammarling (eds), Reliable Scientific Computation. Oxford University Press, Oxford, pp. 113–138. Gill, P. E., Saunders, M. A., and Shinnerl, J. R. (1996), ‘On the stability of Cholesky factorization for quasi-definite systems’, SIAM Journal on Matrix Analysis and Applications 17(1), 35–46. Giraud, L., Marrocco, A., and Rioual, J.-C. (2005), ‘Iterative versus direct parallel substructuring methods in semiconductor device modelling’, Numerical Linear Algebra with Applications 12(1), 33–53. Golub, G. H. and van Loan, C. (2012), Matrix Computations, 4th edn. John Hopkins University Press, Baltimore. Gould, N. I. M. (1991), ‘An algorithm for large scale quadratic programming’, IMA Journal of Numerical Analysis 11, 299–324. Gould, N. I. M. and Toint, Ph. L. (2002), ‘An iterative working-set method for large-scale nonconvex quadratic programmimg’, Applied Numerical Mathematics 43, 109–128. Greenbaum, A. (1997), Iterative Methods for Solving Linear Systems, Frontiers in Applied Mathematics. SIAM Press, Philadelphia. Griewank, A. and Walther, A. (2008), Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. SIAM Press, Philadelphia. Grund, F. (1999), ‘Direct linear solvers for vector and parallel computers’. In: J. M. L. M. Palma, J. Dongarra, and V. Hern´andez (eds), Vector and Parallel Processing - VECPAR’98, Third International Conference, Porto, Portugal, June 1998, Lecture Notes in Computer Science, Vol. 1573. Springer-Verlag, Berlin, pp. 114–127. Grund, F. (2003), Solution of linear systems with sparse matrices, Technical Report 816. Weierstraß-Institut f¨ ur Angewandte Analysis und Stochastik, Berlin. Guermouche, A. and L’Excellent, J.-Y. (2005), ‘A study of various load information exchange mechanisms for a distributed application using dynamic scheduling’. In: 19th International Parallel and Distributed Processing Symposium (IPDPS’05). Guermouche, A. and L’Excellent, J.-Y. (2006), ‘Constructing memoryminimizing schedules for multifrontal methods’, ACM Transations on Mathematical Software 32(1), 17–32. Gupta, A. (2002a), ‘Improved symbolic and numerical factorization algorithms for unsymmetric sparse matrices’, SIAM Journal on Matrix Analysis and Applications 24, 529–552. Gupta, A. (2002b), ‘Recent advances in direct methods for solving unsymmetric sparse systems of linear equations’, ACM Transations on Mathematical Software 28, 301–324. Gupta, A. (2007), ‘A shared- and distributed-memory parallel general sparse

402

REFERENCES

direct solver’, Applicable Algebra in Engineering, Communication, and Computing 18, 263–277. Gupta, A., Karypis, G., and Kumar, V. (1997), ‘Highly scalable parallel algorithms for sparse matrix factorization’, IEEE Transactions on Parallel and Distributed Systems 8(5), 502–520. Gustavson, F. G. (1972), ‘Some basic techniques for solving sparse systems of linear equations’. In: D. J. Rose and R. A. Willoughby (eds), Sparse Matrices and their Applications. Plenum Press, New York, pp. 41–52. Gustavson, F. G. (1976), ‘Finding the block lower triangular form of a sparse matrix’. In: J. R. Bunch and D. J. Rose (eds), Sparse Matrix Computations. Academic Press, London, pp. 275–289. Gustavson, F. G. (1978), ‘Two fast algorithms for sparse matrices: multiplication and permuted transposition’, ACM Transations on Mathematical Software 4, 250–269. Gustavson, F. G., Liniger, W. M., and Willoughby, R. A. (1970), ‘Symbolic generation of an optimal Crout algorithm for sparse systems of linear equations’, Journal of the ACM 17, 87–109. Hachtel, G. D. (1972), ‘Vector and matrix variability type in sparse matrix algorithms’. In: D. J. Rose and R. A. Willoughby (eds), Sparse Matrices and their Applications. Plenum Press, New York, pp. 53–64. Hachtel, G. D. (1976), ‘The sparse tableau approach to finite-element assembly’. In: J. R. Bunch and D. J. Rose (eds), Sparse Matrix Computations. Academic Press, London, pp. 349–363. Hachtel, G. D., Brayton, R. K., and Gustavson, F. G. (1971), ‘The sparse tableau approach to network analysis and design’, IEEE Transactions on Circuit Theory 18, 101–113. Hager, W. W. (1984), ‘Condition estimates’, SIAM Journal on Scientific and Statistical Computing 5, 311–316. Hager, W. W. (2002), ‘Minimizing the profile of a symmetric matrix’, SIAM Journal on Scientific Computing 23(5), 1799–1816. Hall, J. and McKinnon, K. (2005), ‘Hyper-sparsity in the revised simplex method and how to exploit it’, Computational Optimization and Applications 32(3), 259–283. Hall, M. (1956), ‘An algorithm for distinct representatives’, Amererican Mathematical Monthly 63, 716–717. Hall, P. (1935), ‘On representatives of subsets’, Journal of the London Mathematical Society 10(1), 26–30. Hamming, R. W. (1971), Introduction to Applied Numerical Analysis. McGrawHill, New York. Harary, F. (1969), Graph Theory. Addison-Wesley Publishing Company, Boston. Harary, F. (1971), ‘Sparse matrices and graph theory’. In: J. K. Reid (ed), Large Sparse Sets of Linear Equations. Academic Press, London, pp. 139–150. Hellerman, E. and Rarick, D. C. (1971), ‘Reinversion with the preassigned pivot procedure’, Mathematical Programming 1, 195–216.

REFERENCES

403

Hellerman, E. and Rarick, D. C. (1972), ‘The partitioned preassigned pivot procedure (P4 )’. In: D. J. Rose and R. A. Willoughby (eds), Sparse Matrices and their Applications. Plenum Press, New York, pp. 67–76. Hendrickson, B. and Kolda, T. G. (2000a), ‘Graph partitioning models for parallel computing’, Parallel Computing 26(12), 1519–1534. Hendrickson, B. and Kolda, T. G. (2000b), ‘Partitioning rectangular and structurally unsymmetric sparse matrices for parallel processing’, SIAM Journal on Scientific Computing 21(6), 2048–2072. Hendrickson, B. and Rothberg, E. (1998), ‘Improving the runtime and quality of nested dissection ordering’, SIAM Journal on Scientific Computing 20, 468– 489. Higham, N. J. (1987), ‘A survey of condition number estimation for triangular matrices’, SIAM Review 29, 575–596. Higham, N. J. (1988), ‘FORTRAN codes for estimating the one-norm of a real or complex matrix, with applications to condition estimation (Algorithm 674)’, ACM Transations on Mathematical Software 14(4), 381–396. Higham, N. J. (2002), Accuracy and Stability of Numerical Algorithms, 2nd edn. SIAM Press, Philadelphia. Hoffman, A. J., Martin, M. S., and Rose, D. J. (1973), ‘Complexity bounds for regular finite difference and finite element grids’, SIAM Journal on Numerical Analysis 10, 364–369. Hogg, J. (2008), A DAG-based parallel Cholesky factorization for multicore systems, Technical Report RAL-TR-2008-029. Rutherford Appleton Laboratory, Oxfordshire, UK. Hogg, J. (2013), ‘A fast dense triangular solve in CUDA’, SIAM Journal on Scientific Computing 35, C303–C322. Hogg, J. and Scott, J. (2008), The effects of scalings on the performance of a sparse symmetric indefinite solver, Technical Report RAL-TR-2008-007. Rutherford Appleton Laboratory, Oxfordshire, UK. Hogg, J. and Scott, J. (2013a), Achieving bit compatibility in sparse direct solvers (corrected), Technical Report RAL-P-2012-005. Rutherford Appleton Laboratory, Oxfordshire, UK. Hogg, J. and Scott, J. (2013b), ‘Pivoting strategies for tough sparse indefinite systems’, ACM Transations on Mathematical Software 40(1), 4:1–4:19. Hogg, J., Reid, J., and Scott, J. (2010), ‘Design of a multicore sparse Cholesky factorization using DAGs’, SIAM Journal on Scientific Computing 32, 3627–3649. Hood, P. (1976), ‘Frontal solution program for unsymmetric matrices’, International Journal of Numerical Methods in Engineering 10, 379–400. Hopcroft, J. E. and Karp, R. M. (1973), ‘An n5/2 algorithm for maximum matchings in bipartite graphs’, SIAM Journal on Computing 2, 225–231. HSL (2016), ‘The HSL mathematical software library’. Available at: http://www.hsl.rl.ac.uk/index.html (accessed 10 June 2016).

404

REFERENCES

Hu, Y. F. and Scott, J. A. (2001), ‘A multilevel algorithm for wavelength reduction’, SIAM Journal on Scientific Computing 23(4), 1352–1375. Hu, Y. F., Maguire, K. C. F., and Blake, R. J. (2000), ‘A multilevel unsymmetric matrix ordering for parallel process simulation’, Computers in Chemical Engineering 23, 1631–1647. IBM (1976), IBM system/360 and system/370 IBM 1130 and IBM 1800 subroutine library - mathematics. User’s guide, Program Product 5736XM7. IBM catalogue # SH12-5300-1. IEEE (1985), 754-1985 IEEE Standard for Binary Floating-Point Arithmetic. IEEE Computer Society. Irons, B. M. (1970), ‘A frontal solution program for finite-element analysis’, International Journal of Numerical Methods in Engineering 2, 5–32. Jennings, A. (1966), ‘A compact storage scheme for the solution of symmetric linear simultaneous equations’, Computer Journal 9, 351–361. Jennings, A. (1977), Matrix Computation for Engineers and Scientists. J. Wiley and Sons, Chichester. Jennings, A. and Malik, G. M. (1977), ‘Partial elimination’, Journal of the Institute of Mathematics and its Applications 20, 307–316. Karp, R. M. (1986), ‘Combinatorics, complexity, and randomness’, Communications of the ACM 29, 98–109. Karp, R. M. and Sipser, M. (1981), ‘Maximum matching in sparse random graphs’. In: 22nd Annual IEEE Symposium on Foundations of Computer Science (FOCS 1981). IEEE Computer Society, Los Alamitos, CA, USA, pp. 364–375. Karypis, G. and Kumar, V. (1998a), ‘A fast and high quality multilevel scheme for partitioning irregular graphs’, SIAM Journal on Scientific Computing 20(1), 359–392. Karypis, G. and Kumar, V. (1998b), hMeTiS – A Hypergraph Partitioning Package, Version 1.5.3. University of Minnesota, Twin Cities. Karypis, G. and Kumar, V. (1998c), METIS – A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices, Version 4.0. University of Minnesota, Twin Cities. Karypis, G. and Kumar, V. (1998d), ‘Parallel algorithm for multilevel graph partitioning and sparse matrix ordering’, Journal of Parallel and Distributed Computing 48, 71–95. Kernighan, B. and Lin, S. (1970), ‘An efficient heuristic procedure for partitioning graphs’, Bell System Technical Journal 49(2), 291–307. Knight, P. A., Ruiz, D., and U¸car, B. (2014), ‘A symmetry preserving algorithm for matrix scaling’, SIAM Journal on Matrix Analysis and Applications 35(3), 931–955. Knuth, D. E. (1969), Fundamental Algorithms, The Art of Computer Programming I. Addison-Wesley Publishing Company, Boston.

REFERENCES

405

K¨onig, D. (1931), ‘Gr´ afok ´es m´ atrixok’, Matematikai ´es Fizikai Lapok 38, 116– 119. K¨onig, D. (1950), Theorie der Endlichen und Unendlichen Graphen. AMS, Chelsea, New York. Koster, J. (1997), ‘On the parallel solution and the reordering of unsymmetric sparse linear systems’, PhD Thesis, Institut National Polytechnique de Toulouse. CERFACS Technical Report, TH/PA/97/51. Koster, J. and Bisseling, R. H. (1994), ‘An improved algorithm for parallel sparse LU decomposition on a distributed-memory multiprocessor’. In: J. G. Lewis (ed), Proceedings 5th SIAM Conference on Linear Algebra. SIAM Press, Philadelphia, pp. 397–401. Kuhn, H. W. (1955), ‘The Hungarian method for solving the assignment problem’, Naval Research Logistics Quarterly 2, 83–97. Kumfert, G. and Pothen, A. (1997), ‘Two improved algorithms for envelope and wavefront reduction’, BIT 37(3), 559–590. LaSalle, D. and Karypis, G. (2015), ‘Efficient nested dissection for multicore architectures’. In: J. L. Tr¨ ass, S. Hunold, and F. Versaci (eds), Euro-Par 2015: Parallel Processing: 21st International Conference on Parallel and Distributed Computing, Vienna, Austria, August 24–28, 2015, Proceedings. Springer-Verlag, Berlin, pp. 467–478. Lawson, C. L., Hanson, R. J., Kincaid, D. R., and Krogh, F. T. (1979), ‘Basic linear algebra subprograms for Fortran use’, ACM Transations on Mathematical Software 5, 308–325. Lewis, J. G. (1982), ‘Implementation of the Gibbs–Poole–Stockmeyer and Gibbs– King algorithms’, ACM Transations on Mathematical Software 8, 180–189 and 190–194. Li, X. S. (2005), ‘An overview of SuperLU: Algorithms, implementation, and user interface’, ACM Transations on Mathematical Software 31(3), 302–325. Liu, J. W. H. (1985), ‘Modification of the minimum degree algorithm by multiple elimination’, ACM Transations on Mathematical Software 11(2), 141–153. Liu, J. W. H. (1986a), ‘A compact row storage scheme for Cholesky factors using elimination trees’, ACM Transations on Mathematical Software 12, 127– 148. Liu, J. W. H. (1986b), ‘On the storage requirement in the out-of-core multifrontal method for sparse factorization’, ACM Transations on Mathematical Software 12, 249–264. Liu, J. W. H. (1988), ‘Equivalent sparse matrix reordering by elimination tree rotations’, SIAM Journal on Scientific and Statistical Computing 9, 424– 444. Liu, J. W. H. (1989), ‘A graph partitioning algorithm by node separators’, ACM Transations on Mathematical Software 15(3), 198–219. Liu, J. W. H. (1990), ‘The role of elimination trees in sparse factorization’, SIAM Journal on Matrix Analysis and Applications 11(1), 134–172.

406

REFERENCES

Liu, J. W. H. and Sherman, A. H. (1976), ‘Comparative analysis of the Cuthill– McKee and the reverse Cuthill–McKee ordering algorithms for sparse matrices’, SIAM Journal on Numerical Analysis 13, 198–213. Magun, J. (1998), ‘Greeding matching algorithms, an experimental study’, Journal of Experimental Algorithmics 3, 6. Markowitz, H. M. (1957), ‘The elimination form of the inverse and its application to linear programming’, Management Science 3, 255–269. MatrixMarket (2000), ‘Matrix Market, NIST, Gaithersburg, MD’. Available at: http://math.nist.gov/MatrixMarket/ (accessed 10 June 2016). Matstoms, P. (1994), ‘Sparse QR factorization in MATLAB’, ACM Transations on Mathematical Software 20, 136–159. Mayoh, B. (1965), ‘A graph technique for inverting certain matrices’, Mathematics of Computation 19, 644–646. McCormick, S. T. (1983), ‘Optimal approximation of sparse Hessians and its equivalence to a graph coloring problem’, Mathematical Programming 26, 153–171. Meijerink, J. A. and van der Vorst, H. A. (1977), ‘An iterative solution method for linear systems of which the coefficient matrix is a symmetric M-matrix’, Mathematics of Computation 31(137), 148–162. Munro, I. (1971a), ‘Efficient determination of the transitive closure of a directed graph’, Information Processing Letters 1, 55–58. Munro, I. (1971b), ‘Some results in the study of algorithms’, PhD Thesis, University of Toronto, Ontario. Neal, L. and Poole, G. (1992), ‘A geometric analysis of Gaussian elimination II’, Linear Algebra and its Applications 173, 239–264. Ng, E. G. and Peyton, B. W. (2014), ‘Fast implementation of the minimum local fill ordering heuristic’. Presented at CSC14, The Sixth SIAM Workshop on Combinatorial Scientific Computing, 21-23 June 2014, Lyon, France. Noor, A. K., Kamel, H. A., and Fulton, R. E. (1977), ‘Substructuring techniques — status and projections’, Computers and Structures 8, 621–632. Oettli, W. and Prager, W. (1964), ‘Compatibility of approximate solution of linear equations with given error bounds for coefficients and right-hand sides’, Numerische Mathematik 6, 405–409. Ogbuobiri, E. C., Tinney, W. F., and Walker, J. W. (1970), ‘Sparsity directed decomposition for Gaussian elimination on matrices’, IEEE Transactions on Power Systems PAS-89, 141–150. O’Leary, D. P. (1980), ‘Estimating condition numbers’, SIAM Journal on Scientific and Statistical Computing 1, 205–209. Olschowka, M. and Neumaier, A. (1996), ‘A new pivoting strategy for Gaussian elimination’, Linear Algebra and its Applications 240, 131–151. Østerby, O. and Zlatev, Z. (1983), Direct Methods for Sparse Matrices. Lecture notes in Computer Science, Vol. 157. Springer-Verlag, Berlin. Paige, C. C. and Saunders, M. A. (1975), ‘Solution of sparse indefinite systems of linear equations’, SIAM Journal on Numerical Analysis 12, 617–629.

REFERENCES

407

Pan, V. (1984), ‘How can we speed up matrix multiplication?’, SIAM Review 26, 393–415. Parter, S. V. (1961), ‘The use of linear graphs in Gaussian elimination’, SIAM Review 3, 119–130. Pellegrini, F. (2012), ‘Scotch and PT-Scotch graph partitioning software: an overview’. In: U. Naumann and O. Schenk (eds), Combinatorial Scientific Computing. CRC Press, Taylor & Francis Group, Boca Raton, pp. 373–406. Peyton, B. W. (2001), ‘Minimal orderings revisited’, SIAM Journal on Matrix Analysis and Applications 23(1), 271–294. Polizzi, E. and Sameh, A. H. (2007), ‘SPIKE: a parallel environment for solving banded linear systems’, Computers & Fluids 36(1), 113–120. Pothen, A. and Fan, C. (1990), ‘Computing the block triangular form of a sparse matrix’, ACM Transations on Mathematical Software 16(4), 303–324. Pothen, A., Simon, H. D., and Liou, K. P. (1990), ‘Partitioning sparse matrices with eigenvectors of graphs’, SIAM Journal on Matrix Analysis and Applications 11(3), 430–452. Powell, M. J. D. (1970), ‘On the estimation of sparse Hessian matrices’. In: J. B. Rosen, O. L. Mangasarian, and K. Ritter (eds), Nonlinear Programming. Academic Press, London, pp. 31–65. Powell, M. J. D. and Toint, Ph. L. (1979), ‘On the estimation of sparse Hessian matrices’, SIAM Journal on Numerical Analysis 16, 1060–1074. Pralet, S. (2004), ‘Constrained orderings and scheduling for parallel sparse linear algebra’, PhD Thesis, Institut National Polytechnique de Toulouse. CERFACS Technical Report, TH/PA/04/105. Raghavan, P. (1998), ‘Efficient parallel sparse triangular solution using selective inversion’, Parallel Processing Letters 8(1), 29–40. Rajamanickam, S., Boman, E. G., and Heroux, M. A. (2012), ‘ShyLU:A hybridhybrid solver for multicore platforms’. In: IEEE 26th International Parallel & Distributed Processing Symposium (IPDPS), 2012. IEEE Press, New York, pp. 631–643. Reid, J. K. (1971), ‘A note on the stability of Gaussian elimination’, Journal of the Institute of Mathematics and its Applications 8, 374–375. Reid, J. K. (1984), TREESOLVE. a Fortran package for solving large sets of linear finite-element equations, Technical Report CSS 155. AERE Harwell Laboratory. Reid, J. K. and Scott, J. A. (1999), ‘Ordering symmetric sparse matrices for small profile and wavefront’, International Journal of Numerical Methods in Engineering 45, 1737–1755. Reid, J. K. and Scott, J. A. (2002), ‘Implementing Hager’s exchange methods for matrix profile reduction’, ACM Transations on Mathematical Software 28(4), 377–391. Reid, J. K. and Scott, J. A. (2009), ‘An out-of-core sparse Cholesky solver’, ACM Transations on Mathematical Software 36(2), 1–33. Rose, D. J. (1972), ‘A graph-theoretic study of the numerical solution of sparse

408

REFERENCES

positive definite systems of linear equations’. In: R. C. Read (ed), Graph Theory and Computing. Academic Press, London, pp. 183–217. Rose, D. J. and Tarjan, R. E. (1978), ‘Algorithm aspects of vertex elimination on directed graphs’, SIAM Journal on Applied Mathematics 34, 176–197. Rothberg, E. and Eisenstat, S. C. (1998), ‘Node selection strategies for bottom-up sparse matrix ordering’, SIAM Journal on Matrix Analysis and Applications 19, 682–695. Saad, Y. (1993), ‘A flexible inner-outer preconditioned GMRES algorithm’, SIAM Journal on Scientific and Statistical Computing 14, 461–469. Saad, Y. (1996), Iterative Methods for Sparse Linear Systems. PWS Publishing, New York. Saad, Y. (2003), Iterative Methods for Sparse Linear Systems, 2nd edn. SIAM Press, Philadelphia. Saad, Y. and Schultz, M. H. (1986), ‘GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems.’, SIAM Journal on Scientific and Statistical Computing 7, 856–869. Sargent, R. W. H. and Westerberg, A. W. (1964), ‘Speed-up in chemical engineering design’, Transactions of the Institution of Chemical Engineers 42, 190–197. Schenk, O. and G¨artner, K. (2006), ‘On fast factorization pivoting methods for sparse symmetric indefinite systems’, Electronic Transactions on Numerical Analysis 23, 158–179. Schenk, O., G¨artner, K., and Fichtner, W. (2000), ‘Efficient sparse LU factorization with left–right looking strategy on shared memory multiprocessors’, BIT 40(1), 158–176. Schneider, H. (1977), ‘The concepts of irreducibility and full indecomposability of a matrix in the works of Frobenius, K¨ onig, and Markov’, Linear Algebra and its Applications 18, 139–162. Schreiber, R. (1993), ‘Scalability of sparse direct solvers’. In: A. George, J. R. Gilbert, and J. W. H. Liu (eds), Graph Theory and Sparse Matrix Computation, The IMA Volumes in Mathematics and its Applications, Volume 56. Springer-Verlag, Berlin, pp. 191–209. Schulze, J. (2001), ‘Towards a tighter coupling of bottom-up and top-down sparse matrix ordering methods’, BIT 41(4), 800–841. Scott, J. A. (1999), ‘On ordering elements for a frontal solver’, Communications in Numerical Methods in Engineering 15(5), 309–323. Sherman, A. H. (1975), ‘On the efficient solution of sparse systems of linear and non-linear equations’, PhD Thesis, Dept. of Computer Science, Yale University, New Haven. Available as Tech. Report #46. Sherman, A. H. (1978), ‘Algorithm 533. NSPIV, A Fortran subroutine for sparse Gaussian elimination with partial pivoting’, ACM Transations on Mathematical Software 4, 391–398. Sherman, J. and Morrison, W. J. (1949), ‘Adjustment of an inverse matrix corresponding to changes in the elements of a given column or a given

REFERENCES

409

row of the original matrix (abstract)’, Annals of Mathematical Statistics 20(4), 621. Sid-Lakhdar, W. M. (2014), ‘Scaling the solution of large sparse linear systems using multifrontal methods on hybrid shared-distributed memory architecture’, PhD Thesis, ENS, Lyon, France. Skeel, R. D. (1979), ‘Scaling for numerical stability in Gaussian elimination’, Journal of the ACM 26, 494–526. Sloan, S. W. (1986), ‘An algorithm for profile and wavefront reduction of sparse matrices’, International Journal of Numerical Methods in Engineering 23, 239–251. Smith, B., Bj¨orstad, P., and Gropp, W. (1996), Domain Decomposition, Parallel Multilevel Methods for Elliptic Partial Differential Equations. Cambridge University Press, Cambridge. Sorensen, D. C. (1981), ‘An example concerning quasi-Newton estimation of a sparse Hessian’, SIGNUM Newsletter 16(2), 8–10. Speelpenning, B. (1978), The generalized element method, Technical Report UIUCDCS-R-78-946. Department of Computer Science, University of Illinois, Urbana. Stewart, G. W. (1973), Introduction to Matrix Computations. Academic Press, London. Stewart, G. W. (1974), ‘Modifying pivot elements in Gaussian elimination’, Mathematics of Computation 28, 537–542. Strang, G. (1980), Linear Algebra and its Applications, 2nd edn. Academic Press, London. Strassen, V. (1969), ‘Gaussian elimination is not optimal’, Numerische Mathematik 13, 354–356. Sui, X., Nguyen, D., Burtscher, M., and Pingali, K. (2011), Parallel graph partitioning on multicore architectures, Lecture Notes in Computer Science 6548. Springer-Verlag, Berlin, pp. 246–260. Szyld, D. B. (1981), ‘Using sparse matrix techniques to solve a model of the world economy’. In: I. S. Duff (ed), Sparse Matrices and their Uses. Academic Press, London, pp. 357–365. Takahashi, K., Fagan, J., and Chin, M. (1973), ‘Formation of a sparse bus impedance matrix and its application to short circuit study’. In: Proceedings 8th PICA Conference, Minneapolis, Minnesota. Tarjan, R. E. (1972), ‘Depth-first search and linear graph algorithms’, SIAM Journal on Computing 1, 146–160. Tarjan, R. E. (1975), ‘Efficiency of a good but not linear set union algorithm’, Journal of the ACM 22, 215–225. Tewarson, R. P. (1973), Sparse Matrices. Academic Press, London. Thapa, M. N. (1983), ‘Optimization of unconstrained functions with sparse Hessian matrices — quasi-Newton methods’, Mathematical Programming 25, 158–182. Tinney, W. F. and Walker, J. W. (1967), ‘Direct solutions of sparse network

410

REFERENCES

equations by optimally ordered triangular factorization’, Proceedings of the IEEE 55, 1801–1809. Tinney, W. F., Powell, W. L., and Peterson, N. M. (1973), ‘Sparsity-oriented network reduction’. In: Proceedings 8th PICA Conference, Minneapolis, Minnesota. Toint, Ph. L. (1977), ‘On sparse and symmetric matrix updating subject to a linear equation’, Mathematics of Computation 31, 954–961. Toint, Ph. L. (1981), ‘A note about sparsity exploiting quasi-Newton updates’, Mathematical Programming 21, 172–181. Tomlin, J. A. (1972), ‘Pivoting for size and sparsity in linear programming inversion routines’, Journal of the Institute of Mathematics and its Applications 10, 289–295. Tosovic, L. B. (1973), ‘Some experiments on sparse sets of linear equations’, SIAM Journal on Applied Mathematics 25, 142–148. van der Vorst, H. A. (2003), Iterative Krylov Methods for Large Linear Systems. Cambridge University Press, Cambridge. Varga, R. S. (1962), Matrix Iterative Analysis. Prentice-Hall, Upper Saddle River. Vastenhouw, B. and Bisseling, R. H. (2005), ‘A two-dimensional data distribution method for parallel sparse matrix-vector multiplication’, SIAM Review 47, 67–95. Weisbecker, C. (2013), ‘Improving multifrontal solvers by means of algebraic Block Low-Rank representations’, PhD Thesis, Institut National Polytechnique de Toulouse. Westerberg, A. W. and Berna, T. J. (1979), ‘LASCALA — a language for large scale linear algebra’. In: I. S. Duff and G. W. Stewart (eds), Sparse Matrix Proceedings, 1978. SIAM Press, Philadelphia, pp. 90–106. Whaley, R. C., Petitet, A., and Dongarra, J. J. (2000), LAPACK Working Note 147 : Automated empirical optimization of software and the ATLAS project, Technical Report CS-94-234. Department of Computer Sciences, University of Tennessee, Knoxville, Tennessee. Wilkinson, J. H. (1961), ‘Error analysis of direct methods of matrix inversion’, Journal of the ACM 8, 281–330. Wilkinson, J. H. (1965), The Algebraic Eigenvalue Problem. Oxford University Press, Oxford. Willoughby, R. A. (1971), ‘Sparse matrix algorithms and their relation to problem classes and computer architecture’. In: J. K. Reid (ed), Large Sparse Sets of Linear Equations. Academic Press, London, pp. 255–277. Woodbury, M. (1950), Inverting modified matrices, Technical Report Memorandum 42. Princeton University Press, Princeton. Xia, J. (2013), ‘Efficient structured multifrontal factorization for general large sparse matrices’, SIAM Journal on Scientific Computing 35, A832–A860. Yamamoto, F. and Takahashi, S. (1985), ‘Vectorized LU decomposition algorithms for large-scale circuit simulation’, IEEE Transactions on

REFERENCES

411

Computer Aided Design of Integrated Circuits and Systems 4, 231–239. Yannakakis, M. (1981), ‘Computing the minimum fill-in is NP-complete’, SIAM Journal on Algebraic and Discrete Methods 2, 77–79. Yip, E. L. (1986), ‘A note on the stability of solving a rank-p modification of a linear system by the Sherman–Morrison–Woodbury formula’, SIAM Journal on Scientific and Statistical Computing 7(3), 507–513. Zlatev, Z. (1980), ‘On some pivotal strategies in Gaussian elimination by sparse technique’, SIAM Journal on Numerical Analysis 17, 18–30. Zlatev, Z., Barker, V. A., and Thomsen, P. G. (1978), SLEST: a FORTRAN IV subroutine for solving sparse systems of linear equations, Technical Report NI-78-01. Bell Laboratories, Murray Hill, New Jersey. Zlatev, Z., Wa´sniewski, J., Hansen, P. C., and Ostromsky, T. (1995), PARASPAR: a package for the solution of large linear algebraic equations on parallel computers with shared memory, Technical Report 95-10. Numerisk Institut, Lyngby, Denmark.

AUTHOR INDEX Agullo, Buttari, Guermouche, and Lopez (2013), 347 Aho, Hopcroft, and Ullman (1974), 261 Amestoy and Puglisi (2002), 282, 283, 309, 313 Amestoy, Ashcraft, Boiteau, Buttari, L’Excellent, and Weisbecker (2015a), 305 Amestoy, Davis, and Duff (1996a), 140, 234, 237 Amestoy, Duff, and L’Excellent (2000), 301, 303 Amestoy, Duff, and Puglisi (1996b), 347 Amestoy, Duff, Giraud, L’Excellent, and Puglisi (2004), 14 Amestoy, Duff, Guermouche, and Slavova (2010), 152, 318, 322 Amestoy, Duff, L’Excellent, and Koster (2001), 276, 302 Amestoy, Duff, L’Excellent, and Rouet (2015b), 319, 322, 339 Amestoy, Duff, L’Excellent, Robert, Rouet, and U¸car (2012), 318, 319, 339 Amestoy, Guermouche, L’Excellent, and Pralet (2006), 302 Anderson, Bai, Bischof, Blackford, Demmel, Dongarra, Du Croz, Greenbaum, Hammarling, McKenney, and Sorensen (1999), 80 Arioli and Duff (2008), 227 Arioli, Demmel, and Duff (1989), 94, 227, 335

Arioli, Duff, Gratton, and Pralet (2007), 285, 286 Ashcraft and Grimes (1999), 192 Ashcraft and Liu (1998), 184 Ashcraft, Eisenstat, and Liu (1999), 215 Aspen Technology Inc. (1995), 224 Aykanat, Pinar, and C ¸ ataly¨ urek (2004), 199 Barnard and Simon (1994), 166 Barnard, Pothen, and Simon (1995), 163, 164 Barwell and George (1976), 74 Bauer (1963), 82 Bebendorf (2008), 304 Berge (1957), 129 Bhat, Habashi, Liu, Nguyen, and Peeters (1993), 184 Bj¨ orck (1987), 348 Blackford, Choi, Cleary, D’Azevedo, Demmel, Dhillon, Dongarra, Hammarling, Henry, Petitet, Stanley, Walker, and Whaley (1997), 302 Blackford, Demmel, Dongarra, Duff, Hammarling, Henry, Heroux, Kaufman, Lumsdaine, Petitet, Pozo, Remington, and Whaley (2002), 36 Bondy and Murty (2008), 2 Bunch (1974), 74 Bunch and Parlett (1971), 74 Bunch, Kaufman, and Parlett (1976), 74 Buttari, Langou, Kurzak, and Dongarra (2009), 293

AUTHOR INDEX

Calahan (1982), 223 Campbell and Davis (1995), 339 Carvalho, Giraud, and Meurant (2001), 349 Catalyurek and Aykanat (1999), 199 Chan and Mathew (1994), 348 Chang (1969), 211, 222 Chevalier and Pellegrini (2008), 294 Chu, George, Liu, and Ng (1984), 223, 244 Cline and Rew (1983), 80 Cline, Conn, and Van Loan (1982), 80 Cline, Moler, Stewart, and Wilkinson (1979), 79, 80 Coleman (1984), 339 Coleman and Mor´e (1983), 342 Coleman and Mor´e (1984), 343 Coleman, Edenbrandt, and Gilbert (1986), 109 Coleman, Garbow, and Mor´e (1984a), 343 Coleman, Garbow, and Mor´e (1984b), 342 Collins (1973), 162 Curtis and Reid (1971), 34, 147, 206 Curtis and Reid (1972), 84 Curtis, Powell, and Reid (1974), 340, 341 Cuthill and McKee (1969), 159, 162 Davis (2004), 224 Davis (2011), 347 Davis and Duff (1997), 306, 307 Davis and Hu (2011), 14, 107, 359 Davis and Yew (1990), 228, 229 Davis, Gilbert, Larimore, and Ng (2004a), 141, 291 Davis, Gilbert, Larimore, and Ng (2004b), 141, 195 Dembart and Erisman (1973), 144, 151, 219, 222

413

Demmel, Eisenstat, Gilbert, Li, and Liu (1999), 291 Dennis Jr. and Schnabel (1983), 339 Devine, Boman, Heaphy, Hendrickson, and Vaughan (2002), 192 Dijkstra (1959), 130 Dodds and Lopez (1980), 263 Dolan and Mor´e (2002), 289 Dongarra, Bunch, Moler, and Stewart (1979), 255 Dongarra, Du Croz, Duff, and Hammarling (1990), 13 Dongarra, Du Croz, Hammarling, and Hanson (1988), 13 Dongarra, Duff, Sorensen, and van der Vorst (1998), 228, 255 Drummond, Duff, Guivarch, Ruiz, and Zenadi (2015), 351 Duff (1972), 111, 117, 144 Duff (1977), 126 Duff (1979), 142, 147, 208 Duff (1981a), 111 Duff (1981b), 117, 129, 254 Duff (1981c), 111, 115, 116 Duff (1983), 254 Duff (1984a), 254 Duff (1984b), 279 Duff (1984c), 220 Duff (1986), 303 Duff (1989), 303 Duff (2004), 90, 104, 277 Duff (2007), 289 Duff and Koster (1999), 130 Duff and Koster (2001), 85, 130 Duff and Nowak (1987), 216 Duff and Pralet (2004), 284 Duff and Pralet (2005), 284, 287, 290 Duff and Reid (1974), 144 Duff and Reid (1975), 347 Duff and Reid (1976), 151, 152 Duff and Reid (1978a), 125 Duff and Reid (1978b), 125 Duff and Reid (1979), 14

414

AUTHOR INDEX

Duff and Reid (1982), 153 Duff and Reid (1983), 41, 153, 154, 235, 273, 277 Duff and Reid (1996a), 35, 90, 141, 220 Duff and Reid (1996b), 93, 154 Duff and Scott (1996), 254 Duff and Scott (1999), 254 Duff and Scott (2004), 200, 295 Duff and U¸car (2010), 127, 135 Duff, Erisman, and Reid (1976), 145 Duff, Erisman, and Reid (1986), 193 Duff, Erisman, and Reid (2016), 193 Duff, Erisman, Gear, and Reid (1988), 337 Duff, Grimes, and Lewis (1989), 14 Duff, Grimes, and Lewis (1997), 14 Duff, Guivarch, Ruiz, and Zenadi (2015), 352 Duff, Heroux, and Pozo (2002), 36 Duff, Kaya, and U¸car (2011), 111, 115–118, 129 Duff, Reid, Munksgaard, and Nielsen (1979), 153 Dulmage and Mendelsohn (1959), 133, 134 Eisenstat and Liu (1993), 214 Eisenstat and Liu (2005), 312, 313 Eisenstat, Gursky, Schultz, and Sherman (1982), 236 Erisman (1972), 106, 150, 222, 223 Erisman (1973), 150 Erisman and Reid (1974), 72, 73, 93 Erisman and Spies (1972), 73, 106, 150 Erisman and Tinney (1975), 337 Erisman, Grimes, Lewis, and Poole Jr. (1985), 193 Erisman, Grimes, Lewis, Poole Jr., and Simon (1987), 15, 127 Eskow and Schnabel (1991), 345 Everstine (1979), 163 Felippa (1975), 244

Feurzeig (1960), 30 Fiduccia and Mattheyses (1982), 189, 191, 192, 199 Fiedler (1975), 163, 165 Fletcher (1980), 339 Fletcher (1981), 339 Ford and Fulkerson (1962), 191 Forsythe and Moler (1967), 43, 82 Forth, Tadjouddine, Pryce, and Reid (2004), 222 Foster (1997), 69 Gallivan, Hansen, Ostromsky, and Zlatev (1995), 194 Gear (1975), 91, 92, 334 Geist and Ng (1989), 298, 301 George (1971), 159 George (1973), 180, 181, 202 George (1977), 97 George (1980), 178, 179 George and Liu (1975), 157 George and Liu (1978a), 171, 172 George and Liu (1978b), 188, 192, 235 George and Liu (1979), 188 George and Liu (1981), 2, 233, 235 George and Ng (1985), 140, 141, 195, 240, 241 George and Ng (1987), 241 George, Heath, Liu, and Ng (1988), 215 George, Liu, and Ng (1987), 301 George, Poole Jr., and Voigt (1978), 183 Gibbs, Poole Jr., and Stockmeyer (1976), 162, 188 Gilbert and Liu (1993), 310, 312 Gilbert and Ng (1993), 141, 195 Gilbert and Peierls (1988), 142, 213 Gilbert and Schreiber (1982), 194 Gilbert, Moler, and Schreiber (1992), 140 Gilbert, Ng, and Peyton (1994), 269

AUTHOR INDEX

Gill, Murray, and Wright (1981), 339, 345 Gill, Murray, Saunders, and Wright (1990), 216 Gill, Saunders, and Shinnerl (1996), 147 Giraud, Marrocco, and Rioual (2005), 349, 350 Golub and Van Loan (2012), 43 Gould (1991), 69 Gould and Toint (2002), 147, 216 Greenbaum (1997), 348 Griewank and Walther (2008), 343 Grund (1999), 223 Grund (2003), 224 Guermouche and L’Excellent (2005), 302 Guermouche and L’Excellent (2006), 272, 273 Gupta (2002a), 283, 310, 311 Gupta (2002b), 283 Gupta (2007), 312 Gupta, Karypis, and Kumar (1997), 297, 299 Gustavson (1972), 34, 211, 239 Gustavson (1976), 111, 117 Gustavson (1978), 37, 106 Gustavson, Liniger, and Willoughby (1970), 221 Hachtel (1972), 153 Hachtel (1976), 151 Hachtel, Brayton, and Gustavson (1971), 153, 331 Hager (1984), 80 Hager (2002), 168 Hall (1935), 115 Hall (1956), 111 Hall and McKinnon (2005), 152, 218, 319 Hamming (1971), 84 Harary (1969), 2 Harary (1971), 109 Hellerman and Rarick (1971), 109

415

Hellerman and Rarick (1972), 193 Hendrickson and Kolda (2000a), 197 Hendrickson and Kolda (2000b), 197 Hendrickson and Rothberg (1998), 189, 190, 192, 193 Higham (1987), 80 Higham (1988), 80, 94 Higham (2002), 43, 57, 63, 80 Hoffman, Martin, and Rose (1973), 181 Hogg (2008), 293 Hogg (2013), 321 Hogg and Scott (2008), 289 Hogg and Scott (2013a), 266 Hogg and Scott (2013b), 285, 287 Hogg, Reid, and Scott (2010), 293, 294 Hood (1976), 245, 254 Hopcroft and Karp (1973), 116 HSL (2016), 84 Hu and Scott (2001), 167, 168 Hu, Maguire, and Blake (2000), 197, 200 IBM (1976), 219 IEEE (1985), 63 Irons (1970), 245, 254 Jennings (1966), 244 Jennings (1977), 160 Jennings and Malik (1977), 227 Karp (1986), 137 Karp and Sipser (1981), 111, 117 Karypis and Kumar (1998a), 188, 189, 191 Karypis and Kumar (1998b), 199 Karypis and Kumar (1998c), 189, 192 Karypis and Kumar (1998d), 294 Kernighan and Lin (1970), 189, 192, 197, 199 Knight, Ruiz, and U¸car (2014), 84 Knuth (1969), 28 K¨ onig (1931), 135

416

AUTHOR INDEX

K¨onig (1950), 2 Koster (1997), 229 Koster and Bisseling (1994), 229 Kuhn (1955), 111, 130 Kumfert and Pothen (1997), 167 LaSalle and Karypis (2015), 294 Lawson, Hanson, Kincaid, and Krogh (1979), 13 Lewis (1982), 162 Li (2005), 290, 291 Liu (1985), 235, 237 Liu (1986a), 266 Liu (1986b), 272 Liu (1988), 273–276 Liu (1989), 188, 190, 192 Liu (1990), 268 Liu and Sherman (1976), 160 Magun (1998), 111, 117 Markowitz (1957), 96, 138, 139, 143, 201, 204, 205 Matrix Market (2000), 14 Matstoms (1994), 347 Mayoh (1965), 195 McCormick (1983), 341 Meijerink and van der Vorst (1977), 228 Munro (1971a), 122 Munro (1971b), 122 Neal and Poole (1992), 69 Ng and Peyton (2014), 144 Noor, Kamel, and Fulton (1977), 263 O’Leary (1980), 80 Oettli and Prager (1964), 335 Ogbuobiri, Tinney, and Walker (1970), 144 Olschowka and Neumaier (1996), 85 Østerby and Zlatev (1983), 227 Paige and Saunders (1975), 166 Pan (1984), 57

Parter (1961), 4 Pellegrini (2012), 189, 192, 294 Peyton (2001), 144 Polizzi and Sameh (2007), 255 Pothen and Fan (1990), 116, 133 Pothen, Simon, and Liou (1990), 188, 192 Powell (1970), 340 Powell and Toint (1979), 340–342 Pralet (2004), 289 Raghavan (1998), 321 Rajamanickam, Boman, and Heroux (2012), 195 Reid (1971), 66 Reid (1984), 263 Reid and Scott (1999), 163, 167 Reid and Scott (2002), 168, 169 Reid and Scott (2009), 35, 277 Rose (1972), 144, 233 Rose and Tarjan (1978), 4, 137 Rothberg and Eisenstat (1998), 144 Saad (1993), 285 Saad (1996), 228 Saad (2003), 348 Saad and Schultz (1986), 286 Sargent and Westerberg (1964), 110, 120 Schenk and G¨artner (2006), 284 Schenk, G¨artner, and Fichtner (2000), 292 Schneider (1977), 109 Schreiber (1993), 295 Schulze (2001), 184 Scott (1999), 254 Sherman (1975), 25 Sherman (1978), 213 Sherman and Morrison (1949), 327 Sid-Lakhdar (2014), 302 Skeel (1979), 79, 335 Sloan (1986), 162, 163, 247 Smith, Bj¨orstad, and Gropp (1996), 348

AUTHOR INDEX

Sorensen (1981), 344 Speelpenning (1978), 259 Stewart (1973), 43, 358 Stewart (1974), 213, 240, 328 Strang (1980), 43 Strassen (1969), 57 Sui, Nguyen, Burtscher, and Pingali (2011), 294 Szyld (1981), 109 Takahashi, Fagan, and Chin (1973), 338 Tarjan (1972), 122, 125 Tarjan (1975), 122, 268 Tewarson (1973), 195 Thapa (1983), 344 Tinney and Walker (1967), 139, 143, 201 Tinney, Powell, and Peterson (1973), 151 Toint (1977), 340, 344 Toint (1981), 344 Tomlin (1972), 147 Tosovic (1973), 142 van der Vorst (2003), 228, 348 Varga (1962), 3 Vastenhouw and Bisseling (2005), 199 Weisbecker (2013), 305 Westerberg and Berna (1979), 109 Whaley, Petitet, and Dongarra (2000), 51 Wilkinson (1961), 62, 69, 72 Wilkinson (1965), 43, 66, 70 Willoughby (1971), 221 Woodbury (1950), 327 Xia (2013), 304 Yamamoto and Takahashi (1985), 225 Yannakakis (1981), 137 Yip (1986), 327

417

Zlatev (1980), 143, 208 Zlatev, Barker, and Thomsen (1978), 227 Zlatev, Wa´sniewski, Hansen, and Ostromsky (1995), 194

SUBJECT INDEX a posteriori, 71 a priori column ordering, 140–142 AT A, 140–141, 194, 240, 283, 337– 356 A + AT , 194, 282 AAT , 194, 195 access by rows and columns, 34 whole symmetric matrix, 233 accidental zeros, 210 active entries, 50, 239 active node, 162 active variables, 246 acyclic graph, 119 adding packed vectors, 19 aerospace, 360 algorithm instability, 65 algorithm stability controlling, 67 definition, 62 allocation problem, 129 alternating path, 129 amalgamation of nodes, 276–277 parameter nemin, 277 AMD, 140, 184, 192, 236–238, 291 vs dissection algorithms, 192 vs minimum degree, 238 AMF, 144, 184 ANALYSE, 100, 101, 204, 210, 232, 262 ANALYSE-FACTORIZE, 101, 204 approximate minimum degree, see AMD approximate minimum fill-in, see AMF arithmetic error, 63 assembly, 246 by rows, 252

assembly tree, 258, 264 supernodal techniques, 290–292 assignment, 112 cheap, 112 problem, 129 ATLAS, 51 atmospheric pollution, 362 augmented matrix, 151 augmenting path, 129 automatic differentiation, 222, 343 auxiliary storage, 248 back-substitution, 44, 103, 149, 215 cost, 56 backward error, 66 backward error analysis sparse case, 334–335 band matrix, 96, 156, 157, 174, 241, 343 blocking, 169, 174 computational requirements, 243 row interchanges, 243 bandwidth, 157 bi-irreducibility, 109 binary tree, 32 bipartite graph, 4, 133 bireducibility, 109 BLAS, 13 Level 1, 13 Level 2, 13, 59, 60, 102, 105 Level 3, 13, 51, 102, 105, 254 sparse, 36 block back-substitution, 316 block Cimmino algorithm, 351 block forward substitution, 109, 317 block iterative method, 350 block matrix, 6 block pivot, 74 block pivot row, 316

SUBJECT INDEX

block triangular matrix, 95, 108, 118 benefits from, 127 essential uniqueness, 125 experience, 127 lower triangular form, 108 performance of, 127 rectangular, 133 symmetric permutation, 118 upper triangular form, 108 block triangularization computational cost, 127 timing, 127 block tridiagonal matrix, 97, 159, 172 bordered matrix block triangular, 99 boundary row, 197 boundary variables, 259 breadth-first search, 111 cache, 10, 243 latency, 57 misses, 40 cancellation exact, 210 chain, 171 cheap assignments maximize number, 117 Chebyshev acceleration, 227 chemical engineering, 127, 361 chemical plant, 361 child node, 3 Cholesky, see factorization circuit analysis, 345 circuit design, 15 cliques, 233 artificial generation of, 41 pivotal, 233 storage, 39, 41, 233 closed path, 120 COLAMD, 141, 195, 291 collapsing nodes, 120 collection of packed vectors, 34, 234 column count, 205

419

column singleton, 110, 138 column-linked list, 29, 32, 34 complete pivoting, see full pivoting complex matrix, 106 componentwise relative error, 335 composite node, 120 computational cost, 6, 55 band matrix, 243 Tarjan’s algorithm, 125 gain, 59 performance, 12–13 savings, 9 sequence, 50–52, 54, 339 compute-bound, 105 condition number, 76, 78, 328 estimating, 94 Hager’s method, 80 LAPACK estimate, 80 LINPACK estimate, 79 Skeel, 79, 335 conditioning, 62, 94 automatic detection, 79 ill-conditioned, 62, 71, 74–80 theoretical discussion, 75 well-conditioned, 62 conjugate gradients, 227 connectivity, 196 contiguous storage, 241 contingency analysis, 330 convergence, 226 coordinate scheme, 22, 32, 33, 99 counting sort, 30 covering of entries, 135 CPR algorithm, 341 cubic grid, 186 Cuthill–McKee algorithm, 158–161 cuts, see dissection cycle, 3 DAG, 3, 292–294 data-DAG, 310 L-DAG and U-DAG, 310 task-DAG, 310

420

SUBJECT INDEX

data movement, 270 data structures comparisons, 40 compression, 40 coordinate scheme, 22, 32, 33, 99 doubly-linked list, 27 dynamic, 18, 283 linked list, see linked list selection of pivots, 207 static, 18, 205, 239 updating, 207 data-fitting problem, 337 delayed pivots, 278, 283 dense matrix, 43, 292 cost, 55 definition, 1 multiplication, 12 parallelization of SOLVE, 321 depth-first search, 111, 113–116, 122, 261 dgemm, 293, 294 diagonal blocks, 58 diagonally-dominant matrix, 73 digraph, 2, 118–120, 336 direct addressing, 219, 253, 317 vs indirect addressing, 317 directed acyclic graph, see DAG directed graph, see digraph dissection, 98 choosing dissection sets, 188 comparison, 192 implementing, 238 nested, see nested dissection obtaining initial separator, 188 one-way, 177–180 packages, 192 refining, 189 subproblem oriented, 326 unsymmetric matrices, 193 vs AMD, 192 distributed memory, 12, 303 domain decomposition, 348–350 doubly-bordered block diagonal matrix, 98, 99, 178

doubly-linked list, 27, 206, 234 drop tolerances, 225, 345 Dulmage–Mendelsohn decomposition, 133 dynamic analysis, 360 dynamic data structures, 18, 283 dynamic scheduling, 284, 300 economic modelling, 363 edge separator, 186, 194 edges, 2 electric circuit modelling, 364 electrical networks, 222 element assembly tree, 259 elementary matrix lower triangular, 47 permutation, 38 elimination implementation, 209 elimination digraph, 4 elimination tree, 258, 263–268 efficient generation, 266 entry definition, 2 of the inverse, 56 equilibration, 82 error absolute error, 78 arithmetic error, 63 backward error, 66 estimate of, 93 factorization error, 66 relative error, 78 estimating condition number, 94 explicit zeros, 244 external degree, 235 factorization block Cholesky, 292 Cholesky, 9, 53, 72, 241 implicit block, 58 LDLT , 54, 73, 74 LDU, 53 LU, 47, 240, 337

SUBJECT INDEX

orthogonal, 70, 346–348 partitioned, 57, 70 rectangular matrix, 49 recursive, 52 sequence of, 153 triangular, 48 use of low-rank matrices, 304–306 factorization-error matrix, 66, 71 FACTORIZE, 100, 101–102, 204, 210– 215, 232, 239 fast, 211, 215 first, 210, 215 with pivoting, 213 without pivoting, 210 fan in, 52, 215 fan out, 52, 215 fast analysis, 240 features of a sparse code, 99–103 FGMRES, 286 Fiduccia and Mattheyses algorithm, 189 Fiedler vector, 163, 188 calculating it, 166 fill-in, 4, 7, 8, 212 confined, 109, 156 inserting in data structure, 24 minimum, see minimum fill-in none, 139 total, 157 finite differences, 15, 254, 340, 341 finite-element problems, 15, 232, 246, 259 reverse Cuthill–McKee, 160 storage, 39 floating-point IEEE standard, 63 operations, 1, 7, 10, 55 number, 7 Florida Collection, 14 formulation, see problem formulation forward substitution, 44, 48, 103, 149, 215 cost, 56

421

Frobenius norm, 357 front, 246 frontal matrix, 246 parallelism, 298 partitioning, 298–299 rectangular, 306–312 size, 253, 297 with zero rows, 282 frontal method, 232, 246 data by equations, 254 example, 246 fully summed, 246 general finite-element problems, 250 introduction, 245 non-element problems, 251 permutations, 249 pivoting, 250 preliminary pass, 250 reordering elements, 254 right-hand side, 318 SPD finite-element problems, 246 full pivoting, 69 full-length vector, 22, 40 full-matrix code switching to, 219 fully indecomposable, 109 fully-summed, 246 gangster operator, 343 gather, 19, 212 Gaussian elimination, 4, 7, 43, 44, 55 density increase, 151 ordering, 138 stability in positive-definite case, 72 gbtrf, 173 gemm, 51, 170, 292, 294, 321 generated element, 259 Gflops, 10 Gibbs–Poole–Stockmeyer algorithm, 162 Gilbert–Peierls algorithm, 213, 214, 218, 318

422

SUBJECT INDEX

global ordering, 156, 177 GMRES, 227, 286 graph, 3, 281 coarsening, 189 partitioning, 193 software, 191 theory, 2–5 unsymmetric matrix, 193, 194 greedy algorithm, 341 greedy matching, 117 growth, 66 bound on growth, 91 Hall property, 115 hashing function, 234 heap sort, 31–32 Hellerman–Rarick algorithm, 193 Hessian matrix, 340 update, 343–344 HSL, 84 HSL MA48, 90, 128 HSL MA55, 170 HSL MA57, 90 HSL MA87, 294 HSL MC64, 287 HSL MC66, 198, 200 hybrid methods, 348, 352 hybrid multilevel algorithm, 168 hydrocarbon separation, 361 hyper-sparsity, 218 hyperedge, 196 hypergraph, 196 I-matrix scaling, 85, 130 ICCG, 227 IEEE arithmetic, 63 ILU, 228 implicit block factorization, 58, 98, 179 in-place permuting, 38 in-place sorting, 31, 39 incomplete factorization, 226 indefinite matrix, 74, 93, 153, 345 stability, 74

indirect addressing, 40, 219, 317 vs direct addressing, 317 inner product, 36 packed vectors, 20 input of data, 101 instability, 65 interchanges, 38, 217 interpretative code, 222 inverse matrix, 55, 335, 337 computation of, 105 cost of computing, 56 entries of, 335, 337 explicit use of, 56 irreducibility, 109 irreducible matrix, 109, 149 has dense inverse, 335–337 iterative refinement, 80, 213, 226, 286 Jacobian matrix, 99, 340, 341 Kernighan–Lin algorithm, 189 in pseudo-algol, 198 LAPACK, 80, 173 LAPACK condition estimator, 80 Laplacian matrix, 163 latency, 10 leaf node, 3 least squares, 63, 152, 337 left-looking, 50, 54, 211, 271 legal path, 336 level of independence, 225 level sets, 159, 179 generation of, 162 linear programming, 15, 127, 218 basis, 359 linked list, 26, 29, 34, 40, 206, 234, 239 LINPACK condition estimator, 79 local minimum, 340 local nested dissection, 184 local ordering, 137 difficulties, 144 strategies, 138

SUBJECT INDEX

look ahead, 114 loop-free code, 221 low-rank matrices, 304 lower triangular matrix, 43 submatrix, 58 unit, 44, 47, 48 LU factorization, see factorization MA17, 236 MA27, 236, 366 MA28, 220, 227 MA32, 254 MA36, 245 MA38, 309 MA41, 194, 282 vs other codes, 309 MA41 UNSYM, 282 vs other codes, 309 MA42, 254 MA48, 90, 127, 143, 209, 215, 220, 227, 283, 295 timing, 103 vs other codes, 309 MA57, 90, 285, 288 timing, 104 MA62, 254 MA65, 174 Markowitz count, 138, 205, 307 Markowitz ordering, 96, 138, 142, 205 2×2 pivot, 154 not optimal, 140 Markowitz search restriction of search, 142, 208 termination of, 208 mass node elimination, 235 matching, 129 problem, 129 weighted, 130 matrix approximation of Hessian, 340 band, 96 block triangular, see block triangular matrix

423

block tridiagonal, 97 bordered block triangular, 98 by matrix product, 36 by vector product, 36, 40 complex, 106 diagonally dominant, 73 doubly-bordered block diagonal, 98, 178 inverse, 105 L stored by columns, 217 L stored by rows, 216 lower triangular, see lower triangular matrix multiplication, 57 norm inequalities, 357 orthogonal, 346 positive-definite, see positivedefinite matrix rectangular, see rectangular matrix singly bordered block diagonal, 197 symmetric, see symmetric matrix triangular, see triangular matrix unassembled, 254 upper triangular, see upper triangular matrix variable-band, see variable-band matrix Matrix Market, 14 matrix modification formula, 213, 240, 284, 326–331 applications, 328–331 building a large problem, 328–329 compared with partitioning, 329– 330 for sensitivity analysis, 330–331 stabililty, 327–328 stability corrections, 328 matrix norm, 355–358 maxflow algorithm, 191 maximal independent set, 167 maximal transversal, 133 maximum transversal, 128, 129

424

SUBJECT INDEX

MC21, 111, 116 MC29, 84 MC30, 84, 290 MC43, 254 MC47, 140, 238 MC61, 168 MC64, 130, 288–291 MC67, 168 MC71, 94 MC73, 168 MC77, 85, 290 memory hierarchies, 10 memory-bound, 105 merging linked lists, 234 lists, 262 METIS, 185, 192, 199, 294 Mflops, 10 min. col. in min. row, 142 min. row in min. col., 142 minimum deficiency, 143 minimum degree, 139, 142, 365, 366 analysis, 145 experimental data, 145 finite-element case, 236 implementation, 233 improvements, 235 not optimal, 139 ordering, 138, 233 regular grid, 145 tie-breaking, 145, 359 vs AMD, 238 vs nested dissection, 185 minimum fill-in, 137, 143, 144 not optimal, 144 MMD, 235 model reduction, 331–333 with a regular submodel, 331, 333 MONET ordering, 197 performance, 200 monitoring stability, 71 mt-ND-Metis, 295 multifrontal method, 259, 303 example, 277

indefinite problems, 277 storage for factors, 25 unsymmetric matrices, 310 multilevel algorithm coarsening algorithm, 199 multiple minimal degree, see MMD multiple right-hand sides, 54, 319 multiplier, 45, 48, 55 multiply-add pairs, 55 multisection, 182 multisector, 184 MUMPS, 145, 184, 276, 283, 285, 288, 300, 302, 305, 306, 322 ND-Metis, 295 nemin, 277 nested dissection, 98, 180, 238 complexity, 181 finding dissection cuts, 182 local, 184 vs minimum degree, 185 net, 196 net-cut, 197 node parallelism, 295 nodes, 2 nonlinear optimization, 340 nonlinear equations, 340 nonlinear optimization, 339 normal equations, 63, 331 normal matrix, see AT A norms, 355 Frobenius, 357 inequalities, 355 infinity norm, 355 one norm, 356 two norm, 356 NP complete, 137 null-space basis, 319–320 numerical stability, 204, 205 O(n2 ) traps avoidance of, 106 one-way dissection, 98, 177–180

SUBJECT INDEX

automatic, 179 finding cuts, 179 ordered vector, 212 ordering a posteriori for stability, 251 a priori column, 140–142 constrained, 150 for small bandwidth, 158 global, 156, 177 introduction, 94–95 local, see local ordering Markowitz, see Markowitz ordering min. col. in min. row, 142 min. row in min. col., 142 minimum deficiency, 143 minimum degree, see minimum degree minimum fill-in, see minimum fillin of children, 271 pagewise, 145 refined quotient tree, 170–173 simple strategies, 142 spectral, 163–166 spiral, 145 Tinney scheme 2, see minimum degree Tinney–Walker scheme 3, 143 topological, see topological ordering unsymmetric on symmetric matrix, 152 using AT A, see AT A variability-type, 153 within columns, 32 ordinary differential equations, 216 orthogonal factorization, 70, 346– 348 out-of-memory, 243, 245 outer product, 36, 37 ozone depletion, 362 packed

425

form, 19, 48, 212 vectors, 22 adding, 19, 20, 21, 22 inner product, 20 page-thrashing, 40 pagewise ordering, 145 parallel architecture, 10, 342 parallel computer, 262 parallel ANALYSE, 294 parallel FACTORIZE, 295–303 parallel SOLVE, 320–322 parallelism, 57, 228, 255 balance between tree and node, 297 dynamic scheduling, 302–303 memory use, 299 static and dynamic mapping, 300 static mapping and scheduling, 300 parallelization levels, 295 PARDISO, 284, 285, 292 parent node, 3, 263 ParMETIS, 294 partial differential equations, 15, 254 partial pivoting, 67, 252 partial solution, 149–152 partitioned factorization, 57, 70 partitioned matrix, 57 partitioning software, 191 unsymmetric matrices, 193, 197 PasTiX, 284 PDE grids, 127 performance profile, 289 permutation matrix, 37, 49 row and column, 49 permutations in frontal method, 249 in solutions of equations, 217 permuting in place, 38 perturbations

426

SUBJECT INDEX

in entries of A, 330 small, 334 Pflops, 10 photochemical smog, 362 pictures of sparse matrices, 359–366 pins, 196 pipelining, 10, 11 pivot, 45 2×2, 74, 93, 250, 278 oxo, 154 pivotal clique, 233 pivoting, 46, 67 2×2, 153 biasing pivot choice, 152 complete, 69 full, 69 none, 232 numerical, 147 partial, 67, 240 rook, 69, 92 static, 284–286 threshold, see threshold pivoting threshold rook, 92 pord, 184 positive-definite matrix, 53, 70, 72, 227, 344 approximating, 344–346 positive-semidefinite matrix, 345 postordering, 261 postordering assembly tree, 366 power systems, 52, 364 precision, 64 preconditioner, 227, 228 problem formulation, 13–14, 63, 331–333 process, 12 product matrix by full vector, 36 matrix by matrix, 36 matrix by vector, 40 profile, 157, 244 profile reduction Fiedler vector, 168 Hager’s exchange method, 168

hybrid orderings, 167 spectral ordering, 163 pseudodiameter, 162 pseudoperipheral node, 162 PT-SCOTCH, 294 QR factorization, 347 quasi-Newton equation, 340 quotient graph, 171, 235 quotient tree, 178 random number generators, 16 Rayleigh–Ritz principle, 356 Rayleigh-quotient iteration, 166 rectangular matrix block triangular, 133 LU factorization, 49 recursive factorization, 52 reducibility, 109 measure of, 127 reducible matrix, 108 refined quotient tree, 170–173 relabelling, 120, 158 complexity, 121 eliminated, 122 of nodes, 118 reduced, 124 relative precision, 64 residual scaled, 72 vector interpretation of large, 71 interpretation of small, 71 reverse Cuthill–McKee algorithm, 159–161, 180 ordering, 171, 254 right-hand sides multiple, 100, 104 right-looking, 50, 54, 206, 270 rook pivoting, 69, 92 root node, 3 rooted tree, 3 row count, 205 row graph, 195

SUBJECT INDEX

row interchanges, 46, 49, 53 band, 243 row singleton, 110, 138 Sargent and Westerberg algorithm, 119 ScaLAPACK, 302 scaled residual, 72, 149 scaling, 81–85, 92, 287, 358 and reordering, 287–290 symmetric matrix, 287 automatic, 83 effect of, 66, 288 entries close to one, 84 I-matrix scaling, 85 norms of rows and columns, 84 poorly scaled, 82 strategies, 289 scatter, 19, 212 Schur complement, 51 SCOTCH, 192, 294 semibandwidth, 157 sensitivity, 63, 338 separator edge, 186, 194 initial, 188 refining, 189 set, 180 vertex, 186 wide, 194 serial computing, 10 shared memory, 11, 293, 303 Sherman–Morrison formula, 327 simulation of computing systems, 363 simulations, 127 singleton, 110, 138 singly-bordered block diagonal form, 197 singular block, 59 singular matrix, 46 structurally, 112, 115 symbolically, 112, 115 singular values, 356

427

SLMATH, 219 Sloan algorithm, 162 Sloan’s algorithm, 254 software for sparse matrices features, 89 graph partitioning, 191 input, 101 interface, 99 output, 103 relative work in phases, 103 using, 106 writing, 106 solution using LU factorization, 49 SOLVE, 100, 102, 204, 215, 240, 243, 315 at the node level, 316–317 complexity, 315 experimental results, 322 order of access to tree nodes, 321– 322 parallelization, 320–322 using the tree, 318 sorting counting sort, 30 heap sort, 31 in place, 31, 39 sparse BLAS, 36 sparse column vectors, 33, 34 sparse matrix characteristics, 15 definition, 1 ordering, 8 sparse right-hand sides, 318–319 sparse tableau, 153 sparsity constructing pattern of U, 269 in inverse, 337 in right-hand side, 149–152, 216 pattern, 8, 335 trade-off with stability, 90 trade-off with structure, 156 sparsity pattern, 338, 359–366 SPARSPAK, 223, 236, 244, 365

428

SUBJECT INDEX

SPD, 246 spectral ordering, 163–166 spectral radius, 226, 227 spiral ordering, 145 SPOOLES, 192 SSLEST, 227 stability assessment in sparse case, 92 bound on stability, 92 monitoring, 71, 73, 94 trade-off with sparsity, 90 stable algorithm, 62 stack, 122, 125, 262 starting node influence of, 124 static condensation, 249, 258, 259 static data structures, 18, 205, 239 static mapping, 300 static pivoting, 284–286 static scheduling, 300 storage collection of packed vectors, 23 compressed index, 25 compression, 25 coordinate scheme, 22 full-length vector, 18 gain, 59 linked list, 26 reuse of, 210 savings, 210 sparse vectors, 18 two-dimensional array, 48 streaming, 12 strong component, 119, 120, 122 strong Hall, 109, 141 structural analysis, 15, 52 structural inverse, 335 structural mechanics, 246 structurally singular, 112, 115 structurally symmetric matrix, 277 structure identical, 35 submatrices, 58 substitution method, 343

substructuring, 259 super-row, 234 SuperLU, 283, 285, 290–291 vs other codes, 309 SuperLU DIST, 284, 289, 291–292 SuperLU MT, 283, 284, 291 supervariable, 35, 234, 291, 311 identification, 35 switching to full form, 219 symbolic analysis unsymmetric matrices, 282 symbolic rank, 129 symbolically singular, see structurally singular symmetric matrix, 52, 56, 277, 342 approximation, 340 Gaussian elimination, 54 ignoring symmetry, 152 indefinite, 74, 93, 153, 345 stability, 74 nearly symmetric, 282 positive-definite, 53, 70, 72, 227 positive-semidefinite, 345 scaling and reordering, 287 tridiagonal, 338 symmetric permutation, 53, 158 symmetry index, 15, 127 SYMMLQ, 166 Tarjan’s algorithm, 122 computational cost, 125 description, 124 illustration, 124 implementation, 125 proof of validity, 124 test collection, 14 Tflops, 10 thread, 11 threshold parameter, 68, 92, 147– 149 effect of varying, 148 recommended value, 148

SUBJECT INDEX

threshold pivoting, 68, 90, 147, 205, 250 tie-breaking, see minimum degree Tinney scheme 2 ordering, see minimum degree Tinney–Walker scheme 1, 143 topological ordering, 171, 264, 318 trade-off sparsity with stability, 90 transversal, 110, 111–118 algorithm analysis, 115 algorithm implementation, 116 computational complexity, 115 extension, 112, 113 maximum, 128 tree, 3, 139, 160, 171 tree parallelism, 295 tree rotation, 273–276 triangular factorization, see factorization triangular form permutation to, 119 illustration, 121 triangular matrix, 43, 243 tridiagonal matrix, 341 trsm, 321 true degree, 235 UMFPACK, 309 vs other codes, 309 undirected graph, 3 unit triangular matrix, 48 unstable FACTORIZE, 213 unsymmetric interchanges, 250 unsymmetric matrices, 282 dissection, 193 multifrontal, 310 partitioning, 193 symbolic analysis, 282 trees for, 312 unsymmetric ordering on symmetric matrix, 152 upper triangular matrix, 43, 45 submatrix, 58

429

unit, 44 variability type, 153 variable-band matrix, 96, 156, 174, 244 blocking, 169 dynamic allocation of storage, 174 variable-band ordering, 252 variances, 337 vector architecture, 317 vector norm, 355–358 vertex separator, 186 vertices, 2 water-distribution network, 334 wavefront, 162 weighted matching, 130 algorithm, 130–133 algorithm performance, 132 wide separator, 194 Woodbury formula, 327 WSMP parallel version, 297, 312 symmetric case, 283, 297–299, 300 unsymmetric case, 283, 310–312 Y12M, 227 YSMP, 236, 365 zeros accidental, 210 explicit storage of, 157, 253 ZOLTAN, 192